Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-12071] String.count returns wrong value for Hindi, Thai #54507

Closed
swift-ci opened this issue Jan 23, 2020 · 4 comments
Closed

[SR-12071] String.count returns wrong value for Hindi, Thai #54507

swift-ci opened this issue Jan 23, 2020 · 4 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. standard library Area: Standard library umbrella

Comments

@swift-ci
Copy link
Collaborator

Previous ID SR-12071
Radar rdar://problem/58853160
Original Reporter claus.joergensen (JIRA User)
Type Bug
Status Closed
Resolution Invalid
Additional Detail from JIRA
Votes 0
Component/s Standard Library
Labels Bug
Assignee None
Priority Medium

md5: 0b1652a4952ea393f26e16b4a8c29849

Issue Description:

The count property of a String in Swift returns an incorrect value for Hindi and Thai, whereas the NSString equivalent returns a correct value for the length property.

The issue was found when we were trying to manipulate the layout of the string(s) in a NSAttributedString, and the positioning was wrong due to the string length was calculated incorrectly.

Example:

import Foundation

func checkLength(_ lang: String, _ string: String, _ expectedLength: Int) {
    if string.count != expectedLength {
        print("Swift String.count returned unexpected string length \(string.count) for \(lang), should be \(expectedLength)")
    }

    if (string as NSString).length != expectedLength {
        print("Swift NSString.length returned unexpected string length \((string as NSString).length) for \(lang), should be \(expectedLength)")
    }
}

// String.count and NSString.length both return the correct value
checkLength("English", "New to Chat?", 12)
checkLength("Arabic", "هل أنت حديث العهد باستخدام Chat؟", 32)
checkLength("Chinese (Traditional)", "您是 Chat 的新使用者嗎", 14)
checkLength("Japanese", "Chatを初めてご利用の場合", 14)

// String.count returns incorrect value whereas NSString.length returns the correct value
checkLength("Hindi", "क्या आप Chat पर नए हैं?", 23)
checkLength("Thai", "เพิ่งเริ่มแชทใช่หรอไม่", 22)

Output:

Swift String.count returned unexpected string length 19 for Hindi, should be 23
Swift String.count returned unexpected string length 16 for Thai, should be 22
@stephentyrone
Copy link
Member

CC @milseman

@swift-ci
Copy link
Collaborator Author

Comment by Kyle Macomber (JIRA)

@swift-ci create

@lorentey
Copy link
Member

This behaves correctly – String and NSString operate on different levels of abstraction. NSString counts UTF-16 code units, while String counts Unicode grapheme clusters (which are closer to what people typically consider "characters").

For example, the character "क्" from the beginning of the Hindi example is actually composed of two Unicode scalars, but it's still considered a single character.

"क" – U+0915 DEVANAGARI LETTER KA
"्" – U+094D DEVANAGARI SIGN VIRAMA

NSString and NSAttributedString considers "क्" a string of length 2, while String has its count as 1. Neither of these is incorrect – the correct answer depends on what we mean by "character", which changes from problem to problem. While processing text, you need to always keep in mind what representation you're working with, especially while interfacing with libraries not written in Swift, like Cocoa here.

To interface with NSAttributedString indices, you'll need to use the `.utf16` view on String:

let hindi = "क्या आप Chat पर नए हैं?"
print(hindi.count)  // ⟹ 19
print(hindi.utf16.count) // ⟹ 23
print((hindi as NSString).length) // ⟹ 23

let thai = "เพิ่งเริ่มแชทใช่หรอไม่"
print(thai.count) // ⟹ 16
print(thai.utf16.count) // ⟹ 22
print((thai as NSString).length) // ⟹ 22

Besides `.utf16`, String also provides `.utf8` and `.unicodeScalars` views that provide additional ways to count String contents:

"�‍♂️".count // ⟹ 1

"�‍♂️".unicodeScalars.count // ⟹ 4

"�‍♂️".utf16.count // ⟹ 5

"�‍♂️".utf8.count // ⟹ 13

("�‍♂️" as NSString).length // ⟹ 5 (same as .utf16)

From the String documentation:

A string is a collection of extended grapheme clusters, which approximate human-readable characters. Many individual characters, such as “é”, “김”, and “��”, can be made up of multiple Unicode scalar values. These scalar values are combined by Unicode’s boundary algorithms into extended grapheme clusters, represented by the Swift Character type. Each element of a string is represented by a Character instance.

For more details, see the rest of the Swift documentation (and, ultimately, the Unicode standard).

@lorentey
Copy link
Member

Jira's dodgy Unicode support failed me there. The last examples were supposed to include the hugely useful "dancing men wearing bunny ears" emoji, represented by the single-character string "\u{1F46F}\u{200D}\u{2642}\u{FE0F}".

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. standard library Area: Standard library umbrella
Projects
None yet
Development

No branches or pull requests

3 participants