New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SR-12071] String.count returns wrong value for Hindi, Thai #54507
Comments
CC @milseman |
Comment by Kyle Macomber (JIRA) @swift-ci create |
This behaves correctly – String and NSString operate on different levels of abstraction. NSString counts UTF-16 code units, while String counts Unicode grapheme clusters (which are closer to what people typically consider "characters"). For example, the character "क्" from the beginning of the Hindi example is actually composed of two Unicode scalars, but it's still considered a single character. "क" – U+0915 DEVANAGARI LETTER KA NSString and NSAttributedString considers "क्" a string of length 2, while String has its count as 1. Neither of these is incorrect – the correct answer depends on what we mean by "character", which changes from problem to problem. While processing text, you need to always keep in mind what representation you're working with, especially while interfacing with libraries not written in Swift, like Cocoa here. To interface with NSAttributedString indices, you'll need to use the `.utf16` view on String: let hindi = "क्या आप Chat पर नए हैं?"
print(hindi.count) // ⟹ 19
print(hindi.utf16.count) // ⟹ 23
print((hindi as NSString).length) // ⟹ 23
let thai = "เพิ่งเริ่มแชทใช่หรอไม่"
print(thai.count) // ⟹ 16
print(thai.utf16.count) // ⟹ 22
print((thai as NSString).length) // ⟹ 22 Besides `.utf16`, String also provides `.utf8` and `.unicodeScalars` views that provide additional ways to count String contents: "�♂️".count // ⟹ 1 "�♂️".unicodeScalars.count // ⟹ 4 "�♂️".utf16.count // ⟹ 5 "�♂️".utf8.count // ⟹ 13 ("�♂️" as NSString).length // ⟹ 5 (same as .utf16) From the String documentation: A string is a collection of extended grapheme clusters, which approximate human-readable characters. Many individual characters, such as “é”, “김”, and “��”, can be made up of multiple Unicode scalar values. These scalar values are combined by Unicode’s boundary algorithms into extended grapheme clusters, represented by the Swift For more details, see the rest of the Swift documentation (and, ultimately, the Unicode standard). |
Jira's dodgy Unicode support failed me there. The last examples were supposed to include the hugely useful "dancing men wearing bunny ears" emoji, represented by the single-character string "\u{1F46F}\u{200D}\u{2642}\u{FE0F}". |
Additional Detail from JIRA
md5: 0b1652a4952ea393f26e16b4a8c29849
Issue Description:
The count property of a String in Swift returns an incorrect value for Hindi and Thai, whereas the NSString equivalent returns a correct value for the length property.
The issue was found when we were trying to manipulate the layout of the string(s) in a NSAttributedString, and the positioning was wrong due to the string length was calculated incorrectly.
Example:
Output:
The text was updated successfully, but these errors were encountered: