[SR-12071] String.count returns wrong value for Hindi, Thai #54507

swift-ci · 2020-01-23T13:01:56Z


Previous ID	SR-12071
Radar	rdar://problem/58853160
Original Reporter	claus.joergensen (JIRA User)
Type	Bug
Status	Closed
Resolution	Invalid

Additional Detail from JIRA


Votes	0
Component/s	Standard Library
Labels	Bug
Assignee	None
Priority	Medium

md5: 0b1652a4952ea393f26e16b4a8c29849

Issue Description:

The count property of a String in Swift returns an incorrect value for Hindi and Thai, whereas the NSString equivalent returns a correct value for the length property.

The issue was found when we were trying to manipulate the layout of the string(s) in a NSAttributedString, and the positioning was wrong due to the string length was calculated incorrectly.

Example:

import Foundation

func checkLength(_ lang: String, _ string: String, _ expectedLength: Int) {
    if string.count != expectedLength {
        print("Swift String.count returned unexpected string length \(string.count) for \(lang), should be \(expectedLength)")
    }

    if (string as NSString).length != expectedLength {
        print("Swift NSString.length returned unexpected string length \((string as NSString).length) for \(lang), should be \(expectedLength)")
    }
}

// String.count and NSString.length both return the correct value
checkLength("English", "New to Chat?", 12)
checkLength("Arabic", "هل أنت حديث العهد باستخدام Chat؟", 32)
checkLength("Chinese (Traditional)", "您是 Chat 的新使用者嗎", 14)
checkLength("Japanese", "Chatを初めてご利用の場合", 14)

// String.count returns incorrect value whereas NSString.length returns the correct value
checkLength("Hindi", "क्या आप Chat पर नए हैं?", 23)
checkLength("Thai", "เพิ่งเริ่มแชทใช่หรอไม่", 22)

Output:

Swift String.count returned unexpected string length 19 for Hindi, should be 23
Swift String.count returned unexpected string length 16 for Thai, should be 22

stephentyrone · 2020-01-23T18:25:18Z

CC @milseman

swift-ci · 2020-01-24T00:17:27Z

Comment by Kyle Macomber (JIRA)

@swift-ci create

lorentey · 2020-01-24T01:31:20Z

This behaves correctly – String and NSString operate on different levels of abstraction. NSString counts UTF-16 code units, while String counts Unicode grapheme clusters (which are closer to what people typically consider "characters").

For example, the character "क्" from the beginning of the Hindi example is actually composed of two Unicode scalars, but it's still considered a single character.

"क" – U+0915 DEVANAGARI LETTER KA
"्" – U+094D DEVANAGARI SIGN VIRAMA

NSString and NSAttributedString considers "क्" a string of length 2, while String has its count as 1. Neither of these is incorrect – the correct answer depends on what we mean by "character", which changes from problem to problem. While processing text, you need to always keep in mind what representation you're working with, especially while interfacing with libraries not written in Swift, like Cocoa here.

To interface with NSAttributedString indices, you'll need to use the `.utf16` view on String:

let hindi = "क्या आप Chat पर नए हैं?"
print(hindi.count)  // ⟹ 19
print(hindi.utf16.count) // ⟹ 23
print((hindi as NSString).length) // ⟹ 23

let thai = "เพิ่งเริ่มแชทใช่หรอไม่"
print(thai.count) // ⟹ 16
print(thai.utf16.count) // ⟹ 22
print((thai as NSString).length) // ⟹ 22

Besides `.utf16`, String also provides `.utf8` and `.unicodeScalars` views that provide additional ways to count String contents:

"�‍♂️".count // ⟹ 1

"�‍♂️".unicodeScalars.count // ⟹ 4

"�‍♂️".utf16.count // ⟹ 5

"�‍♂️".utf8.count // ⟹ 13

("�‍♂️" as NSString).length // ⟹ 5 (same as .utf16)

From the String documentation:

A string is a collection of extended grapheme clusters, which approximate human-readable characters. Many individual characters, such as “é”, “김”, and “��”, can be made up of multiple Unicode scalar values. These scalar values are combined by Unicode’s boundary algorithms into extended grapheme clusters, represented by the Swift Character type. Each element of a string is represented by a Character instance.

For more details, see the rest of the Swift documentation (and, ultimately, the Unicode standard).

lorentey · 2020-01-24T01:36:20Z

Jira's dodgy Unicode support failed me there. The last examples were supposed to include the hugely useful "dancing men wearing bunny ears" emoji, represented by the single-character string "\u{1F46F}\u{200D}\u{2642}\u{FE0F}".

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-12071] String.count returns wrong value for Hindi, Thai #54507

[SR-12071] String.count returns wrong value for Hindi, Thai #54507

swift-ci commented Jan 23, 2020

stephentyrone commented Jan 23, 2020

swift-ci commented Jan 24, 2020

lorentey commented Jan 24, 2020

lorentey commented Jan 24, 2020

[SR-12071] String.count returns wrong value for Hindi, Thai #54507

[SR-12071] String.count returns wrong value for Hindi, Thai #54507

Comments

swift-ci commented Jan 23, 2020

stephentyrone commented Jan 23, 2020

swift-ci commented Jan 24, 2020

lorentey commented Jan 24, 2020

lorentey commented Jan 24, 2020