Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-2956] CharacterSet union and formUnion not working properly in Swift 3 with unicode #4318

Open
swift-ci opened this issue Oct 14, 2016 · 6 comments

Comments

@swift-ci
Copy link
Contributor

Previous ID SR-2956
Radar None
Original Reporter thuss (JIRA User)
Type Bug
Status Reopened
Resolution
Environment

Mac OS Sierra 10.12
Xcode 8.0 (8a218a)
Apple Swift version 3.0 (swiftlang-800.0.46.2 clang-800.0.38)
Target: x86_64-apple-macosx10.9

Additional Detail from JIRA
Votes 2
Component/s Foundation
Labels Bug, SDKOverlay
Assignee None
Priority Medium

md5: eea36f50a39958f934a3b2a2adf3ded2

Issue Description:

You can reproduce with the following code: https://gist.github.com/twobitlabs/5ba150aed3c159d215ef049f0c5739de

import Foundation
var charset = CharacterSet(charactersIn: "a")
charset.formUnion(CharacterSet(charactersIn: "\u{1F600}"))
print(charset.contains("\u{1F600}")) // prints false but should print true
@jepers
Copy link

jepers commented Feb 16, 2018

The issue is still present in Xcode 9.3 beta 2.

Also, note that .insert can be used to demonstrate the same problem (maybe union uses insert):

var charset = CharacterSet(charactersIn: "a")
charset.insert(charactersIn: "\u{1F600}")
print(charset.contains("\u{1F600}")) // false

And this:

var charset = CharacterSet()
charset.insert(charactersIn: "\u{1F600}")
print(charset.contains("\u{1F600}")) // true

And this:

var charset = CharacterSet()
charset.insert(charactersIn: "a")
charset.insert(charactersIn: "\u{1F600}")
print(charset.contains("\u{1F600}")) // false

And this:

var charset = CharacterSet()
charset.insert(charactersIn: "\u{1F600}")
charset.insert(charactersIn: "a")
print(charset.contains("\u{1F600}")) // true

Since the following is working as expected no matter the order of a and b, maybe the problem is in how the ...(charactersIn: String) separates the String into unicode scalars?

let a = Unicode.Scalar("a")!
let b = Unicode.Scalar("\u{1F600}")!
var cs = CharacterSet()
cs.insert(a)
cs.insert(b)
print(cs.contains("\u{1F600}")) // true, and
// also true when swapping the order in which
// a and b are inserted.

Seems like it has problems handling supplemental code points:

var charset = CharacterSet()
charset.insert(charactersIn: "a")
charset.insert(charactersIn: "\u{1F600}")
print(charset.contains("\u{1F600}")) // false
print(charset.debugDescription)
// <CFCharacterSet Items(U+0061 U+F600)>
// Note missing upper 16 bits from supplemental
// code point U+F600 instead of U+1F600.
var charset = CharacterSet()
//charset.insert(charactersIn: "a") <-- Commented out
charset.insert(charactersIn: "\u{1F600}")
print(charset.contains("\u{1F600}")) // false
print(charset.debugDescription)
// <CFCharacterSet Range(128512, 1)>
// (128512 == 0x1F600)

@milseman
Copy link
Mannequin

milseman mannequin commented Feb 16, 2018

@itaiferber, @parkera @phausler

What is the semantics of CharacterSet exactly? The description: "A set of Unicode character values for use in search operations." implies that it can at least store unicode scalar values, but this behavior makes it look like it can only store UTF-16 code units. Is this a bug and can it be fixed, or is this the (undocumented) semantics of CharacterSet?

@phausler
Copy link
Member

this is undocumented semantics where we just don't trap on invalid characters.

@milseman
Copy link
Mannequin

milseman mannequin commented Feb 16, 2018

Could you define character? Is that restricted to BMP scalar?

@itaiferber
Copy link
Contributor

@milseman CharacterSet and co work primarily in UTF-16 code units. CFString/NSString have APIs which work at higher levels, but the core of all of this is set historically in UTF-16 code units.

@milseman
Copy link
Mannequin

milseman mannequin commented Feb 16, 2018

Can we at least document the fact that CharacterSet is only for use with BMP scalars?

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
@shahmishal shahmishal transferred this issue from apple/swift May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants