Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-5992] String.Index.init?(_:within:) sometimes succeeds even if not grapheme-aligned #48549

Closed
ole opened this issue Sep 26, 2017 · 1 comment
Assignees
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. standard library Area: Standard library umbrella

Comments

@ole
Copy link
Contributor

ole commented Sep 26, 2017

Previous ID SR-5992
Radar None
Original Reporter @ole
Type Bug
Status Resolved
Resolution Done
Environment

Swift 4.0, Xcode 9.0, macOS 10.13 GM (17A362a)

Additional Detail from JIRA
Votes 0
Component/s Standard Library
Labels Bug
Assignee @milseman
Priority Medium

md5: 16ba556673580da05fa7c50fda72c6fb

Issue Description:

The documentation for String.Index.init?(_ sourcePosition: String.Index, within target: String) says:

If the index passed as sourcePosition represents the start of an extended grapheme cluster—the element type of a string—then the initializer succeeds.

But the initializer also wrongly succeeds sometimes when passed an index (say, from the UTF-16 or UTF-8 view) that is not aligned with the start of the grapheme cluster and when the input string is an emoji ZWJ sequence.

Example:

let str = "\u{1f469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f467}" // family of four

print("str.count: \(str.count)")
print("str.utf16.count: \(str.utf16.count)")
print("UTF-16 code units: \(str.utf16.map { "0x\(String($0, radix: 16))" })")
print("")

for (offset, utf16Index) in str.utf16.indices.enumerated() {
    let index = String.Index(utf16Index, within: str)
    print("\(offset): \(String(describing: index))")
}

This prints in Swift 4.0 on macOS 10.13:

str.count: 1
str.utf16.count: 11
UTF-16 code units: ["0xd83d", "0xdc69", "0x200d", "0xd83d", "0xdc69", "0x200d", "0xd83d", "0xdc67", "0x200d", "0xd83d", "0xdc67"]

0: Optional(Swift.String.Index(_compoundOffset: 0, _cache: Swift.String.Index._Cache.character(11)))
1: nil
2: nil
3: Optional(Swift.String.Index(_compoundOffset: 12, _cache: Swift.String.Index._Cache.character(8)))
4: nil
5: nil
6: Optional(Swift.String.Index(_compoundOffset: 24, _cache: Swift.String.Index._Cache.character(5)))
7: nil
8: nil
9: Optional(Swift.String.Index(_compoundOffset: 36, _cache: Swift.String.Index._Cache.character(2)))
10: nil

I'd expect only offset 0 to return a valid String.Index. All offsets from 1 to 10 should result in nil. The problem seems to be that the initializer only checks if the passed-in index constitutes a valid start of a grapheme cluster; it doesn't backtrack to check if there is a valid grapheme cluster boundary between the index position and the code point before it.

In this example, the indices at offsets 3, 6 and 9 should see the respective ZWJ code points that precede them and then return nil because UAX #29 specifies includes the rule "Do not break before extending characters or ZWJ."

Unfortunately, doing this correctly has an impact on performance.

Related links: This report was triggered by a Stack Overflow question and this Twitter discussion.

Note: an answer to the Stack Overflow question points out that the Foundation method rangeOfComposedCharacterSequence(at: ) has the correct behavior and can be used as a workaround.

@ole
Copy link
Contributor Author

ole commented Apr 1, 2018

Closing this as fixed in Swift 4.1. The test case I provided in the report now works correctly in Swift 4.1. I can't find the specific commit/PR that fixed this though.

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. standard library Area: Standard library umbrella
Projects
None yet
Development

No branches or pull requests

1 participant