Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-9802] String.subscript using Range with misaligned Index created by UTF8View is not compatible between Swift 4.x and Swift 5 #52226

Closed
norio-nomura opened this issue Jan 30, 2019 · 11 comments
Assignees
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. regression standard library Area: Standard library umbrella swift 5.0

Comments

@norio-nomura
Copy link
Contributor

Previous ID SR-9802
Radar None
Original Reporter @norio-nomura
Type Bug
Status Resolved
Resolution Done
Environment

swift-5.0-DEVELOPMENT-SNAPSHOT-2019-01-28-a

Additional Detail from JIRA
Votes 1
Component/s Standard Library
Labels Bug, 5.0Regression
Assignee @milseman
Priority Medium

md5: 8de5a32c30085c0bf4d7d4a6037a1cc5

relates to:

  • SR-9820 String.subscript using Range with misaligned Index created by UTF16View is not compatible between Swift 4.x and Swift 5

Issue Description:

test.swift from forum post:

let café = "café"
let index = café.utf8.index(before: café.utf8.endIndex) // The last byte.
let before = String(café[..<index])
let after = String(café[index...])
print(before + after) // “café”
print(after.utf8.count) // 2 bytes?!?

Results are different between Swift 4.2.1 and 5.0

$ cat test.swift|xcrun --toolchain org.swift.42120181030a swift
Welcome to Apple Swift version 4.2.1 (swift-4.2.1-RELEASE). Type :help for assistance.
café: String = "café"
index: String.UTF8View.Index = {
  _compoundOffset = 13
  _utf8Buffer = {
    _biasedBits = 43716
  }
  _graphemeStrideCache = 0
}
before: String = "caf"
after: String = "é"
café
2
$ cat test.swift|xcrun --toolchain org.swift.5020190128a swift
Welcome to Apple Swift version 5.0-dev (LLVM eb302e257f, Clang a113643bc4, Swift 8efbb50504).
Type :help for assistance.
café: String = "café"
index: String.UTF8View.Index = {
  _rawBits = 262144
}
before: String = "caf?
after: String = "\u{a9}"
café
1

Swift 4.2.1 is compatible with Swift 4.0.3 and 4.1.3.

Is this change expected ?
/cc: @milseman

@lorentey
Copy link
Member

The printout for after may indicate an additional issue – U+00A9 is the copyright symbol; what is it doing there?

after: String = "\u{a9}"

@ole
Copy link
Contributor

ole commented Jan 30, 2019

@lorentey A9 is the last byte of the UTF-8 sequence, so that's where it's coming from:

café.utf8.map { String($0, radix: 16, uppercase: true) }
// ["63", "61", "66", "C3", "A9"]

But if you print after it will (correctly) print (the replacement character for an ill-formed sequence), not the copyright symbol. Maybe the debugger is using a different way to show a String's contents?

@lilyball
Copy link
Mannequin

lilyball mannequin commented Jan 30, 2019

It definitely seems like a bug that café[index...] can produce an invalid string that contains just a single UTF-8 continuation byte.

Beyond that, the printing of "\u{a9}" in the swift LLDB repl seems like a separate bug. I believe this is using a custom LLDB formatter, so that must be buggy in the presence of invalid UTF-8 sequences, but it seems the debugDescription implementation is also buggy because passing before and after to debugPrint produces

"cafÀ"
"©"

Granted, I'm not quite sure what this should print, because there's no way to write a string literal that would reproduce the original malformed string, but the best option is probably to just use U+FFFD replacement characters instead of this (the © appears to be interpreting the a9 byte as U+A9, but the À is mystifying; the UTF-8 byte here is c3 but À is U+C0).

@norio-nomura
Copy link
Contributor Author

It seems String.UTF16View.index(before:) is also not compatible between Swift 4.x and Swift 5.

$ pbpaste
func printUnicodeScalars<S: StringProtocol>(_ string: S) {
    print(string.unicodeScalars.map {
        "\\u{\(String($0.value, radix: 16))}"
    }.joined())
}
// https://emojipedia.org/family-man-woman-girl-boy/
let family = "\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}"
printUnicodeScalars(family)

let utf16Index = family.utf16.index(before: family.utf16.endIndex)
let utf16Before = String(family[..<utf16Index])
printUnicodeScalars(utf16Before)

let utf16After = String(family[utf16Index...])
printUnicodeScalars(utf16After)

printUnicodeScalars(utf16Before + utf16After)
print(utf16After.utf16.count)
$ pbpaste|xcrun --toolchain org.swift.42120181030a swift -
\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}
\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{fffd}
\u{fffd}
\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}
1
$ pbpaste|xcrun --toolchain org.swift.5020190129a swift -
\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}
\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}
\u{1f466}
\u{1f468}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}
2

@norio-nomura
Copy link
Contributor Author

filed separated issue https://bugs.swift.org/browse/SR-9820 for String.UTF16View.index(before:)

@norio-nomura
Copy link
Contributor Author

UnicodeScalar boundaries are:

  • respected by:
-   `String.UTF8View.index(before:)` on Swift 4

-   `String.UTF16View.index(before:)` on Swift 5
  • not respected by
-   `String.UTF16View.index(before:)` on Swift 4

-   `String.UTF8View.index(before:)` on Swift 5

At the same time Native Encoding is switched from UTF-16 to UTF-8, the behavior is also swapped.

@norio-nomura
Copy link
Contributor Author

It seems I misunderstood.
The problem is how does subscript handle Range with Index that is not aligned with unicodeScalar, not index(before:).

@belkadan
Copy link
Contributor

belkadan commented Apr 2, 2019

Did this ever get resolved? @lorentey, @milseman

@milseman
Copy link
Mannequin

milseman mannequin commented Apr 2, 2019

Not yet, I can look at it this week.

@milseman
Copy link
Mannequin

milseman mannequin commented Apr 10, 2019

(I am investigating this and a whole slew of issues as part of https://bugs.swift.org/browse/SR-10124)

@milseman
Copy link
Mannequin

milseman mannequin commented Jun 27, 2019

#23834

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. regression standard library Area: Standard library umbrella swift 5.0
Projects
None yet
Development

No branches or pull requests

5 participants