Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-13852] Diagnostics calculates column numbers incorrectly for non-ASCII source lines #56250

Open
WowbaggersLiquidLunch opened this issue Nov 13, 2020 · 8 comments
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself diagnostics QoI Bug: Diagnostics Quality of Implementation

Comments

@WowbaggersLiquidLunch
Copy link
Collaborator

Previous ID SR-13852
Radar rdar://problem/71385575
Original Reporter @WowbaggersLiquidLunch
Type Bug

Attachment: Download

Environment

macOS 10.15.7 (19H15)

Apple Swift version 5.3.2 (swiftlang-1200.0.44.1 clang-1200.0.32.28)

Additional Detail from JIRA
Votes 0
Component/s Compiler
Labels Bug, DiagnosticsQoI
Assignee None
Priority Medium

md5: a9c18eeec75c2236196435a189405308

Issue Description:

When compiling this buggy source (also attached to this issue):

let abc: String
let abĉ: String
let abê: String
let 甲乙丙: String
let abc: String
    abc = 0   // indented with a character tabulation (U+0009)
abĉ = 0           // ĉ is U+0063 followed by U+0302
abê = 0           // ê is U+00EA
甲乙丙 = 0           // 甲 is U+7532; 乙 is U+4E59; 丙 is U+4E19
abc = 0           // a is U+FF41; b is U+FF42; c is U+FF43

the compiler gives the following diagnosis:

column.swift:6:8: error: cannot assign value of type 'Int' to type 'String'
        abc = 0   // indented with a character tabulation (U+0009)
              ^
column.swift:7:9: error: cannot assign value of type 'Int' to type 'String'
abĉ = 0           // ĉ is U+0063 followed by U+0302
column.swift:8:8: error: cannot assign value of type 'Int' to type 'String'
abê = 0           // ê is U+00EA
column.swift:9:13: error: cannot assign value of type 'Int' to type 'String'
甲乙丙 = 0           // 甲 is U+7532; 乙 is U+4E59; 丙 is U+4E19
column.swift:10:13: error: cannot assign value of type 'Int' to type 'String'
abc = 0           // a is U+FF41; b is U+FF42; c is U+FF43

Only the column number for line 6 is calculated correctly.

Column number for line 7 is offset by 2.

Column number for line 8 is offset by 1.

Column numbers for line 9 and 10 are offset by 6 if each character takes 1 column, or offset by 3 if each each character takes 2 columns, or correct if each character takes 3 columns.

@typesanitizer
Copy link

@swift-ci create

@swift-ci
Copy link
Collaborator

Comment by Owen Voorhees (JIRA)

To provide some background on this, the "column" numbers in diagnostics are currently just byte offsets from the start of the line, so they make a lot less sense for any lines that aren't 100% ASCII. There are some grapheme breaking algorithms in the compiler we might be able to use to get a more accurate character count, but we don't want to add a full ICU dependency to the compiler and there isn't a 1:1 mapping between characters and columns, so improving this could be difficult.

@WowbaggersLiquidLunch
Copy link
Collaborator Author

the "column" numbers in diagnostics are currently just byte offsets from the start of the line

This explains the amount each column number is offset by in the example. Thanks for providing this background information!

we don't want to add a full ICU dependency to the compiler

What are the disadvantages to adding a full ICU dependency to the compiler? I thought the compiler already assumed the existence of the ICU library so that Swift's strings and characters could be Unicode-compliant. Or was it just the standard library?

there isn't a 1:1 mapping between characters and columns

IMO this might be the biggest hurdle to column-counting. I think we can probably safely assume a 1:1 mapping for accented Latin characters, and maybe Greek and Cyrillic as well. CJK characters are a lot more difficult, because different terminal emulators, text editors, and typefaces render them at different widths. Whitespace characters are also problematic with all the different widths they have. And these are just a tiny fraction of all Unicode characters, although likely among the most common ones in Swift sources.

@typesanitizer
Copy link

Even with ICU, column count doesn't make sense for non-ASCII characters (and things like tabs). For non-ASCII characters, any meaningful interpretation of column count depends on the choice of font.

We could use the East Asian width based on UAX #11 but that still falls over in certain cases. Here's an example from the https://github.com/unicode-rs/unicode-width README:

extern crate unicode_width;
use unicode_width::UnicodeWidthStr;

fn main() {
    assert_eq!(UnicodeWidthStr::width("�"), 2); // Woman
    assert_eq!(UnicodeWidthStr::width("�"), 2); // Microscope
    assert_eq!(UnicodeWidthStr::width("�‍�"), 4); // Woman scientist
}

The standard library uses ICU; the compiler doesn't. I suspect if we were to add it as a dependency, the "right" fix would be to add it to LLVM instead of adding it to Swift directly, but that would be a hard sell given that LLVM has a minimal dependency footprint and ICU is a large library.

It would almost certainly be much simpler to write an implementation of UAX #11 in C++ and add that to the Swift compiler. Looking at the Rust package, it looks like ~600 LoC including the tables and tests.

@benlangmuir
Copy link
Member

Due to the issues that Varun mentioned, there is no unambiguously correct way to return "display columns". The width is going to depend on your platform and possibly on the specific editors or other tools and configurations you are using the view your source code.

Moreover, diagnostic column locations are most often interpreted by tools that highlight, jump to the location, or otherwise interact with it programatically. For example, if you paste the test case here into Xcode, or with an LSP-compatible editor using sourcekit-lsp, it will highlight the correct column in each case.

For tooling, it is better for us to be predictable (avoiding the ambiguities around display width) and backwards compatible. Another factor is how much complexity we are imposing on other tools. Any tool that handles utf-8 should be able to convert a utf-8 offset into whatever kind of position they want to use internally, but if we started using a different notion of column, we might impose much higher complexity.

It seems reasonable to provide a compiler option to opt-in to a different behaviour if you have a specific use case for it, but I think the current behaviour is correct by default.

@benlangmuir
Copy link
Member

I guess one option I didn't mention: we could possibly change the behaviour only for what we print to stderr/stdout, but leave serialized diagnostics and sourcekitd using the existing behaviour. The biggest downside would be that many simpler tools scan for diagnostics in build logs etc. rather than using serialized diagnostics, so it could still break compatibility for those tools.

@typesanitizer
Copy link

Hmm, I thought this wouldn't work with LSP since it uses UTF-16 based offsets instead of UTF-8 based offsets: microsoft/language-server-protocol#376

Maybe I'm misunderstanding something.

@benlangmuir
Copy link
Member

SourceKit-LSP converts between UTF-8 and UTF-16 as needed.

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. compiler The Swift compiler in itself diagnostics QoI Bug: Diagnostics Quality of Implementation
Projects
None yet
Development

No branches or pull requests

4 participants