Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SR-1280] Unicode conformance readLine #43888

Open
swift-ci opened this issue Apr 21, 2016 · 11 comments
Open

[SR-1280] Unicode conformance readLine #43888

swift-ci opened this issue Apr 21, 2016 · 11 comments
Labels
improvement standard library Area: Standard library umbrella

Comments

@swift-ci
Copy link
Collaborator

Previous ID SR-1280
Radar rdar://problem/20013999
Original Reporter tinysun (JIRA User)
Type Improvement
Additional Detail from JIRA
Votes 1
Component/s Standard Library
Labels Improvement
Assignee None
Priority Medium

md5: 001f9dad12b054615ab07d0ec9cf6287

Issue Description:

readLine() is implemented in InputStream.swift, and it contains following comment.

// FIXME: Unicode conformance. To fix this, we need to reimplement the
// code we call above to get a line, since it will only stop on LF.
//
// <rdar://problem/20013999> Recognize Unicode newlines in readLine()
//
// Recognize only LF and CR+LF combinations for now.

I cannot access rdar://problem/*, so posted here.

@belkadan
Copy link
Contributor

The Radar contains this additional information:

See http://unicode.org/reports/tr14/

BK: Mandatory Break (A) (Non-tailorable)
LF: Line Feed (A) (Non-tailorable)
NL: Next Line (A) (Non-tailorable)

See also the section 5.8, Newline Guidelines, in the Unicode spec.

@swift-ci
Copy link
Collaborator Author

Comment by Han Sangjin (JIRA)

I just read the section 5.8, Newline Guidelines. It has superseeded UAX #13 http://unicode.org/reports/tr13, and I think it has a explicit information about the behavior of the readline.

But I'm confused about http://unicode.org/reports/tr14/, LINE BREAKING ALGORITHM.

Copied from the summary of the tr14,
The line breaking algorithm produces a set of "break opportunities", or positions that would be suitable for wrapping lines when preparing text for display.

I think the LINE BREAKING ALGORITHM is only for display, not for readline.

Could you confirm this ?

@swift-ci
Copy link
Collaborator Author

Comment by Han Sangjin (JIRA)

I stopped this task.

When I started, I hoped a common readLine() code which is platform independent. And I wrote a Unicode Newline recognizer in C++ which used getc(stdin) and could be run on Windows and Linux. Its function worked, it stopped at NLF, LS, FF, or PS. (NLF is CR, LF, CRLF, NEL)

But the speed was very slower than old code that used getline(), it was about 9x slower. (loop readLine for 2MB binary file, Linux)

The root cause was not the increased newline code. The getline() scanned the '\n' in the internal buffer using memchr(), while my code calling the getc() for each bytes.

To meet a similar performance, in my opinion, it should be implemented in the low level which access the buffer and is dependent on platform.

Any ideas are welcome.

@milseman
Copy link
Mannequin

milseman mannequin commented Jan 2, 2019

I'm not sure if this should be done. On the one hand, it makes sense to change the implementation to be consistent with Character or Unicode.Scalar's notion of newlines. On the other hand, this is reading from stdin (i.e. from the shell) and it would make sense to read until encountering the platform-specific line delimiter, for consistency with the shell.

@benrimmington
Copy link
Collaborator

@@milseman The recommendations in §5.8 Newline Guidelines of the Unicode Standard include:

Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them.

and

R4 A readline function should stop at NLF, LS, FF, or PS. In the typical implementation, it does not include the NLF, LS, PS, or FF that caused it to stop.

My pull request at apple/swift#21586 follows this recommendation, and also stops at VT to match the Character.isNewline property.

Java has System.getProperties("line.separator") to access the platform-specific NLF (i.e. LF, CR, CRLF, or NEL). Is there an equivalent in the standard C library? Should the print(_:separator:terminator:) function also use NLF by default?

@milseman
Copy link
Mannequin

milseman mannequin commented Jan 4, 2019

@compnerd, what approach do you think would make sense for Windows?

@benrimmington
Copy link
Collaborator

Windows uses text mode by default for stdin and stdout.

Can we require text mode for the print and readLine APIs, in addition to the existing UTF-8 requirement?

@milseman
Copy link
Mannequin

milseman mannequin commented Jan 7, 2019

That sounds great to me; IIUC, the whole point of these facilities is for interacting with the system/command-line. But I can see how others might disagree and I'd like to see some discussion. Could you open a post on the forums under Development/Standard Library?

@milseman
Copy link
Mannequin

milseman mannequin commented Jan 7, 2019

Thanks for following up and chasing this down, BTW!

@benrimmington
Copy link
Collaborator

@benrimmington
Copy link
Collaborator

I've closed my pull request at apple/swift#21586. There were two possible implementations:

  • 49c342e hard-coded the newline sequences.
  • 641ed37 used the `isNewline` property.

@swift-ci swift-ci transferred this issue from apple/swift-issues Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement standard library Area: Standard library umbrella
Projects
None yet
Development

No branches or pull requests

3 participants