[SR-1280] Unicode conformance readLine #43888

swift-ci · 2016-04-21T08:55:34Z


Previous ID	SR-1280
Radar	rdar://problem/20013999
Original Reporter	tinysun (JIRA User)
Type	Improvement

Additional Detail from JIRA


Votes	1
Component/s	Standard Library
Labels	Improvement
Assignee	None
Priority	Medium

md5: 001f9dad12b054615ab07d0ec9cf6287

Issue Description:

readLine() is implemented in InputStream.swift, and it contains following comment.

// FIXME: Unicode conformance. To fix this, we need to reimplement the
// code we call above to get a line, since it will only stop on LF.
//
// <rdar://problem/20013999> Recognize Unicode newlines in readLine()
//
// Recognize only LF and CR+LF combinations for now.

I cannot access rdar://problem/*, so posted here.

belkadan · 2016-04-21T16:34:27Z

The Radar contains this additional information:

See http://unicode.org/reports/tr14/

BK: Mandatory Break (A) (Non-tailorable)
LF: Line Feed (A) (Non-tailorable)
NL: Next Line (A) (Non-tailorable)

See also the section 5.8, Newline Guidelines, in the Unicode spec.

swift-ci · 2016-04-21T22:42:47Z

Comment by Han Sangjin (JIRA)

I just read the section 5.8, Newline Guidelines. It has superseeded UAX #13 http://unicode.org/reports/tr13, and I think it has a explicit information about the behavior of the readline.

But I'm confused about http://unicode.org/reports/tr14/, LINE BREAKING ALGORITHM.

Copied from the summary of the tr14,
The line breaking algorithm produces a set of "break opportunities", or positions that would be suitable for wrapping lines when preparing text for display.

I think the LINE BREAKING ALGORITHM is only for display, not for readline.

Could you confirm this ?

swift-ci · 2016-04-24T09:25:30Z

Comment by Han Sangjin (JIRA)

I stopped this task.

When I started, I hoped a common readLine() code which is platform independent. And I wrote a Unicode Newline recognizer in C++ which used getc(stdin) and could be run on Windows and Linux. Its function worked, it stopped at NLF, LS, FF, or PS. (NLF is CR, LF, CRLF, NEL)

But the speed was very slower than old code that used getline(), it was about 9x slower. (loop readLine for 2MB binary file, Linux)

The root cause was not the increased newline code. The getline() scanned the '\n' in the internal buffer using memchr(), while my code calling the getc() for each bytes.

To meet a similar performance, in my opinion, it should be implemented in the low level which access the buffer and is dependent on platform.

Any ideas are welcome.

milseman · 2019-01-02T22:51:55Z

I'm not sure if this should be done. On the one hand, it makes sense to change the implementation to be consistent with Character or Unicode.Scalar's notion of newlines. On the other hand, this is reading from stdin (i.e. from the shell) and it would make sense to read until encountering the platform-specific line delimiter, for consistency with the shell.

benrimmington · 2019-01-03T22:28:08Z

@@milseman The recommendations in §5.8 Newline Guidelines of the Unicode Standard include:

Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them.

and

R4 A readline function should stop at NLF, LS, FF, or PS. In the typical implementation, it does not include the NLF, LS, PS, or FF that caused it to stop.

My pull request at apple/swift#21586 follows this recommendation, and also stops at VT to match the Character.isNewline property.

Java has System.getProperties("line.separator") to access the platform-specific NLF (i.e. LF, CR, CRLF, or NEL). Is there an equivalent in the standard C library? Should the print(_:separator:terminator:) function also use NLF by default?

milseman · 2019-01-04T01:03:20Z

@compnerd, what approach do you think would make sense for Windows?

benrimmington · 2019-01-05T13:53:07Z

Windows uses text mode by default for stdin and stdout.

Can we require text mode for the print and readLine APIs, in addition to the existing UTF-8 requirement?

milseman · 2019-01-07T19:26:50Z

That sounds great to me; IIUC, the whole point of these facilities is for interacting with the system/command-line. But I can see how others might disagree and I'd like to see some discussion. Could you open a post on the forums under Development/Standard Library?

milseman · 2019-01-07T19:27:20Z

Thanks for following up and chasing this down, BTW!

benrimmington · 2019-01-08T18:55:23Z

https://forums.swift.org/t/newlines-in-text-streams/19346

benrimmington · 2019-02-05T22:49:54Z

I've closed my pull request at apple/swift#21586. There were two possible implementations:

49c342e hard-coded the newline sequences.

641ed37 used the `isNewline` property.

swift-ci transferred this issue from apple/swift-issues Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SR-1280] Unicode conformance readLine #43888

[SR-1280] Unicode conformance readLine #43888

swift-ci commented Apr 21, 2016

belkadan commented Apr 21, 2016

swift-ci commented Apr 21, 2016

swift-ci commented Apr 24, 2016

milseman mannequin commented Jan 2, 2019

benrimmington commented Jan 3, 2019

milseman mannequin commented Jan 4, 2019

benrimmington commented Jan 5, 2019

milseman mannequin commented Jan 7, 2019

milseman mannequin commented Jan 7, 2019

benrimmington commented Jan 8, 2019

benrimmington commented Feb 5, 2019

[SR-1280] Unicode conformance readLine #43888

[SR-1280] Unicode conformance readLine #43888

Comments

swift-ci commented Apr 21, 2016

belkadan commented Apr 21, 2016

swift-ci commented Apr 21, 2016

swift-ci commented Apr 24, 2016

milseman mannequin commented Jan 2, 2019

benrimmington commented Jan 3, 2019

milseman mannequin commented Jan 4, 2019

benrimmington commented Jan 5, 2019

milseman mannequin commented Jan 7, 2019

milseman mannequin commented Jan 7, 2019

benrimmington commented Jan 8, 2019

benrimmington commented Feb 5, 2019