You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use ICU heavily for normalization, and doing so efficiently is a source of considerable stdlib complexity (more complexity than just implementing the algorithm). If we have efficient access to the data tables, we should just implement this ourselves.
We heavily check NFC_QC=yes and hasCompBoundaryBefore in our fast-paths. Bouncing over to ICU gives us a hefty perf cost compared to checking locally. A local Unicode.Scalar trie-like structure that can answer these queries efficiently would alleviate this.
Using ICU for normalization involves transcoding UTF-8 to UTF-16 and back. This is costly and another source of complexity. E.g., we need many growable buffers of different widths, and even more conservative growth reservation factors.
We'd like fast-paths for languages with combining characters. Scalar-based queries only fast-path single-scalar segments, and ICU's implementation of the multi-scalar QC algorithm is UTF-16.
The text was updated successfully, but these errors were encountered:
Additional Detail from JIRA
md5: 2e7cb865a995734ac55ca28ea9e21299
Parent-Task:
Issue Description:
We use ICU heavily for normalization, and doing so efficiently is a source of considerable stdlib complexity (more complexity than just implementing the algorithm). If we have efficient access to the data tables, we should just implement this ourselves.
We heavily check NFC_QC=yes and hasCompBoundaryBefore in our fast-paths. Bouncing over to ICU gives us a hefty perf cost compared to checking locally. A local Unicode.Scalar trie-like structure that can answer these queries efficiently would alleviate this.
Using ICU for normalization involves transcoding UTF-8 to UTF-16 and back. This is costly and another source of complexity. E.g., we need many growable buffers of different widths, and even more conservative growth reservation factors.
We'd like fast-paths for languages with combining characters. Scalar-based queries only fast-path single-scalar segments, and ICU's implementation of the multi-scalar QC algorithm is UTF-16.
The text was updated successfully, but these errors were encountered: