[GTALUG] string representation is tricky
Stewart C. Russell
scruss at gmail.com
Wed Apr 17 16:36:42 EDT 2019
On 2019-04-16 10:36 p.m., D. Hugh Redelmeier via talk wrote:
>
> Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete
> filesystem, the NT File system, and many other things use UTF-8.
You meant to say UTF-16?
Collation is difficult in anything but the simplistic "ASCIIbetical"
case. People expect natural sort orders now, with '10' coming after ' 9'
and case being of lesser importance. Once you get outside English*,
things get much more delightful. In Welsh, for instance, 'ff' and 'll'
sort as different codepoints to f and l, but an initial 'ng' sorts as a
'g' as it's merely an inflected form. Capitalization is a whole
different horror and left as an exercise for the reader. Suffice to say,
an initial 'ff' (as in the rare Welsh/English surnames ffrench and
ffinch) is never capitalized.
Stewart
*: difficult, because we assimilate everything, accents and all.
More information about the talk
mailing list