[GTALUG] string representation is tricky

Stewart C. Russell scruss at gmail.com
Wed Apr 17 16:36:42 EDT 2019


On 2019-04-16 10:36 p.m., D. Hugh Redelmeier via talk wrote:
> 
> Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete 
> filesystem, the NT File system, and many other things use UTF-8.

You meant to say UTF-16?

Collation is difficult in anything but the simplistic "ASCIIbetical"
case. People expect natural sort orders now, with '10' coming after ' 9'
and case being of lesser importance. Once you get outside English*,
things get much more delightful. In Welsh, for instance, 'ff' and 'll'
sort as different codepoints to f and l, but an initial 'ng' sorts as a
'g' as it's merely an inflected form. Capitalization is a whole
different horror and left as an exercise for the reader. Suffice to say,
an initial 'ff' (as in the rare Welsh/English surnames ffrench and
ffinch) is never capitalized.

 Stewart

*: difficult, because we assimilate everything, accents and all.


More information about the talk mailing list