[GTALUG] string representation is tricky
D. Hugh Redelmeier
hugh at mimosa.com
Wed Apr 17 18:44:52 EDT 2019
| From: Stewart C. Russell via talk <talk at gtalug.org>
| On 2019-04-16 10:36 p.m., D. Hugh Redelmeier via talk wrote:
| >
| > Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete
| > filesystem, the NT File system, and many other things use UTF-8.
|
| You meant to say UTF-16?
Yes. Thanks!
[the rest is for orthographic nerds only]
| Collation is difficult
Yes.
And even just string equality. And there are security implications
here.
| Once you get outside English*,
| things get much more delightful.
| *: difficult, because we assimilate everything, accents and all.
Including Scots, accent and all :-)
| In Welsh, for instance, 'ff' and 'll'
| sort as different codepoints to f and l, but an initial 'ng' sorts as a
| 'g' as it's merely an inflected form. Capitalization is a whole
| different horror and left as an exercise for the reader. Suffice to say,
| an initial 'ff' (as in the rare Welsh/English surnames ffrench and
| ffinch) is never capitalized.
And that's not all ffolkes!
But Jasper Fforde, apparently.
As far as I know, the idea of upper-case doesn't apply to most languages.
Of course other languages have distinctions that we're not used to. Think
of all the forms of eacj letter in Arabic.
One UNICODE surprise: it has a capital scharfes S. Wikipedia says:
In 2017, the Council for German Orthography ultimately adopted
capital ß (ẞ) into German orthography, ending a
long orthographic debate.[4]
<https://en.wikipedia.org/wiki/%C3%9F>
In English, certain "s" letters were written in a way that looks like an f
to us (but the cross is missing or different). I remember thinking "King
Charles the Fecond" was a witty pun (Spring Thaw, 1967). This seems to be
related to the scharfes S.
<https://en.wikipedia.org/wiki/Long_s>
Have a look at the contrasting Britanica pages.
A google for "charles the fecond" gets me lots of books.google.* hits for
books that have been OCRed incorrectly: the long s has been taken as an f.
More information about the talk
mailing list