[GTALUG] string representation is tricky

D. Hugh Redelmeier hugh at mimosa.com
Wed Apr 17 18:44:52 EDT 2019


| From: Stewart C. Russell via talk <talk at gtalug.org>

| On 2019-04-16 10:36 p.m., D. Hugh Redelmeier via talk wrote:
| > 
| > Java, Python before 3, Javascript, Microsoft's C and C++, the Jolliete 
| > filesystem, the NT File system, and many other things use UTF-8.
| 
| You meant to say UTF-16?

Yes.  Thanks!

[the rest is for orthographic nerds only]

| Collation is difficult

Yes.

And even just string equality.  And there are security implications
here.

| Once you get outside English*,
| things get much more delightful.

| *: difficult, because we assimilate everything, accents and all.

Including Scots, accent and all :-)

| In Welsh, for instance, 'ff' and 'll'
| sort as different codepoints to f and l, but an initial 'ng' sorts as a
| 'g' as it's merely an inflected form. Capitalization is a whole
| different horror and left as an exercise for the reader. Suffice to say,
| an initial 'ff' (as in the rare Welsh/English surnames ffrench and
| ffinch) is never capitalized.

And that's not all ffolkes!

But Jasper Fforde, apparently.

As far as I know, the idea of upper-case doesn't apply to most languages.  
Of course other languages have distinctions that we're not used to.  Think 
of all the forms of eacj letter in Arabic.

One UNICODE surprise: it has a capital scharfes S.  Wikipedia says:

  In 2017, the Council for German Orthography ultimately adopted
  capital ß (ẞ) into German orthography, ending a
  long orthographic debate.[4]

<https://en.wikipedia.org/wiki/%C3%9F>

In English, certain "s" letters were written in a way that looks like an f 
to us (but the cross is missing or different).  I remember thinking "King 
Charles the Fecond" was a witty pun (Spring Thaw, 1967).  This seems to be 
related to the scharfes S.

<https://en.wikipedia.org/wiki/Long_s>

Have a look at the contrasting Britanica pages.

A google for "charles the fecond" gets me lots of books.google.* hits for 
books that have been OCRed incorrectly: the long s has been taken as an f.


More information about the talk mailing list