filename completion, UTF-16 [was Re: rsync backup]

Thu Dec 9 22:08:14 UTC 2010

On 10-12-09 02:06 PM, Lennart Sorensen wrote:
> On Thu, Dec 09, 2010 at 12:16:32PM -0500, D. Hugh Redelmeier wrote:
>> Spoken like a touch-typist.  I agree.
>>
>> This is somewhat mitigated by the fact that the odd behaviour on SPace is
>> triggered by having previously used tab.  But not completely: it means
>> that you need to remember that the shell is in
>> incomplete-tab-completion mode.  And modes add a cognitive burden.
>> "Don't mode me in" is an old EMACs-culture refrain that I subscribe
>> to.
>>
>> My guess is that touch-typing habits are swamped by the opposite: the
>> mass of thumb-typing folks that want anything that can save
>> thumb-strokes, including a lot of very modish things.  Kids these days!
>>
>> I'm old school.  I like EMACS keystrokes but not its complexity
>> ("there's an app for that").  So I use a small subset implementation
>> called JOVE.  Richer than nano/pico but small enough that it fit on a
>> 64K machine (PDP-11).
>>
>> Some day I'll switch because I'm too lazy to make JOVE support UTF-8.
>> It survives UTF-8 but doesn't support it.  For example, the following
>> line:
>> |>    - If you type a space, that space replaces the trailing slash. (Or the
>>
>> looks like this to me as I edit it in JOVE:
>> |>  \302\240- If you type a space, that space replaces the trailing slash. (Or the
>>
>>
>> My opinion is UTF-16 is a mistake.
>>
>> - a single UTF-16 code unit cannot represent all of Unicode.  So each
>>    character takes one or two 16-bit code units.  So, like UTF-8 (and
>>    unlike ASCII, ISO-8859-1, UCS-2, UTF-32) a character is variable
>>    length, making processing a little more awkward.
>>
>> - it doubles the size of ordinary text (unlike UTF-8)
>>
>> - I think that it is harder to convert old C code to support UTF-16
>>    than UTF-8.
>>
>> - a chunk of plain ASCII in memory does not look like UTF-16 and vice
>>    versa.  But it does look like UTF-8 and vice versa.
>>
>> - UTF-16 raises the big-endian vs little-endian issues (in which order do
>>    the two bytes go?).  Unfortunately, both are legitimate, leading to a
>>    having to support both.  I seem to recollect HTML standards have to make
>>    this surviveable.  Sheesh!
>>
>> Yet UTF-16 is what MS Windows, Java, Python, and who knows what else have
>> adopted.  I imagine the reason was that they thought UCS-2 was good enough
>> but had to back down to UTF-16 when Han Unification was rejected.
>> UTF-16 replaced UCS-2 in 1996 (according to my reading of Wikipedia).
>> Surely that was early enough to prevent some of those adoptions.
>>
>> Am I wrong?
>
> UTF-16 is useless.  It makes ascii take twice the space, and doesn't
> handle all unicode without multiple chunks anyhow.  UTF-8 leaves ascii
> alone, and handles everything, usually more space efficiently than UTF-16.
> UTF-16 is simply a bad mistake.

I tried to track down the truth about Python this morning, but I had 
trouble finding answers that seemed definitive enough.  Python 3, of 
course, now uses Unicode as its native string type.  Internally it 
stores strings as UCS-2 or UCS-4, depending on an option set when the 
Python package is compiled.  This settles the matter as far as pure 
Python library modules are concerned, but modules coded in other 
languages and supplied as .so or .dll have to match, or else.  Most 
pre-compiled Python releases seem to have gone with UCS-2.

I'm forced to just assume that the possible UCS-2 glitches are handled 
right, where necessary -- that is that

a = "我们更喜欢你的时候你得更远"
print (len (a))

will print 13, and that

b = a[5:9]
print (b)

will reliably print "你的时候" (whatever that means -- I can't 
actually read Chinese.)

Unicode encodings are separate from concerns about internal 
representation.  It's up to the program to control the encoding when 
Unicode strings are written to files and devices (and the decoding 
when read from.)  That's where choices like UTF-8 (the only encoding 
I've had any need to use) vs UTF-16 or any of the many others come 
into play.

	Mel.
>

--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists