filename completion, UTF-16 [was Re: rsync backup]

D. Hugh Redelmeier hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Thu Dec 9 17:16:32 UTC 2010


| From: Giles Orr <gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>

| On 9 December 2010 01:47, Brandon Sandrowicz <bsandrow-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:

| > z-shell's tab-completion gets this right. When tab-completing a
| > directory, it will put the trailing slash there but it defines the
| > following behavior:
| 
| I'm not sure that's right: I don't want the shell messing with
| anything I see unless I specifically told it to.  By hitting "Tab" I
| said "I want you to fill something in for me."  By hitting "Space" I
| said "insert a space here," not "go back and change something that I
| thought was already written."

Spoken like a touch-typist.  I agree.

This is somewhat mitigated by the fact that the odd behaviour on SPace is 
triggered by having previously used tab.  But not completely: it means
that you need to remember that the shell is in
incomplete-tab-completion mode.  And modes add a cognitive burden.
"Don't mode me in" is an old EMACs-culture refrain that I subscribe
to.

My guess is that touch-typing habits are swamped by the opposite: the
mass of thumb-typing folks that want anything that can save
thumb-strokes, including a lot of very modish things.  Kids these days!

I'm old school.  I like EMACS keystrokes but not its complexity
("there's an app for that").  So I use a small subset implementation
called JOVE.  Richer than nano/pico but small enough that it fit on a
64K machine (PDP-11).

Some day I'll switch because I'm too lazy to make JOVE support UTF-8.
It survives UTF-8 but doesn't support it.  For example, the following
line:
| >  - If you type a space, that space replaces the trailing slash. (Or the

looks like this to me as I edit it in JOVE:
| > \302\240- If you type a space, that space replaces the trailing slash. (Or the


My opinion is UTF-16 is a mistake.

- a single UTF-16 code unit cannot represent all of Unicode.  So each
  character takes one or two 16-bit code units.  So, like UTF-8 (and
  unlike ASCII, ISO-8859-1, UCS-2, UTF-32) a character is variable
  length, making processing a little more awkward.

- it doubles the size of ordinary text (unlike UTF-8)

- I think that it is harder to convert old C code to support UTF-16
  than UTF-8.

- a chunk of plain ASCII in memory does not look like UTF-16 and vice
  versa.  But it does look like UTF-8 and vice versa.

- UTF-16 raises the big-endian vs little-endian issues (in which order do 
  the two bytes go?).  Unfortunately, both are legitimate, leading to a 
  having to support both.  I seem to recollect HTML standards have to make 
  this surviveable.  Sheesh!

Yet UTF-16 is what MS Windows, Java, Python, and who knows what else have 
adopted.  I imagine the reason was that they thought UCS-2 was good enough 
but had to back down to UTF-16 when Han Unification was rejected.
UTF-16 replaced UCS-2 in 1996 (according to my reading of Wikipedia).  
Surely that was early enough to prevent some of those adoptions.

Am I wrong?


More information about the Legacy mailing list