filename completion, UTF-16 [was Re: rsync backup]
D. Hugh Redelmeier
hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org
Thu Dec 9 17:16:32 UTC 2010
| From: Giles Orr <gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
| On 9 December 2010 01:47, Brandon Sandrowicz <bsandrow-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> wrote:
| > z-shell's tab-completion gets this right. When tab-completing a
| > directory, it will put the trailing slash there but it defines the
| > following behavior:
|
| I'm not sure that's right: I don't want the shell messing with
| anything I see unless I specifically told it to. By hitting "Tab" I
| said "I want you to fill something in for me." By hitting "Space" I
| said "insert a space here," not "go back and change something that I
| thought was already written."
Spoken like a touch-typist. I agree.
This is somewhat mitigated by the fact that the odd behaviour on SPace is
triggered by having previously used tab. But not completely: it means
that you need to remember that the shell is in
incomplete-tab-completion mode. And modes add a cognitive burden.
"Don't mode me in" is an old EMACs-culture refrain that I subscribe
to.
My guess is that touch-typing habits are swamped by the opposite: the
mass of thumb-typing folks that want anything that can save
thumb-strokes, including a lot of very modish things. Kids these days!
I'm old school. I like EMACS keystrokes but not its complexity
("there's an app for that"). So I use a small subset implementation
called JOVE. Richer than nano/pico but small enough that it fit on a
64K machine (PDP-11).
Some day I'll switch because I'm too lazy to make JOVE support UTF-8.
It survives UTF-8 but doesn't support it. For example, the following
line:
| > - If you type a space, that space replaces the trailing slash. (Or the
looks like this to me as I edit it in JOVE:
| > \302\240- If you type a space, that space replaces the trailing slash. (Or the
My opinion is UTF-16 is a mistake.
- a single UTF-16 code unit cannot represent all of Unicode. So each
character takes one or two 16-bit code units. So, like UTF-8 (and
unlike ASCII, ISO-8859-1, UCS-2, UTF-32) a character is variable
length, making processing a little more awkward.
- it doubles the size of ordinary text (unlike UTF-8)
- I think that it is harder to convert old C code to support UTF-16
than UTF-8.
- a chunk of plain ASCII in memory does not look like UTF-16 and vice
versa. But it does look like UTF-8 and vice versa.
- UTF-16 raises the big-endian vs little-endian issues (in which order do
the two bytes go?). Unfortunately, both are legitimate, leading to a
having to support both. I seem to recollect HTML standards have to make
this surviveable. Sheesh!
Yet UTF-16 is what MS Windows, Java, Python, and who knows what else have
adopted. I imagine the reason was that they thought UCS-2 was good enough
but had to back down to UTF-16 when Han Unification was rejected.
UTF-16 replaced UCS-2 in 1996 (according to my reading of Wikipedia).
Surely that was early enough to prevent some of those adoptions.
Am I wrong?
More information about the Legacy
mailing list