[GTALUG] tr: Illegal byte sequence

D. Hugh Redelmeier hugh at mimosa.com
Thu Sep 27 00:55:15 EDT 2018


| From: Stewart C. Russell via talk <talk at gtalug.org>

| tr on Mac OS seems to assume input is valid UTF-8 text (if locale is
| suitably UTF-8).

To amplify this, not all byte sequences are valid UTF-8.  Random byte
sequences will sometimes be invalid.

Off the top of my head, I think that the following are invalid:

- A 0x80 byte not preceded by a byte with the high bit on

- A string ending with a byte with the high bit on

- A sequence of more than n bytes with the high bit on (n is something
  like 4).

Each valid character is represented as a sequence of zero or more
bytes with the high bit on, not starting with 0x80, followed by a byte
without the high bit on.  All the non-high bits are concatenated to
form the UTF-32 value.  Overflow is forbidden.

On the other hand, UTF-8 is UTF-8, whether you are in US or CA locale.
So the different behaviours between the two UTF-8 locales would seem
to be a bug.  (In theory, collating sequences could be different so
ranges in tr could be different, but I would not see that affecting
the ASCII subset you are using in your ranges.)

Using C locale should give you 8-bit characters, not UTF-8.  So it
should work.

This (untested) small change to Giles' script should work.

dd if=/dev/urandom bs=1 count=256 2>/dev/null |
	LC_ALL=C tr -dc 'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' |
	head -c 32

LC_ALL might be overkill.  I don't know.

I'd probably add an echo to put a newline at the end.


More information about the talk mailing list