[GTALUG] tr: Illegal byte sequence
D. Hugh Redelmeier
hugh at mimosa.com
Thu Sep 27 00:55:15 EDT 2018
| From: Stewart C. Russell via talk <talk at gtalug.org>
| tr on Mac OS seems to assume input is valid UTF-8 text (if locale is
| suitably UTF-8).
To amplify this, not all byte sequences are valid UTF-8. Random byte
sequences will sometimes be invalid.
Off the top of my head, I think that the following are invalid:
- A 0x80 byte not preceded by a byte with the high bit on
- A string ending with a byte with the high bit on
- A sequence of more than n bytes with the high bit on (n is something
like 4).
Each valid character is represented as a sequence of zero or more
bytes with the high bit on, not starting with 0x80, followed by a byte
without the high bit on. All the non-high bits are concatenated to
form the UTF-32 value. Overflow is forbidden.
On the other hand, UTF-8 is UTF-8, whether you are in US or CA locale.
So the different behaviours between the two UTF-8 locales would seem
to be a bug. (In theory, collating sequences could be different so
ranges in tr could be different, but I would not see that affecting
the ASCII subset you are using in your ranges.)
Using C locale should give you 8-bit characters, not UTF-8. So it
should work.
This (untested) small change to Giles' script should work.
dd if=/dev/urandom bs=1 count=256 2>/dev/null |
LC_ALL=C tr -dc 'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' |
head -c 32
LC_ALL might be overkill. I don't know.
I'd probably add an echo to put a newline at the end.
More information about the talk
mailing list