[GTALUG] tr: Illegal byte sequence

Stewart C. Russell scruss at gmail.com
Thu Sep 27 08:52:40 EDT 2018


On 2018-09-27 12:55 AM, D. Hugh Redelmeier via talk wrote:
> 
> On the other hand, UTF-8 is UTF-8, whether you are in US or CA locale.
> So the different behaviours between the two UTF-8 locales would seem
> to be a bug.

The Mac I tested this on used the same CA locale as my Linux box. It
still failed on the Mac. The issue is more likely to be that Mac OS 'tr'
is a BSD version, and the Linux one is Gnu.

Mac OS's command line suite is a mish-mash of sources and versions.
Their tr is marked BSD, from 2005. Their sed (which also requires valid
UTF-8 byte streams) is from FreeBSD circa 2004. Mac OS awk is bwk's "One
True awk" (which doesn't seem to care if a byte stream is valid or not),
but a couple of versions behind current.

Linux distros tend to be more homogeneous. The only difference I've
found that's common is that Debian tends to prefer mawk (it's much
faster) while others ship with gawk (it has better - but still limited -
UTF-8 support). There's still enough difference between the two that it
can trip you up on edge-case input data. Or more likely, it's tripped
*me* up a couple of times: the rest of you will know what you're doing.

cheers,
 Stewart



More information about the talk mailing list