[GTALUG] interesting article and comments about UCS-16, UTF-16, UTF-8

nick xerofoify at gmail.com
Sun Aug 4 01:14:35 EDT 2019



On 2019-08-03 6:18 p.m., D. Hugh Redelmeier via talk wrote:
> https://news.ycombinator.com/item?id=20600195
> 
> There are so many hairy details!
> 
> UTF-8 gets a bit less coverage since it has fewer hairy details.
> 
> From this I learned that Java and JavaScript now have optimizations to
> use LATIN-1 when they can.  Normally they use UTF-16 (originally
> UCS-16).  I take it that Using Latin-1 is an opportunistic
> optimization hidden from the program.  I don't think Python 3 uses
> this.
> 
> I think that Linux does this right and needs no such hack: just use
> UTF-8.  Of course Java, JavaScript, Python 2, and Python 3 on Linux
> don't get it right.
> ---
> Post to this mailing list talk at gtalug.org
> Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
> 

I looked through it briefly and a lot of it would depend on what the implementer
think the language is being used for. There may be a very good reason for it
similar to SSO or small string optimizations in most STLs. Sure that program
may get bit in the foot but it's one program. I would be curious to see it 
across a lot of different Javascript, Python 2/3 and Java programs to see if
it's a good idea. 

It's the same with hardware heuristics or instructions for compiler backends,
sure 5% of programs may perform better but what about the other 95%?

It's interesting to point through that UTF-16 is so complex that UTF-8 or another
less complex version of encoding would be preferred and therefore if the full
encoding is required may cause overhead, the question is how much and if hardware
is already encoding it?

Nick



More information about the talk mailing list