Test for invalid unicode in file name

Henry Spencer henry-lqW1N6Cllo0sV2N9l4h3zg at public.gmane.org
Fri May 6 18:54:26 UTC 2005


On Fri, 6 May 2005, Lennart Sorensen wrote:
> A valid unicode character (well UTF8 at least) is:
> 0-127 is valid by itself.
> 110xxxxx 10xxxxxx is valid.
> 1110xxxx 10xxxxxx 10xxxxxx is valid
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
> 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
> 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid.

The modern definition of UTF-8 is actually somewhat narrower, outlawing
the last two forms and restricting the earlier ones somewhat.  (For
example, encoding the character value 15 as 11000000 10001111 is not
legal any more -- it can be encoded *only* as 00001111.)  See RFC 3629
or the Unicode standard rev 4.0.

                                                          Henry Spencer
                                                       henry-lqW1N6Cllo0sV2N9l4h3zg at public.gmane.org

--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list