Test for invalid unicode in file name
Henry Spencer
henry-lqW1N6Cllo0sV2N9l4h3zg at public.gmane.org
Fri May 6 18:54:26 UTC 2005
On Fri, 6 May 2005, Lennart Sorensen wrote:
> A valid unicode character (well UTF8 at least) is:
> 0-127 is valid by itself.
> 110xxxxx 10xxxxxx is valid.
> 1110xxxx 10xxxxxx 10xxxxxx is valid
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
> 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
> 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid.
The modern definition of UTF-8 is actually somewhat narrower, outlawing
the last two forms and restricting the earlier ones somewhat. (For
example, encoding the character value 15 as 11000000 10001111 is not
legal any more -- it can be encoded *only* as 00001111.) See RFC 3629
or the Unicode standard rev 4.0.
Henry Spencer
henry-lqW1N6Cllo0sV2N9l4h3zg at public.gmane.org
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list