Test for invalid unicode in file name

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Fri May 6 19:03:13 UTC 2005


On Fri, May 06, 2005 at 02:54:26PM -0400, Henry Spencer wrote:
> On Fri, 6 May 2005, Lennart Sorensen wrote:
> > A valid unicode character (well UTF8 at least) is:
> > 0-127 is valid by itself.
> > 110xxxxx 10xxxxxx is valid.
> > 1110xxxx 10xxxxxx 10xxxxxx is valid
> > 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
> > 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
> > 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid.
> 
> The modern definition of UTF-8 is actually somewhat narrower, outlawing
> the last two forms and restricting the earlier ones somewhat.  (For
> example, encoding the character value 15 as 11000000 10001111 is not
> legal any more -- it can be encoded *only* as 00001111.)  See RFC 3629
> or the Unicode standard rev 4.0.

Well yes, you MUST use the shortest form possible although why anyone
would write a parser to NOT accept all possible forms I don't know,
since you just write the generator to write shortest form, and parser to
parse anything that fits the rules in general.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list