Test for invalid unicode in file name

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Fri May 6 19:23:25 UTC 2005


On Fri, May 06, 2005 at 03:17:30PM -0400, William O'Higgins wrote:
> I'm not sure if this will help, but I found this one-liner (reconstruct
> it using " \\" as the separator):
> 
> perl -ne 'use bytes;/^(([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef] \\
> [\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})*)(.*)$/;print "$ARGV:$.:".($ \\
> -[3]+1).":$_" if length($3)'
> 
> I found the above here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl

Neat, although to be complete it should allow {4} and {5} as well in the
matching since UTF-8 does permit that, although I don't think there are
any defined charaters in that range yet.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list