Test for invalid unicode in file name

Fri May 6 18:22:05 UTC 2005

On Thu, May 05, 2005 at 11:53:49PM -0400, Madison Kelly wrote:
> Thanks for the reply!
> 
> The trick is though that I have several valid unicode file names (ie: 
> files using Japanese kana/kanji characters). These file names are 
> accepted just fine and it is important that unicode support remains. If 
> there is a regex that cought all valid unicodes and wasn't too expensive 
> that would be great.

Are you sure the filenames aren't in shiftjis or something instead?
That would be incompatible with unicode for sure.

A valid unicode character (well UTF8 at least) is:
0-127 is valid by itself.
110xxxxx 10xxxxxx is valid.
1110xxxx 10xxxxxx 10xxxxxx is valid
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx is valid.

10xxxxxx by itself is never valid.  It must follow a byte indicating how
many 10xxxxxx bytes follow.
1111111x is never valid.

For example:
$ ./validate.pl "`echo -e '12345\337\277 123\375\277\277\200\277\204\377'`" && echo ok || echo fail
2 byte character found.  Checking bytes 1 2
6 byte character found.  Checking bytes 1 2 3 4 5 6
Invalid character at position 18
fail

$ ./validate.pl "`echo -e '12345\337\277 123\375\277\277\200\277\204\277'`" && echo ok || echo fail
2 byte character found.  Checking bytes 1 2
6 byte character found.  Checking bytes 1 2 3 4 5 6
ok

#!/usr/bin/perl -W

$currentposition=0;
@inputstring=split(//,$ARGV[0]);

sub getnextchar() {
        return '\0' if($currentposition>$#inputstring);
        return $inputstring[$currentposition++];
}

for(;$currentposition<=$#inputstring;) {
        $value=ord(&getnextchar());
        $count=0;
        next if($value<0x80);
        die("Invalid character at position $currentposition\n") if($value>=0xFE or $value<0xC0);
        $count=6 if($value<0xFE);
        $count=5 if($value<0xFC);
        $count=4 if($value<0xF8);
        $count=3 if($value<0xF0);
        $count=2 if($value<0xE0);
        print "$count byte character found.  Checking bytes 1";
        for($i=1;$i<$count;$i++) {
                $value=ord(&getnextchar());
                printf(" %d",1+$i);
                die("Invalid character at position $currentposition\n") if (($value & 0xC0)!=0x80);
        }
        print "\n";
}

Minimally tested, but probably works.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml