Test for invalid unicode in file name

William O'Higgins william.ohiggins-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Fri May 6 19:17:30 UTC 2005


On Thu, May 05, 2005 at 10:20:48PM -0400, Madison Kelly wrote:
>Hi all,
>
>  I've run into a problem where a bulk postgres "COPY..." statement is 
>dieing because one of the lines contains a file name with an invalid 
>unicode character. In nautilus this file has '(invalid encoding)' and 
>the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
>file_name: "Femme Fatal\uffff.url"'.
>
>  Is there a way in perl (something like 'stat') where I can check to 
>make sure a file name has valid encoding? If there is than I can catch 
>this problem before adding it to, and corrupting, my COPY statement? I 
>already 'quote' the file names first but that didn't catch it.

I'm not sure if this will help, but I found this one-liner (reconstruct
it using " \\" as the separator):

perl -ne 'use bytes;/^(([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef] \\
[\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})*)(.*)$/;print "$ARGV:$.:".($ \\
-[3]+1).":$_" if length($3)'

I found the above here: http://www.cl.cam.ac.uk/~mgk25/unicode.html#perl

Good luck.
-- 

yours,

William

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://gtalug.org/pipermail/legacy/attachments/20050506/e408b2a3/attachment.sig>


More information about the Legacy mailing list