Test for invalid unicode in file name

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Wed May 11 17:39:25 UTC 2005


On Tue, May 10, 2005 at 11:00:49PM -0400, Madison Kelly wrote:
> Shoot, it may actually be ShiftJIS... If that is the case, how can I 
> "translate" it into something that postgresql would not choke on? I, 
> wrongly I guess, thought Unicode included Japanese (et. al.). How is a 
> poor programmer to make a program that can handle all these different 
> encodings in a sane way?
> 
> This being a backup program, is there any way I can handle files that 
> could be named in any number of different ways or do I need to 
> brute-force in support for each possible locale or encoding method?

Well you can choose what encoding to use when you mount the filesystem
(as far as I know).  It may also be that the application gets to choose
(but I hope not).

ShiftJIS is a 2 byte per character encoding for japanese, while UTF8 is
an encoding for all of unicode.  Both include support for ascii
characters when the high bit isn't set, so ascii characters are the same
in both encodings.  Many shiftjis byte combinations for non ascii
however are invalid encodings in UTF8 so it would have to be converted.
There are routines for converting between encodings.

You might have to use some perl routine to convert shiftjis to utf8 for
storage if you want to use the postgresql unicode format (to support all
encodings).  Another option is when sending shiftjis names to the
database, you tell if that you want to use shiftjis for that session,
and postgres will convert it to utf8 for internal storage, at least as
far as I know.

How you detect the encoding of the filesystem I don't know.  Many lagecy
encodings are around since utf8 hasn't always existed and it would have
been much to big and complex for many older systems that could handle a
simple 2byte encoding for one language.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list