Removing junk characters from text files?

Mon Feb 7 17:06:52 UTC 2005

On Fri, Feb 04, 2005 at 08:22:01AM -0500, William O'Higgins wrote:
> I get to deal with text files from Windoze and Mac sources on a regular
> basis, and frequently they are filled with junk characters.  I would
> love to be able to de-cruft these files in a systematic way.  I have no
> idea what some of the characters are - they often show up blue in vim,
> and they have numbers like \240 in hex.  I thought that bvi might work
> to let me search and replace then by hex code, but that didn't seem to
> work.  I can usually deal with the infamous "^M" with flip, but I'd love
> something in Perl or vim (so I can understand it - I'm sure it's doable
> in assembly or bash or smalltalk, but then I wouldn't learn anything)
> that will hunt out these weird artifacts of wonky software and remove
> them.  Any suggestions?

\240 (in octal usually ranging from 0 to 377 (0 to 255 decimal or 0 to
ff in hex) is either a character from an extended character set in which
case to really read the file you need to know the character map the file
was written using.  It could also be unicode, which is quite likely if
they occour in groups of 2 or more together.  In that case any program
that is able to read UTF8 files should be able to handle it.

What does 'file badtextfile.txt' say about it?  It is pretty smart at
guessing the type.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml