Removing junk characters from text files?

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Thu Feb 10 17:15:09 UTC 2005


On Thu, Feb 10, 2005 at 12:06:13PM -0500, William O'Higgins wrote:
> Thanks for all of the replies thus far.  To recap:
> 
> I have files with "bad" characters in them - stuff that doesn't print,
> but does screw up the regexes and other text processing.  I identified
> one of these as (I think) \240, but I wasn't sure.
> 
> Several people suggested tricks for removing DOS line endings, both in
> vi and using utilities like dos2unix (I use flip, but we're on the same
> page).
> 
> We also had people suggesting transposition operators, usually looking
> like tr///.  I agree whole-heartedly with this advice - these are good
> tools.
> 
> Lennart asked the incredibly salient question of "what does file say?"
> The answer is that file thinks it is text, encoded with the 8859
> charset.  These files are often multi-generational Windoze documents
> that have passed via the beauty of Object Linking and Embedding through
> several programs, each of which "knows" best.
> 
> The problem I have is that I don't know what to call some of these junk
> characters for transposition.  When vi hands you "||" in blue, what does
> that mean, and how do you strip it?  Thanks.

Maybe what you really want is something like this:

lennartsorensen at debdev1:~$ apt-cache show tcs
Package: tcs
Priority: optional
Section: text
Installed-Size: 356
Maintainer: Frederic Peters <fpeters-8fiUuRrzOP0dnm+yROfE0A at public.gmane.org>
Architecture: i386
Version: 1-9
Depends: libc6 (>= 2.2.4-4)
Filename: pool/main/t/tcs/tcs_1-9_i386.deb
Size: 135404
MD5sum: fd4b28b17575073baf9ddee7e038291f
Description: Character set translator.
 tcs translates character sets from one encoding to another.
 .
 Supported encodings include utf (ISO utf-8), ascii, ISO 8859-[123456789],
 koi8, jis-kanji, ujis, ms-kanji, jis, gb, big5, unicode, tis, msdos, and
 atari.

Looks like a pretty nifty tool actually.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list