Removing junk characters from text files?
Lennart Sorensen
lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Thu Feb 10 17:15:09 UTC 2005
On Thu, Feb 10, 2005 at 12:06:13PM -0500, William O'Higgins wrote:
> Thanks for all of the replies thus far. To recap:
>
> I have files with "bad" characters in them - stuff that doesn't print,
> but does screw up the regexes and other text processing. I identified
> one of these as (I think) \240, but I wasn't sure.
>
> Several people suggested tricks for removing DOS line endings, both in
> vi and using utilities like dos2unix (I use flip, but we're on the same
> page).
>
> We also had people suggesting transposition operators, usually looking
> like tr///. I agree whole-heartedly with this advice - these are good
> tools.
>
> Lennart asked the incredibly salient question of "what does file say?"
> The answer is that file thinks it is text, encoded with the 8859
> charset. These files are often multi-generational Windoze documents
> that have passed via the beauty of Object Linking and Embedding through
> several programs, each of which "knows" best.
>
> The problem I have is that I don't know what to call some of these junk
> characters for transposition. When vi hands you "||" in blue, what does
> that mean, and how do you strip it? Thanks.
Maybe what you really want is something like this:
lennartsorensen at debdev1:~$ apt-cache show tcs
Package: tcs
Priority: optional
Section: text
Installed-Size: 356
Maintainer: Frederic Peters <fpeters-8fiUuRrzOP0dnm+yROfE0A at public.gmane.org>
Architecture: i386
Version: 1-9
Depends: libc6 (>= 2.2.4-4)
Filename: pool/main/t/tcs/tcs_1-9_i386.deb
Size: 135404
MD5sum: fd4b28b17575073baf9ddee7e038291f
Description: Character set translator.
tcs translates character sets from one encoding to another.
.
Supported encodings include utf (ISO utf-8), ascii, ISO 8859-[123456789],
koi8, jis-kanji, ujis, ms-kanji, jis, gb, big5, unicode, tis, msdos, and
atari.
Looks like a pretty nifty tool actually.
Lennart Sorensen
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list