Removing junk characters from text files?

Taavi Burns jaaaarel-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Fri Feb 4 18:50:59 UTC 2005


> > work.  I can usually deal with the infamous "^M" with flip, but I'd love
> > something in Perl or vim (so I can understand it - I'm sure it's doable
> > in assembly or bash or smalltalk, but then I wouldn't learn anything)
> > that will hunt out these weird artifacts of wonky software and remove
> > them.  Any suggestions?

Well...UNIX has always used LF (linefeed) to signify the end of a line.
DOS/Windows somewhere along the line decided to use CR/LF (CR
would be carriage return).  Ye Olde Macs use simply CR (newer ones
use LF, as they're actually UNIX).

LF has an ASCII value of 10, and CR has an ASCII value of 13.

> I have this problem as well.  A quick fix for the "^M" in vim/vi you can
> use :.,$s/\r//g and that will get rid of them.  I started to mess around

Sure does.  \n tends to map to LF and \r to CR.  (though I'm pretty sure
that some DOS compilers will automatically translate \n to CR/LF)

> with a perl script to get rid of all non-ASCII characters from a file
> but I couldn't seem to get it to work and I am too busy today to spend
> more time on it, but here it is.  You might be able to modify it or
> whatever to get it to work.  The line $line =~s/[^\x00-\x7f]//g;  is
> supposed to replace all non-ASCII characters with nothing but it doesn't
> seem to work here.  Anyway, hope this helps.

Naturally, since the ^M you see (the CR) is a perfectly valid ASCII character.
Perhaps something like:
$line =~ s/[^\x00-\x0b\0xd-\x7f]//g;
might work better for removing the DOS/Win formatting.

Good luck!  :)

-- 
taa
/*eof*/
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list