Removing junk characters from text files?

Devin Whalen devin-Gq53QDLGkWIleAitJ8REmdBPR1lH4CV8 at public.gmane.org
Fri Feb 4 18:35:55 UTC 2005


On Fri, 2005-02-04 at 13:32 -0500, Devin Whalen wrote:
> On Fri, 2005-02-04 at 08:22 -0500, William O'Higgins wrote:
> > I get to deal with text files from Windoze and Mac sources on a regular
> > basis, and frequently they are filled with junk characters.  I would
> > love to be able to de-cruft these files in a systematic way.  I have no
> > idea what some of the characters are - they often show up blue in vim,
> > and they have numbers like \240 in hex.  I thought that bvi might work
> > to let me search and replace then by hex code, but that didn't seem to
> > work.  I can usually deal with the infamous "^M" with flip, but I'd love
> > something in Perl or vim (so I can understand it - I'm sure it's doable
> > in assembly or bash or smalltalk, but then I wouldn't learn anything)
> > that will hunt out these weird artifacts of wonky software and remove
> > them.  Any suggestions?
> > 
> > Thanks.
> 
> 
> 
> Hey,
> 
> I have this problem as well.  A quick fix for the "^M" in vim/vi you can
> use :.,$s/\r//g and that will get rid of them.  I started to mess around
> with a perl script to get rid of all non-ASCII characters from a file
> but I couldn't seem to get it to work and I am too busy today to spend
> more time on it, but here it is.  You might be able to modify it or
> whatever to get it to work.  The line $line =~s/[^\x00-\x7f]//g;  is
> supposed to replace all non-ASCII characters with nothing but it doesn't
> seem to work here.  Anyway, hope this helps.
> 
> Later
> 
> 
> #!/usr/bin/perl -w
> use strict;
> 
> my $file = "sql.sql";
> my $newFile = "onlyascii.sql";
> 
> open(FILE,"$file") || die "Could not Open $file\n";
> open(NEWFILE,">$newFile") || die "Could not Open $newFile\n";
> my $line = "";
>         while($line = <FILE>)
>         {
>          chomp($line);
>           #$line =~s/[^\w\d\s]+//g;
>           $line =~s/([[^:ascii:]]|\r)*//g;
> 	  #$line =~s/[^\x00-\x7f]//g;
>           #warn "$line\n";
>           #$line=~s/[^\w\s<>,.'"*-+=]#:;?\/&%@!\$()]}{_~`\^]/ /g;
> 
>           print(NEWFILE $line."\n");
>         }
> 
> 
> close(FILE);
> close(NEWFILE);
> 
> 
> 
> 

I just realized that "^M" is actually an ASCII character, so maybe it
will get rid of the other characters for you.  I just don't have a file
with any non-ASCII characters to get rid of, just the "^M".  No wonder
it wasn't working :).

Later


-- 
Devin Whalen
Programmer
Synaptic Vision Inc
Phone-(416) 539-0801
Fax- (416) 539-8280
1179A King St. West
Toronto, Ontario
Suite 309 M6K 3C5
Home-(416) 653-3982


Take back the Web with FireFox....a browser you can trust
www.getfirefox.com

   .-.
   /v\    L   I   N   U   X
  // \\  
 /(   )\
  ^^-^^   

--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list