Removing junk characters from text files?

Devin Whalen devin-Gq53QDLGkWIleAitJ8REmdBPR1lH4CV8 at public.gmane.org
Fri Feb 4 18:32:46 UTC 2005


On Fri, 2005-02-04 at 08:22 -0500, William O'Higgins wrote:
> I get to deal with text files from Windoze and Mac sources on a regular
> basis, and frequently they are filled with junk characters.  I would
> love to be able to de-cruft these files in a systematic way.  I have no
> idea what some of the characters are - they often show up blue in vim,
> and they have numbers like \240 in hex.  I thought that bvi might work
> to let me search and replace then by hex code, but that didn't seem to
> work.  I can usually deal with the infamous "^M" with flip, but I'd love
> something in Perl or vim (so I can understand it - I'm sure it's doable
> in assembly or bash or smalltalk, but then I wouldn't learn anything)
> that will hunt out these weird artifacts of wonky software and remove
> them.  Any suggestions?
> 
> Thanks.



Hey,

I have this problem as well.  A quick fix for the "^M" in vim/vi you can
use :.,$s/\r//g and that will get rid of them.  I started to mess around
with a perl script to get rid of all non-ASCII characters from a file
but I couldn't seem to get it to work and I am too busy today to spend
more time on it, but here it is.  You might be able to modify it or
whatever to get it to work.  The line $line =~s/[^\x00-\x7f]//g;  is
supposed to replace all non-ASCII characters with nothing but it doesn't
seem to work here.  Anyway, hope this helps.

Later


#!/usr/bin/perl -w
use strict;

my $file = "sql.sql";
my $newFile = "onlyascii.sql";

open(FILE,"$file") || die "Could not Open $file\n";
open(NEWFILE,">$newFile") || die "Could not Open $newFile\n";
my $line = "";
        while($line = <FILE>)
        {
         chomp($line);
          #$line =~s/[^\w\d\s]+//g;
          $line =~s/([[^:ascii:]]|\r)*//g;
	  #$line =~s/[^\x00-\x7f]//g;
          #warn "$line\n";
          #$line=~s/[^\w\s<>,.'"*-+=]#:;?\/&%@!\$()]}{_~`\^]/ /g;

          print(NEWFILE $line."\n");
        }


close(FILE);
close(NEWFILE);




-- 
Devin Whalen
Programmer
Synaptic Vision Inc
Phone-(416) 539-0801
Fax- (416) 539-8280
1179A King St. West
Toronto, Ontario
Suite 309 M6K 3C5
Home-(416) 653-3982


Take back the Web with FireFox....a browser you can trust
www.getfirefox.com

   .-.
   /v\    L   I   N   U   X
  // \\  
 /(   )\
  ^^-^^   

--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list