Removing junk characters from text files?
Devin Whalen
devin-Gq53QDLGkWIleAitJ8REmdBPR1lH4CV8 at public.gmane.org
Fri Feb 4 18:32:46 UTC 2005
On Fri, 2005-02-04 at 08:22 -0500, William O'Higgins wrote:
> I get to deal with text files from Windoze and Mac sources on a regular
> basis, and frequently they are filled with junk characters. I would
> love to be able to de-cruft these files in a systematic way. I have no
> idea what some of the characters are - they often show up blue in vim,
> and they have numbers like \240 in hex. I thought that bvi might work
> to let me search and replace then by hex code, but that didn't seem to
> work. I can usually deal with the infamous "^M" with flip, but I'd love
> something in Perl or vim (so I can understand it - I'm sure it's doable
> in assembly or bash or smalltalk, but then I wouldn't learn anything)
> that will hunt out these weird artifacts of wonky software and remove
> them. Any suggestions?
>
> Thanks.
Hey,
I have this problem as well. A quick fix for the "^M" in vim/vi you can
use :.,$s/\r//g and that will get rid of them. I started to mess around
with a perl script to get rid of all non-ASCII characters from a file
but I couldn't seem to get it to work and I am too busy today to spend
more time on it, but here it is. You might be able to modify it or
whatever to get it to work. The line $line =~s/[^\x00-\x7f]//g; is
supposed to replace all non-ASCII characters with nothing but it doesn't
seem to work here. Anyway, hope this helps.
Later
#!/usr/bin/perl -w
use strict;
my $file = "sql.sql";
my $newFile = "onlyascii.sql";
open(FILE,"$file") || die "Could not Open $file\n";
open(NEWFILE,">$newFile") || die "Could not Open $newFile\n";
my $line = "";
while($line = <FILE>)
{
chomp($line);
#$line =~s/[^\w\d\s]+//g;
$line =~s/([[^:ascii:]]|\r)*//g;
#$line =~s/[^\x00-\x7f]//g;
#warn "$line\n";
#$line=~s/[^\w\s<>,.'"*-+=]#:;?\/&%@!\$()]}{_~`\^]/ /g;
print(NEWFILE $line."\n");
}
close(FILE);
close(NEWFILE);
--
Devin Whalen
Programmer
Synaptic Vision Inc
Phone-(416) 539-0801
Fax- (416) 539-8280
1179A King St. West
Toronto, Ontario
Suite 309 M6K 3C5
Home-(416) 653-3982
Take back the Web with FireFox....a browser you can trust
www.getfirefox.com
.-.
/v\ L I N U X
// \\
/( )\
^^-^^
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list