Removing junk characters from text files?
Devin Whalen
devin-Gq53QDLGkWIleAitJ8REmdBPR1lH4CV8 at public.gmane.org
Fri Feb 4 18:35:55 UTC 2005
On Fri, 2005-02-04 at 13:32 -0500, Devin Whalen wrote:
> On Fri, 2005-02-04 at 08:22 -0500, William O'Higgins wrote:
> > I get to deal with text files from Windoze and Mac sources on a regular
> > basis, and frequently they are filled with junk characters. I would
> > love to be able to de-cruft these files in a systematic way. I have no
> > idea what some of the characters are - they often show up blue in vim,
> > and they have numbers like \240 in hex. I thought that bvi might work
> > to let me search and replace then by hex code, but that didn't seem to
> > work. I can usually deal with the infamous "^M" with flip, but I'd love
> > something in Perl or vim (so I can understand it - I'm sure it's doable
> > in assembly or bash or smalltalk, but then I wouldn't learn anything)
> > that will hunt out these weird artifacts of wonky software and remove
> > them. Any suggestions?
> >
> > Thanks.
>
>
>
> Hey,
>
> I have this problem as well. A quick fix for the "^M" in vim/vi you can
> use :.,$s/\r//g and that will get rid of them. I started to mess around
> with a perl script to get rid of all non-ASCII characters from a file
> but I couldn't seem to get it to work and I am too busy today to spend
> more time on it, but here it is. You might be able to modify it or
> whatever to get it to work. The line $line =~s/[^\x00-\x7f]//g; is
> supposed to replace all non-ASCII characters with nothing but it doesn't
> seem to work here. Anyway, hope this helps.
>
> Later
>
>
> #!/usr/bin/perl -w
> use strict;
>
> my $file = "sql.sql";
> my $newFile = "onlyascii.sql";
>
> open(FILE,"$file") || die "Could not Open $file\n";
> open(NEWFILE,">$newFile") || die "Could not Open $newFile\n";
> my $line = "";
> while($line = <FILE>)
> {
> chomp($line);
> #$line =~s/[^\w\d\s]+//g;
> $line =~s/([[^:ascii:]]|\r)*//g;
> #$line =~s/[^\x00-\x7f]//g;
> #warn "$line\n";
> #$line=~s/[^\w\s<>,.'"*-+=]#:;?\/&%@!\$()]}{_~`\^]/ /g;
>
> print(NEWFILE $line."\n");
> }
>
>
> close(FILE);
> close(NEWFILE);
>
>
>
>
I just realized that "^M" is actually an ASCII character, so maybe it
will get rid of the other characters for you. I just don't have a file
with any non-ASCII characters to get rid of, just the "^M". No wonder
it wasn't working :).
Later
--
Devin Whalen
Programmer
Synaptic Vision Inc
Phone-(416) 539-0801
Fax- (416) 539-8280
1179A King St. West
Toronto, Ontario
Suite 309 M6K 3C5
Home-(416) 653-3982
Take back the Web with FireFox....a browser you can trust
www.getfirefox.com
.-.
/v\ L I N U X
// \\
/( )\
^^-^^
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list