Removing junk characters from text files?
Lennart Sorensen
lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Thu Feb 10 21:18:29 UTC 2005
On Thu, Feb 10, 2005 at 04:01:43PM -0500, William O'Higgins wrote:
> On Thu, Feb 10, 2005 at 01:51:27PM -0500, Stewart C. Russell wrote:
> What I am looking for is a way to strip these characters out. They seem
> to be coming from formatting code, and they have 0 semantic value - they
> just prevent CSV files from being cleanly pulling into databases or
> correctly interpreted by spreadsheets. Basically, the problem is that
> when I see these junk characters (vim syntax colouring shows them in
> blue on a console) I want to do this:
>
> :%s/$junkcharacter//g
>
> The problem is that I don't know how to obtain values for $junkcharacter
> based on the crap I see on the screen. F'rinstance, a CRLF shows up as
> ^M in vim (with the a line break) and I know that that is called "\r" in
> my replacement string - but I don't know what to call some of this other
> stuff that I see. I can't copy/paste it, because it is represented on
> the screen as something other than what is found with a regex. Does
> that help?
How about this for a filter:
#!/usr/bin/perl
while(<>) {
@chars = split(//);
foreach $c (@chars) {
#print "$c" if (ord($c)<128);
print "$c" if (ord($c)<128 and ord($c)>31 or ord($c) eq 10);
}
}
Just pipe the file through that perl script and see if that does it.
I think characters 32 to 127 and linefeed are all that you would want
in a unix text file.
Lennart Sorensen
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list