Removing junk characters from text files?

William O'Higgins william.ohiggins-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Fri Feb 11 01:08:25 UTC 2005


I now have a solution that works (well, two), so thanks to all who
helped my out and stuck with this thread.  What follows is a quick
recap.

Problem:  Windoze put formatting characters from the upper reaches of
the 8859 charset in my csv files, and I wanted them stripped.  I didn't
know what to call them, and TLUG came to the rescue.

Henry Spencer's solution:

>        tr -cd '\n\040-\176'
>
>That gets rid of everything except newlines and the printable ASCII
>characters.  (\177, aka character 127, is not printable.)

That worked great, but it took me a while to remember how to put that on
a command line, like so:

cat filewithcrap.csv | tr -cd '\n\040-\176' > crapfree.csv

Lennart Sorenson's solution:

>#!/usr/bin/perl
>while(<>) {
>        @chars = split(//);
>		foreach $c (@chars) {
>		#print "$c" if (ord($c)<128);
>		print "$c" if (ord($c)<128 and ord($c)>31 or ord($c) eq 10);
>		}
>}
>
>Just pipe the file through that perl script and see if that does it.  I
>think characters 32 to 127 and linefeed are all that you would want in a
>unix text file.

This also works, so that's two ways, which never ever hurts.  One very
thorough-looking solution that didn't work but which I need to spend
some time with to better understand (it could be user error) was Devin
Whalen's.  Here's the response to his message:

On Thu, Feb 10, 2005 at 04:18:38PM -0500, Devin Whalen wrote:
>Can you send a file with some examples?  I am pretty sure the perl
>script I sent will work.  I used it on getting junk characters from a
>file from an AIX server.

I'll put in some example text under my .sig.  Running your script on my
csv file worked like "touch" for me - harmless but not functional.  I
dunno why yet, but I should get some time to look at it soon.
-- 

yours,

William


,Advisors with Announced Mergers,,
Rank  ,Advisor,Count  ,Total Value  
1  ,Goldman Sachs  ,109  ,"384,587,655,889  "
2  ,J P Morgan  ,89  ,"315,830,356,519  "
3  ,Morgan Stanley  ,97  ,"261,610,080,134  "
4  ,Merrill Lynch  ,49  ,"256,469,079,383  "
5  ,Lehman Brothers  ,68  ,"178,053,915,635  "
6  ,Lazard  ,51  ,"167,218,432,036  "
7  ,Citigroup  ,63  ,"165,931,845,345  "
8  ,Credit Suisse First Boston  ,61  ,"128,476,680,743  "
9  ,UBS  ,58  ,"122,940,275,398  "
10  ,Rothschild  ,22  ,"93,481,910,892  "
11  ,BNP Paribas  ,2  ,"71,010,611,837  "
12  ,Deutsche Bank  ,37  ,"53,642,317,116  "
13  ,Banc of America Securities  ,72  ,"51,391,768,933  "
14  ,Bear Stearns  ,41  ,"30,962,283,027  "
15  ,Unicredito Italiano Group  ,2  ,"29,436,189,520  "
16  ,Banca Intesa SPA  ,1  ,"29,333,946,470  "
17  ,Mediobanca SPA  ,1  ,"29,333,946,470  "
18  ,Houlihan Lokey Howard & Zukin  ,27  ,"23,316,569,382  "
19  ,HSBC  ,4  ,"22,723,939,539  "
20  ,Cazenove & Co  ,8  ,"19,686,510,001  "
21  ,Wachovia  ,34  ,"18,890,085,079  "
22  ,Sandler O'Neill  ,39  ,"13,265,640,755  "
23  ,Keefe Bruyette & Woods Inc  ,36  ,"13,014,032,089  "
24  ,Greenhill & Co  ,8  ,"10,903,667,720  "
25  ,Petrie Parkman & Co  ,7  ,"10,315,696,517  "
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://gtalug.org/pipermail/legacy/attachments/20050210/20554f1f/attachment.sig>


More information about the Legacy mailing list