Removing junk characters from text files?
William O'Higgins
william.ohiggins-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Fri Feb 11 01:08:25 UTC 2005
I now have a solution that works (well, two), so thanks to all who
helped my out and stuck with this thread. What follows is a quick
recap.
Problem: Windoze put formatting characters from the upper reaches of
the 8859 charset in my csv files, and I wanted them stripped. I didn't
know what to call them, and TLUG came to the rescue.
Henry Spencer's solution:
> tr -cd '\n\040-\176'
>
>That gets rid of everything except newlines and the printable ASCII
>characters. (\177, aka character 127, is not printable.)
That worked great, but it took me a while to remember how to put that on
a command line, like so:
cat filewithcrap.csv | tr -cd '\n\040-\176' > crapfree.csv
Lennart Sorenson's solution:
>#!/usr/bin/perl
>while(<>) {
> @chars = split(//);
> foreach $c (@chars) {
> #print "$c" if (ord($c)<128);
> print "$c" if (ord($c)<128 and ord($c)>31 or ord($c) eq 10);
> }
>}
>
>Just pipe the file through that perl script and see if that does it. I
>think characters 32 to 127 and linefeed are all that you would want in a
>unix text file.
This also works, so that's two ways, which never ever hurts. One very
thorough-looking solution that didn't work but which I need to spend
some time with to better understand (it could be user error) was Devin
Whalen's. Here's the response to his message:
On Thu, Feb 10, 2005 at 04:18:38PM -0500, Devin Whalen wrote:
>Can you send a file with some examples? I am pretty sure the perl
>script I sent will work. I used it on getting junk characters from a
>file from an AIX server.
I'll put in some example text under my .sig. Running your script on my
csv file worked like "touch" for me - harmless but not functional. I
dunno why yet, but I should get some time to look at it soon.
--
yours,
William
,Advisors with Announced Mergers,,
Rank ,Advisor,Count ,Total Value
1 ,Goldman Sachs ,109 ,"384,587,655,889 "
2 ,J P Morgan ,89 ,"315,830,356,519 "
3 ,Morgan Stanley ,97 ,"261,610,080,134 "
4 ,Merrill Lynch ,49 ,"256,469,079,383 "
5 ,Lehman Brothers ,68 ,"178,053,915,635 "
6 ,Lazard ,51 ,"167,218,432,036 "
7 ,Citigroup ,63 ,"165,931,845,345 "
8 ,Credit Suisse First Boston ,61 ,"128,476,680,743 "
9 ,UBS ,58 ,"122,940,275,398 "
10 ,Rothschild ,22 ,"93,481,910,892 "
11 ,BNP Paribas ,2 ,"71,010,611,837 "
12 ,Deutsche Bank ,37 ,"53,642,317,116 "
13 ,Banc of America Securities ,72 ,"51,391,768,933 "
14 ,Bear Stearns ,41 ,"30,962,283,027 "
15 ,Unicredito Italiano Group ,2 ,"29,436,189,520 "
16 ,Banca Intesa SPA ,1 ,"29,333,946,470 "
17 ,Mediobanca SPA ,1 ,"29,333,946,470 "
18 ,Houlihan Lokey Howard & Zukin ,27 ,"23,316,569,382 "
19 ,HSBC ,4 ,"22,723,939,539 "
20 ,Cazenove & Co ,8 ,"19,686,510,001 "
21 ,Wachovia ,34 ,"18,890,085,079 "
22 ,Sandler O'Neill ,39 ,"13,265,640,755 "
23 ,Keefe Bruyette & Woods Inc ,36 ,"13,014,032,089 "
24 ,Greenhill & Co ,8 ,"10,903,667,720 "
25 ,Petrie Parkman & Co ,7 ,"10,315,696,517 "
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://gtalug.org/pipermail/legacy/attachments/20050210/20554f1f/attachment.sig>
More information about the Legacy
mailing list