Removing junk characters from text files?

William O'Higgins william.ohiggins-H217xnMUJC0sA/PxXw9srA at public.gmane.org
Thu Feb 10 21:01:43 UTC 2005


On Thu, Feb 10, 2005 at 01:51:27PM -0500, Stewart C. Russell wrote:
>tcs sounds like the little brother of Gnu recode, which handles more 
>charsets than most people can even imagine could exist; 281, in the 
>version I have.

These look very neat, but the problem isn't really the encoding.

>William, what do you want to do with these 'junk' characters? It's 
>getting harder and harder to work in just plain ASCII these days. It 
>just doesn't support the glyphs that people need to use.

What I am looking for is a way to strip these characters out.  They seem
to be coming from formatting code, and they have 0 semantic value - they
just prevent CSV files from being cleanly pulling into databases or
correctly interpreted by spreadsheets.  Basically, the problem is that
when I see these junk characters (vim syntax colouring shows them in
blue on a console) I want to do this:

:%s/$junkcharacter//g

The problem is that I don't know how to obtain values for $junkcharacter
based on the crap I see on the screen.  F'rinstance, a CRLF shows up as
^M in vim (with the a line break) and I know that that is called "\r" in
my replacement string - but I don't know what to call some of this other
stuff that I see.  I can't copy/paste it, because it is represented on
the screen as something other than what is found with a regex.  Does
that help?
-- 

yours,

William

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://gtalug.org/pipermail/legacy/attachments/20050210/a1972a07/attachment.sig>


More information about the Legacy mailing list