regexp matching question
William Park
opengeometry-FFYn/CNdgSA at public.gmane.org
Fri Oct 7 01:16:53 UTC 2005
On Thu, Oct 06, 2005 at 12:58:36AM +0300, Peter wrote:
> Basically I have several mail files which contain messages and
> duplicates thereof (possibly with different headers). The goal is to end
> up with one file containing all the messages, with duplicates pruned.
>
> In theory formail should be able to split the files into messages and
> feed them to something that can uniq them (the criteria for uniq is
> identical body, disregarding all headers).
>
> So far I am using a scheme where formail splits the message, then again
> the body and relevant headers, and then I compute a md5sum hash on the
> body and refer to the message by that hash as a key. Duplicates end up
> with the same key and are rejected after a comparison. I used md5sum
> because it is well-tested. I also do a comparison before rejecting a
> duplicate.
>
> The problem is that formail fails to properly split certain files and I
> do not know why. I tried to understand the problem but formail
> source is un-maintainable imho. Thus I am trying to work around it using
> my own matcher. That's where the regexp question came in.
>
> The question was, whether someone tried to use the regexp C library to
> match and then split whole messages, possibly 5+MB in size. I think that
> it can be done, if the Content-length: header is ignored, but I don't
> know, so I ask. If it is possible, then it may save me some work with a
> 'proper' parser that matches the header start, the header end/body start
> and the file end/next message start.
If Formail is failing, then I doubt regex(3) would help. Find out why
it's failing.
For manual regex solution, try
csplit inbox '/^From /' '{*}'
which will split mbox emails into each file.
Or, if you are bored, you can try my extended Bash shell,
all=`< inbox`
set -- "${all|,$'\nFrom '}"
Here, messages are now $1, $2, etc. All emails are missing 'From ',
except for the first email.
--
William Park <opengeometry-FFYn/CNdgSA at public.gmane.org>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list