regexp matching question

William Park opengeometry-FFYn/CNdgSA at public.gmane.org
Fri Oct 7 01:16:53 UTC 2005


On Thu, Oct 06, 2005 at 12:58:36AM +0300, Peter wrote:
> Basically I have several mail files which contain messages and 
> duplicates thereof (possibly with different headers). The goal is to end 
> up with one file containing all the messages, with duplicates pruned.
> 
> In theory formail should be able to split the files into messages and 
> feed them to something that can uniq them (the criteria for uniq is 
> identical body, disregarding all headers).
> 
> So far I am using a scheme where formail splits the message, then again 
> the body and relevant headers, and then I compute a md5sum hash on the 
> body and refer to the message by that hash as a key. Duplicates end up 
> with the same key and are rejected after a comparison. I used md5sum 
> because it is well-tested. I also do a comparison before rejecting a 
> duplicate.
> 
> The problem is that formail fails to properly split certain files and I 
> do not know why. I tried to understand the problem but formail 
> source is un-maintainable imho. Thus I am trying to work around it using 
> my own matcher. That's where the regexp question came in.
> 
> The question was, whether someone tried to use the regexp C library to 
> match and then split whole messages, possibly 5+MB in size. I think that 
> it can be done, if the Content-length: header is ignored, but I don't 
> know, so I ask. If it is possible, then it may save me some work with a 
> 'proper' parser that matches the header start, the header end/body start 
> and the file end/next message start.

If Formail is failing, then I doubt regex(3) would help.  Find out why
it's failing.

For manual regex solution, try
    csplit inbox '/^From /' '{*}'
which will split mbox emails into each file.

Or, if you are bored, you can try my extended Bash shell,
    all=`< inbox`
    set -- "${all|,$'\nFrom '}"
Here, messages are now $1, $2, etc.  All emails are missing 'From ',
except for the first email.

-- 
William Park <opengeometry-FFYn/CNdgSA at public.gmane.org>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
	   http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
	  http://freshmeat.net/projects/bashdiff/
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list