regexp matching question

Wed Oct 5 21:58:36 UTC 2005

On Wed, 5 Oct 2005, William Park wrote:

> On Wed, Oct 05, 2005 at 12:33:15PM +0300, Peter wrote:
>>
>> Hi all
>>
>> I need to match email messages using regexec(3). I would like to match
>> as much as possible in a piece, i.e. the interesting headers and the
>> body (which could be large). Can this be done and is it economical
>> (speedwise) to use a single hairy regexp to match the whole message or
>> is it better to match the message and then parse it ? formail already
>> does this somehow (I have not looked yet).
>
> Such question is asked by newbie.  But, since you're asking this, there
> must be something else to it.  Can you give an example?
>
> Header and body are separate section of email message, so they should be
> searched separately.  There is no reason why an email message shouldn't
> be in memory (ie. procmail does it), so without more detail, it's
> difficult to answer.

Basically I have several mail files which contain messages and 
duplicates thereof (possibly with different headers). The goal is to end 
up with one file containing all the messages, with duplicates pruned.

In theory formail should be able to split the files into messages and 
feed them to something that can uniq them (the criteria for uniq is 
identical body, disregarding all headers).

So far I am using a scheme where formail splits the message, then again 
the body and relevant headers, and then I compute a md5sum hash on the 
body and refer to the message by that hash as a key. Duplicates end up 
with the same key and are rejected after a comparison. I used md5sum 
because it is well-tested. I also do a comparison before rejecting a 
duplicate.

The problem is that formail fails to properly split certain files and I 
do not know why. I tried to understand the problem but formail 
source is un-maintainable imho. Thus I am trying to work around it using 
my own matcher. That's where the regexp question came in.

The question was, whether someone tried to use the regexp C library to 
match and then split whole messages, possibly 5+MB in size. I think that 
it can be done, if the Content-length: header is ignored, but I don't 
know, so I ask. If it is possible, then it may save me some work with a 
'proper' parser that matches the header start, the header end/body start 
and the file end/next message start.

thanks,
Peter

--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml