regexp matching question
Peter
plp-ysDPMY98cNQDDBjDh4tngg at public.gmane.org
Wed Oct 5 21:58:36 UTC 2005
On Wed, 5 Oct 2005, William Park wrote:
> On Wed, Oct 05, 2005 at 12:33:15PM +0300, Peter wrote:
>>
>> Hi all
>>
>> I need to match email messages using regexec(3). I would like to match
>> as much as possible in a piece, i.e. the interesting headers and the
>> body (which could be large). Can this be done and is it economical
>> (speedwise) to use a single hairy regexp to match the whole message or
>> is it better to match the message and then parse it ? formail already
>> does this somehow (I have not looked yet).
>
> Such question is asked by newbie. But, since you're asking this, there
> must be something else to it. Can you give an example?
>
> Header and body are separate section of email message, so they should be
> searched separately. There is no reason why an email message shouldn't
> be in memory (ie. procmail does it), so without more detail, it's
> difficult to answer.
Basically I have several mail files which contain messages and
duplicates thereof (possibly with different headers). The goal is to end
up with one file containing all the messages, with duplicates pruned.
In theory formail should be able to split the files into messages and
feed them to something that can uniq them (the criteria for uniq is
identical body, disregarding all headers).
So far I am using a scheme where formail splits the message, then again
the body and relevant headers, and then I compute a md5sum hash on the
body and refer to the message by that hash as a key. Duplicates end up
with the same key and are rejected after a comparison. I used md5sum
because it is well-tested. I also do a comparison before rejecting a
duplicate.
The problem is that formail fails to properly split certain files and I
do not know why. I tried to understand the problem but formail
source is un-maintainable imho. Thus I am trying to work around it using
my own matcher. That's where the regexp question came in.
The question was, whether someone tried to use the regexp C library to
match and then split whole messages, possibly 5+MB in size. I think that
it can be done, if the Content-length: header is ignored, but I don't
know, so I ask. If it is possible, then it may save me some work with a
'proper' parser that matches the header start, the header end/body start
and the file end/next message start.
thanks,
Peter
--
The Toronto Linux Users Group. Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml
More information about the Legacy
mailing list