parsing HTML with awk or sed

Christopher Browne cbbrowne-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Wed Feb 25 18:01:25 UTC 2009


On 2009-02-25, Lennart Sorensen <lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org> wrote:
> On Tue, Feb 24, 2009 at 10:32:21PM -0500, Giles Orr wrote:
>> I'd like to extract the contents of paragraph tags (<p>) from an HTML
>> file.  Don't want anything else, just that - the P tags and what's
>> inside them, all other tags and contents not printed.  Unfortunately,
>> some are single line:
>>
>> <p>data</p>
>>
>> and some are multi-line:
>>
>> <p>
>> More data
>>
>> </p>
>
> And some are:
> <p>stuff
> <p>other stuff
> <p>yet more stuff
>
> The <p> tag was not required to be closed.  Kind of a pain isn't it?

For that particular reason, I'd suggest running some "tidyer" tool.
The one I usually use is the following one...

http://www.w3.org/People/Raggett/tidy/

That cleans up the HTML, fixing up misordered end tags, ensuring <p>
and </p> are matched, as well as <li> and </li>, and such.
-- 
http://linuxfinances.info/info/linuxdistributions.html
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list