parsing HTML with awk or sed

Wed Feb 25 12:46:38 UTC 2009

> On Tue, 24 Feb 2009, Giles Orr wrote:
>
>> I'd like to extract the contents of paragraph tags (<p>) from an HTML
>> file.  Don't want anything else, just that - the P tags and what's
>> inside them, all other tags and contents not printed.  Unfortunately,
>> some are single line:
>>
>> <p>data</p>
>>
>> and some are multi-line:
>>
>> <p>
>> More data
>>
>> </p>
>>
>> cat filename | sed -n '/<p>/,/<\/p>/p'   works fine on the latter but
>> not the former.  I can catch the former on a separate sweep, but I
>> need to get both in one pass.  Awk is fine too, in fact I'd probably
>> prefer it.  I have a mild aversion to perl, but would use it if
>> needed.
>>
>> Here's an example file (most will be quite simple, similar to this):
>>
>>
>> <html>
>> <head>
>> <title>photo17</title>
>> </head>
>> <body>
>>
>> <h1>Photo photo17</h1>
>>
>> <p>
>> Various discussion of what's going on in the photo.
>> </p>
>>
>> <img src=photo17.web.jpg>
>>
>> <h6>Photo #photo17</h6>
>> <p>Photo © 2001, Giles Orr</p>
>>
>> </body>
>> </html>
>
2009/2/24 ted leslie <tleslie-RBVUpeUoHUc at public.gmane.org>:
> just remove all \r \n , i.e. bring it together on one long line,
> then do a split on <p>  i.e. <p> -> \n<p>
> and </p> -> \n<\p>
> and your good to do what you have proposed to do below.
>
>
> -tl
>
2009/2/24 Chris F.A. Johnson <cfaj-uVmiyxGBW52XDw4h08c5KA at public.gmane.org>:
> file=FILE.html
> sed -e 's|<p[^>]*>|&\n|' -e 's|</p[^>]*>|\n&|' "$file" |
>  awk '/<p/,/<\/p/ && /./ { gsub(/<\/*p[^>]*>/,"")
>                           if ( length ) print } '
>

Wow, this is one of those things that's blindingly obvious _after_
you've been hit with it: re-arrange the file _before_ you parse it.  I
was blocked (without ever realizing it) by the unconscious thought
that I was parsing/mangling the file already, why do anything to it
first?

Huge thanks to Ted and Chris.  I haven't finished writing the script
yet, but now I know I can.

-- 
Giles
http://www.gilesorr.com/
gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists