parsing HTML with awk or sed

Wed Feb 25 03:57:46 UTC 2009

On Tue, 24 Feb 2009, Giles Orr wrote:

> I'd like to extract the contents of paragraph tags (<p>) from an HTML
> file.  Don't want anything else, just that - the P tags and what's
> inside them, all other tags and contents not printed.  Unfortunately,
> some are single line:
>
> <p>data</p>
>
> and some are multi-line:
>
> <p>
> More data
>
> </p>
>
> cat filename | sed -n '/<p>/,/<\/p>/p'   works fine on the latter but
> not the former.  I can catch the former on a separate sweep, but I
> need to get both in one pass.  Awk is fine too, in fact I'd probably
> prefer it.  I have a mild aversion to perl, but would use it if
> needed.
>
> Here's an example file (most will be quite simple, similar to this):
>
>
> <html>
> <head>
> <title>photo17</title>
> </head>
> <body>
>
> <h1>Photo photo17</h1>
>
> <p>
> Various discussion of what's going on in the photo.
> </p>
>
> <img src=photo17.web.jpg>
>
> <h6>Photo #photo17</h6>
> <p>Photo © 2001, Giles Orr</p>
>
> </body>
> </html>

file=FILE.html
sed -e 's|<p[^>]*>|&\n|' -e 's|</p[^>]*>|\n&|' "$file" |
  awk '/<p/,/<\/p/ && /./ { gsub(/<\/*p[^>]*>/,"")
                            if ( length ) print } '

-- 
    Chris F.A. Johnson, webmaster         <http://woodbine-gerrard.com>
    ===================================================================
    Author:
    Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists