parsing HTML with awk or sed

Giles Orr gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Wed Feb 25 03:32:21 UTC 2009


I'd like to extract the contents of paragraph tags (<p>) from an HTML
file.  Don't want anything else, just that - the P tags and what's
inside them, all other tags and contents not printed.  Unfortunately,
some are single line:

<p>data</p>

and some are multi-line:

<p>
More data

</p>

cat filename | sed -n '/<p>/,/<\/p>/p'   works fine on the latter but
not the former.  I can catch the former on a separate sweep, but I
need to get both in one pass.  Awk is fine too, in fact I'd probably
prefer it.  I have a mild aversion to perl, but would use it if
needed.

Here's an example file (most will be quite simple, similar to this):


<html>
<head>
<title>photo17</title>
</head>
<body>

<h1>Photo photo17</h1>

<p>
Various discussion of what's going on in the photo.
</p>

<img src=photo17.web.jpg>

<h6>Photo #photo17</h6>
<p>Photo © 2001, Giles Orr</p>

</body>
</html>


Thanks for any help offered.

-- 
Giles
http://www.gilesorr.com/
gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list