parsing HTML with awk or sed
Giles Orr
gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Wed Feb 25 03:32:21 UTC 2009
I'd like to extract the contents of paragraph tags (<p>) from an HTML
file. Don't want anything else, just that - the P tags and what's
inside them, all other tags and contents not printed. Unfortunately,
some are single line:
<p>data</p>
and some are multi-line:
<p>
More data
</p>
cat filename | sed -n '/<p>/,/<\/p>/p' works fine on the latter but
not the former. I can catch the former on a separate sweep, but I
need to get both in one pass. Awk is fine too, in fact I'd probably
prefer it. I have a mild aversion to perl, but would use it if
needed.
Here's an example file (most will be quite simple, similar to this):
<html>
<head>
<title>photo17</title>
</head>
<body>
<h1>Photo photo17</h1>
<p>
Various discussion of what's going on in the photo.
</p>
<img src=photo17.web.jpg>
<h6>Photo #photo17</h6>
<p>Photo © 2001, Giles Orr</p>
</body>
</html>
Thanks for any help offered.
--
Giles
http://www.gilesorr.com/
gilesorr-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list