parsing HTML with awk or sed
Chris F.A. Johnson
cfaj-uVmiyxGBW52XDw4h08c5KA at public.gmane.org
Wed Feb 25 03:57:46 UTC 2009
On Tue, 24 Feb 2009, Giles Orr wrote:
> I'd like to extract the contents of paragraph tags (<p>) from an HTML
> file. Don't want anything else, just that - the P tags and what's
> inside them, all other tags and contents not printed. Unfortunately,
> some are single line:
>
> <p>data</p>
>
> and some are multi-line:
>
> <p>
> More data
>
> </p>
>
> cat filename | sed -n '/<p>/,/<\/p>/p' works fine on the latter but
> not the former. I can catch the former on a separate sweep, but I
> need to get both in one pass. Awk is fine too, in fact I'd probably
> prefer it. I have a mild aversion to perl, but would use it if
> needed.
>
> Here's an example file (most will be quite simple, similar to this):
>
>
> <html>
> <head>
> <title>photo17</title>
> </head>
> <body>
>
> <h1>Photo photo17</h1>
>
> <p>
> Various discussion of what's going on in the photo.
> </p>
>
> <img src=photo17.web.jpg>
>
> <h6>Photo #photo17</h6>
> <p>Photo © 2001, Giles Orr</p>
>
> </body>
> </html>
file=FILE.html
sed -e 's|<p[^>]*>|&\n|' -e 's|</p[^>]*>|\n&|' "$file" |
awk '/<p/,/<\/p/ && /./ { gsub(/<\/*p[^>]*>/,"")
if ( length ) print } '
--
Chris F.A. Johnson, webmaster <http://woodbine-gerrard.com>
===================================================================
Author:
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list