Spiders and crawlers
CLIFFORD ILKAY
clifford_ilkay-biY6FKoJMRdBDgjK7y7TUQ at public.gmane.org
Mon Apr 5 19:29:53 UTC 2010
On 04/01/2010 05:56 PM, Evan Leibovitch wrote:
> Hi all,
>
> I'm looking to implement a spidering system intended to look through a
> bunch of catalog websites, in order to track changes to those catalogs
> (with the help of a backend MySQL system).
>
> The Wikipedia entry for "web crawler" returns a lot of interesting
> choices; I'm wondering is anyone here has experience in either writing
> one or using an existing open source one. I'm hoping for something that
> is reasonably configurable so that one doesn't need to know a language
> like C or Java to make minor config changes.
This sort of thing is easy to do with Python, whether you use string
slicing, regular expressions, or an XML/HTML parser. I've used all three
methods. Which method I use depends on the particulars of the situation.
Here is an example from a recent project where I had to screen-scrape
(another term for "crawl" or "spider") a jobs site and fetch the unique
identifier for each job from an HTML table of jobs using an XPath query
and put the unique identifier (guid) in a Python list (job_guids).
from lxml import html
def get_job_guids(url):
job_guids = []
# Omit exception handling code
tree = html.parse(url).getroot()
table_rows = tree.xpath('//tr[@class="even"] | //tr[@class="odd"]')
for the_row in table_rows:
the_url = the_row.xpath('.//a')[0].values()[0]
guid = the_url[the_url.find('{')+1:the_url.find('}')]
job_guids.append(guid)
return job_guids
That function returns a list of unique identifiers (job_guids), which I
then iterate over to fetch the details on each individual job into
another list (all_jobs). I then iterate over that list and save each job
with its details in a database (PostgreSQL in my case) using the Django
ORM (Object/Relational Mapper). Here is an excerpt of that code.
from dinamis_cms.models import Job
def save_jobs(all_jobs):
for the_job in all_jobs:
job = Job()
job.guid = the_job['guid']
job.title = the_job['title']
job.job_type = the_job['job_type']
job.start_date = the_job['start_date']
# Omit a bunch of other attributes and exception handling code
job.save()
Whitespace is significant in Python so be mindful of that in case your
email client mangles the code above. Each code block above is indented
with four spaces.
I tried to use the Selenium testing framework to do this at one point
but I found it easier to just write the script than to point and click
and fiddle with the generated code.
--
Regards,
Clifford Ilkay
Dinamis
1419-3266 Yonge St.
Toronto, ON
Canada M4N 3P6
<http://dinamis.com>
+1 416-410-3326
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list