Spiders and crawlers

Mon Apr 5 19:29:53 UTC 2010

On 04/01/2010 05:56 PM, Evan Leibovitch wrote:
> Hi all,
>
> I'm looking to implement a spidering system intended to look through a
> bunch of catalog websites, in order to track changes to those catalogs
> (with the help of a backend MySQL system).
>
> The Wikipedia entry for "web crawler" returns a lot of interesting
> choices; I'm wondering is anyone here has experience in either writing
> one or using an existing open source one. I'm hoping for something that
> is reasonably configurable so that one doesn't need to know a language
> like C or Java to make minor config changes.

This sort of thing is easy to do with Python, whether you use string 
slicing, regular expressions, or an XML/HTML parser. I've used all three 
methods. Which method I use depends on the particulars of the situation. 
Here is an example from a recent project where I had to screen-scrape 
(another term for "crawl" or "spider") a jobs site and fetch the unique 
identifier for each job from an HTML table of jobs using an XPath query 
and put the unique identifier (guid) in a Python list (job_guids).

from lxml import html
def get_job_guids(url):
     job_guids = []
     # Omit exception handling code
     tree = html.parse(url).getroot()
     table_rows = tree.xpath('//tr[@class="even"] | //tr[@class="odd"]')
     for the_row in table_rows:
         the_url = the_row.xpath('.//a')[0].values()[0]
         guid = the_url[the_url.find('{')+1:the_url.find('}')]
         job_guids.append(guid)
     return job_guids

That function returns a list of unique identifiers (job_guids), which I 
then iterate over to fetch the details on each individual job into 
another list (all_jobs). I then iterate over that list and save each job 
with its details in a database (PostgreSQL in my case) using the Django 
ORM (Object/Relational Mapper). Here is an excerpt of that code.

from dinamis_cms.models import Job
def save_jobs(all_jobs):
     for the_job in all_jobs:
         job = Job()
         job.guid = the_job['guid']
         job.title = the_job['title']
         job.job_type = the_job['job_type']
         job.start_date = the_job['start_date']
         # Omit a bunch of other attributes and exception handling code
         job.save()

Whitespace is significant in Python so be mindful of that in case your 
email client mangles the code above. Each code block above is indented 
with four spaces.

I tried to use the Selenium testing framework to do this at one point 
but I found it easier to just write the script than to point and click 
and fiddle with the generated code.
-- 
Regards,

Clifford Ilkay
Dinamis
1419-3266 Yonge St.
Toronto, ON
Canada  M4N 3P6

<http://dinamis.com>
+1 416-410-3326
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists