Spiders and crawlers

Myles Braithwaite me-qIX3qoPyADtH8hdXm2+x1laTQe2KTcn/ at public.gmane.org
Tue Apr 6 12:50:11 UTC 2010


I would look at Solr and Nutch[1]. Ye the are Java based but you can
access you data in Solr thought HTTP XML/JSON API. There are plenty of
libraries out their for Python, Ruby, etc.

[1]: http://wiki.apache.org/nutch/RunningNutchAndSolr

On Thu, Apr 1, 2010 at 10:56 PM, Evan Leibovitch <evan-ieNeDk6JonTYtjvyW6yDsg at public.gmane.org> wrote:
> Hi all,
>
> I'm looking to implement a spidering system intended to look through a bunch
> of catalog websites, in order to track changes to those catalogs (with the
> help of a backend MySQL system).
>
> The Wikipedia entry for "web crawler" returns a lot of interesting choices;
> I'm wondering is anyone here has experience in either writing one or using
> an existing open source one. I'm hoping for something that is reasonably
> configurable so that one doesn't need to know a language like C or Java to
> make minor config changes.
>
> Any help is appreciated.
>
> --
> Evan Leibovitch
> evan-ieNeDk6JonTYtjvyW6yDsg at public.gmane.org
>



-- 
Myles Braithwaite
http://mylesbraithwaite.com | me-qIX3qoPyADtH8hdXm2+x1laTQe2KTcn/@public.gmane.org
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list