Spiders and crawlers

Mon Apr 5 16:51:34 UTC 2010

On Mon, Apr 5, 2010 at 4:41 PM, Lennart Sorensen
<lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org> wrote:
> On Thu, Apr 01, 2010 at 05:56:35PM -0400, Evan Leibovitch wrote:
>> I'm looking to implement a spidering system intended to look through a bunch
>> of catalog websites, in order to track changes to those catalogs (with the
>> help of a backend MySQL system).
>
> I always wonder: Why mysql?  Postgresql is an obviously better and more
> scalable choice.  Why do so many people just barge ahead with mysql?

If you're going down that road, why not barge ahead with APDB instead?

http://thedailywtf.com/Articles/Announcing-APDB-The-Worlds-Fastest-Database.aspx

"Relational and post-relational databases have the notion of index
look-up, which means that retrieving a piece of data involves a long,
arduous process:

- Find, on disk, where the appropriate index file is stored
- Look for a free block of RAM in which the index file can be loaded
- Load the index file into memory
- Iterate over each index entry until the desired key matches the index key
- When/if found, load the actual data location into memory
- Find, on disk, where the actual data is stored using the index locator
- Look for a free block of RAM in which the actual data can be loaded
- Load the actual data into memory

With APDB, this process is reduced to one step: go to the actual
location that matches the index key specified. No middleman, index
files, or other nonsense needed; just go directly to the data you
want."

Note that for those that are unhappy at how long it takes to access
(possibly cached?) data via fread(), APDB has a further performance
optimization.  Careful analysis showed that all modern disk drives use
microcontrollers from just 2 manufacturers, who allow firmware
rewrites, thereby supporting the following:

"By rewriting the harddrive’s firmware, APDB can operate in the most
performant mode possible"
-- 
http://linuxfinances.info/info/linuxdistributions.html
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists