search engine pollution

Peter L. Peres plp-ysDPMY98cNQDDBjDh4tngg at public.gmane.org
Sat Apr 24 12:05:55 UTC 2004


On Fri, 23 Apr 2004, Emma Jane Hogbin wrote:
> Assuming the pages are marked up with some kind of semantic markup
> language, you can adjust the rankings of the headings and titles of a
> document. You really need to read the documentation that goes with
> ht://dig. http://www.htdig.org/confindex.html Specifically:
> 	title_factor: http://www.htdig.org/attrs.html#title_factor
> 	heading_factor: http://www.htdig.org/attrs.html#heading_factor

The pages are really plain text and a pain to mark up (2500+ of them --
rfcs). A solution would be to write a Perl wrapper that would be called to
interpret such text documents by htdig and return a higher factor for what
it would identify as titles and definitions. Interestingly, I believe that
Google does something very similar. f.ex. typing 'ftp rfc' into Google
returns a link to w3c.org as the first match, specifically:

<http://www.w3.org/Protocols/rfc959/Overview.html>

and analysis of this html document reveals that there are no META tags
whatsoever in it. However the title does contain the words 'FILE TRANSFER
PROTOCOL (FTP)' and not the work 'rfc' entered by me in the query. The hit
is the first among 911,000 matches. This could hardly be a coincidence.
How do they do it ? There must be a list of heuristics almost the size of
the database to manage such things imho. Either that or AI tricks I know
nothing about (and that does not mean much because I know little about
AI).

It would be interesting to modify htdig so it scores pages by the number
of links pointing to them from other pages (in the same realm). I feel
that such an algorythm would recreate the original semantic tree
structure of the realm e.g. leaf->index->master index etc etc as far as
scoring is concerned. Then the user should have the option to select in
the search form whether he wants more index-kind of data or more
content-kind of data, the first request giving more weight to the
pointed-to score, and the second more weight to occurence count of the
search key in a document (probably normalized to the word count of the
document).

Peter
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list