OT: AOL Releases Search Logs of 657,427 Users

Meng Cheah meng-D1t3LT1mScs at public.gmane.org
Mon Aug 7 23:56:53 UTC 2006


Meng Cheah wrote:

> Words fail me.
>
> /http://www.techcrunch.com/2006/08/06/aol-proudly-releases-massive-amounts-of-user-search-data// 
>
>
> *"The utter stupidity of this is staggering.* AOL has released very 
> private data about its users without their permission. While the AOL 
> username has been changed to a random ID number, the abilitiy to 
> analyze all searches by a single user will often lead people to easily 
> determine who the user is, and what they are up to. The data includes 
> personal names, addresses, social security numbers and everything else 
> someone might type into a search box.
>
> The most serious problem is the fact that many people often search on 
> their own name, or those of their friends and family, to see what 
> information is available about them on the net. Combine these ego 
> searches with porn queries and you have a serious embarrassment. 
> Combine them with “buy ecstasy” and you have evidence of a crime. 
> Combine it with an address, social security number, etc., and you have 
> an identity theft waiting to happen. The possibilities are endless."
>
> The apology from AOL:
> All –
>
>    This was a screw up, and we’re angry and upset about it. It was an
>    innocent enough attempt to reach out to the academic community with
>    new research tools, but it was obviously not appropriately vetted,
>    and if it had been, it would have been stopped in an instant.
>
>    Although there was no personally-identifiable data linked to these
>    accounts, we’re absolutely not defending this. It was a mistake, and
>    we apologize. We’ve launched an internal investigation into what
>    happened, and we are taking steps to ensure that this type of thing
>    never happens again.
>
>    Here was what was mistakenly released:
>
>    * Search data for roughly 658,000 anonymized users over a three
>    month period from March to May.
>
>    * There was no personally identifiable data provided by AOL with
>    those records, but search queries themselves can sometimes include
>    such information.
>
>    * According to comScore Media Metrix, the AOL search network had
>    42.7 million unique visitors in May, so the total data set covered
>    roughly 1.5% of May search users.
>
>    * Roughly 20 million search records over that period, so the data
>    included roughly 1/3 of one percent of the total searches conducted
>    through the AOL network over that period.
>
>    * The searches included as part of this data only included U.S.
>    searches conducted within the AOL client software.
>
>    We apologize again for the release.
>
>    Andrew Weinstein
>    AOL Spokesman

 From the README file with the original release:

500k User Session Collection
----------------------------------------------
This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. 
Any application of this collection for commercial purposes is STRICTLY PROHIBITED.

Brief description:

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged. 

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research. 

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
        AnonID - an anonymous user ID number.
        Query  - the query issued by the user, case shifted with
                 most punctuation removed.
        QueryTime - the time at which the query was submitted for search.
        ItemRank  - if the user clicked on a search result, the rank of the
                    item on which they clicked is listed. 
        ClickURL  - if the user clicked on a search result, the domain portion of 
                    the URL in the clicked result is listed.

Each line in the data represents one of two types of events:
        1. A query that was NOT followed by the user clicking on a result item.
        2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above). 
In the second case (click through), there is data in all five columns.  For click through events, the query that preceded the click through is included.  Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events.  Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.

CAVEAT EMPTOR -- SEXUALLY EXPLICIT DATA!  Please be aware that these queries are not filtered to remove any content.  Pornography is prevalent on the Web and unfiltered search engine logs contain queries by users who are looking for pornographic material.  There are queries in this collection that use SEXUALLY EXPLICIT LANGUAGE.  This collection of data is intended for use by mature adults who are not easily offended by the use of pornographic search terms.  If you are offended by sexually explicit language you should not read through this data.  Also be aware that in some states it may be illegal to expose a minor to this data.  Please understand that the data represents REAL WORLD USERS, un-edited and randomly sampled, and that AOL is not the author of this data.

Basic Collection Statistics
Dates:
  01 March, 2006 - 31 May, 2006

Normalized queries:
  36,389,567 lines of data
  21,011,340 instances of new queries (w/ or w/o click-through)
   7,887,022 requests for "next page" of results
  19,442,629 user click-through events
  16,946,938 queries w/o user click-through
  10,154,742 unique (normalized) queries
     657,426 unique user ID's


Please reference the following publication when using this collection:

G. Pass, A. Chowdhury, C. Torgeson,  "A Picture of Search"  The First 
International Conference on Scalable Information Systems, Hong Kong, June, 
2006.

Copyright (2006) AOL



--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list