eBay Data

In this document, the eBay data, which was retrieved ("crawled") from the eBay website (http://www.ebay.com) in summer 2002, is described.


1. The Raw Data

Following the URLs of the categories, we downloaded the HTML pages for the items that were listed on eBay.  All the HTML pages are zipped and stored in /p/galanx/ebay_raw/, which are refered as the raw data.

Items on eBay are sorted and listed in different categories and sub-categories, so we organized the HTML pages into the same fashion.  Under /p/galanx/ebay_raw/, there are 120 sub-directories which are listed as 20 major categories, Antiques, Books, Business, ..., Everything_else.  Because the capacity for each AFS volume was set at 2GB, there are multiple directories for most categories.

In each directory, there are several zip files, each of which contains 100,000 item pages.  Use the following command to unzip a file,

   tar zxf Book_0.tgz

The page files follow the same category hierarchical structure which was used by eBay.  Also, each item file is named after its item id.  For instance, 1083917035.html, whose title is "CDV Alexandre Dumas Pere By Reutlinger Paris", is stored in ../Books/Fiction/Adventure/.  Notice that there may be multiple copies of an item were listed on eBay.  They may only differ in their ids and bidding information.  There are more than 80 million item pages in the raw data directory.

Notice that, in the first directory of each major category, there are two meta data file.  One contains the list of all item ids in the category; the other lists all the sub-categories.


2. Item Description

On each item page, there're bidding information and item description text, as well as some other links.  We only retrieved those page files and ignored links to seller and buyer information as well as item pictures.

The item description text is parsed and recorded in /p/galanx/ebay_item_description with the same hierarchical structure.