Stanford pages

This short document explains the web pages which we retrieved from the Stanford University web sites in late 2003.


1. HTML pages

Starting at "http://www.stanford.edu", we crawled the Stanford.edu domain in a breadth-first fashion.  We only follow URL links that reside within the stanford domain.  Notice we only kept pages which are of "text/html" type while ignoring files such as pdf documents, jpeg pictures, etc., because they are all dangling pages which do not have outgoing links and will not affect PageRank computation.

Notice that the crawler did not observe any robot exclusion rules.  When we were asked to stop crawling, we had collected almost 2 million pages.  All pages are stored at /p/galanx/stanford_pages.  Every file, starting from stanford.edu_0 through stanford.edu_1937, contains up to 1,000 pages.  There are two URLs associated with each page.  The first one is the URL the crawler sent out to retrieve the page, while the second one is the real URL that is returned by its web server.  Those two can be different because of issues such as URL redirection.

Pages stored in stanford.edu_0 through stanford.edu_1194 were used in our experiments.  If the data set is viewed as a breadth-first search tree, those pages consist the first 8 levels.


2. Text and URLs

We used lynx to parse the HTML pages, separating URL links from plain text.  The results are stored in /p/galanx/stanford_text_url/stanford_text_url_[0-4].

All pages that were retrived from a same web server are stored together in a file, which is named after the server.  Notice we use the physical server name, because one server may host multiple logical web sites.  For example, the physical server name of the site, http://www.stanford.edu, is www.lb-a.stanford.edu.

We cleaned up the URLs by removing duplicates, aliases, and also URLs that belong to domains other than stanford.edu.  The cleaned data is stored in /p/galanx/stanford_test_url/stanford_domain_text_url_[0-3].