BlogoCenter Data Sets

Added By Infochimps

The following datasets are available:

  • Real-Web dataset containing hash values of the content of 353,739 web pages collected over a period of six months (Feb. 1999 – July 1999).
  • Same real-web dataset formated in three columns (web_site, web_page, change_history). Change history is a sequence of bits: 1 means that the specific page has changed between the respective visits and 0 means that it remained the same (e.g. 10000 means that the page changed the second time we visited it i.e. on March).
  • Synthetic dataset containing info for 300,000 pages in three columns (web_site, web_page, change_history) over 200 visiting cycles. The change frequency of the pages follows a normal distribution.
  • Sample collection of blogs from UCLA used in lexical networks research. This data set also includes generated cosine values and lexical networks for the data. Includes instructions for processing with Clairlib.
  • Lexical networks generated from small sample from the 2004 Document Understanding Conference. Includes instructions for processing with Clairlib.

The goal of the BlogoCenter project is to develop innovative technologies to build a system that will (1) continuously monitor, collect, and store personal Weblogs (or blogs) at a central location, (2) discover hidden structures and trends automatically from the blogs, and (3) make them easily accessible to general users. By making the new information on the blogs easy to discover and access, this project is helping blogs realize their full potential for societal change as the “grassroots media.” It is also collecting an important hypertext dataset of human interactions for further analysis by the research community.