Home Datasets
Datasets

Blog articles which provide dataset directories - see blog comments as well

http://conflate.net/inductio/2008/02/a-meta-index-of-data-sets/ - excellent article listing available data sets in the area of machine learning and inference
http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php - Article containing a list of available dataset websites
http://www.datawrangling.com/some-datasets-available-on-the-web.html
http://www.daniel-lemire.com/blog/data-for-data-mining/ - has blog, tag cloud, wiki dataset categories
http://www.kirix.com/blog/category/data-tagssearch/
http://mobblog.cs.ucl.ac.uk/datasets/

Dataset directories

http://caw2.barcelonamedia.org/node/7 - Content Analysis for the Web 2.0 (CAW 2.0) Workshop - part of 18th International Conference of the World Wide Web. Contains training and test datasets from Twitter, MySpace, Slashdot, Ciao and Kongregate.
http://kdd.ics.uci.edu/ - has a machine learning repository
http://archive.ics.uci.edu/ml/datasets.html http://ckan.net/ - listing of links to various datasets
http://www.ldc.upenn.edu/Obtaining/ - Linguistic data consortium catalog
http://www.swivel.com/data_sets
http://datamob.org/datasets
http://infochimps.org/
http://www.freebase.com/
http://numbrary.com/
http://theinfo.org/
http://www.trustlet.org/wiki/Repositories_of_datasets
http://del.icio.us/kirixstrata/publicdata
http://services.alphaworks.ibm.com/manyeyes/browse/data?q=null
http://googleresearch.blogspot.com/ - google research has stated that http://research.google.com will soon host open-source scientific datasets - http://blog.wired.com/wiredscience/2008/01/google-to-provi.html - watch this space.
http://data.un.org/ http://www.data360.org/index.aspx

Data sets for a specific field - most of these are related to data mining

http://theinfo.org/get/data - good list here - pay attention to web/news/blogs and Text/Language categories as well as trust network data
http://research.microsoft.com/nlp/ - look under data sets
http://nlp.stanford.edu/links/statnlp.html - look under corpora
http://trec.nist.gov/data/reuters/reuters.html - Reuters Corpora - contains large collection of news stories for use in Natural Language Processing, Information Retrieval and Machine Learning Systems (need to order CDs)

http://trec.nist.gov/data.html - Text retrieval. Has spam, web, question answering, blog and ad hoc (e.g. relevance judgement) tracks
http://plg.uwaterloo.ca/~gvcormac/treccorpus/ (300MB) - Spam Corpus 2005
http://plg.uwaterloo.ca/~gvcormac/treccorpus06/ (75MB - english, 60MB chinese) - Spam Corpus 2006
http://trec.nist.gov/data/reljudge_eng.html - Relevance Judgement
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html (25GB - costs 400 GBP) - Blog 06 data
http://trec.nist.gov/data/qamain.html - Question Answering (many tracks)
http://trec.nist.gov/data/novelty.html - Novelty (some relevance) -

http://infochimps.org/tag/language/datasets - languages
http://infochimps.org/tag/lexicon/datasets - lexicon
http://infochimps.org/tag/lexical/datasets - lexical

http://wordnet.princeton.edu/ - Lexical database that is handy for computational linguistics and natural language processing
http://www.dmoz.org/Computers/Artificial_Intelligence/Machine_Learning/Datasets/ - Machine learning datasets
http://cervisia.org/machine_learning_data.php - Machine learning datasets - benchmark data for comparing different algorithms of your classifier is recommended from http://www.ci.tuwien.ac.at/~meyer/benchdata/
http://mill.ucsd.edu/index.php?page=Datasets&subpage=Overview
http://www.trustlet.org/wiki/Trust_network_datasets#Released_datasets - Trust datasets - includes Epinions
http://stuff.metafilter.com/infodump/ - Metafilter - contains posts, comments, tags, favourites, contact and user data
http://an.kaist.ac.kr/traces/IMC2007.html - YouTube dataset
http://socialnetworks.mpi-sws.mpg.de/ - social network dataset
http://people.csail.mit.edu/jrennie/20Newsgroups/ - newsgroup dataset
http://www.yr-bcn.es/webspam/datasets/ - Webspam datasets

Link Analysis

http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html
http://www.cs.toronto.edu/~tsap/experiments/download/download.html

Recommender systems

http://www.grouplens.org/ - MovieLens
http://www.ieor.berkeley.edu/~goldberg/jester-data/ - Jester
http://www.netflixprize.com/ - Netflix
http://www.informatik.uni-freiburg.de/~cziegler/BX/ - Book Crossing

Forums

http://weimo.de/node/642 - Nabble.com + user ratings of posts

Blogs

http://ebiquity.umbc.edu/resource/html/id/212/Splog-Blog-Dataset - Spam blogs (splogs)
http://www.icwsm.org/data.html - 14 million posts, 3 million weblogs - apparently no longer available since Dec 8, 2006
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html - but costs 400 GBP!

Wikis

http://labs.systemone.at/wikipedia3 - wikipedia 3 providing wikipedia datasets
http://download.wikipedia.org/ - official wikipedia database dumps (very large)
http://download.freebase.com/wex/ - English wikipedia articles that have been transformed into XML - all files ~ 55GB
http://dbpedia.org/About - structured information from wikipedia - dataset of this is available

Webpages

http://www.archive.org/web/web.php - 85 billion webpages archived since 1996

Misc

http://opentick.com/ - Stock data
http://lib.stat.cmu.edu/datasets/ - miscellaneous datasets
http://lib.stat.cmu.edu/jasadata/ - datasets from Journal of the American Statistical Association
http://musicbrainz.org/ - music dataset
http://www.jigsaw.com/ - directory of company & business professional dataset
http://www.librarything.com/ - library catalogue
http://www.imeem.com/developers - media library
http://www.scribd.com/doc/9582/integrating-wikipediawordnet - article talking about integrating Wordnet and Wikipedia with YAGO (an extensible and light-weight ontology)
http://wiki.openstreetmap.org/index.php/Potential_Datasources - country maps
http://rdf.dmoz.org/ - open directory project dataset
 
" The hardest part of design … is keeping features out. "
Donald Norman

Sponsored Links