I have come across many datasets in my research and thought I’d share my list with everyone. Feel free to contact me if you want your dataset(s) added to this page.
Blog articles which provide dataset directories
http://conflate.net/inductio/2008/02/a-meta-index-of-data-sets/ – excellent article listing available data sets in the area of machine learning and inference
http://www.datawrangling.com/some-datasets-available-on-the-web.html
http://www.daniel-lemire.com/blog/data-for-data-mining/ – has blog, tag cloud, wiki dataset categories
http://www.kirix.com/blog/category/data-tagssearch/
http://mobblog.cs.ucl.ac.uk/datasets/
http://www.readwriteweb.com/archives/where_to_find_open_data_on_the.php – Article containing a list of available dataset websites
Dataset directories
http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – Public datasets listed on a Quora Q&A thread.
http://caw2.barcelonamedia.org/node/7 – Content Analysis for the Web 2.0 (CAW 2.0) Workshop – part of 18th International Conference of the World Wide Web. Contains training and test datasets from Twitter, MySpace, Slashdot, Ciao and Kongregate.
http://kdd.ics.uci.edu/ – has a machine learning repository
http://archive.ics.uci.edu/ml/datasets.html http://ckan.net/ – listing of links to various datasets
http://www.ldc.upenn.edu/Obtaining/ – Linguistic data consortium catalog
http://www.swivel.com/data_sets
http://datamob.org/datasets
http://infochimps.org/
http://www.freebase.com/
http://numbrary.com/
http://theinfo.org/
http://www.trustlet.org/wiki/Repositories_of_datasets
http://del.icio.us/kirixstrata/publicdata
http://services.alphaworks.ibm.com/manyeyes/browse/data?q=null
http://googleresearch.blogspot.com/ – google research has stated that http://research.google.com will soon host open-source scientific datasets – http://blog.wired.com/wiredscience/2008/01/google-to-provi.html – watch this space.
http://data.un.org/
http://www.data360.org/index.aspx
http://tunedit.org/search?q=arff – 800 datasets in ARFF format for different problems and application domains
http://wikiposit.com
http://gsociology.icaap.org/dataupload.html – The Global Social Change Research Project – social, political and economic datasets
Data sets for a specific field
http://kaggle.com/ – machine learning competitions with data provided by organisations with prize money
http://theinfo.org/get/data – good list here – pay attention to web/news/blogs and Text/Language categories as well as trust network data
http://research.microsoft.com/nlp/ – look under data sets
http://nlp.stanford.edu/links/statnlp.html – look under corpora
http://trec.nist.gov/data/reuters/reuters.html – Reuters Corpora – contains large collection of news stories for use in Natural Language Processing, Information Retrieval and Machine Learning Systems (need to order CDs)
http://trec.nist.gov/data.html – Text retrieval. Has spam, web, question answering, blog and ad hoc (e.g. relevance judgement) tracks
http://plg.uwaterloo.ca/~gvcormac/treccorpus/ (300MB) – Spam Corpus 2005
http://plg.uwaterloo.ca/~gvcormac/treccorpus06/ (75MB – english, 60MB chinese) – Spam Corpus 2006
http://trec.nist.gov/data/reljudge_eng.html – Relevance Judgement
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html (25GB – costs 400 GBP) – Blog 06 data
http://trec.nist.gov/data/qamain.html – Question Answering (many tracks)
http://trec.nist.gov/data/novelty.html – Novelty (some relevance) -
http://infochimps.org/tag/language/datasets – languages
http://infochimps.org/tag/lexicon/datasets – lexicon
http://infochimps.org/tag/lexical/datasets – lexical
http://wordnet.princeton.edu/ – Lexical database that is handy for computational linguistics and natural language processing
http://www.dmoz.org/Computers/Artificial_Intelligence/Machine_Learning/Datasets/ – Machine learning datasets
http://cervisia.org/machine_learning_data.php – Machine learning datasets – benchmark data for comparing different algorithms of your classifier is recommended from http://www.ci.tuwien.ac.at/~meyer/benchdata/
http://mill.ucsd.edu/index.php?page=Datasets&subpage=Overview
http://www.trustlet.org/wiki/Trust_network_datasets#Released_datasets – Trust datasets – includes Epinions
http://stuff.metafilter.com/infodump/ – Metafilter – contains posts, comments, tags, favourites, contact and user data
http://an.kaist.ac.kr/traces/IMC2007.html – YouTube dataset
http://socialnetworks.mpi-sws.mpg.de/ – social network dataset
http://people.csail.mit.edu/jrennie/20Newsgroups/ – newsgroup dataset
http://www.yr-bcn.es/webspam/datasets/ – Webspam datasets
Link Analysis
http://www.cs.toronto.edu/~tsap/experiments/datasets/index.html
http://www.cs.toronto.edu/~tsap/experiments/download/download.html
Recommender systems
http://www.grouplens.org/ – MovieLens
http://www.ieor.berkeley.edu/~goldberg/jester-data/ – Jester
http://www.netflixprize.com/ – Netflix
http://www.informatik.uni-freiburg.de/~cziegler/BX/ – Book Crossing
Forums
http://weimo.de/node/642 – Nabble.com + user ratings of posts
Blogs
http://ebiquity.umbc.edu/resource/html/id/212/Splog-Blog-Dataset – Spam blogs (splogs)
http://www.icwsm.org/data.html – 14 million posts, 3 million weblogs – apparently no longer available since Dec 8, 2006
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html – but costs 400 GBP!
Wikis
http://labs.systemone.at/wikipedia3 – wikipedia 3 providing wikipedia datasets
http://download.wikipedia.org/ – official wikipedia database dumps (very large)
http://download.freebase.com/wex/ – English wikipedia articles that have been transformed into XML – all files ~ 55GB
http://dbpedia.org/About – structured information from wikipedia – dataset of this is available
Webpages
http://www.archive.org/web/web.php – 85 billion webpages archived since 1996
Misc
http://opentick.com/ – Stock data
http://lib.stat.cmu.edu/datasets/ – miscellaneous datasets
http://lib.stat.cmu.edu/jasadata/ – datasets from Journal of the American Statistical Association
http://musicbrainz.org/ – music dataset
http://www.jigsaw.com/ – directory of company & business professional dataset
http://www.librarything.com/ – library catalogue
http://www.imeem.com/developers – media library
http://www.scribd.com/doc/9582/integrating-wikipediawordnet – article talking about integrating Wordnet and Wikipedia with YAGO (an extensible and light-weight ontology)
http://wiki.openstreetmap.org/index.php/Potential_Datasources – country maps
http://rdf.dmoz.org/ – open directory project dataset