I have come across many datasets in my research and thought I’d share my list with everyone. Feel free to contact me if you want your dataset(s) added to this page.

Blog articles which provide dataset directories – excellent article listing available data sets in the area of machine learning and inference – has blog, tag cloud, wiki dataset categories – Article containing a list of available dataset websites – Article describing 100+ datasets

Dataset directories – Public datasets listed on a Quora Q&A thread. – Content Analysis for the Web 2.0 (CAW 2.0) Workshop – part of 18th International Conference of the World Wide Web. Contains training and test datasets from Twitter, MySpace, Slashdot, Ciao and Kongregate. – has a machine learning repository – listing of links to various datasets – Linguistic data consortium catalog – google research has stated that will soon host open-source scientific datasets – – watch this space. – 800 datasets in ARFF format for different problems and application domains – The Global Social Change Research Project – social, political and economic datasets

Data sets for a specific field – machine learning competitions with data provided by organisations with prize money – good list here – pay attention to web/news/blogs and Text/Language categories as well as trust network data – look under data sets – look under corpora – Reuters Corpora – contains large collection of news stories for use in Natural Language Processing, Information Retrieval and Machine Learning Systems (need to order CDs) – Text retrieval. Has spam, web, question answering, blog and ad hoc (e.g. relevance judgement) tracks (300MB) – Spam Corpus 2005 (75MB – english, 60MB chinese) – Spam Corpus 2006 – Relevance Judgement (25GB – costs 400 GBP) – Blog 06 data – Question Answering (many tracks) – Novelty (some relevance) – – languages – lexicon – lexical – Lexical database that is handy for computational linguistics and natural language processing – Machine learning datasets – Machine learning datasets – benchmark data for comparing different algorithms of your classifier is recommended from – Trust datasets – includes Epinions – Metafilter – contains posts, comments, tags, favourites, contact and user data – YouTube dataset – social network dataset – newsgroup dataset – Webspam datasets

Link Analysis / Social Networks – Twitter dataset – friends network for 2009 and 2013

Natural Language Processing – Multilingual WordNet List containing many languages

Recommender systems – MovieLens – Jester – Netflix – Book Crossing

Forums – + user ratings of posts

Blogs – Spam blogs (splogs) – 14 million posts, 3 million weblogs – apparently no longer available since Dec 8, 2006 – but costs 400 GBP!

Wikis – wikipedia 3 providing wikipedia datasets – official wikipedia database dumps (very large) – English wikipedia articles that have been transformed into XML – all files ~ 55GB – structured information from wikipedia – dataset of this is available

Webpages – 85 billion webpages archived since 1996

Misc – Stock data – miscellaneous datasets – datasets from Journal of the American Statistical Association – music dataset – directory of company & business professional dataset – library catalogue – media library – article talking about integrating Wordnet and Wikipedia with YAGO (an extensible and light-weight ontology) – country maps – open directory project dataset – online personality tests data