[ ]


Microsoft MAchine Reading COmprehension Dataset is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.

Academic Torrents

Academic Torrents makes 15.49TB of research data available.

We’ve designed a distributed system for sharing enormous datasets - for researchers, by researchers. The result is a scalable, secure, and fault-tolerant repository for data, with blazing fast download speeds.

One Billion Word Benchmark

One Billion Word Benchmark, a large language corpus in English which was released in 2013. This dataset contains about one billion words, and has a vocabulary size of about 800K words. It contains mostly news data. Since sentences in the training set are shuffled, models can ignore the context and focus on sentence level language modeling.

Annotated English Gigaword

[Annotated English Gigaword] (, a dataset often used in summarization research. Annotated English Gigaword was developed by Johns Hopkins Universitys Human Language Technology Center of Excellence. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge-acquisition efforts by researchers. Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition.

Google Brain robot datasets

The Google Brain team releases massive robotics datasets from two of their recent papers to further drive the field forward. Their grasping dataset contains roughly 650,000 examples of robot grasping attempts (original paper). Their push dataset contains roughly 59,000 examples of robot pushing motions, including one training set (train) and two test sets of previously seen (testseen) and unseen (testnovel) objects (original paper).



SpaceNet is a corpus of commercial satellite imagery and labeled training data being made available at no cost to the public to foster innovation in the development of computer vision algorithms to automatically extract information from remote sensing data. The website is []

Example image

The current SpaceNet corpus includes approximately 1,900 square kilometers full-resolution 50 cm imagery collected from DigitalGlobe’s WorldView-2 commercial satellite and includes 8-band multispectral data. The dataset also includes 220,594 building footprints derived from this imagery which can be used as training data for machine learning. This dataset is being made public to advance the development of algorithms to automatically extract geometric features such as roads, building footprints, and points of interest using satellite imagery. The first Area of Interest (AOI) to be released is of Rio De Janeiro, Brazil.

Written on December 12, 2016