Tools&Benchmark
Benchmark Dataset
TAC TREC NTCIR CLEF DUC ACE_upenn
State-of-the-Art Results in NLP and CV in Papers with Code : A nice collection of state-of-the-art published results on a variety of tasks like QA, MRC, MT, LM, etc.
Academic Benchmark by Prof. Jiafeng Guo: benchmark datasets, codes, and scripts for many IR/NLP domains like recommendation, representation learning, topic modeling, community detection, learning to rank, diverse ranking, Neu-IR, etc.
The major purpose is to change the situation in CS where most published methods are difficult to compare with due to the lack of public dataset, open codes, or clear experimental settings. Therefore, in this website, we not only report the performances of state-of-the-art algorithms in different domains, but also collect the corresponding datasets, codes, and scripts that can make the experimental results reproducible. When developing new methods in these domain, researchers can simply take-away and compare with all the available baseline methods. In this way, we can help researchers spend less time on duplicated efforts, and focus more on new ideas.
Datasets for Natural Language Processing by Karthik Narasimhan in MIT NLP group.
Question Answering Benchmark Datasets
TREC QA: http://www.aclweb.org/aclwiki/index.php?title=Question_Answering_(State_of_the_art)
WikiQA: https://www.microsoft.com/en-us/download/details.aspx?id=52419
InsuranceQA: https://github.com/shuzi/insuranceQA
AmazonQA: http://jmcauley.ucsd.edu/data/amazon/qa/
Yahoo CQA: https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
GraphQuestions: https://github.com/ysu1989/GraphQuestions
Microsoft Research QA: a text file containing 1.4K questions aimed at the text of Encarta 98, the full text of Encarta 98, and a set of human annotations identifying pieces of text in Encarta that fully or partially answer the question.
WebQA: WebQA is a large scale Chinese human annotated real-world QA dataset which contains 42k questions and 579k evidences, where an evidence is a piece of text which may contain information for answering the question. The evidences are retrieved using a search engine with questions as queries. The data provides 1 to 10 annotated evidences (depends on the search results) for each question.
WebAP: WebAP is a benchmark dataset for non-factoid answer passage/sentence retrieval.
QuAC : Question Answering in Context. A dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). Paper and Leaderboard
CoQA: A Conversational Question Answering Challenge. A novel dataset for building Conversational Question Answering systems.1 Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. Paper and Leaderboard
Machine Comprehension Benchmark Datasets
MS MARCO: http://www.msmarco.org/ paper github
DeepMind Q&A Dataset (CNN and Daily Mail): http://cs.nyu.edu/~kcho/DMQA/
MCTest: http://research.microsoft.com/en-us/um/redmond/projects/mctest/data.html
StoryCloze: http://cs.rochester.edu/nlp/rocstories/
Children's Book Test: http://cs.rochester.edu/nlp/rocstories/
Dialogue Systems Benchmark Datasets
Facebook bAbI Project QA Data: https://research.fb.com/downloads/babi/
The 20 QA bAbI tasks
The 6 dialog bAbI tasks
The Children’s Book Test
The Movie Dialog dataset
The WikiMovies dataset
The Dialog-based Language Learning dataset
The SimpleQuestions dataset
HITL Dialogue Simulator
A Good Survey of Available Corpora for Building Data-Driven Dialogue Systems: https://breakend.github.io/DialogDatasets/
Ubuntu Dialogue Corpus: http://cs.mcgill.ca/~jpineau/datasets/ubuntu-corpus-1.0/
DSTC 2&3: http://camdial.org/~mh521/dstc/
Alibaba Customer Service Dialog Data
Will add more...
Document Ranking and Passage Ranking Benchmark Datasets
Gov2: http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm
Robust04: http://trec.nist.gov/data/robust/04.guidelines.html
ClueWeb 12: http://lemurproject.org/clueweb12/
ClueWeb 09: http://lemurproject.org/clueweb09.php/
MS MARCO Document Ranking & Passage Ranking data: https://microsoft.github.io/msmarco/
TREC 2019 Deep Learning Track Document Ranking & Passage Ranking data: https://github.com/microsoft/TREC-2019-Deep-Learning
ORCAS: Open Resource for Click Analysis in Search: ORCAS is a click-based dataset associated with the TREC Deep Learning Track. It covers 1.4 million of the TREC DL documents, providing 18 million connections to 10 million distinct queries.
TREC-2020-Deep-Learning on document/passage ranking: TREC 2020 and TREC 2019 deep learning tracks provided large scale train/eval/test data for passage/ document ranking benchmark data sets.
Learning to Rank Document Retrieval Benchmark Datasets
Microsoft LETOR 3.0 (575 Gov queries + 106 Ohsumed queries): LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and several baselines.
Microsoft LETOR 4.0 (MQ 2007 with 1700 queries & MQ 2008 with 800 queries, Gov2 Collection): Very different from previous versions (V3.0 is an update based on V2.0 and V2.0 is an update based on V1.0), LETOR4.0 is a totally new release. It uses the Gov2 web page collection (~25M pages) and two query sets from Million Query track of TREC 2007 and TREC 2008. We call the two query sets MQ2007 and MQ2008 for short. There are about 1700 queries in MQ2007 with labeled documents and about 800 queries in MQ2008 with labeled documents.
Microsoft Learning to Rank Datasets: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.
Yahoo! Learning to Rank Challenge Datasets: features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are.
Istella Learning to Rank dataset: The Istella LETOR full dataset is composed of 33,018 queries and 220 features representing each query-document pair. It consists of 10,454,629 examples labeled with relevance judgments ranging from 0 (irrelevant) to 4 (perfectly relevant).
Web Search Query Log Downloads from Prof. Jeff Huang
http://jeffhuang.com/search_query_logs.html (AOL Query Logs , MSN Query Logs, Sogou Query Logs , Yandex Query Logs, Yahoo Logs )
More useful web search data sets from Sougou http://www.sogou.com/labs/resource/list_yuliao.php
State-of-the-art results on core NLP/IR tasks
Recommender Systems Dataset List by Prof. Julian McAuley in UCSD
Good Open Source Code/ Library/ Infra
ParlAI: ParlAI (pronounced “par-lay”) is a framework for dialog AI research, implemented in Python. Its goal is to provide researchers: a unified framework for training and testing dialog models; multi-task training over many datasets at once; seamless integration of Amazon Mechanical Turk for data collection and human evaluation.
Huggingface Transformers: Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pretrained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.
Good Open Source Knowledge Base
YAGO : YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.
Freebase: An open source knowledge base.
WikiData: Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
Useful Tools and Packages for IR/NLP/DM/ML
Mallet: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
Weka: a Java-based package for data mining algorithms.
trec_eval: the standard tool used by the TREC community for evaluating an ad hoc retrieval run, given the results file and a standard set of judged results.
indri: a search engine in C++ from the Lemur project. http://sourceforge.net/p/lemur/wiki/Home/
Galago: a search engine in Java from the Lemur project. http://sourceforge.net/p/lemur/wiki/Galago/ Docs by Laura Galago Hackathon Doc (CIIR Internal Access)
Lucene: a search engine in Java from the Apache Lucene project. https://lucene.apache.org/
ranklib: a library of learning to rank models. http://sourceforge.net/p/lemur/wiki/RankLib/
libsvm: an integrated tool for support vector classification and regression.
svm-rank: SVM-rank is an instance of SVM-struct for efficiently training Ranking SVMs.
perl: a highly capable, feature-rich programming language.
awk: an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool in Linux/Unix system.
qsub/qstat: run programs on computer clusters.
splitta: a tool for sentence boundary detection in Python.
scikit-learn: a machine learning package in Python including implementation of various learning algorithms including random forests and GBDT.
jforests: a Java library that implements many tree-based learning algorithms including LambdaMART.
StanfordNLP: Java-based nlp tools and packages from Stanford NLP group.
OpenNLP: Java-based nlp tools and packages from the Apache OpenNLP project.
Theano: a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Deep learning library.
TensorFlow: an open source software library for numerical computation using data flow graphs. Deep learning/machine learning library from Google.
Caffe: a deep learning framework made with expression, speed, and modularity in mind.
CNTK: the Computational Network Toolkit by Microsoft Research, a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph.
Torch: a scientific computing framework with wide support for machine learning algorithms that puts GPUs first.
MXNet/Gluon: a deep learning framework designed for both efficiency and flexibility. Video Tutorial on Youtube
Keras: a high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.
Lasagne: a lightweight library to build and train neural networks in Theano.
will add more...