Tools&Benchmark

Benchmark Dataset


TAC TREC NTCIR CLEF DUC ACE_upenn

Good Open Source Code/ Library/ Infra

  • ParlAI: ParlAI (pronounced “par-lay”) is a framework for dialog AI research, implemented in Python. Its goal is to provide researchers: a unified framework for training and testing dialog models; multi-task training over many datasets at once; seamless integration of Amazon Mechanical Turk for data collection and human evaluation.

  • Huggingface Transformers: Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over thousands of pretrained models in 100+ languages and deep interoperability between PyTorch & TensorFlow 2.0.

Good Open Source Knowledge Base

  • YAGO : YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. Currently, YAGO has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.

  • Freebase: An open source knowledge base.

  • WikiData: Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.


Useful Tools and Packages for IR/NLP/DM/ML

  • Mallet: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

  • Weka: a Java-based package for data mining algorithms.

  • trec_eval: the standard tool used by the TREC community for evaluating an ad hoc retrieval run, given the results file and a standard set of judged results.

  • indri: a search engine in C++ from the Lemur project. http://sourceforge.net/p/lemur/wiki/Home/

  • Galago: a search engine in Java from the Lemur project. http://sourceforge.net/p/lemur/wiki/Galago/ Docs by Laura Galago Hackathon Doc (CIIR Internal Access)

  • Lucene: a search engine in Java from the Apache Lucene project. https://lucene.apache.org/

  • ranklib: a library of learning to rank models. http://sourceforge.net/p/lemur/wiki/RankLib/

  • libsvm: an integrated tool for support vector classification and regression.

  • svm-rank: SVM-rank is an instance of SVM-struct for efficiently training Ranking SVMs.

  • perl: a highly capable, feature-rich programming language.

  • awk: an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool in Linux/Unix system.

  • qsub/qstat: run programs on computer clusters.

  • splitta: a tool for sentence boundary detection in Python.

  • scikit-learn: a machine learning package in Python including implementation of various learning algorithms including random forests and GBDT.

  • jforests: a Java library that implements many tree-based learning algorithms including LambdaMART.

  • StanfordNLP: Java-based nlp tools and packages from Stanford NLP group.

  • OpenNLP: Java-based nlp tools and packages from the Apache OpenNLP project.

  • Theano: a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Deep learning library.

  • TensorFlow: an open source software library for numerical computation using data flow graphs. Deep learning/machine learning library from Google.

  • Caffe: a deep learning framework made with expression, speed, and modularity in mind.

  • CNTK: the Computational Network Toolkit by Microsoft Research, a unified deep-learning toolkit that describes neural networks as a series of computational steps via a directed graph.

  • Torch: a scientific computing framework with wide support for machine learning algorithms that puts GPUs first.

  • MXNet/Gluon: a deep learning framework designed for both efficiency and flexibility. Video Tutorial on Youtube

  • Keras: a high-level neural networks library, written in Python and capable of running on top of either TensorFlow or Theano.

  • Lasagne: a lightweight library to build and train neural networks in Theano.

  • will add more...