Data & Code

Data Sets Developed in Our Research

MSDialog Download
- The MSDialog dataset is a labeled dialog dataset of question answering (QA) interactions between information seekers and answer providers from an online forum on Microsoft products (Microsoft Community). The dataset contains more than 2,000 multi-turn information-seeking conversations with 10,000 utterances that are annotated with user intent on the utterance level (~ 35K information-seeking conversations for the complete data set). Annotations were done using crowdsourcing with Amazon Mechanical Turk. MSDialog has several versions, including the complete set (MSDialog-Complete) and a labeled subset (MSDialog-Intent). We also preprocessed the data to produce MSDialog-ResponseRank for conversation response ranking.
- The data is developed in our SIGIR'18 paper Analyzing and Characterizing User Intent in Information-seeking Conversations and is adopted by the response ranking experiments in our SIGIR'18 paper Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems .
WikiPassageQA Download
- A benchmark collection which contains thousands of questions with annotated non-factoid answers for research on non-factoid answer passage retrieval developed in our SIGIR'18 paper A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval .
The email ID subset of the Avocado email collection Download
- The Email IDs of the filtered subset of the Avocado email data used in our SIGIR'17 paper Characterizing and Predicting Enterprise Email Reply Behavior.
WebAP: a benchmark data set for answer passage/ sentence retrieval for non-factoid questions from Web queries. Download
- The benchmark data set for answer passage/ sentence retrieval used in our ECIR'16 paper Beyond Factoid QA: Effective Methods for Non-factoid Answer Sentence Retrieval.
Travel CQA summary data sets Download
- The Yahoo! Answers CQA posts summary data sets annotated and used in our COLING'14 paper Generating Supplementary Travel Guides from Social Media.
Create Debate data sets Download
- 6 data sets used in our NAACL'13 paper Mining User Relations from Online Discussions using Sentiment Analysis and Probabilistic Matrix Factorization.
CQARank data sets Download
- Data sets used in our CIKM'13 paper CQARank: Jointly Model Topics and Expertise in Community Question Answering.

Open Source Project Code & Software

Gibbs Sampling of LDA(Github)
- Open Source Package for Gibbs Sampling of LDA.

EM Inference of PLSA(Github)

Attention-Based Neural Matching Model Java Code (Github aNMM-CIKM16)
- This package implements the Attention-based Neural Matching Model (aNMM) for question answering with the TREC QA data.
- aNMM is proposed in our CIKM'16 paper aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model .

Topic Expertise Model Java Code (Github TEM)
- This package implements Gibbs sampling for Topic Expertise Model for jointly modeling topics and expertise in question answering communities.
- TEM is proposed in our CIKM'13 paper CQARank: Jointly Model Topics and Expertise in Community Question Answering

Sentiment analysis for online discussion forums Java Code (Github NLPForumPostOTE)
- This package implements the construction of opinion matrices which are the input of PMF model. The main features include aspect identification, opinion expression identification and opinion relation extraction based on dependency path rules.

Twitter-LDA Java Code (Github Twitter-LDA)
- The original setting in Latent Dirichlet Allocation (LDA), where each word has a topic label, may not work well with Twitter as tweets are short and a single tweet is more likely to talk about one topic. Hence, Twitter-LDA (T-LDA) has been proposed in this paper "Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan and Xiaoming Li. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR'11) " to address this issue. T-LDA also addresses the noisy nature of tweets, where it captures background words in tweets.

MatchZoo: MatchZoo is a toolkit for deep neural text matching. It was developed with a focus on facilitating the designing, comparing and sharing of deep text matching models. The implemented models include ARC-I/ARC-II, DSSM, CDSSM, MatchPyramid, DRMM, aNMM, MV-LSTM, Duet, etc.
NeuralResponseRanking: NeuralResponseRanking is an open source package for several neural matching models for response ranking in information-seeking conversations.

Technical Notes

Google Sites

Report abuse