DATA SET FOR QUERY AUTO-COMPLETION (SIGIR 2017)

This is a data set used in the following paper:

Park, Dae Hoon, and Rikio Chiba. "A neural language model for query auto-completion." Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017. [ pdf ]

Link for data set [ zip ]

The data set includes files for four data:

background (6,773,535 queries; used for training our neural language model; queries that appear less than three times are removed to filter out noisy queries.)
training (75,198 prefixes; used for training LambdaMART.)
validation (30,993 prefixes)
test (32,559 prefixes; it is written as "32,044" in the paper by mistake.)

In each of training, validation, and test data, there are only two columns: a prefix and the corresponding query.

For example, a query "my space\n", where \n is a new line character, generates the following prefix-query pairs

my my space\n

my s my space\n

my sp my space\n

my spa my space\n

my spac my space\n

my space my space\n

Please cite the following paper if you use the data set:

@inproceedings{park2017neural, title={A neural language model for query auto-completion}, author={Park, Dae Hoon and Chiba, Rikio}, booktitle={Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages={1189--1192}, year={2017}, organization={ACM} }

Google Sites

Report abuse