Passage-Level Label Transfer for Contextual Document Ranking

Koustav Rudra, Indian Institute of Technology Kharagpur

Zeon Trevor Fernando, ImmobilienScout GmbH, Germany

Avishek Anand, Delft University of Technology, Netherlands

Pre-trained contextual language models such as BERT, GPT, and XLnet work quite well for document retrieval tasks but suffer from the token limitation. Research approaches proposed to truncate the documents or split documents into small passages. The challenge appears while transfering relevance labels from query-document pairs to query-passage pairs. We find that direct transfer of relevance labels from documents to passages introduces label noise that strongly affects retrieval effectiveness for large training datasets. We propose a careful passage level labelling scheme using weak supervision that improves performance. We conduct a detailed study of the design decisions about splitting and label transfer on retrieval effectiveness and efficiency. 

Finding 1: Careful weak supervision based query-passage level label learning works better than document to passage level label transfer

Finding 2: Weak supervision based query-passage level label learning (QA-DOCRANK) is  more robust than DOCLABELLED approach for different kinds of passage generation

CORE Dataset

ROBUST Dataset

Finding 3: Weak supervision based passage level learning (QA-DOCRANK) is computationally more efficient than sentence level zero-shot learning based ranking (BERT-3S)

Finding4: Careful selection of training documents and passages might help in the direct application of document-specific fine-tuned models over new collections.