As of the second year (2014) of the track, a pre-filtered version of the corpus was provided to reduce the filtering load on participants. For the 2015, a similar filtered corpus was provided, along pre-filtered version of the original 2013 track. This paage summarizes how these filtered corpora were created.
TREC-TS-2014F
The TREC-TS-2014F dataset is a filtered version of the KBA 2014 corpus. It is stored in the same format, follows the same file structure (ordered into per-hour folders) and is encrypted with the same GPG key (see above). To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 15 events from the TREC-TS 2014 track topics were removed, i.e. only documents with timestamps between the start and end tag for one or more TREC-TS 2014 topics were kept. Second, we filtered the remaining documents, keeping only those which were likely to contain one or more relevant sentences to an event. This filtering was performed as follows:
TREC-TS-2015F
The TREC-TS-2015F dataset is a filtered version of the KBA 2014 corpus for the TREC-TS 2015 topics. The filtering methodology is identical to the TREC-TS-2014F dataset, with the exception of that the rank cutoff used was 100, rather than 1000. This smaller rank cutoff was chosen, since it was observed that most of the relevant content was available in the top documents. The result of this change is that the 2015 dataset is smaller than the 2014 dataset.
TREC-TS-2013F
The TREC-TS-2013F dataset was released in 2015 for participants that wanted to train their systems using the 2013 topics. Importantly, the filtering methodology used to create this dataset is not the same as the other filtered versions. In particular, TREC-TS-2013F is a prefiltered is a filtered version of the KBA 2013 corpus for the TREC-TS 2013 topics that was originally created by a participant to the 2014 TREC track. To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 9 events from the TREC-TS 2013 track topics were removed and only documents from the 'news' subset were considered. Second, the remaining documents were subject to a machine learned document classifier trained on hand annotated documents collected from the Reuters news agency for other events. This classifier uses basic distance metrics between the document and the initial event representation (query).