Welcome to the website of the Federated Web Search track, part of NIST's Text REtrieval Conference TREC 2014. The track investigates techniques for the selection and merging of search results from a large number of real online web search services. Make sure you join the FedWeb mailing list, to participate in discussions and receive the latest information about the track.
The FedWeb track will continue in 2014. Next to the resource selection and results merging tasks, FedWeb 2014 will feature a vertical selection task in which the systems have to rank the best vertical type for a query
Below is a tentative timeline for this year's track:
TREC2014-FedWeb-0_Overview.pdf (Slides of the overview session)
We released the raw FedWeb14 relevance judgments:
singleJudgmentsFW14_50official.txt contains the complete page judgments (34,003 in total) for the 50 official topics, and singleJudgmentsFW14_10extra.txt contains the jugments for the 10 additional topics (6,451 judgments) used in the online evaluation system during the TREC preparation phase.
The parameters included in these files are:
The 2014 FedWeb track promotes research on federated search with realistic web data. Federated search is the approach of querying multiple search engines simultaneously, and combining their results into one coherent search engine result page. The goal of the Federated Web Search (FedWeb) track is to evaluate approaches to federated search at very large scale in a realistic setting, by combining the search results of existing web search engines (also see the 2013 guidelines). We introduce one new task and a number of important modifications to the 2013 tasks, in order to make these tasks more realistic.
This year’s new challenge is the Vertical Selection task, where the participants have to predict the quality of the different verticals for a particular query (for instance sports, news, or images). A set of relevant verticals should be selected for each test topic. Note that a vertical could contain multiple resources (search engines).
The second task, Resource Selection, is about predicting the quality of the individual resources on the test topics, where the participants are required to rank all resources.
The third and final task, Results Merging, aims at creating ranked lists of result snippets, by merging results from a limited number of resources. Not only relevance of individual results, but also diversity in terms of verticals should be taken into account.
The figure below provides an overview of the different tasks, with the required output and the evaluation metrics that will be used. Important to note is the logical connection between the different tasks. Participants are free to participate in both or either one of the Vertical Selection and Resource Selection tasks. Groups participating in the Results Merging task, should submit at least a single run based on a resource selection baseline that will be provided.
The FedWeb 2014 dataset consists of sampled search results of 149 web search engines crawled between April and May 2014. These 149 engines are a subset of the FedWeb 2013 engines (some could not be crawled). However, a larger set of samples (4000 queries) is available for building resource descriptions.
The dataset is available after signing a (new) license agreement available here. After approval of your application, you will receive details how to obtain the collection as soon as possible.
An overview of the engines can be found here.
We will provide 75 test topics, from which 50 will be used for system evaluations. The query terms for these test topics are available here. We do not announce prior to submission which will be the 50 topics chosen for evaluation, hence we expect submitted runs covering all 75 topics.
For resource selection and vertical selection, you can only use (and download) the samples (query snippets and corresponding documents). Since August 18th, the query snippets of the topics are available from the download site.
In web search, a vertical is associated with content dedicated to either a topic (e.g. “finance”), a media type (e.g. “images”) or a genre (e.g. “news”). For example, an “image” vertical contains resources such as Flickr and Picasa. For a given user information need, only a subset of verticals will provide the most relevant results. For example, relevant verticals for a query such as “flowers” might include “image” and “encyclopedia” verticals. Therefore, the system should select a subset of verticals to retrieve from. Vertical selection improves the effectiveness while reducing the load to query multiple verticals. With this task, we aim to encourage vertical (domain) modeling from the participants.
Input: a query
Below is an overview of the verticals in the Fedweb 2014 dataset:
The specific mapping from resource to vertical can be found here.
The list of verticals can be found here.
To make the vertical selection decision, participants are allowed to use the provided samples of each vertical and other external resources (e.g. Wikipedia, Wordnet or query-logs). However, participants are not allowed to sample the online verticals themselves.
The submission file contains a set of relevant verticals for each query.
7001 FW14-v002 univXVS1
For example, for topic 7001 there are three selected verticals (FW14-v002, FW14-v007, FW14-v001).
Participants can submit up to 7 vertical selection runs.
This selected set of verticals will be evaluated by standard classification metrics: F-measure (main metric), precision and recall. The set of relevant verticals will be based on the relevance of the individual search results provided by the resources in that vertical. More details are provided here.
For practical reasons, it is not possible to query all available resources (search engines) when a query is issued to a federated search system. Therefore, the system first needs to select the appropriate search engines for the given query. This task is called resource selection. More specifically, the task expects the following input and output:
Input: a query
For example, suitable resources for a query such as pittsburgh steelers news might be ESPN, Fox Sports, etc. To simulate a realistic setting, the participants are not allowed to sample or retrieve results from the resources themselves. Participants can only use the provided samples or external resources (e.g. Wikipedia or Wordnet).
The submission is a standard TREC format and has 6 columns: QueryID, Q0 (unused), resourceID, rank, score and runtag (more details here). See the example below (in this example, the resource FW13-e002 is ranked highest in query 7001):
7001 Q0 FW14-e002 1 29.34 univXRS1
Participants can submit up to 7 resource selection runs.
The ranking of resources will be evaluated by normalized discounted cumulative gain (nDCG), the variant introduced by Christopher Burges et al. Learning to rank using gradient descent ICML 2005). See here for more information about how the relevance of resources is determined.
The goal of results merging is to merge the search result snippets from previously selected resources in a single ranked list. This year, each results merging run must be explicitly based on a resource selection run: The merging run can only use, and therefore should only contain, search result snippets from the top 20 resources selected during resource selection for each query. In contrast to last year, pages will not be provided for the results merging task. At least one submission must be based on the baseline resource selection run (to be provided by TREC). Participants can submit up to 7 results merging runs.
The submission is a standard TREC format and has 6 columns (see here): QueryID, Q0 (unused), snippetID, rank, score and runtag, as follows:
7001 Q0 FW14-e001-7001-01 1 12.34 univXRM1
Each run will be evaluated using two metrics: nDCG to measure topical relevance, and nDCG-IA to measure diversity between verticals in addition to topical relevance. The main metric will be nDCG. nDCG-IA is intended to stimulate participants to provide merged lists that include relevant results from several verticals. Details of nDCG-IA can be found in Agrawal, Rakesh, et al. "Diversifying search results." WSDM 2009. We will use the variant of nDCG introduced by Christopher Burges et al. (Learning to rank using gradient descent. ICML 2005). We will provide a qrels file for the participants that assigns each snippet a relevance grade for a given query (see here for more details).
Results merging may only use the top 20 selected resources for a query
Duplicates documents will be irrelevant
In this track, relevance is defined on multiple levels (documents, resources and verticals)
Documents can be judged as Nav, Key, Hrel, Rel or Non relevant (taken from the TREC Web track).
The relevance of each resource is determined by calculating the graded precision (see Using graded relevance assessments in IR evaluation, J. Kekäläinen and K. Järvelin, JASIST 53(13), 2002) on its top 10 results. This takes the graded relevance levels of the documents in the top 10 into account, but not the ranking.
The relevance of a vertical for a given query is determined by the best performing resource (search engine) within this vertical. More specifically, the relevance is represented by the maximum graded precision of its resources. For the final evaluation, the binary relevance of a vertical is determined by a threshold: A vertical for which the maximum graded precision is 0.5 is considered relevant. This threshold was determined based on data analyses, such that for most queries there is a small set of relevant verticals. If for a given query, no verticals have exceeded this threshold, we use the top-1 vertical with the maximal relevance as the relevant vertical.
The submission formats of the resource selection and results merging tasks follow the traditional trec_eval format:
Information about the 2013 track can be found here.