Welcome to the website of the Federated Web Search track, part of NIST's Text REtrieval Conference TREC 2014. The track investigates techniques for the selection and merging of search results from a large number of real online web search services. Make sure you join the FedWeb mailing list, to participate in discussions and receive the latest information about the track.


2014 Track

The FedWeb track will continue in 2014. Next to the resource selection and results merging tasks, FedWeb 2014 will feature a vertical selection task in which the systems have to rank the best vertical type for a query
Below is a tentative timeline for this year's track:

 now  Training data available (Fedweb 2013CIKM 2012)
 May 2014  New sample data released (Fedweb 2014)            
 June 2014  Test queries released  (here)            
 August 18, 2014  Vertical selection and resource selection runs due;
 Official RS baseline available (here);
 Snippets released for each topic (on the collection download sites).
 September 15, 2014  Results merging results due
 18-21 November, 2014  TREC 2014 workshop
We released the raw FedWeb14 relevance judgments:
singleJudgmentsFW14_50official.txt contains the complete page judgments (34,003 in total) for the 50 official topics, and singleJudgmentsFW14_10extra.txt contains the jugments for the 10 additional topics (6,451 judgments) used in the online evaluation system during the TREC preparation phase
The parameters included in these files are:
  • snippetID: the page ID
  • userID: the anonymized assessor ID
  • pagelabel: the assigned relevance level
  • pageproblem: whether the assessor had a problem with visualizing the original page
    (0 = no problem, confident in the assigned label; 1 = potential problem, less confident)
  • watchedvideo: whether the assessor effectively watched (part of) the video contained in the original page (0 = did not watch, and 1 = watched) 
Only snippetID and pagelabel were used for the FedWeb14 evaluation, but the others might be useful for further analysis.


The 2014 FedWeb track promotes research on federated search with realistic web data. Federated search is the approach of querying multiple search engines simultaneously, and combining their results into one coherent search engine result page. The goal of the Federated Web Search (FedWeb) track is to evaluate approaches to federated search at very large scale in a realistic setting, by combining the search results of existing web search engines (also see the 2013 guidelines). We introduce one new task and a number of important modifications to the 2013 tasks, in order to make these tasks more realistic.

This year’s new challenge is the Vertical Selection task, where the participants have to predict the quality of the different verticals for a particular query (for instance sports, news, or images). A set of relevant verticals should be selected for each test topic. Note that a vertical could contain multiple resources (search engines).

The second task, Resource Selection, is about predicting the quality of the individual resources on the test topics, where the participants are required to rank all resources.

The third and final task, Results Merging, aims at creating ranked lists of result snippets, by merging results from a limited number of resources. Not only relevance of individual results, but also diversity in terms of verticals should be taken into account.

The figure below provides an overview of the different tasks, with the required output and the evaluation metrics that will be used. Important to note is the logical connection between the different tasks. Participants are free to participate in both or either one of the Vertical Selection and Resource Selection tasks. Groups participating in the Results Merging task, should submit at least a single run based on a resource selection baseline that will be provided.


FedWeb 2014 Dataset

Please note that a new, more comprehensive collection has been released: FedWeb Greatest Hits. More information can be found here

The FedWeb 2014 dataset consists of sampled search results of 149 web search engines crawled between April and May 2014. These 149 engines are a subset of the FedWeb 2013 engines (some could not be crawled). However, a larger set of samples (4000 queries) is available for building resource descriptions. 
The dataset is available after signing a (new) license agreement available here. After approval of your application, you will receive details how to obtain the collection as soon as possible.
An overview of the engines can be found here.
We will provide 75 test topics, from which 50 will be used for system evaluations. The query terms for these test topics are available here. We do not announce prior to submission which will be the 50 topics chosen for evaluation, hence we expect submitted runs covering all 75 topics.  

For resource selection and vertical selection, you can only use (and download) the samples (query snippets and corresponding documents). Since August 18th, the query snippets of the topics are available from the download site.

Task 1: Vertical Selection


In web search, a vertical is associated with content dedicated to either a topic (e.g. “finance”), a media type (e.g. “images”) or a genre (e.g. “news”). For example, an “image” vertical contains resources such as Flickr and Picasa. For a given user information need, only a subset of verticals will provide the most relevant results. For example, relevant verticals for a query such as “flowers” might include “image” and “encyclopedia” verticals. Therefore, the system should select a subset of verticals to retrieve from. Vertical selection improves the effectiveness while reducing the load to query multiple verticals. With this task, we aim to encourage vertical (domain) modeling from the participants. 

Input: a query
Output: A set of relevant verticals.

Below is an overview of the verticals in the Fedweb 2014 dataset:

The specific mapping from resource to vertical can be found here.

The list of verticals can be found here.


To make the vertical selection decision, participants are allowed to use the provided samples of each vertical and other external resources (e.g. Wikipedia, Wordnet or query-logs). However, participants are not allowed to sample the online verticals themselves. 

Submission format

The submission file contains a set of relevant verticals for each query. 
[Column 1: the topic number]
[Column 2: the official engine identifier of the selected vertical]
[Column 3: the "run tag". It should be a unique identifier for your group AND for the method, containing at most 12 characters and no punctuation]
An example is below:

7001 FW14-v002 univXVS1
7001 FW14-v007 
7001 FW14-v001 
7002 FW14-v004 

For example, for topic 7001 there are three selected verticals (FW14-v002, FW14-v007, FW14-v001).

Participants can submit up to 7 vertical selection runs.


This selected set of verticals will be evaluated by standard classification metrics: F-measure (main metric), precision and recall. The set of relevant verticals will be based on the relevance of the individual search results provided by the resources in that vertical. More details are provided here.

Task 2: Resource Selection


For practical reasons, it is not possible to query all available resources (search engines) when a query is issued to a federated search system. Therefore, the system first needs to select the appropriate search engines for the given query. This task is called resource selection. More specifically, the task expects the following input and output:

Input: a query
Output: A ranking of resources (the most appropriate resources are ranked highest)

For example, suitable resources for a query such as pittsburgh steelers news might be ESPN, Fox Sports, etc. To simulate a realistic setting, the participants are not allowed to sample or retrieve results from the resources themselves. Participants can only use the provided samples or external resources (e.g. Wikipedia or Wordnet).

Submission format

The submission is a standard TREC format and has 6 columns: QueryID, Q0 (unused), resourceID, rank, score and runtag (more details here). See the example below (in this example, the resource FW13-e002 is ranked highest in query 7001):

7001 Q0 FW14-e002 1 29.34 univXRS1
7001 Q0 FW14-e007 2 21.67 univXRS1
7001 Q0 FW14-e001 3 19.97 univXRS1
7001 Q0 FW14-e004 4 19.21 univXRS1

Participants can submit up to 7 resource selection runs.


The ranking of resources will be evaluated by normalized discounted cumulative gain (nDCG), the variant introduced by Christopher Burges et al. Learning to rank using gradient descent ICML 2005). See here for more information about how the relevance of resources is determined.

Task 3: Results Merging

The goal of results merging is to merge the search result snippets from previously selected resources in a single ranked list. This year, each results merging run must be explicitly based on a resource selection run: The merging run can only use, and therefore should only contain, search result snippets from the top 20 resources selected during resource selection for each query. In contrast to last year, pages will not be provided for the results merging task. At least one submission must be based on the baseline resource selection run (to be provided by TREC). Participants can submit up to 7 results merging runs.

Submission format

The submission is a standard TREC format and has 6 columns (see here): QueryID, Q0 (unused), snippetID, rank, score and runtag, as follows:

7001 Q0 FW14-e001-7001-01 1 12.34 univXRM1
7001 Q0 FW14-e001-7001-02 2 11.67 univXRM1
7001 Q0 FW14-e001-7001-02 3 10.97 univXRM1


Each run will be evaluated using two metrics: nDCG to measure topical relevance, and nDCG-IA to measure diversity between verticals in addition to topical relevance. The main metric will be nDCG. nDCG-IA is intended to stimulate participants to provide merged lists that include relevant results from several verticals. Details of nDCG-IA can be found in Agrawal, Rakesh, et al. "Diversifying search results." WSDM 2009. We will use the variant of nDCG introduced by Christopher Burges et al. (Learning to rank using gradient descent. ICML 2005). We will provide a qrels file for the participants that assigns each snippet a relevance grade for a given query (see here for more details).

Results merging may only use the top 20 selected resources for a query
Each results merging run must be explicitly based on a resource selection run. The snippets from the other search engines should be treated as unseen (for instance, you cannot use the fact that a document is also retrieved by an unseen resource). TREC will check runs for these constraints.

Duplicates documents will be irrelevant
Duplicate documents (based on URL and content) further down the list are considered non-relevant when calculating nDCG on the merged results.


In this track, relevance is defined on multiple levels (documents, resources and verticals)


Documents can be judged as Nav, Key, Hrel, Rel or Non relevant (taken from the TREC Web track).


The relevance of each resource is determined by calculating the graded precision (see Using graded relevance assessments in IR evaluation, J. Kekäläinen and K. Järvelin, JASIST 53(13), 2002) on its top 10 results. This takes the graded relevance levels of the documents in the top 10 into account, but not the ranking.


The relevance of a vertical for a given query is determined by the best performing resource (search engine) within this vertical. More specifically, the relevance is represented by the maximum graded precision of its resources. For the final evaluation, the binary relevance of a vertical is determined by a threshold: A vertical for which the maximum graded precision is 0.5 is considered relevant. This threshold was determined based on data analyses, such that for most queries there is a small set of relevant verticals. If for a given query, no verticals have exceeded this threshold, we use the top-1 vertical with the maximal relevance as the relevant vertical.

Submission format

The submission formats of the resource selection and results merging tasks follow the traditional trec_eval format:

  • Column 1: the topic number.
  • Column 2: currently unused and should always be "Q0".
  • Column 3: the official identifier of the resource/snippet.
  • Column 4: the rank.
  • Column 5: the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order. The evaluation program ranks verticals from these scores, not from your ranks. If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  • Column 6: the "run tag". It should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with no punctuation, to facilitate labeling graphs with the tags.

NEW: 2014 online evaluation

You can check your runs, and get preliminary evaluation results at FedWeb Circus.  Get notifications of new runs by following @TRECFedWeb on Twitter.

2013 Track

Information about the 2013 track can be found here.

Track coordinators