2013 TREC Crowdsourcing Track

The Crowdsourcing Track investigates the use of crowdsourcing techniques to better evaluate information retrieval (IR) systems. The track is run as part of the National Institute of Standards and Technology (NIST)'s annual Text REtrieval Conference (TREC).

In Information Retrieval (IR), relevance judgments are often collected from human judges to evaluate search ...(more)
Want to take part?

Track Overview
The 2013 Crowdsourcing Track will require collecting relevance judgments for Web pages and search topics.  There will be 50 search topics - some of them with multiple facets/sub-topics - taken from the TREC Web Track. Web pages to be judged for relevance will be drawn from the ClueWeb12 collection.  After the Web Track participants submit their retrieval runs, NIST will identify a subset of the submitted documents to be judged for relevance.  In parallel with NIST assessment, the Crowdsourcing Track participants will judge the same documents as the NIST judges. Evaluation will measure accuracy of crowd judgments, using NIST judgments as a gold standard for comparison, with secondary evaluation of self-reported time, cost, and effort required to crowdsource the judgments.

In this "Crowdsourcing" track, participants are actually free to use or not use crowdsourcing techniques however they wish.  For example, judgments may be obtained via a fully-automated system, or using traditional relevance assessment practices, or a mix of these.  Participants may use a purely crowdsourcing-based approach, or employ a hybrid approach combining automated systems, crowds, trained judges or other resources and techniques. Crowdsourcing may be paid or non-paid. It is entirely up to each team to innovate the best way to obtain accurate judgments, in a reliable & scalable manner, while minimizing the time, cost, and effort required. 

Track Tasks
We welcome broad participation, and our track design intends to make this as easy as possible.  We offer two entry levels for participation. Participants can choose to enter at either or both levels:
  • Basic:  ~3.5 k documents (subset of NIST pool, 10 topics)
  • Standard: ~20 k documents (entire NIST pool, 50 topics)
The task in both cases is to obtain relevance labels for the documents and search topics included in the entry level set. As mentioned in the overview, any crowdsourcing techniques may be used as well as automated systems.
Training data. No training data (relevance judgments) will be provided.  To evaluate the quality of your judgments prior to the official evaluation, participants are welcome to use any NIST judgments from prior years (e.g., previous Web Tracks using ClueWeb09) or create their own judgments for ClueWeb12.
Test data. Participants are provided with details of the search topics and a list of (topic-ID, document-ID) pairs to be judged - this data is now available from the active participants section of the TREC website.  The document-IDs identify the documents to be judged from the ClueWeb12 collection. This collection is already available:
  • To purchase ClueWeb12, please follow the instructions at http://boston.lti.cs.cmu.edu/clueweb12/.
  • Thanks to Jamie Callan at CMU, it is possible to participate in the track without purchasing ClueWeb12.  In order to obtain the data, you must still sign the ClueWeb12 agreement and return it to CMU.  You will then receive username/password credentials for downloading the data from CMU at: http://boston.lti.cs.cmu.edu/clueweb12/TRECcrowdsourcing2013/. Processing these agreements takes 1-2 weeks, so plan accordingly. If you require expedited processing, contact the Crowdsourcing Track organizers and we'll see if we can help.
For example, this topic from last year's web track is a multi-faceted topic:

<topic number="154" type="faceted">
Find information on nutritional or health benefits of figs.
<subtopic number="1" type="inf">
Find information on nutritional or health benefits of figs.
<subtopic number="2" type="nav">Find recipes that use figs.</subtopic>
<subtopic number="3" type="inf">
Find information on the different varieties of figs.
<subtopic number="4" type="inf">Find information on growing figs.</subtopic>

And if we were to judge relevance for it, relevance judging would only be with respect to the description: "Find information on nutritional or health benefits of figs." and not the various subtopics.  Note that the description is always repeated as subtopic number 1.

Each topic-document pair needs to be judged on a six-point scale, as follows:
    4 = Nav This page represents a home page of an entity directly named by the query; the user may be searching for this specific page or site.
    3 = Key This page or site is dedicated to the topic; authoritative and comprehensive, it is worthy of being a top result in a web search engine.
    2 = HRel The content of this page provides substantial information on the topic.
    1 = Rel The content of this page provides some information on the topic, which may be minimal; the relevant information must be on that page, not just promising-looking anchor text pointing to a possibly useful page.
    0 = Non The content of this page does not provide useful information on the topic, but may provide useful information on other topics, including other interpretations of the same query.
    -2 = Junk This page does not appear to be useful for any reasonable purpose; it may be spam or junk.
Note that all topics are expressed in English. Non-English documents will be judged non-relevant by NIST assessors, even if the assessor understands the language of the document and the document would be relevant in that language. If the location of the user matters, the assessor will assume that the user is located in Gaithersburg, Maryland.

This year's topics are available in the active participants section of the TREC website along with the two sets of topic-docno pairs.  The topics for the basic set are 202, 214, 216, 221, 227, 230, 234, 243, 246, and 250.

In addition, the NIST assessor guidelines are posted in the active participants section.  These guidelines allow you to see the instructions that the NIST assessors have been given for judging the 2013 web track.

Submission Format
Participating groups will be allowed to submit as many runs as they like, but need to ask permission before submitting more than 10 runs. One of the runs must be designated as the primary run, which will be used for group majority voting, i.e., only one submission per participating groups will contribute to the majority relevance label across all groups.
Each run need to be submitted as an ASCII text file, with one judgment per line. A run must only contain one relevance label per topic-document pair. 
A submission run must then consist of the following whitespace separated columns:
  1. Topic-id
  2. Document-id
  3. Relevance label.  Allowed values = {4, 3, 2, 1, 0, -2} as per the above mapping where 4 = Nav, etc.
  4. A score or probability of relevance.  The greater the score or probability, the more likely this document-id is relevant.  We will use these scores for computation of label quality measures.  If you do not want to produce a score or probability of relevance, please copy the relevance label here as well.
  5. Run-tag: an identifier for a run that is unique among all the runs from your group. Run tags should be 12 or less characters, made up of letter and numbers only.
All columns are required.

A perl script is available in the active participants portion of the TREC website to verify that your run submission is in the correct format.  
Relevance labels submitted to the track will be evaluated against NIST judgments. We currently expect to use three metrics:
  • Rank Correlation: The Web Track participants' IR systems will be scored based on NIST judgments according to the primary Web Track metric, inducing a ranking of IR systems.  A similar ranking of IR systems will be induced from each Crowdsourcing Track participant's submitted judgments.  Rank correlation indicates how accurately crowd judgments can be used to predict the NIST ranking of IR systems.  The primary measure we will use for rank correlation is Yilmaz et al's AP Correlation ( http://doi.acm.org/10.1145/1390334.1390435 ).

  • Score Accuracy: In addition to correctly ranking systems, it is important that the evaluation scores be as accurate as possible.  We will use root mean square error (RMSE) for this measure.

  • Label Quality: Direct comparison of each participant's submitted judgments vs. NIST judgments (no evaluation of Web track IR systems). Label quality provides the simplest evaluation metric and can be correlated with the other measures predicting performance of IR systems.  We will use graded average precision (GAP) for this measure ( http://dx.doi.org/10.1145/1835449.1835550 ).  The GAP will be computed by ordering the documents as per the score assigned to the document and then using the qrels provided by NIST. 

All sponsorships below are strictly only available for the track participants and only for the purpose of collecting relevance labels for the tasks set out above. Teams accepting sponsorship support must acknowledge it in any derivative works (e.g. published papers and public presentations).
  • Amazon: up to $300 per team for Mechanical Turk use (first 10 academic teams who email Matt Lease to commit to participating; non-academic teams will be considered on a case-by-case basis). Funds will be distributed from UT Austin as a cash reimbursement for attending TREC conference, payable after the conference, and require completing some paperwork and providing a government-issued proof of identity. Amount requested should be less than or equal to total Mechanical Turk cost, which must be documented in the team's TREC Conference Notebook paper.

  • Crowd Computing Systems: offer of $100 for 2 team teams and the use of their platform. There is an application process and interested groups should contact Jaimie Lang at jlang@crowdcomputingsystems.com.


  • Aug 9: Topics and document sets released 
  • Sep 6: Submissions due
  • Oct 11: Preliminary results released
  • Oct 27: Conference notebook papers due
  • Nov 19-22: TREC 2013 conference at NIST, Gaithersburg, MD, USA
  • Dec 16: Final results released
  • Jan 2014: Final papers due

Related Events

Crowdsourcing Track Organizers