2011 Track Guidelines


These are the draft guidelines for the 2011 edition of the TREC Microblog track. The authors are Craig Macdonald, Iadh Ounis, Jimmy Lin, Abdur Choudhury and Ian Soboroff.

Data

For TREC 2011, we are using the Tweets2011 corpus. The corpus is comprised of 2 weeks of tweets sampled courtesy of Twitter. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e.  both important and spam tweets are included. As the reusability of a test collection is paramount in a TREC track, this sample can be obtained at any point in time.

Size and Format of the Tweets11 Corpus

The size of the corpus is approximately 16 million tweets over a period of 2 weeks (24th January 2011 until 8th February, inclusive), which covers both the time period of the Egyptian revolution and the US Superbowl. Different types of tweets are present, including replies and retweets.

Each day of the corpus is split into files called "blocks", each of which contains about 10,000 tweets compressed using gzip. Each tweet is in JSON format, similar (but not identical) to the format used by the Twitter Gardenhose. Within the corpus, tweets are ordered by tweet id, which are chronologically ordered for our purposes.

We hope to later distribute a corpus of the Web documents that are linked to by the tweets in the corpus, with the shortened URLs resolved.

Obtaining the Tweets2011 Corpus

The homepage for the Tweets2011 corpus is at http://trec.nist.gov/data/tweets/.

The Tweets2011 corpus is unusual in that the actual tweets are downloaded directly from Twitter, using a provided tool. However, to obtain the lists of particular tweets in each block to be downloaded (i.e. the "tweet lists"), a license agreement must be signed. Once signed, the agreement should be emailed back to NIST, who will provide you with a username/password to download the tweet lists (in the form of a .tar.gz file).

Once you have downloaded and decompressed the tweet lists from NIST, you should obtain and run the twitter-academia-tools corpus downloader. For further instructions on downloading and using the twitter-corpus-tools corpus downloader.

You MUST NOT re-distribute the tweet lists or the corpus obtained by using the tweet lists, as this breaks both the Tweets2011 corpus license agreement and the Twitter Terms of Use. Note that it can take several days to download your copy of the Tweets2011.

Finally, by signing the corpus license agreement, you agree to remove when instructed tweets that have been deleted from Twitter. twitter-academia-tools will provide a tool for removing deleted tweets from your copy of the corpus.

Realtime Adhoc Task

In the first run of the Microblog track, we will be addressing a search task whereby a user's information need will be represented by a query at a specific time. In particular, we address a realtime search task, where the user wishes to see the most recent but relevant information to the query. Hence, the system should answer a query by providing a list of relevant tweets ordered from newest to oldest, starting from the time the query was issued. When selecting tweets to include in the list, systems should favour "interesting" but "newer" relevant tweets. Interestingness is subjective, but the issuer of a query might interpret it as providing somehow added value with respect to the query topic. For this year, the "novelty" between tweets will not be considered.

Topics will be developed to represent an information need at a specific point in time. An example topic could be:

<top>
<num> Number: MB01 </num>
<title> Wael Ghonim </title>
<querytime> 25th February 2011 04:00:00 +0000 </querytime>
<querytweettime> 3857291841983981 </querytweettime>
</top>

where:
  •  the num tag contains the topic number.
  •  the title tag contains the user's query representation.
  •  the querytime contains the timestamp of the query in a human and machine readable ISO standard form.
  •  the querytweettime tag contains the timestamp of the query in terms of the chronologically nearest tweet id within the corpus.

NIST will create 50 topics for the purposes of this task. Moreover, while no narrative and description tags are provided, the topic developer/assessor will have a clearly defined information need.

Submission Guidelines

Participating groups may submit up to four runs to the Realtime Adhoc task. At least one run should not use any external or future source of evidence (see below for a description of external and future sources of evidence). Runs that use external or future evidence will be ranked separately in the track overview paper. 

Submitted runs must follow the standard TREC format:

MB01 Q0 3857291841983309 1 0.999 myRun
MB01 Q0 3857291841983302 2 0.878 myRun
MB01 Q0 3857291841983301 3 0.314 myRun
...
MB02 Q0 3857291214283390 1000 0.000001 myRun
...

The fields are the topic number, a literal "Q0", a tweet id, the rank that the tweet was retrieved at, the score, and the identifier for the run (the "run tag").  Note that for the primary task evaluation measures, the run will be sorted by topic number and descending tweet id.  The rank and score numbers are being retained to compute other measures and to preserve information about the system's ordering of the tweets.  All retrieved tweets must be no greater than the querytweettime id.

For each query, the system should provide up to 1000 tweets, however the most 30 recent tweets will be the target of the evaluation measure(s).

External and Future Evidence

The use of exernal or future evidence should be acknowledged for every submitted run. In particular, we define external and future evidence as follows:

  1. External Evidence: Evidence outwith the Tweets2011 corpus - for instance, this encompasses other tweets (gardenhose/firehose) or information from Twitter, as well as other corpora e.g. Wikipedia or the Web.
  2. Future evidence: Information that would not have been available to the system at the timestamp of the query.
If you make use of a Wikipedia snapshot from April 2011 (i.e. after the corpus), then this is both an external and future evidence, while a Wikipedia snapshot from December 2010 is considered external but not future evidence.

Assessment & Evaluation

NIST assessors will judge the relevance of tweets to the specified information need, on a graded scale of "interestingness". As all topics are expressed in English, non-English tweets will be judged non-relevant, even if the topic's assessor understands the language of the tweet and the tweet would be relevant in that language.

Evaluation of the submitted runs will be primarily set-based, where 30 is the rank cutoff for the evaluation. For instance, we will measure the precision@30. Moreover, a new evaluation measure will be proposed, which will balance recency and interestingness of tweets.

Systems will be compared to two standard baselines created by a given retrieval system, namely:
  1. Conjunctive: the most recent 1000 tweets that contain all of the query terms.
  2. Disjunctive: the most recent 1000 tweets that contain any of the query terms.
These baselines will be used to measure how systems are identifying interesting tweets.

Timeline
 * 16th May 2011: Details of corpus released
 * 20th June 2011 (approx): Topics released
 * 11th August 2011: Runs due
 * late September/early October 2011: Relevance assessments released
 * 15th-18th November 2011: TREC conference at Gaithersburg MD, USA

Comments