2012 Track Guidelines

These are the draft guidelines for the 2012 edition of the TREC Microblog track. The authors are Craig Macdonald, Iadh Ounis, Jimmy Lin, and Ian Soboroff. The Microblog track first ran in 2011. In the second year of the Microblog track, we will be addressing two search tasks, namely: Real-time Adhoc and Real-time Filtering, whereby a user's information need will be represented by a query at a specific time. In the following, we describe both the data used by the Microblog track, as well as the two search tasks.

Note that to participate in the track you need to be a registered participant in TREC 2012.  See http://trec.nist.gov/pubs/call2012.html for details.  Note that TREC becomes closed to new participants in late May or early June.

Data

For TREC 2012, we will again use the Tweets2011 corpus. The corpus is comprised of 2 weeks of tweets sampled courtesy of Twitter. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e.  both important and non-English tweets are included. As the reusability of a test collection is paramount in a TREC track, this sample can be obtained at any point in time.

Size and Format of the Tweets11 Corpus

The size of the corpus is approximately 16 million tweets over a period of 2 weeks (24th January 2011 until 8th February, inclusive), which covers both the time period of the Egyptian revolution and the US Superbowl. Different types of tweets are present, including replies and retweets.

Each day of the corpus is split into files called "blocks", each of which contains about 10,000 tweets compressed using gzip. Each tweet is in JSON format, similar (but not identical) to the format used by the Twitter Gardenhose. Within the corpus, tweets are ordered by tweet id, which are chronologically ordered for our purposes.

Obtaining the Tweets2011 Corpus

The homepage for the Tweets2011 corpus is at http://trec.nist.gov/data/tweets/ (Further information about Tweets11 can be found in [1]).

The Tweets2011 corpus is unusual in that the actual tweets are downloaded directly from Twitter, using a provided tool. However, to obtain the lists of particular tweets in each block to be downloaded (i.e. the "tweet lists"), a usage agreement must be signed. Once signed, the agreement should be emailed back to NIST, who will provide you with a username/password to download the tweet lists (in the form of a .tar.gz file).

Once you have downloaded and decompressed the tweet lists from NIST, you should obtain and run the twitter-corpus-tools corpus downloader. For further instructions on downloading and using the twitter-corpus-tools corpus downloader.

You MUST NOT re-distribute the tweet lists or the corpus obtained by using the tweet lists, as this breaks both the Tweets2011 corpus license agreement and the Twitter Terms of Use. Note that it can take several days to download your copy of Tweets2011.

Finally, by signing the corpus license agreement, you agree to remove when instructed tweets that have been deleted from Twitter. If your copy of Tweets11 is old, we recommend that you re-run the corpus tool. twitter-corpus-tools will also provide a tool for removing deleted tweets from your copy of the corpus.

Robustness to Tweet Deletions
As tweets can be deleted or protected after posting, tweets that have now been deleted will not appear in the collection of participants who download the corpus at a later date. In our experience in TREC 2011, the deletion of tweets did not affect the reusability of the corpus, nor the relative ranking of systems [3]. However, for TREC 2012, we will mark as defacto irrelevant tweets that are not fetchable as of May 7th, 2012.

Real-time Adhoc Task

In the real-time search task, the user wishes to see the most recent but relevant information to the query. Hence, the system should answer a query by providing a list of relevant tweets ordered from newest to oldest, starting from the time the query was issued. However, different from last year, we will evaluate this by having the systems of participating groups return their top 10,000 scoring tweets prior to and including the query time defined by the topic. Evaluation will then be conducted by sweeping a retrieval score threshold, and evaluating the most recent tweets above that threshold, and reporting a score as the area under the ROC curve (true positives vs false positives, for all possible score thresholds).

When scoring tweets, systems should favor relevant and highly informative tweets about the query topic. For this year, the "novelty" between tweets will again not be considered.  Basically, systems should not consider the presentation order of the tweets, but instead issue a retrieval score representing the probability (or degree, or confidence) of relevance to the query.  In addition to the ROC measure as defined above, we can score runs based on score and tweet-id orderings using standard IR measures including MAP.

Topics will be developed to represent an information need at a specific point in time. An example topic could be:

<top>
<num> Number: MB01 </num>
<query> Wael Ghonim </query>
<querytime> 25th February 2011 04:00:00 +0000 </querytime>
<querytweettime> 3857291841983981 </querytweettime>
</top>

where:

     the num tag contains the topic number.
     the query tag contains the user's query representation.
     the querytime contains the timestamp of the query in a human and machine readable ISO standard form.
     the querytweettime tag contains the timestamp of the query in terms of the chronologically nearest tweet id within the corpus.
 
NIST will create 60 new topics for the purposes of this task. Moreover, while no narrative and description tags are provided, the topic developer/assessor will have a clearly defined information need recorded when the topic was created. For each topic, systems should score each of the tweets between querytweettime and queryoldesttweet inclusive by their importance to the query. Note that while tweet ids are not strictly chronologically ordered, we consider querytweettime to be definitive in preference to querytime.


Submission Guidelines

Participating groups may submit up to four runs to the real-time adhoc task. At least one run should not use any external or future source of evidence (see below for a description of external and future sources of evidence). While because of the nature of real-time search, the use of future evidence is discouraged, the use of timely external resources is encouraged. 

Differently from last year, submitted runs must follow a modified TREC format:

MB01 3857291841983981 1.999 myRun
MB01 3857291841983302 3.878 myRun
MB01 3857291841983301 0.314 myRun
...
MB02 3857291214283390 0.000001 myRun
...

The fields are the topic number, a tweet id, the score of the tweet by your system, and the identifier for the run (the "run tag"). Note that compared to the standard TREC run format, the Q0 and rank fields have been removed. There is no ordering of tweets defined for a query within the run file, other than that implicitly defined by the score field.

For each query, the system may return up to 10,000 tweets prior to the querytweettime, that is, whose tweet ids are less than or equal to the querytweettime.  Tweets not scored will be assumed to have a minimal score (e.g. negative infinity). 

Note that for the primary task evaluation measures, the run will be sorted by topic number, while thereafter measures such as those considering the area under the ROC curve will be used to decide how well system perform. This evaluation permits both chronological orderings and most relevant first orderings to be evaluated.

External and Future Evidence

The use of external or future evidence should be acknowledged for every submitted run. In particular, we define external and future evidence as follows:

  • External Evidence: Evidence outside the Tweets2011 corpus - for instance, this encompasses other tweets (gardenhose/firehose) or information from Twitter, as well as other corpora e.g. Wikipedia or the Web.
  • Future evidence: Information that would not have been available to the system at the timestamp of the query.

For example, if you make use of a Wikipedia snapshot from April 2011 (i.e. after the corpus), then this is both an external and future evidence, while a Wikipedia snapshot from December 2010 is considered external but not future evidence.

Moreover, if your system does not allow you to adapt your inverted file statistics  (e.g. IDF) to discard all those tweets posted after query timestamp, then your system must be labelled as using future evidence.

Assessment & Evaluation

NIST assessors will judge the relevance of tweets to the specified information need, on a graded scale of "informativeness". As all topics are expressed in English, non-English tweets will be judged non-relevant, even if the topic's assessor understands the language of the tweet and the tweet would be relevant in that language. Similar to last year, re-tweets will be automatically judged as irrelevant. 

We will report as a primary measure the area under the ROC curve, precision and recall.

Real-time Filtering Pilot Task

In the real-time filtering task, the aim is to decide if subsequently posted tweets are relevant for a query entered at a particular point in time. The real-time filtering task can be thought of as orthogonal to the real-time adhoc task: in the adhoc task the user is interested in tweets before the querytweettime; in the filtering task, the user has seen the tweets before the querytweettime, they are now interested in new relevant tweets. This allows a user to keep up to date about a developing topic on Twitter. Indeed, this can be seen in the “20 new tweets” aspect of the Twitter search page.

The filtering task will use the same run format and a similar topic format to the adhoc task. In particular, instead of specifying a time range before the querytweettime, the time range for a topic is after the querytweettime.

<top>
<num> Number: MB01 </num>
<title> Wael Ghonim </title>
<querytime> 25th February 2011 04:00:00 +0000 </querytime>
<querytweettime> 3857291841983981 </querytweettime>
<querynewesttweet> 3857291841993981 </querynewesttweet>
</top>

The topics used for the real-time filtering task will be the 2011 Microblog track topics, and the relevance judgments will be those from last year.  The topic querytweettime will point to the oldest known relevant tweet, and querynewesttweet will be the original 2011 querytweettime.

The task will follow the adaptive filtering methodology from the TREC filtering track (see [2]).  At the starting point for a given topic, systems are allowed to use the entire corpus prior to querytweettime as background training data, and the query and querytweet as a positive training example.  (Note that querytweets are not guaranteed to be good examples!)

Following this, the system must emit a score as well as a retrieval decision for each tweet from querytweettime to querynewesttweet.  Any unreported tweet will be assumed to have a score of negative infinity with a negative retrieval decision. Systems must process the tweets in tweet ID order and output a decision before processing any further tweets.

If for a specific tweet, the system emits a positive retrieval decision, the system is allowed to know the relevance value of that tweet (or 0 for an unjudged tweet).  This simulates (exceedingly prompt) feedback with which the system is allowed to incorporate in future retrieval decisions.  If the system emits a negative retrieval decision for a tweet, it does not get access to any relevance information for that tweet, although the tweet may be used otherwise (for example for background models or IDF distributions).

Participants may use topics {MBx | x mod 5 == 1}, that is, MB1, MB6, MB11, MB16, MB21, MB26, MB31, MB36, MB41, and MB46, along with all their relevance data, as development topics to tune their systems.  Results should only be submitted for the remaining 39 topics.

Submission Guidelines

Participating groups may submit up to four runs to the real-time filtering task. At least one run should not use any external or future source of evidence (see below for a description of external and future sources of evidence). As above, runs using external timely resources are encouraged.

Run format for the real-time filtering task is similar to the real-time adhoc task:

MB01 3857291841983981 1.999 no myRun
MB01 3857291841983302 3.878 yes myRun
MB01 3857291841983301 0.314 no myRun
...
MB02 3857291214283390 0.000001 no myRun
...

The fields are the topic number, a tweet id, the score of the tweet by your system, a retrieval decision (a literal lowercase 'yes' or 'no') and the identifier for the run (the "run tag").

External or Future Evidence

As for the real-time adhoc task, the use of external or future evidence should be acknowledged for every submitted run. In particular, we define external and future evidence as follows:
  • External Evidence: Evidence outwith the Tweets2011 corpus - for instance, this encompasses other tweets (gardenhose/firehose) or information from Twitter, as well as other corpora e.g. Wikipedia or the Web.
  • Future evidence: Information that would not have been available to the system at the timestamp of the retrieved tweet. Note that this definition is more relaxed than the real-time adhoc task - in particular, evidence that occurs after the querytweettime and before the id of the tweet being considered is allowable.
Assessment & Evaluation

No new relevance judgments will be done for this task, rather, we will use the relevance judgments from last year's track.

Timeline
  • 20th June 2012 (approx): Topics released
  • 10th July 2012: Adhoc runs due
  • 31st July 2012: Filtering runs due
  • late August 2012: Relevance assessments released
  • 6th-9th November 2012: TREC conference at Gaithersburg MD, USA

References

[1] Ian Soboroff, Dean McCullough, Jimmy Lin, Craig Macdonald, Iadh Ounis, Richard McCreadie. Evaluating Real-Time Search over Tweets. In Proceedings of ICWSM 2012.

[2] Ian Soboroff, Stephen E. Robertson: Building a filtering test collection for TREC 2002. In Proceedings of SIGIR 2003.

[3] Richard McCreadie, Ian Soboroff, Jimmy Lin, Craig Macdonald, Iadh Ounis, and Dean McCullough. On Building a Reusable Twitter Corpus. In Proceedings of SIGIR 2012.

Comments