Information Extraction from Microblogs Posted during Disasters

Track description

User­-generated content in microblogging sites like Twitter is known to be important sources of real­time information on various events, including disaster events like floods, earthquakes, and terrorist attacks. In this track, our aim is to develop IR methodologies for extracting important information from microblogs posted during disasters. 

A large set of microblogs (tweets) posted during a recent disaster event will be made available, along with a set of topics (in TREC format). Each ‘topic’ will identify a broad information need during a disaster, such as – what resources are needed by the population in the disaster­ affected area, what resources are available, what resources are required / available in which geographical region, and so on. Specifically, each topic will contain a title, a brief description, and a more detailed narrative on what type of tweets will be considered relevant to the topic. The participants are required to develop methodologies for extracting tweets that are relevant to each topic with high precision (i.e., ideally, only the relevant tweets should be identified) as well as high recall (i.e., ideally, all relevant tweets should be identified).

This is essentially an ad­hoc search task, where the main challenges are: 
  1. Dealing with the noisy nature of microblogs which are very small (at most 140 characters) and often written informally, using abbreviations, colloquial terms, etc, and
  2. Identifying specific keywords relevant to each broad topic. Note that, each individual microblog contains only a few words, and might not contain most of the specific keywords even though the tweet is relevant to a topic.

Data

The data will contain:
  1. Around 50,000 microblogs (tweets) from Twitter, that were posted during the Nepal earthquake in April 2015. Since the Twitter terms do not allow public sharing of tweets, only the tweetids of the tweets will be provided, along with a script that can be used to download the tweets using the Twitter API.
  2. A set of 5 – 8 topics in TREC format, each containing a title, a brief description, and a more detailed narrative on what type of tweets will be considered relevant to the topic.

Evaluation plan

Since the aim of this track is to extract a set of tweets that are relevant to each topic, set­-based evaluation metrics like precision, recall, and F­-score will be used. The gold­ standard, against which the set of tweets identified by the participants will be matched, will be generated by a “manual run” where human volunteers (assessors) will be given the same set of tweets and topics, and asked to identify all possible relevant tweets using a search engine (Indri). 
 
While judging the participants’ runs, we will also arrange for a second round of assessments, if necessary, to judge the relevance of tweets that are identified by the participants but not identified during the first round of human assessment.

Timeline

  • July 1, 2016: Data and topics released - To get the dataset please send the scanned version of the duly filled-in "organizational-access form" found here to the email address fire.microblog2016@gmail.com . Please mention "FIRE 2016 Microblog Track" somewhere in the form. We will email the topics, tweetids and a script to download the tweets to your email address.
  • August 25, 2016: Run submission deadline.  [Run sumission instructions (.pdf)]   [Sample run file to be submitted (.txt)]
  • September 21, 2016: Results declared.
  • October 22, 2016 October 30, 2016: Working notes due.

Organizers