Practical Annotation and Processing of Social Media with GATE

Practical Annotation and Processing of Social Media with GATE


Social media is fast becoming a crucial part of our everyday lives, not only as a fun and practical way to share our interests and activities with geographically distributed networks of friends, but as an important part of our business lives also. Processing social media is particularly problematic for automated tools, because it is a strong departure from the tradition of newswire that many tools were developed with and evaluated against, and also due to the terse and low-context language it typically comprises.

This tutorial will address these issues in the context of the semantic web by introducing some of the problems faced by using NLP tools on social media, and solutions to these problems, including those specifically implemented in GATE and recently made publicly available. It will demonstrate techniques for extracting the relevant information from unstructured text in social media, so that participants will be equipped with the necessary building blocks of knowledge to build their own tools and tackle complex issues. Since all of the NLP tools to be presented are open source (e.g. GATE), the tutorial will provide the attendees with skills which are easy to apply and do not require special software or licenses.


From a business and government point of view there is an increasing need to interpret and act upon information from large-volume, social media streams, such as Twitter, Facebook, and forum posts. While natural language processing from newswire has been very well studied in the past two decades, understanding social media content has only recently been addressed in NLP research.

Social media poses three major computational challenges, dubbed by Gartner the 3Vs of big data: volume, velocity, and variety. NLP methods, in particular, face further difficulties arising from the short, noisy, and strongly contextualised nature of social media. To address the 3Vs of social media, novel language technologies have emerged, e.g. using locality sensitive hashing to detect new stories in media streams (volume), predicting stock market movements from tweet sentiment (velocity), and recommending blogs and news articles based on users' own comments (variety).

The proposed tutorial takes a detailed view of key NLP tasks (corpus annotation, linguistic pre-processing, information extraction and opinion mining) of social media content. After a short introduction to the challenges of processing social media, we will cover key NLP algorithms adapted to processing such content, discuss available evaluation datasets and outline remaining challenges. The goal is to make attendees familiar with the issues involved in social media, and to give them practical tools for handling unstructured data of this kind.

Tutorial Length

This is a half-day tutorial.

The speaker has extensive experience delivering similar material after developing and delivering two 4 hour courses on NLP for social media, as well as a week-long PhD-level course involving hands-on exercises, and various tutorial seminars. The tutorial is structured as follows:

Data gathering: In this 25 min section we discuss access to social network content. Firstly we introduce APIs for three popular networks: Twitter, LinkedIn and Facebook. We cover the most relevant points of the terms of service, and discuss the practical legal and ethical considerations of collecting, retaining and analysing this data. Both private and public APIs are discussed, as well as rate limiting operations and good practice for treating data retrieved via API. We also present the strengths and weaknesses of streaming versus query-based search, in the context of task and corpus requirements, and demonstrate GATE's Twitter corpus support.

Linguistic pre-processing of social media content: This section will start by introducing the challenges in processing social media with state-of-the-art NLP tools, which have not been trained or adapted to this kind of content. Next we will focus on language identification, normalisation, tokenisation, and part-of-speech tagging. Common difficulties will be examined and key new NLP methods for tackling these challenges will be introduced especially with focus on Twitter, including an introduction to GATE's TwitIE pipeline which addresses many of these challenges.

Crowdsourcing social media corpora: This section will cover briefly key crowdsourced social media corpora, provide a step-by-step example of using GATE's CrowdFlower plugin to collect named entity annotated tweets, as well as discuss briefly different task designs, quality assurance approaches, and annotation adjudication methods.

Entity recognition and disambiguation in Twitter: This section will cover these two more challenging tasks, again starting with common mistakes of news-trained state-of-the-art entity recognition & linking algorithms and opinion mining challenges. Algorithms developed specifically for social media will be introduced. GATE's new entity disambiguation tool, YODIE, which has been developed with social media specifically in mind, will also be introduced during this section.

Tutoring team

Dr. Leon Derczynski, University of Sheffield

Leon Derczynski is a post-doctoral Research Associate, who completed a PhD in Temporal Information Extraction at the University in Sheffield in 2012 under an enhanced EPSRC doctoral training grant. His main interests are in data-intensive approaches to computational linguistics, specialising in information extraction, spatio-temporal semantics, semi-supervised learning, and handling noisy linguistic data, especially social media.

Dr. Derczynski has been working in commercial and academic research on NLP and IR since 2003 with focus on temporal relations, temporal and spatial information extraction, semantic annotation, usability and social media. Commercial work included early introduction of Mechanical Turk for scaling marketing and linguistic discrimination tasks. Leon co-organises TempEval, the established evaluation challenge steering the state-of-the-art in temporal information extraction technology. His current work focuses on handling social media as big data.