Research‎ > ‎

Rich information extraction from noisy user generated text


1. DAWT: Densely Annotated Wikipedia Texts across multiple languages

Contributors: Nemanja Spasojevic, Preeti Bhargava, Guoning Hu


In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic. We also present the methodology used to generate the dataset which enriches Wikipedia markup in order to increase number of links. In addition to the main dataset, we open up several derived datasets including mention entity co-occurrence counts and entity embeddings, as well as mappings between Freebase ids and Wikidata item ids. We also discuss two applications of these datasets and hope that opening them up would prove useful for the Natural Language Processing and Information Retrieval communities, as well as facilitate multi-lingual research.

Related publications:

a. Nemanja Spasojevic, Preeti Bhargava, Guoning Hu, DAWT: Densely Annotated Wikipedia Texts across multiple languages, Wiki workshop 2017 colocated with WWW'2017 (Wiki'17) [PDF]

2. High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

Contributors: Preeti Bhargava, Nemanja Spasojevic, Guoning Hu


The Entity Disambiguation and Linking (EDL) task matches entity mentions in text to a unique Knowledge Base (KB) identifier such as a Wikipedia or Freebase id. It plays a critical role in the construction of a high quality information network, and can be further leveraged for a variety of information retrieval and NLP tasks such as text categorization and document tagging. EDL is a complex and challenging problem due to ambiguity of the mentions and real world text being multi-lingual. Moreover, EDL systems need to have high throughput and should be lightweight in order to scale to large datasets and run on off-the-shelf machines. More importantly, these systems need to be able to extract and disambiguate dense annotations from the data in order to enable an Information Retrieval or Extraction task running on the data to be more efficient and accurate. In order to address all these challenges, we present the Lithium EDL system and algorithm - a high-throughput, lightweight, language-agnostic EDL system that extracts and correctly disambiguates 75% more entities than state-of-the-art EDL systems and is significantly faster than them.

Related publications:

a. Preeti Bhargava, Nemanja Spasojevic, Guoning Hu, High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data, LDOW workshop 2017 colocated with WWW'2017 (LDOW'17) [PDF]

3. Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media

Contributors: Nemanja Spasojevic, Preeti Bhargava, Guoning Hu, Sarah Ellinger, Prantik Bhattacharyya


In this paper, we describe the Lithium Natural Language Processing (NLP) system - a resource-constrained, high-throughput and language-agnostic system for information extraction from noisy user generated text on social media. Lithium NLP extracts a rich set of information including entities, topics, hashtags and sentiment from text. We discuss several real world applications of the system currently incorporated in Lithium products. We also compare our system with existing commercial and academic NLP systems in terms of performance, information extracted and languages supported. We show that Lithium NLP is at par with and in some cases, outperforms state-of-the-art commercial NLP systems

Related publications:

a. Preeti Bhargava, Nemanja Spasojevic, Guoning Hu, Lithium NLP : A System for Rich Information Extraction from Noisy User Generated Text, EMNLP Workshop on Noisy User Generated Text (WNUT'17) [PDF]

b. Lithium Engineering blogpost - Natural Language Processing: Our take

c. Sarah Ellinger, Prantik Bhattacharyya, Preeti Bhargava, Nemanja Spasojevic, Klout Topics for Modeling Interests and Expertise of Users Across Social Networks, arXiv preprint [PDF] 

4. Analyzing users' sentiment towards popular consumer industries and brands on Twitter

Contributors: Nemanja Spasojevic, Preeti Bhargava, Guoning Hu, Sarah Ellinger, Saul Fuhrmann


Social media serves as a unified platform for users to express their thoughts on subjects ranging from their daily lives to their opinion on consumer brands and products. These users wield an enormous influence in shaping the opinions of other consumers and influence brand perception, brand loyalty and brand advocacy. In this paper, we analyze the opinion of 19M Twitter users towards 62 popular industries, encompassing 12,898 enterprise and consumer brands, as well as associated subject matter topics, via sentiment analysis of 330M tweets over a period spanning a month. We find that users tend to be most positive towards manufacturing and most negative towards service industries. In addition, they tend to be more positive or negative when interacting with brands than generally on Twitter. We also find that sentiment towards brands within an industry varies greatly and we demonstrate this using two industries as use cases. In addition, we discover that there is no strong correlation between topic sentiments of different industries, demonstrating that topic sentiments are highly dependent on the context of the industry that they are mentioned in. We demonstrate the value of such an analysis in order to assess the impact of brands on social media. We hope that this initial study will prove valuable for both researchers and companies in understanding users’ perception of industries, brands and associated topics and encourage more research in this field.

Related publications:

a. Guoning Hu, Preeti Bhargava, Saul Fuhrmann, Sarah Ellinger, Nemanja Spasojevic, Analyzing users’ sentiment towards popular consumer industries and brands on TwitterICDM Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction (SENTIRE 2017) [PDF]

5. Learning to Map Wikidata Entities to Predefined Topics

Contributors: Preeti Bhargava, Nemanja Spasojevic, Sarah Ellinger, Adithya Rao, Abhinand Menon, Saul Fuhrmann, Guoning Hu


Recently much progress has been made in entity disambiguation and linking systems (EDL). Given a piece of text, EDL links words
and phrases to entities in a knowledge base, where each entity defines a specific concept. Although extracted entities are informative,
they are often too specific to be used directly by many applications. These applications usually require text content to be represented
with a smaller set of predefined concepts or topics, belonging to a topical taxonomy, that matches their exact needs. In this study, we
aim to build a system that maps Wikidata entities to such predefined topics. We explore a wide range of methods that map entities to
topics, including GloVe similarity, Wikidata predicates, Wikipedia entity definitions, and entity-topic co-occurrences. These methods
often predict entity-topic mappings that are reliable, i.e., have high precision, but tend to miss most of the mappings, i.e., have low
recall. Therefore, we propose an ensemble system that effectively combines individual methods and yields much better performance,
comparable with human annotators. 

Related publications:

a. Preeti Bhargava, Nemanja Spasojevic, Sarah Ellinger, Adithya Rao, Abhinand Menon, Saul Fuhrmann, Guoning HuLearning to Map Wikidata Entities to Predefined TopicsWiki workshop 2019 colocated with WWW'2019 (Wiki'19)