Research‎ > ‎

Rich information extraction from noisy user generated text

Projects:

1. DAWT: Densely Annotated Wikipedia Texts across multiple languages

Contributors: Nemanja Spasojevic, Preeti Bhargava, Guoning Hu

Abstract:

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic. We also present the methodology used to generate the dataset which enriches Wikipedia markup in order to increase number of links. In addition to the main dataset, we open up several derived datasets including mention entity co-occurrence counts and entity embeddings, as well as mappings between Freebase ids and Wikidata item ids. We also discuss two applications of these datasets and hope that opening them up would prove useful for the Natural Language Processing and Information Retrieval communities, as well as facilitate multi-lingual research.

Related publications:

a. Nemanja Spasojevic, Preeti Bhargava, Guoning Hu, DAWT: Densely Annotated Wikipedia Texts across multiple languages, Wiki workshop 2017 colocated with WWW'2017 (Wiki'17) [PDF]

2. High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

Contributors: Preeti Bhargava, Nemanja Spasojevic, Guoning Hu

Abstract:

The Entity Disambiguation and Linking (EDL) task matches entity mentions in text to a unique Knowledge Base (KB) identifier such as a Wikipedia or Freebase id. It plays a critical role in the construction of a high quality information network, and can be further leveraged for a variety of information retrieval and NLP tasks such as text categorization and document tagging. EDL is a complex and challenging problem due to ambiguity of the mentions and real world text being multi-lingual. Moreover, EDL systems need to have high throughput and should be lightweight in order to scale to large datasets and run on off-the-shelf machines. More importantly, these systems need to be able to extract and disambiguate dense annotations from the data in order to enable an Information Retrieval or Extraction task running on the data to be more efficient and accurate. In order to address all these challenges, we present the Lithium EDL system and algorithm - a high-throughput, lightweight, language-agnostic EDL system that extracts and correctly disambiguates 75% more entities than state-of-the-art EDL systems and is significantly faster than them.

Related publications:

a. Preeti Bhargava, Nemanja Spasojevic, Guoning Hu, High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data, LDOW workshop 2017 colocated with WWW'2017 (LDOW'17) [PDF]

3. Lithium NLP: A System for Rich Information Extraction from Noisy User Generated Text on Social Media

Contributors: Nemanja Spasojevic, Preeti Bhargava, Guoning Hu

Abstract:

In this paper, we describe the Lithium Natural Language Processing (NLP) system - a resource-constrained, high-throughput
and language-agnostic system for information extraction from noisy user generated text on social media. Lithium NLP extracts a rich set of information including entities, topics, hashtags and sentiment from text. We discuss several real world applications of the system currently incorporated in Lithium products. We also compare our system with existing commercial and academic NLP systems in terms of performance, information extracted and languages supported. We show that Lithium NLP is at par with
and in some cases, outperforms state-of-the-art commercial NLP systems

Related publications:

a. Preeti Bhargava, Nemanja Spasojevic, Guoning Hu, Lithium NLP : A System for Rich Information Extraction from Noisy User Generated Text, EMNLP Workshop on Noisy User Generated Text (WNUT'17) [PDF]

3. Analyzing users' sentiment towards popular consumer industries and brands on Twitter

Contributors: Nemanja Spasojevic, Preeti Bhargava, Guoning Hu

Abstract:

Social media serves as a unified platform for users to express their thoughts on subjects ranging from their daily lives to their opinion on consumer brands and products. These users wield an enormous influence in shaping the opinions of other consumers and influence brand perception, brand loyalty and brand advocacy. In this paper, we analyze the opinion of 19M Twitter users towards 62 popular industries, encompassing 12,898 enterprise and consumer brands, as well as associated subject matter topics, via sentiment
analysis of 330M tweets over a period spanning a month. We find that users tend to be most positive towards manufacturing and most negative towards service industries. In addition, they tend to be more positive or negative when interacting with brands than generally
on Twitter. We also find that sentiment towards brands within an industry varies greatly and we demonstrate this using two industries as use cases. In addition, we discover that there is no strong correlation between topic sentiments of different industries, demonstrating that topic sentiments are highly dependent on the context of the industry that they are mentioned in. We demonstrate the value of such an analysis in order to assess the impact of brands on social media. We hope that this initial study will prove valuable for both researchers and companies in understanding users’ perception of industries, brands and associated topics and encourage more research in this field.

Related publications:

a. Guoning Hu, Preeti Bhargava, Saul Fuhrmann, Sarah Ellinger, Nemanja Spasojevic, Analyzing users’ sentiment towards popular consumer industries and brands on TwitterICDM Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction (SENTIRE 2017) [PDF]


Comments