JoBimText Tutorial

The JoBimText Tutorial took place at the NLDB 2015 in Passau, on June 16th.
Many thanks to all the participants for their interest, collaboration and interesting questions!
If you have any questions, you can get in touch with us: Language Technology: People



Motivation

With the increase of available data, the methods have to be modified so they can cope with the amount of data. But even with large amounts of data, many NLP applications suffer from sparsity problems due to the long tail distribution of linguistic elements. This is one of the problems where semantic methods can provide a solution.

Semantic methods do not consider a text to be simply a stream of words. They try to identify and use the underlying semantic structure. The structure between language elements like words can be semantic relatedness or hyponymy relations. A graph of word relations – a distributional thesaurus – can help to identify this structure in the text [1], which allows you to do analyze text more accurately, e.g. discovering the correct word senses (word sense disambiguation).

Therefore, we will introduce the framework JoBimText [2], which is a collection of methods for Distributional Semantics. The framework is designed to handle web scale data and makes use of distributed computing with Apache Hadoop and Apache Pig. It provides tools to compute lexical resources for specific domains, where a general distributional thesaurus would not bring improvements.

The framework is flexible, e.g. it allows arbitrary relations between words for similarity computation (word co-occurrence, dependency graph, etc.). Furthermore, it keeps the evidence for the similarity scores in the resulting JoBimText model, so that the score computation is transparent. This model can be used for cognitive computing applications, e.g. semantic parsing. The models our framework computes consist of similarities between terms as well as context features, resulting into a distributional thesaurus and sense clusters which are labeled with isa-patterns.

Researchers without computational means to process large data collections can use a variety of existing resources. A number of pre-computed models is available for download or usage through JoBimViz, a showcase application to aid semantic parsing and lexical expansion. We will present applications that utilize the web API for quick development without having to deal with data storage and access.

Topics


Below is a tentative Table of Contents for the tutorial:

  • Motivation
  • What is Distributional Semantics?
  • Why do we need Distributional Semantics?
  • Why do we need large scaling methods?
  • Distributional Semantics and the JoBimText approach
    • Computation of distributional thesauri on large data
    • Computation of sense clusters
    • Extraction of Hearst patterns
    • Labeling of sense clusters with hypernym terms (ISA labeling)
  • How to use JoBimText
    • Using the JoBimViz to get an impression of the models
    • How to compute a distributional thesaurus, sense clusters and Hearst patterns from scratch
    • Using the JoBimText API to access models and use them for prototyping

  • Applications for JoBimText
    • Usage for out-of-vocabulary problems
    • Usage for word sense disambiguation
    • Usage for contextualization

Target Audience


This tutorial targets researchers who want a broad introduction into the field of Distributional Semantics. As JoBimText is a Hadoop application, the tutorial will provide an opportunity to learn about large data processing tools. Furthermore, the target audience will use the framework. This includes how to use JoBimText to compute new models. A main focus will be also on the usage of the JoBimText API to access models. That way, the tutorial is aimed at researchers who want to learn about Distributional Semantics and apply semantic methods into their applications by using an API.
The audience should bring their laptops with Java installed, so that they can try out the methods in practice. If they also want to compute models we recommend having the provided virtual machine (see Resources) installed. For accessing JoBimText models, we will provide an Eclipse project with the necessary libraries provided for the usage.

Presenter


Martin Riedl: Martin Riedl is a PhD Candidate at the Language Technology group at the TU Darmstadt. He holds a diploma and a MSc in Computer Science from the Hochschule Mannheim, Germany. His main research is focused on Distributional Semantics in the field of Natural Language Processing. Furthermore he is interested in Machine Learning, especially in unsupervised methods. He is one of the main developer of the OpenSource framework JoBimText, which is developed in cooperation with IBM Watson DeepQA.

  • Homepage: http://www.lt.tu-darmstadt.de/people/martin-riedl/
  • Email: riedl@cs.tu-darmstadt.de

Eugen Ruppert: Eugen Ruppert is a PhD Candidate at the Language Technology group at the TU Darmstadt. He holds a MSc in Computational Linguistics from the Heidelberg University, Germany. His research is focused on creating ontologies from text by Distributional Semantics and projecting this knowledge back into the text. Eugen also maintains the JoBimText project and is responsible for the project documentation.

  • Homepage: http://www.lt.tu-darmstadt.de/people/eugen-ruppert/
  • Email: ruppert@lt.informatik.tu-darmstadt.de

Length

The tutorial is organized as a half-day tutorial (3 hours), split into 0.5 hour for the motivation and theory. 2h are considered to let the audience use the technology and learn how to compute new models. The last half hour is used for showing applications where the models are used and for a discussion.

References

[1] Harris, Z., Distributional structure. Word 10 (23): 146-162, (1954).
[2] Biemann, C., Riedl, M., Text: Now in 2D! A Framework for Lexical Expansion with Contextual Similarity, Journal of Language Modelling 1(1):55–9, (2013).