Workshop on Linguistics and Computation

Date: Tuesday 4 June 2013

Place: 8103 (next to CSE lunch room, 7th floor, Rännvägen 6b), Chalmers, Gothenburg

Structure: 40 min talk + 10 min discussion + 10 min for coffee or tea (available all the time in the lunch room) in the morning. 30 + 5 + 10 in the afternoon.

This workshop is organized in connection with the PhD defence of Shafqat Virk, which is on the previous day at 10:00 in lecture hall HC2. It will bring together the guests participating in the defence as examiners, as well as some local speakers that have worked on related topics. See http://publications.lib.chalmers.se/publication/176382 for information about the defence.

Everyone is welcome - no registration is needed!

Programme

9:00 Pushpak Bhattacharyya (IIT Bombay). 
Together We Can: Cooperative Natural Language Processing

10:00 Gérard Huet (INRIA Paris-Rocquencourt). 
Design of a lean interface for Sanskrit corpus annotation.

11:00 Hans Leiss (LMU Munich).
Learning context-free grammars having the finite context property,

--------------------------
12:00 Lunch at restaurant Konkanok
--------------------------

13:30 Devdatt Dubhashi (Chalmers University of Technology, joint work with Gabriele Capannini and Svetoslav Marinov).
Document summarization: A Knowledge Based Approach using multiple kernels

14:15 Markus Forsberg (University of Gothenburg).
A Swedish LT resource network.

15:00 Krasimir Angelov (University of Gothenburg). 
Robust and statistical parsing in GF.

15:45 Aarne Ranta (University of Gothenburg). 
Abstract syntax, Finnish, and the languages of the world.




Abstracts

Pushpak Bhattacharyya.

Title: Together We Can: Cooperative Natural Language Processing

Abstract: 
NLP today is predominantly data driven. Machine learning applied to annotated language data is almost the norm for performing interesting and important NLP tasks at various level of complexity starting from part of speech (POS) tagging to semantic role labeling and sentiment analysis. Annotation, however, is usually an expensive proposition. In this presentation, we try to set up a case for resource shared multilingual computation with examples of language adaptation in NLP. We focus on word sense disambiguation (WSD), describing our work on "projection of parameters from one language to another" in three settings of "complete", "some" and "no" annotation. This helps perform WSD with reduced language resources. The last scenario of "no annotation", i.e, unsupervised setting, is tackled by an interesting expectation-maximization (EM) formulation.

Besides resource reuse, language adapted NLP helps collate evidences from multiple languages for better performance, for example in search. Multilingual pseudo relevance feedback (PRF) has been shown to be better than monolingual PRF in our recent work. We will touch upon this.

Finally, cross lingual techniques prove effective in resource reuse in NLP. We will end the presentation with discussion on progress we have made in Cross Lingual Sentiment Analysis.

The presentation is based on work done with PhD and Masters students and researchers: Rajat, Mitesh, Salil, Manoj, Karthik, Bala, Aditya and many others, and published in fora like  ACL, COLING, EMNLP, SIGIR and so on.

---------------------------------

Gérard Huet.
Title: Design of a lean interface for Sanskrit corpus annotation.

Abstract
We describe an innovative computer interface designed for assisting annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpus. This interface has been implemented, and is being applied to the annotation of the Sanskrit Library corpus.




---------------------------------

Hans Leiss.
Title: Learning context-free grammars having the finite context property,

Abstract: A main goal of structural linguists, like Z.Harris, was to develop methods for analysing and describing languages in terms of the distribution of words and expressions within sentences. But the regular languages are the only ones that have a grammar based on finitely many distribution classes as its primitive linguistic notions. Recently, A.Clark has suggested to exploit the Galois correspondence between string sets and sets of string contexts and to describe languages in terms of 'syntactic concepts', a generalization of distribution classes. He proposed an algorithm to learn a context-free grammar for a language L from an enumeration of L and finitely many membership queries, provided there exists a context-free grammar for L whose nonterminals are syntactic concepts defined by finitely many contexts. We review the theory of syntactic concepts, demonstrate that Clark's algorithm does not work, and discuss possibilities for a correction.

----------------------------------

Markus Forsberg.
A Swedish LT resource network

We will give a brief background and overview of our work on creating an interconnected LT resource network for Swedish at Språkbanken. The network is huge: It links a text material of a couple of billion words to 22 lexical resources consisting of around 700k entries. The lexical resources contain varying kinds of linguistic information. E.g., some contain only morphological descriptions, while others contain syntactic and semantic information. Some are digitized historical paper dictionaries, while others are pure LT resources. 

---------------------------------

Krasimir Angelov.
Rlobust and statistical parsing in GF

Linguists have developed a number of sophisticated frameworks for describing natural languages, but, despite all this effort, there is no complete description for any of the languages in the world. Because of this problem, the traditional linguistic frameworks were slowly displaced by statistical methods which allow processing algorithms to be learned from plain data. The main advantage of these methods is that they are more robust and that given enough data they could learn things that grammarians might miss.

Grammatical Framework (GF) is one of this sophisticated frameworks and we have used the framework to accumulate a lot of linguistic knowledge for more than 20 languages. The question is whether it can be made robust enough so that we can analyze naturally occurring text. We will go through the current state of GF as a hybride system which combines the manually accumulated linguistic knowledge with automatically acquired statistical evidences.

---------------------------------

Aarne Ranta.
Abstract Syntax, Finnish, and the Languages of the World

Computational linguistics has for the last two decades been dominated by statistical methods. Such methods are expected to make linguistic rule writing unnecessary, since language processing can be performed by means of statistical models learnt from raw language data. These expectations have been confirmed for many processing tasks - in particular, for languages like English, which has simple morphology, strict word order, and a lot of data available. However, languages like Finnish, which don't have these advantages, suffer from serious sparseness of data to cover the variations in the language. Hence tasks like machine translation into Finnish are very hard with standard statistical methods.

To cope with sparse data, abstractions are needed. For instance, Finnish words, which appear in thousands of surface forms (when counting all inflections and suffixes), can be seen in a more abstract way as pairs of lemmas and morphological descriptions. Thus when building a statistical model, there is no need to find data separately for "yö", "yön", "öitä", "öinämme" (different forms of "night"), but any of the word forms can represent all the others, and the descriptions (singular nominative, plural essive possessive, etc) can be treated separately. In machine translation, this leads to so-called factored models, which combine statistics with linguistic knowledge in an efficient way.

Abstract syntax is a method that can likewise reduce the number of syntactic structures, by abstracting away from variations due to tense, word order, etc, which can again lead to thousands of variations in Finnish (e.g. "Matti joi piimää", "piimääkö Matti on juonut", which are different forms of "Matti drinks buttermilk"). This makes it possible to capture a large number of syntactic constructions with a small number of rules and, when statistical models are built, extract them from minimal data. The linguist's knowledge is crucially involved when defining the abstract structures and their realizations.

In the talk we will show how an abstract syntax deals with some of the complexities of Finnish and how it also enables cross-linguistic generalizations. The work belongs to a long-term research programme of GF (Grammatical Framework) Resource Grammar Library, which has developed computational syntax and morphology resources for 26 languages (http://www.grammaticalframework.org/lib/doc/synopsis.html).