lis590sosnotes9-15

Self Organizing Notes: 9-15

Kate, R.J., Wong, Y. W., and Mooney, R.J. Learning to Transform Natural to Formal Languages. Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05), Pittsburgh, PA, pp. 1062--1068, July 2005.

Introduction

This paper presents an inductive method that learns rules to translate natural language sentences to a formal language.

The authors present tree and string based approaches that can work bottom-up or top-down. Their algorithm (SILT) was applied in two domains, RoboCup (CLANG) and GeoQuery.

This approach improves on slow and complex inductive logic programming approaches by mapping substrings or subtrees from the natural language to the target language - without a much deeper understanding of the languages involved. This is accomplished by using the structure of the non-terminal symbols in the formal grammar and comparing these to patterns from the naturally occuring forms.

This approach assumes a formal grammar for the target representation. Specifically, a deterministic, context-free - LR grammars - so that it always has a unique parse tree.

The authors mention that there has been a lot of work in shallow thematic (case-role) analysis, and present their own approach as a point of contrast. Case roles can be understood as the distinct elements of a process, and can be used to more precisely specify elements in a text, e.g.

'you gave birth to Fred.'

you => 'subject' => 'agent' => 'mother' => 'Fred's mother'

Algorithm

To learn the rules, SILT pairs a training set (T) of natural language sentences with a set of productions (PI) from the formal grammar. The prodcution rules lead to positive and negative examples, which are counted. While there are still uncovered positive examples, SILT keeps trying to find rules and adds them to the rule base (L).

The rules come together when a pattern is matched against a rule. The left-hand-side non-terminals replace this portion, and the right-hand-side is remembered as an instantiation. Once replacement occurs, the pattern is again matched against the rules and the LHS/RHS translation is applied again.

FindBestRules works by taking the production rules (PI) along with the positive and negative examples and tries to find the best, and most specific, rules (R) to add to the rule base (L). It does this by generalizing from a thousand random pairings of rules and adding the best these to the set of specific rules (R).

The string based Generalize works by making the longest matched strings possible with the smallest gaps (to handle non-compositionality, etc...). The tree based version returns the largest common subgraphs without allowing gaps. The resulting generalized rules are evaluated using a goodness measure:

pos(r)^2 / pos(r)+neg(r).

Complications: Non-Compositionality

"...non-compositionality, in which the phrases in a sentence do not have a one-to-one correspondence with subexpressions of the formal representation."

Which occurs in cases like:

"player 4 has the ball in REGION"

A parser could find out something about the ball owner and ball location from "the ball."

SILT handles these using context clues, or multiple production rules (described at the bottom-left of page 3). But as I understood it, this could also be the problem when naturally constructed sentences do not precisely follow any of the rules in the formal language. I'd imagine this would be more of a problem than the authors allow for. Their techniques to control for these kinds of problems, do seem to handle issues of filler words, but what about honest differences in structure?

Complications: Completeness of Translation

"The second one is finding a set of rules that cooperate with each other to produce complete, correct formal language translations. This is not trivial because rules are learned in a particular order."

SILT approaches this problem by over producing sets of rules and then pruning those to include only those that cooperate well. The other approach only includes fairly case-specific rules (nothing overly general) so there won't be as much confusion as to which rule is appropriate.

Methodology, or The Impenetrable Paragraph

SILT was evaluated using standard 10-fold cross validation. Syntactic parses required by SILT’s tree-based version were generated by Collins’ parser (Bikel 2004) trained with

the WSJ treebank and gold-standard parse trees of the training sentences.

Why is recall so low for CLANG corpus?

Why is the tree version so much slower - especially for CLANG?

Things I had to look up...

Case-role analysis

10-fold cross validation

Collins' Parser

WSJ treebank

gold-standard parse trees