Here is a list of Ph.D students for whom I was the de facto dissertation adviser, and played a significant role in the technical development of the ideas. I like co-advising. Co-advisors on these dissertations include Eric Fosler-Lussier (Rytting, Weale and Heintz), Mary Beckman (Yoon), Richard Shillcock (McDonald), Michael White (Hovermale, Mehay), Alex Lascarides (Lapata), William Schuler (Mehay) and Hans Kamp (Schulte im Walde).
Dennis Mehay
Bean Soup Translation: Flexible, Linguistically-motivated Syntax for Machine Translation
The main contribution of this thesis is to use the flexible syntax of Combinatory Categorial Grammar [CCG, Steedman, 2000] as the basis for deriving syntactic constituent labels for target strings in phrase-based systems, providing CCG labels for many target strings that traditional syntactic theories struggle to describe. These CCG labels are used to train novel syntax-based reordering and language models, which efficiently describe translation reordering patterns, as well as assess the grammaticality of target translations. The models are easily incorporated into phrase-based systems with minimal disruption to existing technology and achieve superior automatic metric scores and human evaluation ratings over a strong phrase-based baseline, as well as over syntax-based techniques that do not use CCG.
DJ Hovermale
Erron: A Phrase-Based Machine Translation Approach to Customized Spelling Correction
Spellcheckers such as ASPELL and Microsoft Word 2007 were designed to correct the spelling errors of native writers of English (NWEs). While they are widely used and fairly effective at this task, they do not perform well when used to correct the spelling errors in English text written by Japanese writers of English (JWEFLs). The first contribution of this thesis is a comprehensive analysis of the English spelling errors of Japanese writers of English as a foreign language (JWEFLs) with the goal of finding differences from NWE spelling errors that would cause such poor spellchecker performance. In addition to describing the patterns of characteristic JWEFL errors I also provide hypotheses for why each error is observed. I find that JWEFL errors are 2-3 times more likely to have a mistake in the first letter and 4-5 times more likely to have more than one mistake in them than NWE spellings errors. These facts make it very difficult for widely-used English spellcheckers to correct JWEFL errors, because they are designed to exploit patterns in NWE spelling errors which simply are not present in Japanese learner text. I assert, however, that while JWEFL spelling errors do not have the same patterns as NWE errors, they do display patterns that can be exploited to make a custom spellchecker for JWEFL users. This thesis describes the creation of a spellchecker customized for Japanese writers of English as a foreign language.
Stephen Boxwell (Battelle)
A CCG-Based Method for Training a Semantic Role Labeler in the Absence of Explicit Syntactic Training Data
Treebanks are a necessary prerequisite for many NLP tasks, including, but not limited to, semantic role labeling. For many languages, however, treebanks are either nonexistent or too small to be useful. Time-critical applications may require rapid deployment of natural language software for a new critical language { much faster than the development time of a traditional treebank. This dissertation describes a method for generating a treebank and training syntactic and semantic models using only semantic training information. that is, no human-annotated syntactic training data whatsoever. This will greatly increase the speed of development of natural language tools for new critical languages in exchange for a modest drop in overall accuracy. Using Combinatory Categorial Grammar (CCG) in concert with Propbank semantic role annotations allows us to accurately predict lexical categories in combination with a partially hidden Markov model. By training the Berkeley parser on our generated syntactic data, we can achieve SRL performance of 65.5% without using a treebank, as opposed to 74% using the same feature set with gold-standard data.
Tim Weale (DoD, Fort Meade)
Term Relatedness from Wiki-Based Resources Using Sourced PageRank
This dissertation concerns itself with creating a new algorithm for automatically measuring the amount of relatedness between a given pair of terms. Research into term relatedness is important because it has been empirically demonstrated that using relatedness metrics can improve the performance of tasks in Natural Language Processing and Information Retrieval by expanding the usable vocabulary. Previous relatedness metrics have used a variety of sources of semantic data to judge term relatedness, including text corpora, expertly-constructed resources and, most recently, Wikipedia and Wiktionary. The primary focus of this dissertation is the creation of a new metric for deriving term relatedness from the graph structure of Wikipedia and Wiktionary using Sourced PageRank, a modified version of the PageRank algorithm, to generate the relatedness values.
Ilana Heintz ()
ARABIC LANGUAGE MODELING WITH STEM-DERIVED MORPHEMES FOR AUTOMATIC SPEECH RECOGNITION
The goal of this dissertation is to introduce a method for deriving morphemes from Arabic words using stem patterns, a feature of Arabic morphology. The motivations are three-fold: modeling with morphemes rather than words should help address the out-of-vocabulary problem; working with stem patterns should prove to be a cross-dialectally valid method for deriving morphemes using a small amount of linguistic knowledge; and the stem patterns should allow for the prediction of short vowel sequences that are missing from the text. The out-of-vocabulary problem is acute in Modern Standard Arabic due to its rich morphology, including a large inventory of inflectional affixes and clitics that combine in many ways to increase the rate of vocabulary growth. The problem of creating tools that work across dialects is challenging due to the many differences between regional dialects and formal Arabic, and because of the lack of text resources on which to train natural language processing (NLP) tools. The short vowels, while missing from standard orthography, provide information that is crucial to both acoustic modeling and grammatical inference, and therefore must be inserted into the text to train the most predictive NLP models. While other morpheme derivation methods exist that address one or two of the above challenges, none addresses all three with a single solution.
Kirk Baker (iCubed analytics)
MULTILINGUAL DISTRIBUTIONAL LEXICAL SIMILARITY
This dissertation addresses the problem of learning an accurate and scalable lexical classifier in the absence of large amounts of hand-labeled training data. One approach to this problem involves using a rule-based system to generate large amounts of data that serve as training examples for a secondary lexical classifier. The viability of this approach is demonstrated for the task of automatically identifying English loanwords in Korean. A set of rules describing changes English words undergo when they are borrowed into Korean is used to generate training data for an etymological classification task. Although the quality of the rule-based output is low, on a sufficient scale it is reliable enough to train a classifier that is robust to the deficiencies of the original rule-based output and reaches a level of performance that has previously been obtained only with access to substantial hand-labeled training data. The second approach to the problem of obtaining labeled training data uses the output of a statistical parser to automatically generate lexical-syntactic co-occurrence features. These features are used to partition English verbs into lexical semantic classes, producing results on a substantially larger scale than any previously reported and yielding new insights into the properties of verbs that are responsible for their lexical categorization. The work here is geared towards automatically extending the coverage of verb classification schemes such as Levin, VerbNet, and FrameNet to other verbs that occur in a large text corpus.
Jianguo Li, Motorola
HYBRID METHODS FOR ACQUISITION OF LEXICAL INFORMATION: THE CASE FOR VERBS
Improved automatic text understanding requires detailed linguistic information about the words that comprise the text. Particularly crucial is the knowledge about predicates, typically verbs, which communicate both the event being expressed and how participants are related to the event. ... First, deriving Levin-style verb classifications from text corpora helps avoid the expensive hand-coding of such information, but appropriate features must be identified and demonstrated to be effective. One of our primary goals is to assess the linguistic conditions which are crucial for lexical classification of verbs. In particular, we experiment with different ways of mixing syntactic and lexical information for improved verb classification. Second, Levin verb classification provides a systematic account of verb polysemy. We propose a class-based method for disambiguating Levin verbs using only untagged data. The basic working hypothesis is that verbs in the same Levin class tend to share their subcategorization patterns as well as neighboring words. In practice, information about unambiguous verbs in a particular Levin class is employed to disambiguate the ambiguous ones in the same class. Last, automatically created verb classifications are likely to deviate from manually crafted ones, therefore it is of great importance to understand whether automatically created verb classifications can benefit the wider NLP community. We propose to integrate verb class information, automatically learned from text corpora, into a particular parsing task, PP-attachment disambiguation.
Anton Rytting CASL
Preserving subsegmental variation in modeling word segmentation (or, the raising of baby Mondegreen)
This dissertation addresses the first of these two issues by comparing the performance of two classes of distribution-based statistical cues on a corpus of Modern Greek, a language with a phonotactic structure significantly different from that of English, and shows how these differences change the relative effectiveness of two classes of statistical heuristics, compared to their performance in English. To address the second issue, this dissertation proposes an improved representation of the input that preserves the subsegmental variation inherently present in natural speech while maintaining sufficient similarity with previous models to allow for straightforward, meaningful comparisons of performance. The proposed input representation uses an automatic phone classifier to replace the transcription-based phone labels in a corpus of English child-directed speech with real-valued phone probability vectors. These vectors are then used to provide input for a previously-proposed connectionist model of word segmentation, in place of the invariant, transcription-based binary input vectors used in the original model. The performance of the connectionist model as reimplemented here suggests that real-valued inputs present a harder learning task than idealized inputs. In other words, the subsegmental variation hinders the model more than it helps. This may help explain why English-learning infants soon gravitate toward other, potentially more salient cues, such as lexical stress. However, the model still performs above chance even with very noisy input, consistent with studies showing that children can learn from distributional segmental cues alone.
Anna Feldman Montclair State University
Portable language technology: a resource-Light approach to morpho-Syntactic tagging
Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in large corpora, and for further computational processing, such as syntactic parsing, speech recognition, stemming, and word-sense disambiguation, among others. Despite the importance of morphological tagging, there are many languages that lack annotated resources. This is almost inevitable because these resources are costly to create. But, as described in this thesis, it is possible to avoid this expense.
This thesis describes a method for transferring annotation from a morphologically annotated corpus of a source language to a corpus of a related target language. Unlike unsupervised approaches that do not require annotated data at all and, as a consequence, lack precision, the approach proposed in this dissertation relies on linguistic knowledge, but avoids large-scale grammar engineering. The approach needs neither a parallel corpus nor a bilingual lexicon, and requires much less linguistic labor than the standard technology.
This dissertation describes experiments with Russian, Czech, Polish, Spanish, Portuguese, and Catalan. However, the general method proposed can be applied to any fusional language.
Nathan Vaillette (MTM LinguaSoft)Logical specification of finite-State transductions for natural language processing
This thesis is concerned with the use of a logical language for specifying mappings between strings of symbols; specifically, the regular relations, those which can be computed by finite-state transducers. Because of their efficiency and flexibility, regular relations and finite-state transducers are widely used in Natural Language Processing (NLP) for tasks such as grapheme-to-phoneme conversion, morphological analysis and generation, and shallow syntactic parsing. By exploiting logical representations for finite-state transductions, the technique advocated in this thesis combines efficient processing with the advantages of declarative specification, thus taking a step in the direction of providing finite-state NLP with the best of both worlds.
Previous work has demonstrated how all sets of strings recognized by finite-state automata can be described in monadic second-order logic. A formula of this logic describing a set can be automatically compiled into the finite-state automaton recognizing that set. This technique unfortunately does not carry over to relations on strings without further restrictions, since the class of regular relations lacks certain crucial closure properties. In this thesis we introduce the logical language MSO(SLR), a language for same-length relations, a proper subset of the regular relations which has the necessary closure properties. We discuss how a formula of MSO(SLR) describing a relation can be automatically compiled into the finite-state transducer implementing that relation. Although there are many regular relations which MSO(SLR) cannot describe directly, we show how MSO(SLR) can characterize such relations indirectly by describing aligned representations of them.
To demonstrate the usefulness of MSO(SLR), we use it to define the finite-state conditional replace operator A -> B / C _ D in a declarative fashion. We argue that this approach improves on previous definitions in terms of clarity, maintainability, extensibility, and formal verifiability. We justify these claims by discussing several extensions and variations of the operator and providing rigorous proofs of correctness for our definitions.
A further demonstration of MSO(SLR)'s usefulness is given in the form of definitions of the rule formalisms used in two-level morphology. As with the replace operator definition, our declarative definitions give us a compiler automatically and make extensions and formal verification easy.
Sabine Schulte im Walde Stuttgart
Experiments on the Automatic Induction of German Semantic Verb Classes
This thesis investigates the potential and the limits of an automatic acquisition of semantic classes for German verbs. Semantic verb classes are an artificial construct of natural language which generalises over verbs according to their semantic properties; the class labels refer to the common semantic properties of the verbs in a class at a general conceptual level, and the idiosyncratic lexical semantic properties of the verbs are either added to the class description or left underspecified. Examples for conceptual structures are `Position' verbs such as `liegen' (to lie), `sitzen' (to sit),`stehen' (to stand). On the one hand, verb classes reduce redundancy in verb descriptions, since they encode the common properties of verbs. On the other hand, verb classes can predict and refine properties of a verb that received insufficient empirical evidence, with reference to verbs in the same class; under this aspect, a verb classification is especially useful for the pervasive problem of data sparseness in NLP, where little or no knowledge is provided for rare events. To my knowledge, no German verb classification is available for NLP applications. Such a classification would therefore provide a principled basis for filling a gap in available lexical knowledge.
...
The automatic induction of the German verb classes is performed by the k-Means algorithm, a standard unsupervised clustering technique as proposed by Forgy (1965). The algorithm uses the syntactico-semantic descriptions of the verbs as empirical verb properties and learns to induce a semantic classification from this input data. The clustering outcome cannot be a perfect semantic verb classification, since (i) the meaning-behaviour relationship on which we rely for the clustering is not perfect, and (ii) the clustering method is not perfect for the ambiguous verb data. But the goal of this thesis is not necessarily to obtain the optimal clustering result, but to understand the potential and the restrictions of the natural language clustering approach. Only in this way we can develop a methodology which can be applied to large-scale data. Key issues of the clustering methodology refer to linguistic aspects on the one hand, and to technical aspects on the other hand.
Martin Jansche Google,New York
The present work takes its examples from speech synthesis, and is in particular concerned with the task of predicting the pronunciation of words from their spelling. When applied to this task, deterministic mappings are also known as letter-to-sound rules. The three most commonly used metrics for evaluating letter-to-sound rules are prediction error, which is not generally applicable; string error, which can only distinguish between perfect and flawed pronunciations and is therefore too coarse; and symbol error, which is based on string edit distance and subsumes string error. These three performance measures are independent in the sense that they may prefer different models for the same data set. The use of an evaluation measure based on some version of string edit distance is recommended. Existing proposals for learning deterministic letter-to-sound rules are systematized and formalized. Most formal problems underlying the learning task are shown to be intractable, even when they are severely restricted. The traditional approaches based on aligned data and prediction error are tractable, but have other undesirable properties. Approximate and heuristic methods are recommended. The formalization of learning problems also reveals a number of new open problems. Recent probabilistic approaches based on stochastic transducers are discussed and extended. A simple proposal due to Ristad and Yianilos is reviewed and recast in an algebraic framework for weighted transducers. Simple models based on memoryless transducers are generalized to stochastic finite transducers without any restrictions on their state graphs. Four fundamental problems for stochastic transducers (evaluation, parameter estimation, derivation of marginal and conditional models, and decoding) are identified and discussed for memoryless and unrestricted machines. An empirical evaluation demonstrates that stochastic transducers perform better on a letter-to-sound conversion task than deterministic mappings.
Kyuchul Yoon, English Division, Kyungnam University
Building a prosodically sensitive diphone database for a Korean text-to-speech synthesis system
This dissertation describes the design and evaluation of a prosodically sensitive concatenative text-to-speech (TTS) synthesis system for Korean within the Festival TTS framework (Taylor et al., 1998). The primary task that this dissertation undertakes is to build a synthesis system that can test the idea that a speech segment is affected by its prosodic context and is subject to continuous allophonic and categorical allomorphic variation. There are three subtasks to the primary task. The first subtask is to model the allomorphic variation of Korean and to investigate the validity of using hand-written linguistically motivated morphophonological rules in the form of grapheme-to-phoneme (GTP) conversion rules. The evaluation of the implemented GTP module showed that taking advantage of linguistic knowledge could greatly reduce the amount of training material required by any machine-learning approach and that the error analysis is more informative and straightforward. The second subtask is to model positionally-conditioned allophonic variation and to motivate segmental correlates of prosodic categories with a view to designing a prosodically sensitive diphone database. From a corpus of prosodically labeled read speech, we created a prosodically sensitive diphone database, selecting four different prosodic versions of the same diphone. The last subtask is to build a model of Korean prosody, i.e., a model of phrasing, fundamental frequency contour, and duration, using a corpus that has been morpho-syntactically parsed and prosodically labeled following the K-ToBI labeling conventions (Jun, 2000, 1998 & 1993). Only the model of phrasing was implemented, trained from a set of morphosyntactic and textual distance features, and it can predict the location of accentual and intonational phrase breaks. The results of these subtasks were incorporated into the TTS system and the naturalness of the output from the system was evaluated. A listening experiment performed on eighty native speakers of Korean with stimuli synthesized from the TTS system showed that listeners preferred stimuli that were composed of prosodically appropriate diphones. We interpret this as evidence for the idea that the prosodically conditioned allophonic variation is a perceptible marker to the segmental encoding of prosodic domains.
Paul C. Davis Motorola Research
Stone Soup Translation: The Linked Automata Model
This dissertation introduces and begins an investigation of an MT model consisting of a novel combination finite-state devices. The model proposed is more flexible than transducer models, giving increased ability to handle word order differences between languages, as well as crossing and discontinuous alignments between words. The linked automata MT model consists of a source language automaton, a target language automaton, and an alignment table—a function which probabilistically links sequences of source and target language transitions. It is this augmentation to the finite-state base which gives the linked automata model its flexibility.
The dissertation describes the linked automata model from the ground up, beginning with a description of some of the relevant MT history and empirical MT literature, and the preparatory steps for building the model, including a detailed discussion of word alignment and the introduction of a new technique for word alignment evaluation. Discussion then centers on the description of the model and its use of probabilities, including algorithms for its construction from word-aligned bitexts and for the translation process. The focus next moves to expanding the linked automata approach, first through generalization and techniques for extracting partial results, and then by increasing the coverage, both in terms of using additional linguistic information and using more complex alignments. The dissertation presents preliminary results for a test corpus of English to Spanish translations, and suggests ways in which the model can be further expanded as the foundation of a more powerful MT system.
Mirella Lapata Faculty, Informatics, Edinburgh The Acquisition and Modeling of Lexical Knowledge: A Corpus-based Investigation of Systematic Polysemy.
This thesis deals with the acquisition and probabilistic modeling of lexical knowledge. A considerable body of work in lexical semantics concentrates on describing and representing systematic polysemy, i.e., the regular and predictable meaning alternations certain classes of words are subject to. Although the prevalence of the phenomenon has been long recognized, systematic empirical studies of regular polysemy are largely absent, both with respect to the acquisition of systematic polysemous lexical units and the disambiguation of their meaning.
The present thesis addresses both tasks. First, we use insights from linguistic theory to guide and structure the acquisition of systematically polysemous units from domain independent wide-coverage text. Second, we constrain ambiguity by developing a probabilistic framework which provides a ranking on the range of meanings for systematically polysemous words in the absence of discourse context.
We focus on meaning alternations with syntactic effects and exploit the correspondence between meaning and syntax to inform the acquisition process. The acquired information is useful for empirically testing and validating linguistic generalizations, extending their coverage and quantifying the degree to which they are productive. We acquire lexical semantic information automatically using partial parsing and a heuristic approach which exploits fixed correspondences between surface syntactic cues and lexical meaning. We demonstrate the generality of our proposal by applying it to verbs and their complements, adjective-noun combinations, and noun-noun compounds. For each phenomenon we rely on insights from linguistic theory: for verbs we exploit Levin's (1993) influential classification of verbs on the basis of their meaning and syntactic behavior; for compound nouns we make use of Levi's (1978) classification of semantic relations, and finally we look at Vendler's (1968) and Pustejovsky' (1995) generalizations about adjectival meaning.
We present a simple probabilistic model that uses the acquired distributions to select the dominant meaning from a set of meanings arising from syntactically related word combinations. Default meaning --the dominant meaning of polysemous words in the absence of explicit contextual information to the contrary-- is modeled probabilistically in a Bayesian framework which combines observed linguistic dependencies (in the form of conditional probabilities) with linguistic generalizations (in the form of prior probabilities derived from classifications such as Levin's (1993)). Our studies explore a range of model properties: (a) its generality, (b) the representation of the phenomenon under consideration (i.e., the choice of the model variables), (c) the simplification of its parameter space through independence assumptions, and (d) the estimation of the model parameters. Our findings show that the model is general enough to account for different types of lexical units (verbs and their complements, adjective-noun combinations, and noun-noun compounds) under varying assumptions about data requirements (sufficient versus sparse data) and meaning representations (corpus internal or corpus external).
Scott McDonald University of Utrecht
Environmental Determinants of Lexical Processing Effort
A central concern of psycholinguistic research is explaining the relative ease or difficulty involved in processing words. In this thesis, we explore the connection between lexical processing effort and measurable properties of the linguistic environment. Distributional information (information about a word’s contexts of use) is easily extracted from large language corpora in the form of co-occurrence statistics. We claim that such simple distributional statistics can form the basis of a parsimonious model of lexical processing effort. Adopting the purposive style of explanation advocated by the recent rational analysis approach to understanding cognition, we propose that the primary function of the human language processor is to recover meaning from an utterance. We assume that for this task to be efficient, a useful processing strategy is to use prior knowledge in order to build expectations about the meaning of upcoming words. Processing effort can then be seen as reflecting the difference between ‘expected’ meaning and ‘actual’ meaning. Applying the tools of information theory to lexical representations constructed from simple distributional statistics, we show how this quantity can be estimated as the amount of information conveyed by a word about its contexts of use.
The hypothesis that properties of the linguistic environment are relevant to lexical processing effort is evaluated against a wide range of empirical data, including both new experimental studies and computational reanalyses of published behavioural data. Phenomena accounted for using the current approach include: both singleword and multiple-word lexical priming, isolated word recognition, the effect of contextual constraint on eye movements during reading, sentence and ‘feature’ priming, and picture naming performance by Alzheimer’s patients.
Besides explaining a broad range of empirical findings, our model provides an integrated account of both context-dependent and context-independent processing behaviour, offers an objective alternative to the influential spreading activation model of contextual facilitation, and invites reinterpretation of a number of controversial issues in the literature, such as the word frequency effect and the need for distinct mechanisms to explain semantic and associative priming.
We conclude by emphasising the important role of distributional information in explanations of lexical processing effort, and suggest that environmental factors in general should given a more prominent place in theories of human language processing.
David Tugwell , ITRI, Brighton
Dynamic Syntax
In this thesis, I shall argue that dynamic "left-to-right" grammars have been undeservedly neglects as models of natural language syntax, and that they allow more general and elegant syntactic descriptions than has previously been appreciated. ... I propose a concrete example of a left-to-right model of syntax.