Unsupervised Linguistic Structure Induction

14.00 - 16.00 

Goal and Content of the Seminar:

Successful applications of unsupervised learning for natural language processing go back to at least the papers of Brown and colleagues in which they used word clusters to reduce the size of language models* and there are a number of earlier investigations. Since 2000 however the number of advancements in this area has increased and unsupervised approaches are becoming more and more useful in explointing large amounts of data that can be used to supplement smaller annotated training sets. In this seminar we will examine four types of 'annotation' which have been more or less successfully generated in an unsupervised way. The goal is to understand the connections between linguistic concepts and corpus statistics that are exploited by the different algorithms and to understand how the information that results could be exploited in other tasks. At the end of the seminar you should have a clear idea of some of the information that can be extracted from unannotated linguistic data.

* Peter F. Brown; Peter V. deSouza; Robert L. Mercer; Vincent J. Della Pietra & Jenifer C. Lai (1992)
Class-based n-gram models of natural language


In order to understand the papers we will discuss during the seminar a background in computational linguistics equivalent to the content of the 'Advanced Natural Language Processing' lecture is required. Some background in machine learning is helpful but not necessary. We will try to work out everything else during the sessions.


Grading is based on a short, introductory talk given about one of the papers during one of the sessions and on a short (max. 15 pages) seminar paper written after the sessions. This paper will summarize the results of 2 experiments  done during the semester using existing tools for linguistic structure induction.  The paper additionally puts the experiments in the context of further research and/or explains some additional investigations you have made.

You can sign up to present something here (see below for the papers corresponding to dates).

You have to hand in your paper 2 weeks before the end of the semester if you want to pass. If I do not have your paper (electronically (as a pdf(!)) sent to me or on paper given to me or the secretary for theoretical CL) by Midnight on 16.03.2016 then you fail. There will be no exceptions.

Questions for each Paper:

For every paper we read there are a number of questions that highlight some of the points I consider important for understanding it. Some of these questions are the same for multiple papers, because they concern something that is of interest for unsupervised linguistic structure induction in general. If you attend a session of the seminar, then I expect you to be able give some insight concerning these questions. This does not necessarily mean that you need to answer them, but if you cannot then you should at least be able to explain why the question (or the answer the paper gives) is unclear. Sometimes there is no good answer (that I am aware of) and the intention is to highlight something that is an ongoing research topic. The goal is to have a number of starting points for a discussion, so keep that in mind when thinking about the questions.

Your Questions:

If you intend to attend one of the sessions in which a paper is discussed, please also send at least one question that you have concerning the paper to me 2 days before the start of the session. The question can be about anything in the paper e.g. a technique used which you are not familiar with. This will hopefully allow us concentrate on the things that are most important in order for everyone to get the most out of each discussion.


During the duration of the seminar you are expected to conduct 2 experiments on which you will then expand for you final paper. The experiments are described below. Two sessions have been set aside specifically for discussing these experiments and any problems you encountered. The basic experiments should be finished by this point and you should be able to give a summary of your findings. If you need any help or advice for implementing or evaluating the experiment, then make an appointment, or talk to me after the seminar. You are expected to put the code for your experiments (only what you wrote) in a repository that is accessible to the other members of the seminar.

         1. Introduction to Unsupervised Linguistic Structure Induction and Seminar Planning.
            Part-Of-Speech Tagging
            2. Introduction:
                Christodoulopoulos, C.; Goldwater, S. & Steedman, M. (2010)

Two Decades of Unsupervised POS induction: How far have we come?

                -What is the difference between POS induction and POS disambiguation (according
                  to the paper)?
                -What are the problems with evaluating POS induction?
                    +What are the problems of the different evaluation measures?
                -What are the benefits / disadvantages of inducing the number of classes automatically?
                -What is the advantage of settling on a single class for every word type?
                -What feature sets apart the "winning" systems?
                -The two winning systems show little or no improvement from the prototype based
                  technique - why could this be the case?

                Futher Reading:
                Gelling, D.; Cohn, T.; Blunsom, P. & Graça, J. a. (2012)
                The PASCAL Challenge on Grammar Induction

        3. Example Approach

                Christodoulopoulos, C.; Goldwater, S. & Steedman, M. (2011)
                A Bayesian Mixture Model for Part-of-Speech Induction Using Multiple Features

                -What type of features could be useful for POS Induction?
                        +Why would these features be related to POS classes?
                -What is (according to the paper) the difference between clustering and sequence models?
                -How would the learning algorithm look if we used MAP-EM?
                -How are final results generated from the sampling process?
                -What exactly makes the presented model deficient?
                -What is a type level feature and why is it more importance for rare words?
                -Can you come up with a potentially better initialization strategy than those presented in
                  the paper?
                -Euclidean distance is not very suitable for clustering word counts - why? What might work
                -How would the algorithm scale to larger datasets?
                -How, if at all, could the algorithm benefit from more data?
                -Are there other ways that alignments could be incorporated besides those presented in
                  the paper?

            4. Using Multiple Languages
                Snyder, B.; Naseem, T.; Eisenstein, J. & Barzilay, R. (2008)
                Unsupervised Multilingual Learning for POS Tagging


                -How would you expect changes in translation "faithfulness" to impact performance
                 of the presented algorithm?
                -Why is it important that tags of translation pairs are correlated but not identical?
                -How does the monolingual sequence model in this paper differ from that in
                  Christodoulopoulos et. al. 2011?
                -Why is it important that all alignments are 1 to 1?
                -Is it important that the coupling weights form a probability distribution?
                -What would be the problems with jointly finding tags and alignments in the given
                -How is a final result generated?
-How would the algorithm scale to larger datasets?
                -How, if at all, could the algorithm benefit from more data?
                -Why is it hard to integrate out the transition and emission parameters?
                -What explains the good performance of the random baseline (errors only on every
                  5th word, except for English)?

            Word Representations / Word Embeddings:

            5. Introduction
                Turney, P. D. & Pantel, P. (2010)
                From Frequency to Meaning: Vector Space Models of Semantic

                -What are term-document, word-context and pair-pattern matrices?
                -How do the presented algorithms scale?
                -How do they benefit from more data?
                -Why do (certain types of) word count based matrices encode word meaning?
                -What defines (according to the paper) vector space models?
                -How could VSMs benefit from other linguistic pre-processing steps besides those presented
                  in the paper?
                -When are two vectors similar?
                -Are paradigmatic parallels in a corpus semantically associated?
                -How is weighting and the need for stopword lists related?
                -Why is cosine similarity the most popular way to measure word similarity?
                -Which of the presented applications of VSMs are unsupervised?
                -How do context windows for word-context matrices in VSMs for semantic similarity
                 differ from context windows in POS induction?

            6.  Alternative strategies for generating vectors:
                Baroni, M.; Dinu, G. & Kruszewski, G. (2014)
                Don’t count, predict! A systematic comparison of context-counting vs.
                context-predicting semantic vectors

                 -What is the difference between predict and count models - do you think
                    one could be derived from the other?
                 -How do the two approaches scale?
                 -How do DSMs relate to VSMs (last session)?
                 -Is there a difference in the information used for both approaches?
                 -Why are the techniques presented here unsupervised?
                 -Which of the two approaches allow for earsier inclusion of linguistic knowledge?
                 -What could be  the reason for the greater robustness of predict models?
                 -How much data needs to be stored for both models?

            7. Example Approach

                Mikolov, T.; Chen, K.; Corrado, G. & Dean, J (2013)
                Efficient Estimation of Word Representations in Vector Space
 Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. & Dean, J. (2013)
                Distributed Representations of Words and Phrases and their Compositionality

                Goldberg, Y. & Levy, O. (2014)
                word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling
                Word-Embedding Method
                -What is the primary motivation for the presented techniques compared to other
                   Neural Network based techniques?
                -What are the most expensive operations for the presented models?
                -Why is it "bad" to compute probabilities for all words in the vocabulary?
                -What are the parameters the CBOW model estimates?
                -How do different contexts "share information" during the training of their parameters?
                -Why does skip gram perform so much better than CBOW on semantic analogy tasks?
                -Why is it necessary to add noise in the NEG model - why is it not enough to maximize
                   inner products?
                -Why can the number of negative samples be decreased as the corpus size increases?
                -Why does the formula for subsampling preserve ranking?

    8. Experimentation
                You will try out the BMM, the Brown Clustering Model or word2vec, do a small
                experiment and we will discuss your results / any problems you had:

                The goal will be to use word representations to build a simple tagger that just reads
                a word (and its neightbours) and tags it, where you use the representation of the
                word and its neighbourhood to create features for each tag. The corpus for training and
                testing will be the universal dependencies treebank (download here) you can use any
                language(s) you want.
                Pick one of the following tools for learning word representations:

                BMM Tagger (By
Christos Christodoulopoulos and colleagues)
                Brown Tagging (By Percy Liang)
                word2vec tool (By Tomas Mikolov and colleagues)
                Train it on the raw (i.e. without tags) sentences of of the complete corpus (plus
                however much additional language data you think is necessary / feasible). Then use the
                resulting tags or word embeddings as features for a supervised learner that predicts the
                tag of a word (for learning you could use something from scikit-learn or Weka, remember to
                only train this supervised part on the train portion of the data).
Does this perform better on
                the test section than constructing features just from words and affixes?
                How exactly you define the features is up to you.


            9. Constituency (there is no good summary here):

               Hänig, C.; Bordag, S. & Quasthoff, U. (2008)
               UnsuParse: Unsupervised Parsing with unsupervised Part-of-Speech tagging

                -How would the algorithm scale with more data and/or longer sentences?
                    +How much would it benefit from additional data?
                 -What is the advantage of manually written POS tags over automatically induced ones?
                 -Why is it important how often a word occurs at the beginning/end of a sentence?
                 -Is there another type of information POS tags could be replaced with?
                 -What happens when the size of the "window" of pairs that are candidates for a
                   constituent is increased?
                 -Why is it beneficial to "flatten" learned constituents?
                 -Why are prepositions a problem?

           10. Dependency
                   Klein, D. & Manning, C. (2004)
                   Corpus-based induction of syntactic structure: models of dependency and constituency
                    -How does the approach scale with the amount of data given to it?
                    -Would it benefit from substantially more data?
                    -What are problems in the evaluation of unsupervised dependency parsing?
                            +How does measuring only undirected attachments deal with this problem?
                    -Why is it important to assume that the dependency trees are planar?
                    -Why is it helpful to assume that ROOT has only 1 child?
                    -Why is the expectation maximization algorithm initialized with a guess at about the
                      expectations instead of one about the model parameters?
                    -Why would a dependency model benefit from constituency information?
                    -Why are induced POS tags not as helpful as manually written ones?

           11. The Usefulness of Punctuation
                  Seginer, Y. (2007)
                  Fast Unsupervised Incremental Parsing


                  Spitkovsky, V.; Alshawi, H. & Jurafski, D. (2011)
                  Punctuation: Making a Point in Unsupervised Dependency Parsing

                  -How do the presented approaches scale?
                  -Would they benefit from more data?
                  -How are punctuation and dependency/constituency connected?
                  -What are other indicators of dependency/constituency that could be exploited?
                  -How does Seginer's parser represent word classes?
                  -Why would sentences have "skewed" parse trees?
                  -Does Seginer's parser make use of a VSM?
                  -What advantages does punctuation have over mark-up?
                  -Why do Spitkovsky et. al. use a harder constraint for training compared to parsing?


           12. Overview

                  Hammarström, H. & Borin, L. (2011)
                  Unsupervised Learning of Morphology

                  -How much has non-concatenative morphology played a role in unsupervised
                  -What is typical for letter entropy at morphem boundaries?
                  -How could unsupervised morphology benefit from unsupervised POS tagging /
                   -Which level of morphological analysis is necessary for modern supervised parsers and
                      POS taggers - could unsupervised methods achieve this level?
                   -When using letter variety / entropy / max-drop, how could a segmentation threshold
                      be derived?
                   -How are MDL- and letter-based approaches related?
                   -How could group-and-abstract approaches be combined with border based approaches?
                   -What is the advantage of a generative probabilistic model of morphology?

           13. Example Approach

                  Creutz, M. & Lagus, K. (2002)
                  Unsupervised Discovery of Morphemes

                  -How would the presented methods scale with more input data?
                        +How would the method behave as more input data is presented?
                  -Would it be possible to (efficiently) find a segementation for the current word that
                     is exactly best with respect to the MDL criterion?
                         +Would this be beneficial?
                  -Does the ML critierion have any trivial optima?
                  -What makes the evaluation difficult?
                  -Why is it important to reanalyse words that have already been seen?
                  -Why does the MDL based method leave many common words undersegmented?
                  -How could the presented approach benefit from POS tags?

           14. Experimentation
                  Second rounds of experiments, this time you will use Seginer's
                  parser or the DMV:
                  Seginer's Parser (By Yoav Seginer)
                  DMV and CCM
(By Franco Luque)
                  Download one tool and run it on the data for GermEval 2014 (you can just merge
                  train, dev and test set, since you are not going to train on any labels). Then use the data
                  (again possibly with extra data) to train the tool you chose. Finally evaluate for every
                  named entity that consists
of more than one word:

                  - If you choose Seginer's parser: How large the average difference in length is between
                    the named entities and the smallest constituent that contains them and how many
ts on average are inconsistent with a named entity (i.e. cover some, but
                    not all word of the named entity and also cover some words outside of it).    
                  -For the CCM you do the same as for Seginer's parser.

                   -If you chose the DMV evaluate the average distance in the dependency tree between
                    words in the same named entity. Also evaluate how often the words in a named entity
                    are all connected without any other word inbetween.