Add joke here

Christoph Teichmann

I work as a researcher in Computational Linguistics / Natural Language Processing in the group of Prof. Alexander Koller at the Computational Linguistics Department in Saarbrücken. I used to work with Prof. Koller in the Linguistics Department of the University of Potsdam.  Before that I got a doctorate from Leipzig University, where I worked in the Natural Language Processing Group. My research focuses on efficient algorithms for inference in the structured spaces one often encounters in natural language processing.

Google Scholar


E-Mail: name - cteichmann / domain - coli.uni-saarland.de

Word Classes from Suffixes

posted Oct 19, 2015, 6:26 AM by Christoph Teichmann   [ updated Oct 19, 2015, 6:26 AM ]

Since I am currently generating examples for my seminar on unsupervised techniques I have another obvious, but very illustrative data set. I used the same basic data as in the older post - this time I did not even bother drop the leading numbers or to lower case any inputs. All I did way put all words that have the same final three letters into a cluster. As you would expect, most of these clusters are noisy, but they have a clear tendency towards well defined linguistic classes of words and in order to use this approach you only need to see a word once to define it's cluster membership.

the data
(text file best viewed with the 'less' command)

The format is:
suffix : {words with that suffix}

my favourite cluster is:
Saw : {See-Saw, Rip-Saw, m-Calling-This-Number-To-Report-What-I-Saw, Saw, Buzz-Saw}

The lowercase 'saw' cluster is much more noisy:

saw : {Kennesaw, Alingasaw, Nasaw, foresaw, Ripsaw, seesaw, buzzsaw, Lvóv-Warsaw, sea-saw, oversaw, chainsaw, Poznań-Warsaw, Chopsaw, Guudsaw, jigsaw, handsaw, buzz-saw, Dine-saw, Nassaw, Jigsaw, sightsaw, Chickasaw, see-saw, Jasaw, ricksaw, Buzzsaw, warsaw, Kenesaw, Chainsaw, Jisaw, hacksaw, Warsaw, saw, whipsaw}

and it illustrates nicely that 'saw' has at least three very different uses in English: 'I saw Tim buying a saw in Warsaw'.

Distributionally Similar Words

posted Oct 19, 2015, 5:46 AM by Christoph Teichmann   [ updated Oct 12, 2016, 4:36 AM ]

As an illustrative Example for the seminar on Unsupervised Linguistic Structure Induction I recently wanted to compute sets of distributionally similar words. First I needed a corpus so I went to the download page of the Leipzig Corpora Collection, passed the caption and then looked for one of the plain text downloads for equally 'plain' English (eng, not e.g. eng-za or something) - I don't remember how my old colleagues defined plain English, though I'm sure they explained it to me at some point. I settled on the 1 million sentences from Wikipedia from the year 2010.

Then the data needed to be tokenized (note also that there is a number at the start of every sentence, I ignored those since they are completely arbitrary running numbers, as far as this experiment is concerned). I did this by simply assuming that a new word starts at every whitespace and that each punctuation symbol is also it's own word, with the exception of '-'.

Once that is done, you can compute distributional properties of the data. First I computed (well my computer did) the 300 most frequent words, these became my 'contexts'. Then I counted for every word w how often each of the 300 occured directly before and after it. This gives me some nice vectors that might correlate to some deeper knowledge about w (an assumption somewhat supported by distributional hypothesis). So I then did the following: for each word that occurred at least 100 times (so I have some reliable information) and every other word that occurred at least 100 times, compute the cosine similarity (it is really important to not use something based on Euclidian Distance) and then create a file that lists for each word the 10 most similar other words.

the result

The 'neighbours.txt' file in that archive is the most interesting one. Each line has the format:

word    -    [distributionally closest word, distributionally second closest word, ....]

Then there is a file with some interesting examples (most with a note by me; why I found it interesting). Finally there is a file with the cooccurrence statistics in the format:

word    :    number of occurrences
+++++
position (relative to the word)
cooccurrence in that position - count
+++++
other position (relative to the word)
cooccurrence in that position- count
-----

If there is interest I could publish the code I used to do this (I might also publish it if there is no interest, if I find the time to document it properly).

The important point:

I think most of the nearest words have a very intuitive relation - often it is that they have the same word class - and this is already visible with 'only' 1 million sentences. While there are also some combinations that make no sense (to me) this indicates that the distributional hypothesis works really well. The distinctions made are also much, much more fine grained then what you see in a conventional POS-tagset, which could be good or bad, depending on how you want to use this information.

I am by no means the first person to observe that, Chris Biemann's Unsupos is a way more worked out approach to a similar problem. Which you can read about in:



Something I would like to ask the reader:

If you have read up to this point, maybe you can help me understand something about the data: the difference between 1861 and 1852. I know that one is a civil war year and the other is not, but how is this distinguished in the words immediately to the left and to the right? The distributionally similarity 'clearly' picks up on something:

1861    [1944, 1943, 1941, 1864, 1776, 1942, 1863, 1918, 1862, 1897]
most of those are years with 'important' wars going on (from the perspective of the 'standard' - Commonwealth or U.S. American - English speaker) but if we look at 1852:

1852    [1932, 1931, 1928, 1935, 1955, 1911, 1936, 1956, 1921, 1965]
then that is mostly pre- and post-war years (1950-1953 - Korean War) except for 1965 (Cuba Crisis). Most likely I'm putting meaning into these things after the fact, but I wonder if this tells me something about the very local constructions in which these words are used. The co-occurrence vectors are different, but nothing about them - to me - explains the semantic difference. Maybe it tells you something, or maybe I'm imagening things.

Nikos Engonopoulos pointed out to me: "... 5 out of the 11 chronologies (1852, 1928, 1932, 1936, 1956) are election years, which is more than expected considering that the election is only once every four years. Also, 4 more (1931, 1935, 1955, 1911) are pre-election years ...". Thanks to Nikos!

1-2 of 2