Distributionally Similar Words

Post date: Oct 19, 2015 12:46:36 PM

As an illustrative Example for the seminar on Unsupervised Linguistic Structure Induction I recently wanted to compute sets of distributionally similar words. First I needed a corpus so I went to the download page of the Leipzig Corpora Collection, passed the caption and then looked for one of the plain text downloads for equally 'plain' English (eng, not e.g. eng-za or something) - I don't remember how my old colleagues defined plain English, though I'm sure they explained it to me at some point. I settled on the 1 million sentences from Wikipedia from the year 2010.

Then the data needed to be tokenized (note also that there is a number at the start of every sentence, I ignored those since they are completely arbitrary running numbers, as far as this experiment is concerned). I did this by simply assuming that a new word starts at every whitespace and that each punctuation symbol is also it's own word, with the exception of '-'.

Once that is done, you can compute distributional properties of the data. First I computed (well my computer did) the 300 most frequent words, these became my 'contexts'. Then I counted for every word w how often each of the 300 occured directly before and after it. This gives me some nice vectors that might correlate to some deeper knowledge about w (an assumption somewhat supported by distributional hypothesis). So I then did the following: for each word that occurred at least 100 times (so I have some reliable information) and every other word that occurred at least 100 times, compute the cosine similarity (it is really important to not use something based on Euclidian Distance) and then create a file that lists for each word the 10 most similar other words.

the result

The 'neighbours.txt' file in that archive is the most interesting one. Each line has the format:

word - [distributionally closest word, distributionally second closest word, ....]

Then there is a file with some interesting examples (most with a note by me; why I found it interesting). Finally there is a file with the cooccurrence statistics in the format:

word : number of occurrences

+++++

position (relative to the word)

cooccurrence in that position - count

+++++

other position (relative to the word)

cooccurrence in that position- count

-----

If there is interest I could publish the code I used to do this (I might also publish it if there is no interest, if I find the time to document it properly).

The important point:

I think most of the nearest words have a very intuitive relation - often it is that they have the same word class - and this is already visible with 'only' 1 million sentences. While there are also some combinations that make no sense (to me) this indicates that the distributional hypothesis works really well. The distinctions made are also much, much more fine grained then what you see in a conventional POS-tagset, which could be good or bad, depending on how you want to use this information.

I am by no means the first person to observe that, Chris Biemann's Unsupos is a way more worked out approach to a similar problem. Which you can read about in:

Chris Biemann (2006)

Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering

Something I would like to ask the reader:

If you have read up to this point, maybe you can help me understand something about the data: the difference between 1861 and 1852. I know that one is a civil war year and the other is not, but how is this distinguished in the words immediately to the left and to the right? The distributionally similarity 'clearly' picks up on something:

1861 [1944, 1943, 1941, 1864, 1776, 1942, 1863, 1918, 1862, 1897]

most of those are years with 'important' wars going on (from the perspective of the 'standard' - Commonwealth or U.S. American - English speaker) but if we look at 1852:

1852 [1932, 1931, 1928, 1935, 1955, 1911, 1936, 1956, 1921, 1965]

then that is mostly pre- and post-war years (1950-1953 - Korean War) except for 1965 (Cuba Crisis). Most likely I'm putting meaning into these things after the fact, but I wonder if this tells me something about the very local constructions in which these words are used. The co-occurrence vectors are different, but nothing about them - to me - explains the semantic difference. Maybe it tells you something, or maybe I'm imagening things.

Nikos Engonopoulos pointed out to me: "... 5 out of the 11 chronologies (1852, 1928, 1932, 1936, 1956) are election years, which is more than expected considering that the election is only once every four years. Also, 4 more (1931, 1935, 1955, 1911) are pre-election years ...". Thanks to Nikos!