Word Classes from Suffixes

Post date: Oct 19, 2015 1:26:13 PM

Since I am currently generating examples for my seminar on unsupervised techniques I have another obvious, but very illustrative data set. I used the same basic data as in the older post - this time I did not even bother drop the leading numbers or to lower case any inputs. All I did way put all words that have the same final three letters into a cluster. As you would expect, most of these clusters are noisy, but they have a clear tendency towards well defined linguistic classes of words and in order to use this approach you only need to see a word once to define it's cluster membership.

the data

(text file best viewed with the 'less' command)

The format is:

suffix : {words with that suffix}

my favourite cluster is:

Saw : {See-Saw, Rip-Saw, m-Calling-This-Number-To-Report-What-I-Saw, Saw, Buzz-Saw}

The lowercase 'saw' cluster is much more noisy:

saw : {Kennesaw, Alingasaw, Nasaw, foresaw, Ripsaw, seesaw, buzzsaw, Lvóv-Warsaw, sea-saw, oversaw, chainsaw, Poznań-Warsaw, Chopsaw, Guudsaw, jigsaw, handsaw, buzz-saw, Dine-saw, Nassaw, Jigsaw, sightsaw, Chickasaw, see-saw, Jasaw, ricksaw, Buzzsaw, warsaw, Kenesaw, Chainsaw, Jisaw, hacksaw, Warsaw, saw, whipsaw}

and it illustrates nicely that 'saw' has at least three very different uses in English: 'I saw Tim buying a saw in Warsaw'.