There are several ways to get a measure of how related one sentence in a document is related to another. Very often, the cosine similarity is used. This is a measure from information retrieval. I'll spare details here, but read http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html for a very nice overview. The method is nice, and once you've calculated the IDF*TF values, a simple matrix multiplied by its transpose results in the similarites. Here's the general equation for cosine similarity: All well and good, and not really that slow, but I wanted faster. Building on the realization that summarizing a document of sentences (i.e., finding the most related sentences to the whole and doing some voodoo) is not really the same as trying to use IR methods to retrieve documents related to a query (like Google). We are tasked with finding the main ideas of a document and the most important sentences related to those ideas. So, why again with this cosine similarity business? Well, let's take a quick example; Given the 3 sentences: 1.) Nik eats cake. 2.) Bill eats ice cream. 3.) Bill likes pie as well. We have a situation where they are obviously related, and that will come up in the cosine similarity. Two sentences share the word "eats", and two share the word "Bill". A graph of the sentences would looks something like: (1) ----- (2) ----- (3) where the ---- indicates there was a similarity above 0 linking each sentence. Although, not a very exciting example, the point is to recognize what the cosine similarity is doing. We're saying that Sentence 1 is some angle in multi-dimensional space from sentences 2 and 3. Turns out that angle is 90 degrees for 3 and something less for 2. This is how we can say that the sentences are connected in that way. If there were many sentences, groups like the above begin to appear, possibly indicating sub-topics from the whole document. So, enough chat, here's the punch line... certain words occur most often, they tend to involve the primary topics of the document (given a large enough document). This was noted way back in the late 1950's. What I propose is that these words seem to make a good estimator of similarity between sentences. Calculating Cosine Similarity the Normal Way:Here's the standard way of getting the cosine similarity:
The idf*tf vectors
Cosine Similarity = A (dot) B = (.585*0)+(.585*.585) = 0.342225 //terms with all zeros omitted |A||B| = sqrt(2*(.585^2) * (.585^2)) = 1.01324972 = .33775 How to calculate a fast estimate:But how do we use that to make similarities you ask? Well, I'm glad you did.Back to our example, here are the word frequencies: eats -> 2 Bill -> 2 ... don't care if the word only appears once, it gives no information. Make a representation for each sentence like this: 1.) 10 2.) 11 3.) 01 The place values are defined to be the top n words in the document (I suggest 32 or 64 depending on your processor design for maximum speed). In our case the first position is for eats and the second is for Bill. Remember, these positions are defined by using a list of words from the entire document, sorted by frequency in decreasing order. Now, if you're using a language that has fast binary operations, do a bitwise and between the sentences and a quick strip of the 0's and sum the remaining ones. Take that sum over the maximum possible (e.g., sentence 1 has 1 "1" and 2 and 2 "1's"). This result is your similarity estimate between those sentences. This makes it quick and easy to get similarities of groups of sentences as well. Again in a more clear format - here's the similarity between sentence 1 and 2:
There you have it. Not difficult. This method ignores the number of times a word is repeated in a document, which isn't true in the pure IDF*TF. For fear of repeating myself, we are not doing document retrieval, and looking at the sentence level is different than the document level, so how often would that TF even come into play? How do I say that these values are related if they are so different? Some quick experiments: The squared error of the similarities output by this method should decrease as more words are used. As we approach the total number of words in the document, the similarities predicted should match the similarities found using the vector space model. Below I present a table showing the result of increasing the number of words:
Clustering The Results:Clusters can be forced out of the data easier using this approach. Although this may be a bit extreme, below is a graph demonstrating similarities done using 16-bits (top 16 words). The text that was used to generate the above graphs come from http://en.wikipedia.org/wiki/Moon_landing Each sentence is a node, and the edges represent connections by cosine similarity The clusters in the first graph, "16-bit sim", are: A.) 2,23,30,32,75,135,154,210 B.) 148,150,198,200,202,204,206 C.) 18,122,147,213,226,229,231 D.) 43,45,56,157,166,167 E.) 93,95,96,191,195 None of those clusters hold much relevance to each other (good), nor do the sentences in each cluster (bad). The far end of the spectrum is taking every word. Here is a graph for the top 1024 words in the document: The cluster are still fairly obvious. The first one contains the following sentences: A.) 177,198,200,202,204,206 -- why this is nearly the same as from B. in the graph above!?! Since it keeps coming up I'll list the sentences out. Again, this is a copy and paste from that wikipedia article with sentence numbers added in "[]:" [177]:Over 80 mission milestones had to be accomplished for Apollo 11 to land on the moon and return safely to the Earth. [198]:First precise manned landing on the Moon, within walking distance of Surveyor 3. [200]:Commanded by Alan Shepard, the only one of the original Mercury Seven astronauts to walk (and golf) on the Moon. [202]:First mission with the Lunar Rover vehicle. [204]:First landing in the lunar highlands. [206]:Final Apollo lunar mission, first night launch, only mission with a professional geologist. Well, those seem a bit more related than did 148 and 150 to the rest of the group. Those two were: [148]:This superficially indicated action to alleviate the fictional "missile gap" between the U.S. and USSR, a campaign promise of Kennedy's in the 1960 election. [150]:Johnson also advised that for anything less than a lunar landing the USSR had a good chance of beating the U.S. For these reasons, Kennedy seized on Apollo as the ideal focus for American efforts in space. One of the other major clusters in the above graph is: B.) 10,130,133,137,196,197,199, 209,211,229,230 [10]:The Soviet Union later achieved sample returns via the unmanned moon landings Luna 16, Luna 20 and Luna 24. [130]:This mission stopped short at ten miles altitude above the lunar surface, performing necessary low-altitude mapping of trajectory-altering mascons using a factory prototype lunar module that was too overweight to allow a successful landing. [133]:American strategy [137]:Besides guidance and weight management, atmospheric re-entry without ablative overheating was a major hurdle. [196]:First manned landing on the Moon, July 20. [197]:• Apollo 12 - November 14, 1969. [199]:• Apollo 14 - January 31, 1971. ... as you can probably tell, the document's formatting threw off the sentence identification a bit! [209]:Apollo 7 and Apollo 9 never left Earth orbit. [211]:Unlike other international rivalries, the Space Race has remained unaffected in a direct way regarding the desire for territorial expansion. [229]:Many conspiracy theorists insist that the Apollo moon landings were a hoax. [230]:These accusations flourish in part because predictions by enthusiasts that Moon landings would become commonplace have not yet come to pass. Well, maybe this whole clustering thing isn't working out? I'll have to fiddle with this part more. Stay tuned for a follow-up post on clustering. Although a bit apples-to-oranges, here is the version using standard cosine similarity. In all 3 images I've left the nodes with a low degree of connectivity out of the picture, so don't fret if there seems to be a different number of sentences in each image. Also, there is an arbitrary threshold set as to rather to connect two or not. Otherwise, many more edges exist in the graph. Even in the above graph, there is relation between 177 and many of the sentences clustered with it by the fast method. although the links are sparser in this approach. So, what did we learn from all this rambling on? Well, a few things: First, there is a very fast way to estimate the cosine similarity between sentences.
Second, clusters on similarity don't necessarily make good summaries by themselves. I suspected that topics would group together, and that happened to an extent, but not in a dramatic fashion.
Perhaps I just implemented the cosine similarity poorly, but it took too long, even using the matrix times its transpose trick. On the other hand, this approach is fast. I've seen a speed increase of greater than an order of magnitude. A follow-up post will come when I have more answers. Thanks for reading. -erik . |