TREADS

437days since
Fall 2008 Classes Start

Unsorted Thoughts

SAGA3 - Anaphora Resolution Take 2

posted ‎‎Mar 16, 2008 9:37 PM‎‎ by Erik Schweller   [ updated ‎‎Mar 18, 2008 10:35 AM‎‎ ]

Anaphora Resolution, Part 2

 

    In the universe great acts are made up of small deeds /

    The sage does not attempt anything very big, /

    And thus achieves greatness.

                            --- Lau Tsu, Tao Te Ching

 

 

In the first attempt at anaphora resolution, the results were mixed.  Two algorithms were implemented and lightly tested. The first algorithm used a few simple rules to attempt to guess the correct sentence the current sentence under analysis depended upon.  The second algorithm was a slightly more complicated in that it attempted to use co-occurrences of words within a window to better guess the proper dependency of the current sentence to itself or a previous one.  Keeping in the theme, the algorithm presented is this section is also a guesser called the “Schweller Anaphora Guesser Algorithm, Version 3” or SAGA3 for short.  The goal of SAGA3 is to decide to which sentence each anaphoric reference points.  Once this link is determined, the proper noun phrase identified as most likely resolving the current pronoun is used to replace the pronoun in a new collection of sentences.  These sentences will be used when calculating the values used to generate a summary of the text.

 

Anaphora resolution is not an easy problem, which implies that a solution built for speed is not likely to find the best possible solution.  Particularly in automatic anaphora resolution, (i.e., no human intervention or feedback and no text annotation before processing), the best results to date show that there is much work left to be done on this topic.  For a review of how the most cited anaphora resolution algorithms perform in an automatic fashion, please read the article Comparing Pronoun Resolution Algorithms by Ruslan Mitkov.

 

The approach taken in SAGA3, as with the previous implementations, is often called “knowledge-poor”.  The deep semantic meaning of the text is not considered.  Instead, surface level analysis collects indicators that inform the resolution process.  There are many “knowledge-poor” algorithms for anaphora resolution.  See Kennedy and Boguraev’s (1996) parser-free algorithm, Baldwin’s (1997) CogNiac, and Mitkov’s (1998b) knowledge-poor approach for a few examples. 

 

Resolution Clues

 

1. Location.

            An antecedent is more likely to be close to the anaphora.  An antecedent cannot occur after the anaphora.  Subject pronouns in the object, my reference the current’s sentence’s subject.

 

2. Gender and Number

Common to many approaches in anaphor resolution is checking the agreement of gender and number.  For example, the pronoun “he” will rarely resolve to the name “Betty” just as the pronoun “they” will never refer to “Sherry” since it is a singular noun.  SAGA3 incorporates this information for known names. 

 

3. Subject and Object (Not yet implemented)

            English sentences are composed of a subject and a verb or a subject, verb, and object.  The subject pronouns are I, you, he, she, it, we, and they.  The common object pronouns are me, you, him, her, it, us, and them.  In the example, “Bill was not looking.  Erik threw the ball to him.” it would be incorrect to resolve “him” to “Erik”.  In general, object pronouns will usually refer to the object of a sentence if they need resolved (i.e., they refer to a previous sentence) and subject pronouns will refer to the subject if they need resolved.

  NOTE: Currently, the subject and object are guessed . The subject is taken to be the first 1/3 of the sentence, and the object is the rest.  This information is not currently utilized to any  notable extent.

 

SAGA3, like CogNiac, will only resolve a pronoun if the pronoun appears to fit into a situation where general purpose reasoning or real understanding of the text is not required.  A series of rules are sequentially tested, and if none match, the pronoun is not resolved.  


SAGA3 does attempt to resolve reflexive pronouns (e.g., “himself”), as they most often do not point to a separate sentence.

 


 

Processing steps:

Assuming that part of speech tags have been applied:

  1. Identify all noun phrases that do not contain a nested pronoun.
    1. If the noun phrase is entirely contained in the list of words that always appear in the document as capital words, treat the phrase as a name. 

                                                               i.      If any or all parts of the name agree in gender, treat the name as a gendered name.

  1. Identify the gender of the names using a lookup table compiled from various sources.
  2. Identify the number (plural or singular). 
  3. For each sentence, look up to 2 sentences previous and find all possible antecedents.
    1. Tag them with as much information as can be determined;

                                                               i.      Gender if the current pronoun is personal

                                                             ii.      Number if the current pronoun is impersonal

    1. Choose the antecedent that is satisfies the most conditions:

                                                               i.      For Personal Pronouns

1.      Has a known gender

2.       Closest to the pronoun in the text

3.      If the pronoun is in the object, there is no potential match in the subject of the current sentence

4.      The resolving noun appears to be a proper noun.

                                                             ii.      For Impersonal Pronouns

1.      If it is in the object of the sentence, there is no potential subject noun phrase to resolve it.

2.      The number of the pronoun matches the number of the noun phrase.

3.      If the pronoun is in the subject, there is either an appropriate noun phrase in the object of the previous sentence or the previous sentence also contained an impersonal pronoun in the subject.  (In Progress)

  1. When multiple antecedents exist in the list of possible resolutions, the system currently accepts the lexically closer noun phrase for resolution.
    1. Preference is given in the following order:

                                                               i.      Fully identified (known number, gender, and is a proper noun)

                                                             ii.      Partially identified, gender unknown but is proper. May be a noun phrase.

                                                            iii.      Generic noun of unknown type.  May be a noun phrase.

  1. When no antecedents remain, the pronoun is not resolved under the assumption further information is required to make an appropriate decision.  (This has the effect of ignoring pleonastic anaphora such as “it” does not refer to any antecedent in the text).

 


Results

 

This implementation is very careful when resolving.  Many anaphora that a human can easily resolve, are missed by SAGA3, but unlike previous versions, of the attempted resolutions, the results are promising.  In a future update, look for hard numbers describing the success rate.    


The rest of this discussion will center around the same sample as used previously:

“[0]:This is a simple example of anaphora resolution. [1]:This is by no means a complete test and should not be taken as one. [2]:To meet some requirements, this sample will be somewhat lengthy, but that will help its value as a basic test.

[3]:Some proper nouns, such as Bill and Nik need to be introduced. [4]:For them to be identified as proper nouns, they need to be mentioned more than once. [5]:For example, Bill is now a proper noun according to the simple guessing system. [6]:He is also a noun, as was Nik before the second mention of his name.

[7]:Bill traveled around the German town. [8]:He thought it was very beautiful. [9]:He also saw several birds. [10]:They chirped, “hello.” [11]:He had a nice day.

[12]:Most of the pronouns in this test link directly to the previous sentence. [13]:Here is an example of a sentence splitting the discussion. [14]:Nik found a kite. [15]:The find was fortunate. [16]:He was overjoyed. [17]:Another example of a difficult split is when the sentence references itself, but has many pronouns. [18]:This sentence is an example of itself, and it is a good sentence because of its confusing prose.”

 

 

Sentences 4, 6, 8, 9, and 10 were altered.  In order, they have been edited to read as:

4:for nouns to be identified as proper nouns they need to be mentioned more than once

6:bill is also a noun and so is nik before the second mention of his name

8:bill thought town was very beautiful

9:bill also saw several birds

10:birds chirped hello


Notice that "the" was left off of the noun phrases in the case of 8 and 10.   Also notice that sentence chaining is currently not implemented.


Recursively resolving results in sentence 11 being included as well.  Recursion is allowed until no more pronouns are resolved.  Since the "he" in sentence 11 is too far away (i.e., more than 2 sentence) from the "Bill" in sentence 7, the "he" in sentence 8 and 9 first had to be resolved.   The recursion adds little extra to time to the overall algorithm, as only the unresolved pronouns are inspected during each pass.


 

Future work

 

There are numerous extensions to the current status of SAGA3. The next few may include the following.


The simplest of all is to implement sentence chaining.  Whenever the anaphoric references of the previous sentence are resolved, include that knowledge in when resolving for the next. 


It seems the system would benefit from a continually growing database of names to their gender.  The system could learn gender as text is processed, relating occurrence of “he” or “she” with the proper nouns that are currently unknown.  

 

Further world knowledge, such as corporations, and famous people with their associations to events may help resolve "them" and "they". 

 

The resolution of split antecedents (e.g., “Bill and Nik went fishing.  They both wore shorts”) are not currently handled.  When “they” is encountered in the text, previous sentences should be inspected for the form <proper noun> <non-pronoun and non-verb> or <conjunction> <proper noun>.  IFF this appears in one of the previous 2 sentences, and no other pronouns were identified, is they can be resolved.


The noun phrase identification is quite rough around the edges in its current rendetion.  Deciding when to keep or reject a determinate is an open question currently.

 

Stay tuned, this saga is not yet concluded.

-erik

Web-Based Interface to TREADS

posted ‎‎Feb 29, 2008 4:45 PM‎‎ by Erik Schweller   [ updated ‎‎Mar 28, 2008 5:18 PM‎‎ ]

Introduction

 

 

This document describes various projects for extending the functionality of the current TREADS application. 

 

TREADS is a web-based system that reads an input text and generates several forms of output.  The system, written in Perl, runs on a lightweight server.   See http://treads.emich.edu/ for a working demo.

 

We desire web-based application(s) that will read files specified and give an interactive, well formatted, output.  The application(s) should be lightweight, well documented, easy to maintain, and intuitive.  It is preferable that the application(s) operates independently of the existing TREADS application before integration.  The application(s) will need to have a flexible and easily updateable interface for reading data files from the TREADS system.

 

These features are requested:

  1. Easily adjustable interface for parsing TREADS output.  Data returned by TREADS should be displayed in a user interface that:
    1. Has large input elements and is simplistic.
    2. Has a minimal number of interface elements.
    3. Has large fonts.
    4. Includes any further HCI elements appropriate for special education use.
  2. Text to Speech. Words, paragraphs, or the entire document may be read as the user’s request.
    1. Example Link (highlights words as they are read in MS Word):

http://www.wordtalk.org.uk/

  1. Dictionary and Thesaurus lookup for any word in the output section.
    1. Example Link (double click on words in the article): http://www.cbsnews.com/stories/2008/03/22/sunday/main3960219.shtml
  2. User accounts that track preferences, current progress though documents, and allow for restoring to the last location in a summary from which the reader was previously viewing. 
  3. Database of previously summarized and input texts and RSS feeds from a specified time.  Articles are sortable and searchable.
  4. Ability to accept multiple input formats including, RSS feeds, Websites, text documents, and pasted form information.  The raw text from these input forms, with white space preserved, should be passed on to TREADS for summarization and further processing.

 

 

Overview

 

The current process is as follows:

  1. A user accesses TREADS though a web-browser. 
  2. The user selects process options or a user profile, which is a collection of process options (not implemented in the web demo), and enters input text to be processed.
    1. The process options are optional inputs.
  3. TREADS summarizes the input text in the requested method. 
    1. TREADS generates multiple output files. 

                                                               i.      NOTE: TREADS will use a database for all storage in the future, but currently all data is stored on the file system. 

  1. Once all output is ready, the output files are made available to the user via a collection of web pages.

 

Instead of treads generating a set of web pages (steps 1 and 4), a richer user experience is desired.  This requires a robust user interface that can facilitate the various activities of the system (e.g., reading text and highlighting).   

 

The user interface will handle logging in a user, basic preparation of the text for processing, and starting the TREADS application with the proper options for the specific user.  If no user is logged in, treads should function in demo (guest) mode, with a set of defaults.    Once summarization is complete, the user will be presented with the available output types and options as specified by the user’s profile.

 

TREADS will handle the summarization and xml-like tagging of the documents that will be returned to the interface for display to the user. 

 

 

Interfacing to TREADS

 

The software interface is the connection between the display software and TREADS.  TREADS output may be returned as a reference to an array of strings, or as is currently done, written as a set of text documents to the file system for further processing.  

 

Each document on the file system will be named using a run number (a unique identifier for the current summarization), and any other necessary data, such as type of output. 

 

The order and description of the files returned as an array reference will be decided upon with input from the development group.


Files Generated by TREADS

A multitude type of output files are generated.  They will all be in a tagged text format for the system. 

·        Untagged Text Files

o       Any file not tagged will be displayed as a standard text file.

·        Tagged Files

o       A tagging system is not yet defined. 

o       XML –like tagging is preferred. 

o       Tags will include (but not limited to):

§         Font Size

§         Font Color

§         Sentence Score

§         Sentence Number

§         Paragraph

§         Header

§         Bold

§         Italics

o       Not all tags must be included by the summarizer, system defaults will be used to described data missing tags.

o       Tag names, requirements for closure, nesting, etc., are still mostly undefined, --- input is welcome. 

o       Example:

§         <Paragraph><SentenceNumber=’4’><Font Color = ‘Red’><Score=’.432’>The brown fox <Color=’blue’>jumped</Color></Sentence><SentenceNumber=’8’>Example text here. </Sentence>

 

Requested Features in Detail

           

1. User Interface

Keep in mind that students with learning disabilities are the intended target for this system. 

·        Any output view should have the ability to send to the printer. 

·        Readability

o       Large San-Serif Fonts

o       Simple Color Pallets

§         No specification, but theme it similar to EMU’s colors with more colorful elements when appropriate

o       Graphical icons / buttons

o       Few Options to parse at each level

·        Types of Output (Specific Views for the User)

o       Summarized text (May be of the following types):

§         Slider filtered

·        The user can adjust the amount of sentences seen by moving a slider back and forth

o       Sentences are tagged with a threshold value for deciding if they are to display or not

§         Sized All

·        All sentences from the input are included, just some are sized by various criteria (defined through the XML from TREADS)

§         Colored

·        All sentences may or may not be included from the input, just some are colored by various criteria (defined through the XML from TREADS)

§         User Default – Primary output, which may be any combination of the above.

 

o       Raw statistics (if enabled -- the current user is “Privileged”)

o       Input Text

o       Additional files may be included

§         HTML input to be passed though to the output screen directly

§         Image files (possibly scanned texts, or images related to the text)

§         Multiple columns to a single column and vice versa.

 

  • Performance
    • The system should respond promptly when running on a modern desktop.
      • Any performance issues that will lead to user confusion are unacceptable.  
      • Multiple users may be logged into and using the system at the same time.  The system should maintain its performance.

 

  • Reliability
    • Expect unusual inputs and handle them gracefully.
    • Handle multiple users simultaneously.

 

  • System Errors
    • Report errors in an un-intrusive fashion and write them to a log.

 

 

2. Text-to-Speech

 

The software should be able to generate computer-simulated speech for the text from TREADS.  Open source projects such as http://mary.dfki.de/ and http://www.xenocafe.com/tutorials/php/festival_text_to_speech/index.php exist.  As long as the license of the software is acceptable, a third party package may be integrated into the project, as developing a TTS engine is not the goal.

·        When text-to-speech is enabled for the user

o       Clicking on the word will lead to a method for saying the word.

§         Light box?

§         Popup window?

§         Side bar?

§         Automatically say the word without any interruption? 

o       The currently open output should have the option for being read in its entirety.

§         When reading the text to the user in a continuous fashion, the word currently being spoken should be highlighted on the screen.  

§         Speed of playback should be adjustable in the user’s preferences. 

 

 

3. Dictionary and Thesaurus Lookup

Just like the text to speech, the user should be able to click on a word and get it’s definition and synonyms and antonyms.

·        When Dictionary and Thesaurus are enabled for the user (may always have this option enabled):

o       Clicking on a word will lead to a method for displaying the information

·        The output of this information should make clear:

o       The word for which this information is being presented

o       What type of information it is (i.e., “this is a similar list of words”)

o       How to return to the previous view

 

 

4. User Accounts

            The default user account will be “Guest” with no password.  Further accounts should be creatable for users.  These accounts will be used to configure the specific options for the user including the reading speed, user type (Privileged/Normal), etc.  Only a “Privileged” user may add a new user to the system.  Additionally, “Privileged” users will be the only ones able to set certain options available to the “Normal” users.  Privileged users may access any other user’s account to configure options.

 

·        Profile Information – these are options that apply to a user’s experience with the software.

o        A “P” indicates that only a Privileged user may configure this option.  The possible values are described in parenthesis.

o       Parameters describing the types of features to enable for the user

·        Enable Dictionary and Thesaurus      (True / False)            “P”

·        Enable Text-to-Speech             (True / False)            “P”

·        Enable Colored Paragraphs      (True / False)

·        Enable Varied Fonts            (True / False)

·        User Default View            (Text)                  “P”

·        User Type                        (Privileged / Normal ) “P”

·        …More to be described

·        Documents summarized, read, etc. should be saved by the system and available for review.

·        Any document the user has not completed shall be available for continuation.  That is, if a user exits the program or navigates to a different view, the current location the user is at in the text should be saved and restored once the user returns to this document and view. 

 

5. Summarization Database

·        All summaries generated should be accessible to any user, and all documents summarized should be stored.  

·        A simple option to search previously summarized documents is requested.

·        Documents should be sortable and filterable by category, title, date, user, etc.

·        RSS Feeds from sites may be included and kept for a selectable time period.   For example, CNN’s articles for the last month may automatically be stored to the database for quick retrieval and summarization.

           

 

Additional Notes

  • This project will move to open source in time.
  • This project revolves around a developing research base, and such it is not possible to describe every option the user will be presented with.  This implies that an easily adjustable interface is required.
  • Suggestions are welcome.
  • For HTML displays, use CSS and valid HTML (preferably XHTML Transitional) and JavaScript where possible.
  • Suggestions are welcome.
  • Contact Erik any time (othererik at gmail) with questions.
  • Suggestions are welcome. :)
-erik

Heuristic Driven Anaphora Resolution

posted ‎‎Feb 29, 2008 12:16 PM‎‎ by Erik Schweller   [ updated ‎‎Feb 29, 2008 12:58 PM‎‎ ]

.

Introduction

 

What is anaphora resolution?  The best way to describe the process is to give an example in four sentences.

 

1.)    Bill traveled around the German town. 

2.)    He thought it was very beautiful.

3.)    He also saw several birds.

4.)    They chirped, “hello.”

 

When a human reader encounters those sentences, it is obvious that they really read (mentally) as follows:

 

5.)    Bill traveled around the German town. 

6.)    Bill thought the German town was very beautiful.

7.)    Bill also saw several birds.

8.)    Several birds chirped, “hello.”

 

Now, the above feat is simple for humans to do, when enough context is given.   I have yet to find a clear-cut algorithm that does as well as a human at the above task. I will describe the current attempts and results in the rest of this post.  Please, keep in mind that there are many solutions to the anaphora resolution problem, but the goal here is speed, not complete correctness (although a high level of correctness is desired). 

 

Assume that the environment is established.  A sentence pre-processor has cleaned the sentence text and a summarization of the text is ready for generation.  The text is treated as an array where each element is a sentence and the properties associated with it. 

  • One of those properties describes rather the sentence should be kept in a summarization. 
  • Other properties list the guessed proper nouns for the sentence. 
    • This lists is generated by keeping only words that are always lead with a capital letter in the entire document.  Additionally, these words must appear more than once.
  • Another critical property is the parts of speech associated with each of the words in the sentence. 
    • The part of speech tags were created using “a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values” from the lingua::tagger module.
  • The location in the sentence of the first noun or noun phrase.
    • This indicates a potential subject of the sentence, particularly if a pronoun does not come before the first occurrence.
  • Average sentence length.

 

Those properties are fed into the resolving module, which supports two approaches:

 

Approach 1: Simple Anaphora Resolution

 

Look at each sentence in order. 

  • For each word in the sentence, again in order, see if we have a noun or a pronoun.
    • If the current word is a pronoun, decide if the pronoun is a personal pronoun or a object pronoun.
      • If it is a personal pronoun, look up to 5 sentences previous to the current sentence for a guessed proper noun.  If none is found, look in the first sentence, up to the current word.
        • On success, store the link of the current sentence to the one where the proper noun was found.

 

      • If the current word is an object pronoun, (e.g., “it”), look from the current point, back to find any noun as tagged by the tagger.
        • If the sentence is not linking to itself or “-1”
          • If the sentence is greater than or equal any other values assigned to this type of link (default value is “–1”)
            • Assign a link between the two sentences.
              • If linking to a sentence that already links to another, link to its target.

 

For all sentences selected to keep in the summarized version of the text, also keep the sentences linked as well.

Results of approach 1.

 

Sample input:

“[0]:This is a simple example of anaphora resolution. [1]:This is by no means a complete test and should not be taken as one. [2]:To meet some requirements, this sample will be somewhat lengthy, but that will help it’s value as a basic test.

[3]:Some proper nouns, such as Bill and Nik need to be introduced. [4]:For them to be identified as proper nouns, they need to be mentioned more than once. [5]:For example, Bill is now a proper noun according to the simple guessing system. [6]:He is also a noun, as was Nik before the second mention of his name.

[7]:Bill traveled around the German town. [8]:He thought it was very beautiful. [9]:He also saw several birds. [10]:They chirped, “hello.” [11]:He had a nice day.

[12]:Most of the pronouns in this test link directly to the previous sentence. [13]:Here is an example of a sentence splitting the discussion. [14]:Nik found a kite. [15]:The find was fortunate. [16]:He was overjoyed. [17]:Another example of a difficult split is when the sentence references itself, but has many pronouns. [18]:This sentence is an example of itself, and it is a good sentence because of its confusing prose.”

 

Links created (ignoring summarization results):

4 -> 3 twice
6 -> 5 twice
8 -> 7
9 -> 7
10 -> 7 (Incorrect)
11 -> 7
16 -> 14

 

 


Approach 2: A Voting Algorithm for Anaphora Resolution

           

Look at each sentence in order. 

  • For each word in the sentence, again in order, see if we have a noun or a pronoun.
    • Initialize a vote value to twice the average length of the sentence.
    • If the current word is a noun, add this noun to the FIFO queue.  This queue is twice the average sentence length meaning that if the average sentence length is 10, then 20 of the most recently seen nouns exist in this queue.
      • Only keep the first subject instance if the noun is in the list of proper nouns.
    • If the current word is a pronoun, loop thought the current noun queue.
      • For each noun from the queue, decrease the vote value by one.
      • If the current noun from the queue has already been in the queue on a previous occurrence of this pronoun, add the current vote to the nouns previous vote total for this pronoun.
      • If the current noun is a proper noun from the list guessed, give it an additional vote of the average sentence length.

 

Look though the sentences in order.

  • For each word in the sentence, in order, check for noun or pronoun.
    • If the word is a pronoun
      • Allow the first sentence to be marked as the sentence to link to if the pronoun is an object pronoun. 
      • If more than a third of the sentence has already been seen, allow the current to be linked to.
      • If the subject items guessed has already been passed, allow the current sentence to be linked to.
      • Identify the closest noun to the current sentence.
      • Identify the closest proper noun to the current sentence.
      • Identify the noun that received the most votes for the current pronoun that is less than three times the average sentence length away.

 

Look though the sentences in order.

  • Calculate a “confidence” that the current sentence should be linked to a previous sentence. 
    • This value is a combined measure of:
      • The summed distances between each of the three types of sentence selections (closest noun, closest proper noun, most voted noun) each scaled by an arbitrary factor.
      • The current sentence subtract the average position of the three measures, scaled by an arbitrary factor.
      • The number of votes the voted noun received over the most votes a noun received, scaled by an arbitrary factor.
      • The position in the sentence over the length of the sentence, scaled by an arbitrary factor
      • An additional bonus if the current position is 3 or less words into the sentence.
      • If these values are already defined for the current sentence and another word is encountered a bonus is given to the confidence value.
      • If the nouns selected across the measures match, a bonus is given.
    • The sentence the current one is to be linked to is then decided by:
      • Preference is given to the nearest sentence to the current one when linking, but if two of the measures share the same noun, then preference is given to the closest of the two.

 

    • If the confidence factor is greater than .8:
      • If the sentence is not linking to itself
        • If the sentence is greater than or equal any other values assigned to this type of link (default value is –1)
          • Assign a link between the two sentences.
            • If linking to a sentence that already links to another, link to its target.

 

For all sentences selected to keep in the summarized version of the text, also keep the sentences linked as well.

 

Results of approach 2:

 

The same input data is used as with the first approach.

 

Links created (ignoring summarization results):

4 -> 3     Confidence: 1.45
6 -> 5     Confidence: 1.42
8 -> 7     Confidence: 1.42
9 -> 7     Confidence: 1.20
10 -> 7   Confidence: 0.96     (Incorrect choice)
11 -> 9   Confidence: 0.98
16 ->14  Confidence: 0.84

 

 

Future Work

 

The reality is that both of these approaches are limited by their “understanding” of the text.  When trying to get accurate results, natural language processing (NLP) is the next step for most.  Most aspect of NLP tend to be computationally intensive, for this reason, it has been avoided in this exploration.

 

There are ways to mitigate the exclusion of NLP from the solution.  The most obvious extension to try is to add tables for names, places and organizations.  If the pronoun in question is “he”, look before that point for male names.  If the pronoun is “it”, exclude names from the list of possible resolution points.

 

For the simple sample above, most of the selections worked correctly for both algorithms.  For the one incorrect choice, the problem lies in the linking of sentence 9 to sentence 7.  The system does not yet understand that a link off the pronoun “they” should not be linking to a singular name “Bill” and instead should link to the plural noun.  This is another step in the direction of NLP, but again, should be able to be improved by a set of basic rules.

 

Look for further postings on this topic in the future.

-erik

Fast Sentence to Sentence Similarities

posted ‎‎Feb 29, 2008 5:30 AM‎‎ by Erik Schweller   [ updated ‎‎Feb 29, 2008 12:59 PM‎‎ ]


There are several ways to get a measure of how related one sentence in a document is related to another.  Very often, the cosine similarity is used.  This is a measure from information retrieval.  I'll spare details here, but read http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html for a very nice overview.   The method is nice, and once you've calculated the IDF*TF values, a simple matrix multiplied by its transpose results in the similarites.  

Here's the general equation for cosine similarity:


 All well and good, and not really that slow, but I wanted faster.


Building on the realization that summarizing a document of sentences (i.e., finding the most related sentences to the whole and doing some voodoo) is not really the same as trying to use IR methods to retrieve documents related to a query (like Google). We are tasked with finding the main ideas of a document and the most important sentences related to those ideas.  So, why again with this cosine similarity business?  Well, let's take a quick example;

Given the 3 sentences:
1.)  Nik eats cake.
2.)  Bill eats ice cream.
3.)  Bill likes pie as well.

We have a situation where they are obviously related, and that will come up in the cosine similarity.  Two sentences share the word "eats", and two share the word "Bill".  A graph of the sentences would looks something like:

(1)  -----   (2)   -----  (3)    where the ---- indicates there was a similarity above 0 linking each sentence.  


Although, not a very exciting example, the point is to recognize what the cosine similarity is doing.   We're saying that Sentence 1 is some angle in multi-dimensional space from sentences 2 and 3.  Turns out that angle is 90 degrees for 3 and something less for 2.   This is how we can say that the sentences are connected in that way. If there were many sentences, groups like the above begin to appear, possibly indicating sub-topics from the whole document.


So, enough chat, here's the punch line... certain words occur most often, they tend to involve the primary topics of the document (given a large enough document). This was noted way back in the late 1950's.   What I propose is that these words seem to make a good estimator of similarity between sentences.


Calculating Cosine Similarity the Normal Way:


Here's the standard way of getting the cosine similarity:

   
 Word IDF   TF  TF*IDF
Nik
 0  1
eats
 .585  1  .585
cake
 0  1
Bill
 .585  1  .585
ice
 0  1
cream
 0  1
like
 0  1
pie
 0  1
as   0  1 0
well
 0  1


The idf*tf vectors
Sentence
 Nik  eats cake
Bill
ice  cream
like
pie
as
well
 1  0 .585
0
.585
0
0
0
0
0
0
2
0
0
0
.585
0
0
0
0
0
0
3
0
0
0
.585
0
 0 0
0
0
0

Cosine Similarity = 

        A (dot) B = (.585*0)+(.585*.585)     =     0.342225     //terms with all zeros omitted
        |A||B|       = sqrt(2*(.585^2) * (.585^2))    =     1.01324972
   
       = .33775   

 

How to calculate a fast estimate:

But how do we use that to make similarities you ask?   Well, I'm glad you did. 


Back to our example,  here are the word frequencies:

eats  ->     2
Bill    ->     2
... don't care if the word only appears once, it gives no information.

Make a representation for each sentence like this:

1.)  10
2.)  11
3.)  01    
The place values are defined to be the top n words in the document (I suggest 32 or 64 depending on your processor design for maximum speed).  In our case the first position is for eats and the second is for Bill.  Remember, these positions are defined by using a list of words from the entire document, sorted by frequency in decreasing order.  Now, if you're using a language that has fast binary operations, do a bitwise and between the sentences and a quick strip of the 0's and sum the remaining ones.  Take that sum over the maximum possible (e.g.,  sentence 1 has 1 "1" and 2 and 2 "1's").  This result is your similarity estimate between those sentences.   This makes it quick and easy to get similarities of groups of sentences as well.  

Again in a more clear format - here's the similarity between sentence 1 and 2:
  1. And the representations: 10 & 11 = 10
  2. Get the maximum possible number of '1's'    (in this case, 2)
  3. Take the result of the and over the max possible (in this case .5)
    You'll notice, that indeed, the answer is not .33775, but that we indicate the sentences have something in common.


There you have it. Not difficult.  This method ignores the number of times a word is repeated in a document, which isn't true in the pure IDF*TF.  For fear of repeating myself, we are not doing document retrieval, and looking at the sentence level is different than the document level, so how often would that TF even come into play?


How do I say that these values are related if they are so different?


Some quick experiments:

The squared error of the similarities output by this method should decrease as more words are used.   As we approach the total number of words in the document, the similarities predicted should match the similarities found using the vector space model.  Below I present a table showing the result of increasing the number of words:



Moon
Cuba
Lincoln*
Mike*
Crimean
8 Words
 0.0813435  0.0629873  0.0618881  0.0418210  0.0538866
64 Words
 0.0216933  0.0138334  0.0152463  0.0106887  0.0112652
512 Words
 0.0049410  0.0028671  0.0021752  0.0014661  0.0014349
4096 Words
 0.0030707  0.0021179  0.0008323  0.0006371  0.0008352
*Consists of a compilation of multiple sources


Clustering The Results:


Clusters can be forced out of the data easier using this approach.

 Although this may be a bit extreme, below is a graph demonstrating similarities done using 16-bits (top 16 words).
The text that was used to generate the above graphs come from http://en.wikipedia.org/wiki/Moon_landing
Each sentence is a node, and the edges represent connections by cosine similarity


The clusters in the first graph, "16-bit sim", are:
A.) 2,23,30,32,75,135,154,210
B.) 148,150,198,200,202,204,206
C.) 18,122,147,213,226,229,231
D.) 43,45,56,157,166,167
E.) 93,95,96,191,195

None of those clusters hold much relevance to each other (good), nor do the sentences in each cluster (bad).



The far end of the spectrum is taking every word. Here is a graph for the top 1024 words in the document:

The cluster are still fairly obvious.  The first one contains the following sentences:
A.) 177,198,200,202,204,206 -- why this is nearly the same as from B. in the graph above!?!  

Since it keeps coming up I'll list the sentences out. Again, this is a copy and paste from that wikipedia article with sentence numbers added in "[]:"
[177]:Over 80 mission milestones had to be accomplished for Apollo 11 to land on the moon and return safely to the Earth.
[198]:First precise manned landing on the Moon, within walking distance of Surveyor 3.
[200]:Commanded by Alan Shepard, the only one of the original Mercury Seven astronauts to walk (and golf) on the Moon.
[202]:First mission with the Lunar Rover vehicle.
[204]:First landing in the lunar highlands.
[206]:Final Apollo lunar mission, first night launch, only mission with a professional geologist.

Well, those seem a bit more related than did 148 and 150 to the rest of the group. Those two were:
[148]:This superficially indicated action to alleviate the fictional "missile gap" between the U.S. and USSR, a campaign promise of Kennedy's in the 1960 election.
[150]:Johnson also advised that for anything less than a lunar landing the USSR had a good chance of beating the U.S. For these reasons, Kennedy seized on Apollo as the ideal focus for American efforts in space.

One of the other major clusters in the above graph is:
B.) 10,130,133,137,196,197,199, 209,211,229,230

[10]:The Soviet Union later achieved sample returns via the unmanned moon landings Luna 16, Luna 20 and Luna 24.
[130]:This mission stopped short at ten miles altitude above the lunar surface, performing necessary low-altitude mapping of trajectory-altering mascons using a factory prototype lunar module that was too overweight to allow a successful landing.
[133]:American strategy
[137]:Besides guidance and weight management, atmospheric re-entry without ablative overheating was a major hurdle.
[196]:First manned landing on the Moon, July 20.
[197]:•    Apollo 12 - November 14, 1969.
[199]:•    Apollo 14 - January 31, 1971.   ... as you can probably tell, the document's formatting threw off the sentence identification a bit!
[209]:Apollo 7 and Apollo 9 never left Earth orbit.
[211]:Unlike other international rivalries, the Space Race has remained unaffected in a direct way regarding the desire for territorial expansion.
[229]:Many conspiracy theorists insist that the Apollo moon landings were a hoax.
[230]:These accusations flourish in part because predictions by enthusiasts that Moon landings would become commonplace have not yet come to pass.


Well, maybe this whole clustering thing isn't working out?   I'll have to fiddle with this part more.  Stay tuned for a follow-up post on clustering.


Although a bit apples-to-oranges, here is the version using standard cosine similarity.   In all 3 images I've left the nodes with a low degree of connectivity out of the picture, so don't fret if there seems to be a different number of sentences in each image. Also, there is an arbitrary threshold set as to rather to connect two or not. Otherwise, many more edges exist in the graph.



Even in the above graph, there is relation between 177 and many of the sentences clustered with it by the fast method. although the links are sparser in this approach.


So, what did we learn from all this rambling on?   Well, a few things:

     First, there is a very fast way to estimate the cosine similarity between sentences.
    • The results can easily be tuned to give different sized groupings.
    • The results are very 'similar' to those of the normal cosine similarity (though, not exactly the same)
      • They tend to more tightly group into clusters of similar similarity.

    Second, clusters on similarity don't necessarily make good summaries by themselves. I suspected that topics would group together, and that happened to an extent, but not in a dramatic fashion.
    • The next step here, perhaps, is to summarize each cluster as a new document. 
      • Oh, what do these tantalizing clusters hold for us?
 

Perhaps I just implemented the cosine similarity poorly, but it took too long, even using the matrix times its transpose trick.  On the other hand, this approach is fast.  I've seen a speed increase of greater than an order of magnitude.

A follow-up post will come when I have more answers.  Thanks for reading.
-erik









‹ Prev    1-4 of 4    Next ›