TREADS

453days since
Fall 2008 Classes Start

Heuristic Driven Anaphora Resolution

posted ‎‎Feb 29, 2008 12:16 PM‎‎ by Erik Schweller   [ updated ‎‎Feb 29, 2008 12:58 PM‎‎ ]
.

Introduction

 

What is anaphora resolution?  The best way to describe the process is to give an example in four sentences.

 

1.)    Bill traveled around the German town. 

2.)    He thought it was very beautiful.

3.)    He also saw several birds.

4.)    They chirped, “hello.”

 

When a human reader encounters those sentences, it is obvious that they really read (mentally) as follows:

 

5.)    Bill traveled around the German town. 

6.)    Bill thought the German town was very beautiful.

7.)    Bill also saw several birds.

8.)    Several birds chirped, “hello.”

 

Now, the above feat is simple for humans to do, when enough context is given.   I have yet to find a clear-cut algorithm that does as well as a human at the above task. I will describe the current attempts and results in the rest of this post.  Please, keep in mind that there are many solutions to the anaphora resolution problem, but the goal here is speed, not complete correctness (although a high level of correctness is desired). 

 

Assume that the environment is established.  A sentence pre-processor has cleaned the sentence text and a summarization of the text is ready for generation.  The text is treated as an array where each element is a sentence and the properties associated with it. 

  • One of those properties describes rather the sentence should be kept in a summarization. 
  • Other properties list the guessed proper nouns for the sentence. 
    • This lists is generated by keeping only words that are always lead with a capital letter in the entire document.  Additionally, these words must appear more than once.
  • Another critical property is the parts of speech associated with each of the words in the sentence. 
    • The part of speech tags were created using “a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values” from the lingua::tagger module.
  • The location in the sentence of the first noun or noun phrase.
    • This indicates a potential subject of the sentence, particularly if a pronoun does not come before the first occurrence.
  • Average sentence length.

 

Those properties are fed into the resolving module, which supports two approaches:

 

Approach 1: Simple Anaphora Resolution

 

Look at each sentence in order. 

  • For each word in the sentence, again in order, see if we have a noun or a pronoun.
    • If the current word is a pronoun, decide if the pronoun is a personal pronoun or a object pronoun.
      • If it is a personal pronoun, look up to 5 sentences previous to the current sentence for a guessed proper noun.  If none is found, look in the first sentence, up to the current word.
        • On success, store the link of the current sentence to the one where the proper noun was found.

 

      • If the current word is an object pronoun, (e.g., “it”), look from the current point, back to find any noun as tagged by the tagger.
        • If the sentence is not linking to itself or “-1”
          • If the sentence is greater than or equal any other values assigned to this type of link (default value is “–1”)
            • Assign a link between the two sentences.
              • If linking to a sentence that already links to another, link to its target.

 

For all sentences selected to keep in the summarized version of the text, also keep the sentences linked as well.

Results of approach 1.

 

Sample input:

“[0]:This is a simple example of anaphora resolution. [1]:This is by no means a complete test and should not be taken as one. [2]:To meet some requirements, this sample will be somewhat lengthy, but that will help it’s value as a basic test.

[3]:Some proper nouns, such as Bill and Nik need to be introduced. [4]:For them to be identified as proper nouns, they need to be mentioned more than once. [5]:For example, Bill is now a proper noun according to the simple guessing system. [6]:He is also a noun, as was Nik before the second mention of his name.

[7]:Bill traveled around the German town. [8]:He thought it was very beautiful. [9]:He also saw several birds. [10]:They chirped, “hello.” [11]:He had a nice day.

[12]:Most of the pronouns in this test link directly to the previous sentence. [13]:Here is an example of a sentence splitting the discussion. [14]:Nik found a kite. [15]:The find was fortunate. [16]:He was overjoyed. [17]:Another example of a difficult split is when the sentence references itself, but has many pronouns. [18]:This sentence is an example of itself, and it is a good sentence because of its confusing prose.”

 

Links created (ignoring summarization results):

4 -> 3 twice
6 -> 5 twice
8 -> 7
9 -> 7
10 -> 7 (Incorrect)
11 -> 7
16 -> 14

 

 


Approach 2: A Voting Algorithm for Anaphora Resolution

           

Look at each sentence in order. 

  • For each word in the sentence, again in order, see if we have a noun or a pronoun.
    • Initialize a vote value to twice the average length of the sentence.
    • If the current word is a noun, add this noun to the FIFO queue.  This queue is twice the average sentence length meaning that if the average sentence length is 10, then 20 of the most recently seen nouns exist in this queue.
      • Only keep the first subject instance if the noun is in the list of proper nouns.
    • If the current word is a pronoun, loop thought the current noun queue.
      • For each noun from the queue, decrease the vote value by one.
      • If the current noun from the queue has already been in the queue on a previous occurrence of this pronoun, add the current vote to the nouns previous vote total for this pronoun.
      • If the current noun is a proper noun from the list guessed, give it an additional vote of the average sentence length.

 

Look though the sentences in order.

  • For each word in the sentence, in order, check for noun or pronoun.
    • If the word is a pronoun
      • Allow the first sentence to be marked as the sentence to link to if the pronoun is an object pronoun. 
      • If more than a third of the sentence has already been seen, allow the current to be linked to.
      • If the subject items guessed has already been passed, allow the current sentence to be linked to.
      • Identify the closest noun to the current sentence.
      • Identify the closest proper noun to the current sentence.
      • Identify the noun that received the most votes for the current pronoun that is less than three times the average sentence length away.

 

Look though the sentences in order.

  • Calculate a “confidence” that the current sentence should be linked to a previous sentence. 
    • This value is a combined measure of:
      • The summed distances between each of the three types of sentence selections (closest noun, closest proper noun, most voted noun) each scaled by an arbitrary factor.
      • The current sentence subtract the average position of the three measures, scaled by an arbitrary factor.
      • The number of votes the voted noun received over the most votes a noun received, scaled by an arbitrary factor.
      • The position in the sentence over the length of the sentence, scaled by an arbitrary factor
      • An additional bonus if the current position is 3 or less words into the sentence.
      • If these values are already defined for the current sentence and another word is encountered a bonus is given to the confidence value.
      • If the nouns selected across the measures match, a bonus is given.
    • The sentence the current one is to be linked to is then decided by:
      • Preference is given to the nearest sentence to the current one when linking, but if two of the measures share the same noun, then preference is given to the closest of the two.

 

    • If the confidence factor is greater than .8:
      • If the sentence is not linking to itself
        • If the sentence is greater than or equal any other values assigned to this type of link (default value is –1)
          • Assign a link between the two sentences.
            • If linking to a sentence that already links to another, link to its target.

 

For all sentences selected to keep in the summarized version of the text, also keep the sentences linked as well.

 

Results of approach 2:

 

The same input data is used as with the first approach.

 

Links created (ignoring summarization results):

4 -> 3     Confidence: 1.45
6 -> 5     Confidence: 1.42
8 -> 7     Confidence: 1.42
9 -> 7     Confidence: 1.20
10 -> 7   Confidence: 0.96     (Incorrect choice)
11 -> 9   Confidence: 0.98
16 ->14  Confidence: 0.84

 

 

Future Work

 

The reality is that both of these approaches are limited by their “understanding” of the text.  When trying to get accurate results, natural language processing (NLP) is the next step for most.  Most aspect of NLP tend to be computationally intensive, for this reason, it has been avoided in this exploration.

 

There are ways to mitigate the exclusion of NLP from the solution.  The most obvious extension to try is to add tables for names, places and organizations.  If the pronoun in question is “he”, look before that point for male names.  If the pronoun is “it”, exclude names from the list of possible resolution points.

 

For the simple sample above, most of the selections worked correctly for both algorithms.  For the one incorrect choice, the problem lies in the linking of sentence 9 to sentence 7.  The system does not yet understand that a link off the pronoun “they” should not be linking to a singular name “Bill” and instead should link to the plural noun.  This is another step in the direction of NLP, but again, should be able to be improved by a set of basic rules.

 

Look for further postings on this topic in the future.

-erik