TREADS

454days since
Fall 2008 Classes Start

SAGA3 - Anaphora Resolution Take 2

posted ‎‎Mar 16, 2008 9:37 PM‎‎ by Erik Schweller   [ updated ‎‎Mar 18, 2008 10:35 AM‎‎ ]

Anaphora Resolution, Part 2

 

    In the universe great acts are made up of small deeds /

    The sage does not attempt anything very big, /

    And thus achieves greatness.

                            --- Lau Tsu, Tao Te Ching

 

 

In the first attempt at anaphora resolution, the results were mixed.  Two algorithms were implemented and lightly tested. The first algorithm used a few simple rules to attempt to guess the correct sentence the current sentence under analysis depended upon.  The second algorithm was a slightly more complicated in that it attempted to use co-occurrences of words within a window to better guess the proper dependency of the current sentence to itself or a previous one.  Keeping in the theme, the algorithm presented is this section is also a guesser called the “Schweller Anaphora Guesser Algorithm, Version 3” or SAGA3 for short.  The goal of SAGA3 is to decide to which sentence each anaphoric reference points.  Once this link is determined, the proper noun phrase identified as most likely resolving the current pronoun is used to replace the pronoun in a new collection of sentences.  These sentences will be used when calculating the values used to generate a summary of the text.

 

Anaphora resolution is not an easy problem, which implies that a solution built for speed is not likely to find the best possible solution.  Particularly in automatic anaphora resolution, (i.e., no human intervention or feedback and no text annotation before processing), the best results to date show that there is much work left to be done on this topic.  For a review of how the most cited anaphora resolution algorithms perform in an automatic fashion, please read the article Comparing Pronoun Resolution Algorithms by Ruslan Mitkov.

 

The approach taken in SAGA3, as with the previous implementations, is often called “knowledge-poor”.  The deep semantic meaning of the text is not considered.  Instead, surface level analysis collects indicators that inform the resolution process.  There are many “knowledge-poor” algorithms for anaphora resolution.  See Kennedy and Boguraev’s (1996) parser-free algorithm, Baldwin’s (1997) CogNiac, and Mitkov’s (1998b) knowledge-poor approach for a few examples. 

 

Resolution Clues

 

1. Location.

            An antecedent is more likely to be close to the anaphora.  An antecedent cannot occur after the anaphora.  Subject pronouns in the object, my reference the current’s sentence’s subject.

 

2. Gender and Number

Common to many approaches in anaphor resolution is checking the agreement of gender and number.  For example, the pronoun “he” will rarely resolve to the name “Betty” just as the pronoun “they” will never refer to “Sherry” since it is a singular noun.  SAGA3 incorporates this information for known names. 

 

3. Subject and Object (Not yet implemented)

            English sentences are composed of a subject and a verb or a subject, verb, and object.  The subject pronouns are I, you, he, she, it, we, and they.  The common object pronouns are me, you, him, her, it, us, and them.  In the example, “Bill was not looking.  Erik threw the ball to him.” it would be incorrect to resolve “him” to “Erik”.  In general, object pronouns will usually refer to the object of a sentence if they need resolved (i.e., they refer to a previous sentence) and subject pronouns will refer to the subject if they need resolved.

  NOTE: Currently, the subject and object are guessed . The subject is taken to be the first 1/3 of the sentence, and the object is the rest.  This information is not currently utilized to any  notable extent.

 

SAGA3, like CogNiac, will only resolve a pronoun if the pronoun appears to fit into a situation where general purpose reasoning or real understanding of the text is not required.  A series of rules are sequentially tested, and if none match, the pronoun is not resolved.  


SAGA3 does attempt to resolve reflexive pronouns (e.g., “himself”), as they most often do not point to a separate sentence.

 


 

Processing steps:

Assuming that part of speech tags have been applied:

  1. Identify all noun phrases that do not contain a nested pronoun.
    1. If the noun phrase is entirely contained in the list of words that always appear in the document as capital words, treat the phrase as a name. 

                                                               i.      If any or all parts of the name agree in gender, treat the name as a gendered name.

  1. Identify the gender of the names using a lookup table compiled from various sources.
  2. Identify the number (plural or singular). 
  3. For each sentence, look up to 2 sentences previous and find all possible antecedents.
    1. Tag them with as much information as can be determined;

                                                               i.      Gender if the current pronoun is personal

                                                             ii.      Number if the current pronoun is impersonal

    1. Choose the antecedent that is satisfies the most conditions:

                                                               i.      For Personal Pronouns

1.      Has a known gender

2.       Closest to the pronoun in the text

3.      If the pronoun is in the object, there is no potential match in the subject of the current sentence

4.      The resolving noun appears to be a proper noun.

                                                             ii.      For Impersonal Pronouns

1.      If it is in the object of the sentence, there is no potential subject noun phrase to resolve it.

2.      The number of the pronoun matches the number of the noun phrase.

3.      If the pronoun is in the subject, there is either an appropriate noun phrase in the object of the previous sentence or the previous sentence also contained an impersonal pronoun in the subject.  (In Progress)

  1. When multiple antecedents exist in the list of possible resolutions, the system currently accepts the lexically closer noun phrase for resolution.
    1. Preference is given in the following order:

                                                               i.      Fully identified (known number, gender, and is a proper noun)

                                                             ii.      Partially identified, gender unknown but is proper. May be a noun phrase.

                                                            iii.      Generic noun of unknown type.  May be a noun phrase.

  1. When no antecedents remain, the pronoun is not resolved under the assumption further information is required to make an appropriate decision.  (This has the effect of ignoring pleonastic anaphora such as “it” does not refer to any antecedent in the text).

 


Results

 

This implementation is very careful when resolving.  Many anaphora that a human can easily resolve, are missed by SAGA3, but unlike previous versions, of the attempted resolutions, the results are promising.  In a future update, look for hard numbers describing the success rate.    


The rest of this discussion will center around the same sample as used previously:

“[0]:This is a simple example of anaphora resolution. [1]:This is by no means a complete test and should not be taken as one. [2]:To meet some requirements, this sample will be somewhat lengthy, but that will help its value as a basic test.

[3]:Some proper nouns, such as Bill and Nik need to be introduced. [4]:For them to be identified as proper nouns, they need to be mentioned more than once. [5]:For example, Bill is now a proper noun according to the simple guessing system. [6]:He is also a noun, as was Nik before the second mention of his name.

[7]:Bill traveled around the German town. [8]:He thought it was very beautiful. [9]:He also saw several birds. [10]:They chirped, “hello.” [11]:He had a nice day.

[12]:Most of the pronouns in this test link directly to the previous sentence. [13]:Here is an example of a sentence splitting the discussion. [14]:Nik found a kite. [15]:The find was fortunate. [16]:He was overjoyed. [17]:Another example of a difficult split is when the sentence references itself, but has many pronouns. [18]:This sentence is an example of itself, and it is a good sentence because of its confusing prose.”

 

 

Sentences 4, 6, 8, 9, and 10 were altered.  In order, they have been edited to read as:

4:for nouns to be identified as proper nouns they need to be mentioned more than once

6:bill is also a noun and so is nik before the second mention of his name

8:bill thought town was very beautiful

9:bill also saw several birds

10:birds chirped hello


Notice that "the" was left off of the noun phrases in the case of 8 and 10.   Also notice that sentence chaining is currently not implemented.


Recursively resolving results in sentence 11 being included as well.  Recursion is allowed until no more pronouns are resolved.  Since the "he" in sentence 11 is too far away (i.e., more than 2 sentence) from the "Bill" in sentence 7, the "he" in sentence 8 and 9 first had to be resolved.   The recursion adds little extra to time to the overall algorithm, as only the unresolved pronouns are inspected during each pass.


 

Future work

 

There are numerous extensions to the current status of SAGA3. The next few may include the following.


The simplest of all is to implement sentence chaining.  Whenever the anaphoric references of the previous sentence are resolved, include that knowledge in when resolving for the next. 


It seems the system would benefit from a continually growing database of names to their gender.  The system could learn gender as text is processed, relating occurrence of “he” or “she” with the proper nouns that are currently unknown.  

 

Further world knowledge, such as corporations, and famous people with their associations to events may help resolve "them" and "they". 

 

The resolution of split antecedents (e.g., “Bill and Nik went fishing.  They both wore shorts”) are not currently handled.  When “they” is encountered in the text, previous sentences should be inspected for the form <proper noun> <non-pronoun and non-verb> or <conjunction> <proper noun>.  IFF this appears in one of the previous 2 sentences, and no other pronouns were identified, is they can be resolved.


The noun phrase identification is quite rough around the edges in its current rendetion.  Deciding when to keep or reject a determinate is an open question currently.

 

Stay tuned, this saga is not yet concluded.

-erik