Anaphora Resolution, Part 2
In the universe great acts are made up of small deeds / The sage does not attempt anything very big, / And thus achieves greatness. --- Lau Tsu, Tao Te Ching
In the first attempt at anaphora resolution, the results were mixed. Two algorithms were implemented and lightly tested. The first algorithm used a few simple rules to attempt to guess the correct sentence the current sentence under analysis depended upon. The second algorithm was a slightly more complicated in that it attempted to use co-occurrences of words within a window to better guess the proper dependency of the current sentence to itself or a previous one. Keeping in the theme, the algorithm presented is this section is also a guesser called the “Schweller Anaphora Guesser Algorithm, Version 3” or SAGA3 for short. The goal of SAGA3 is to decide to which sentence each anaphoric reference points. Once this link is determined, the proper noun phrase identified as most likely resolving the current pronoun is used to replace the pronoun in a new collection of sentences. These sentences will be used when calculating the values used to generate a summary of the text.
Anaphora resolution is not an easy problem, which implies that a solution built for speed is not likely to find the best possible solution. Particularly in automatic anaphora resolution, (i.e., no human intervention or feedback and no text annotation before processing), the best results to date show that there is much work left to be done on this topic. For a review of how the most cited anaphora resolution algorithms perform in an automatic fashion, please read the article Comparing Pronoun Resolution Algorithms by Ruslan Mitkov.
The approach taken in SAGA3, as with the previous implementations, is often called “knowledge-poor”. The deep semantic meaning of the text is not considered. Instead, surface level analysis collects indicators that inform the resolution process. There are many “knowledge-poor” algorithms for anaphora resolution. See Kennedy and Boguraev’s (1996) parser-free algorithm, Baldwin’s (1997) CogNiac, and Mitkov’s (1998b) knowledge-poor approach for a few examples.
Resolution Clues
1. Location. An antecedent is more likely to be close to the anaphora. An antecedent cannot occur after the anaphora. Subject pronouns in the object, my reference the current’s sentence’s subject.
2. Gender and Number Common to many approaches in anaphor resolution is checking the agreement of gender and number. For example, the pronoun “he” will rarely resolve to the name “Betty” just as the pronoun “they” will never refer to “Sherry” since it is a singular noun. SAGA3 incorporates this information for known names.
3. Subject and Object (Not yet implemented) English sentences are composed of a subject and a verb or a subject, verb, and object. The subject pronouns are I, you, he, she, it, we, and they. The common object pronouns are me, you, him, her, it, us, and them. In the example, “Bill was not looking. Erik threw the ball to him.” it would be incorrect to resolve “him” to “Erik”. In general, object pronouns will usually refer to the object of a sentence if they need resolved (i.e., they refer to a previous sentence) and subject pronouns will refer to the subject if they need resolved. NOTE: Currently, the subject and object are guessed . The subject is taken to be the first 1/3 of the sentence, and the object is the rest. This information is not currently utilized to any notable extent.
SAGA3, like CogNiac, will only resolve a pronoun if the pronoun appears to fit into a situation where general purpose reasoning or real understanding of the text is not required. A series of rules are sequentially tested, and if none match, the pronoun is not resolved.
Processing steps: Assuming that part of speech tags have been applied:
i. If any or all parts of the name agree in gender, treat the name as a gendered name.
i. Gender if the current pronoun is personal ii. Number if the current pronoun is impersonal
i. For Personal Pronouns 1. Has a known gender 2. Closest to the pronoun in the text 3. If the pronoun is in the object, there is no potential match in the subject of the current sentence 4. The resolving noun appears to be a proper noun. ii. For Impersonal Pronouns 1. If it is in the object of the sentence, there is no potential subject noun phrase to resolve it. 2. The number of the pronoun matches the number of the noun phrase. 3. If the pronoun is in the subject, there is either an appropriate noun phrase in the object of the previous sentence or the previous sentence also contained an impersonal pronoun in the subject. (In Progress)
i. Fully identified (known number, gender, and is a proper noun) ii. Partially identified, gender unknown but is proper. May be a noun phrase. iii. Generic noun of unknown type. May be a noun phrase.
This implementation is very careful when resolving. Many anaphora that a human can easily resolve, are missed by SAGA3, but unlike previous versions, of the attempted resolutions, the results are promising. In a future update, look for hard numbers describing the success rate.
“[0]:This is a simple example of anaphora resolution.
[1]:This is by no means a complete test and should not be taken as one. [2]:To
meet some requirements, this sample will be somewhat lengthy, but that will
help its value as a basic test.
Sentences 4, 6, 8, 9, and 10 were altered. In order, they have been edited to read as: 4:for nouns to be identified as proper nouns they need to be mentioned more than once Notice that "the" was left off of the noun phrases in the case of 8 and 10. Also notice that sentence chaining is currently not implemented.
Future work
There are numerous extensions to the current status of
SAGA3. The next few may include the following. The simplest of all is to implement sentence chaining. Whenever the anaphoric references of the previous sentence are resolved, include that knowledge in when resolving for the next. It seems the system would benefit from a continually growing database of names to their gender. The system could learn gender as text is processed, relating occurrence of “he” or “she” with the proper nouns that are currently unknown.
Further world knowledge, such as corporations, and famous people with their associations to events may help resolve "them" and "they".
The resolution of split antecedents (e.g., “Bill and Nik went fishing. They both wore shorts”) are not currently handled. When “they” is encountered in the text, previous sentences should be inspected for the form <proper noun> <non-pronoun and non-verb> or <conjunction> <proper noun>. IFF this appears in one of the previous 2 sentences, and no other pronouns were identified, is they can be resolved. The noun phrase identification is quite rough around the edges in its current rendetion. Deciding when to keep or reject a determinate is an open question currently.
Stay tuned, this saga is not yet concluded. -erik |