.Introduction
What is anaphora resolution? The best way to describe the process is to give an example in
four sentences.
1.)
Bill traveled around the German town.
2.)
He thought it was very beautiful.
3.)
He also saw several birds.
4.)
They chirped, “hello.”
When a human reader encounters those sentences, it is
obvious that they really read (mentally) as follows:
5.)
Bill traveled around the German town.
6.)
Bill thought the German town was very beautiful.
7.)
Bill also saw several birds.
8.)
Several birds chirped, “hello.”
Now, the above feat is simple for humans to do, when enough
context is given. I have yet to find a
clear-cut algorithm that does as well as a human at the above task. I will
describe the current attempts and results in the rest of this post. Please, keep in mind that there are many
solutions to the anaphora resolution problem, but the goal here is speed,
not complete correctness (although a high level of correctness is desired).
Assume that the environment is established. A sentence pre-processor has cleaned the
sentence text and a summarization of the text is ready for generation. The text is treated as an array where each
element is a sentence and the properties associated with it.
- One of
those properties describes rather the sentence should be kept in a
summarization.
- Other
properties list the guessed proper nouns for the sentence.
- This
lists is generated by keeping only words that are always lead with a
capital letter in the entire document.
Additionally, these words must appear more than once.
- Another
critical property is the parts of speech associated with each of the words
in the sentence.
- The
part of speech tags were created using “a probability based,
corpus-trained tagger that assigns POS tags to English text based on a
lookup dictionary and a set of probability values” from the lingua::tagger
module.
- The
location in the sentence of the first noun or noun phrase.
- This
indicates a potential subject of the sentence, particularly if a pronoun
does not come before the first occurrence.
- Average
sentence length.
Those properties are fed into the resolving module, which
supports two approaches:
Approach 1: Simple Anaphora Resolution
Look at each sentence in order.
- For
each word in the sentence, again in order, see if we have a noun or a
pronoun.
- If
the current word is a pronoun, decide if the pronoun is a personal
pronoun or a object pronoun.
- If
it is a personal pronoun, look up to 5 sentences previous to the current
sentence for a guessed proper noun.
If none is found, look in the first sentence, up to the current
word.
- On
success, store the link of the current sentence to the one where the
proper noun was found.
- If
the current word is an object pronoun, (e.g., “it”), look from the
current point, back to find any noun as tagged by the tagger.
- If
the sentence is not linking to itself or “-1”
- If
the sentence is greater than or equal any other values assigned to
this type of link (default value is “–1”)
- Assign
a link between the two sentences.
- If
linking to a sentence that already links to another, link to its
target.
For all sentences selected to keep in the summarized version
of the text, also keep the sentences linked as well.
Results of approach 1.
Sample input: “[0]:This is a simple example of anaphora resolution. [1]:This is by no means a complete test and should not be taken as one. [2]:To meet some requirements, this sample will be somewhat lengthy, but that will help it’s value as a basic test.
[3]:Some proper nouns, such as Bill and Nik need to be introduced. [4]:For them to be identified as proper nouns, they need to be mentioned more than once. [5]:For example, Bill is now a proper noun according to the simple guessing system. [6]:He is also a noun, as was Nik before the second mention of his name.
[7]:Bill traveled around the German town. [8]:He thought it was very beautiful. [9]:He also saw several birds. [10]:They chirped, “hello.” [11]:He had a nice day.
[12]:Most of the pronouns in this test link directly to the previous sentence. [13]:Here is an example of a sentence splitting the discussion. [14]:Nik found a kite. [15]:The find was fortunate. [16]:He was overjoyed. [17]:Another example of a difficult split is when the sentence references itself, but has many pronouns. [18]:This sentence is an example of itself, and it is a good sentence because of its confusing prose.”
Links created (ignoring summarization
results): 4 -> 3 twice 6 -> 5 twice 8 -> 7 9 -> 7 10 -> 7 (Incorrect) 11 -> 7 16 -> 14
Approach 2: A Voting Algorithm for Anaphora Resolution
Look at each sentence in order.
- For
each word in the sentence, again in order, see if we have a noun or a
pronoun.
- Initialize
a vote value to twice the average length of the sentence.
- If
the current word is a noun, add this noun to the FIFO queue. This queue is twice the average
sentence length meaning that if the average sentence length is 10, then
20 of the most recently seen nouns exist in this queue.
- Only
keep the first subject instance if the noun is in the list of proper
nouns.
- If
the current word is a pronoun, loop thought the current noun queue.
- For
each noun from the queue, decrease the vote value by one.
- If
the current noun from the queue has already been in the queue on a
previous occurrence of this pronoun, add the current vote to the nouns
previous vote total for this pronoun.
- If
the current noun is a proper noun from the list guessed, give it an
additional vote of the average sentence length.
Look though the sentences in order.
- For
each word in the sentence, in order, check for noun or pronoun.
- If
the word is a pronoun
- Allow
the first sentence to be marked as the sentence to link to if the
pronoun is an object pronoun.
- If
more than a third of the sentence has already been seen, allow the
current to be linked to.
- If
the subject items guessed has already been passed, allow the current
sentence to be linked to.
- Identify
the closest noun to the current sentence.
- Identify
the closest proper noun to the current sentence.
- Identify
the noun that received the most votes for the current pronoun that is
less than three times the average sentence length away.
Look though the sentences in order.
- Calculate
a “confidence” that the current sentence should be linked to a previous
sentence.
- This
value is a combined measure of:
- The
summed distances between each of the three types of sentence selections
(closest noun, closest proper noun, most voted noun) each scaled by an
arbitrary factor.
- The
current sentence subtract the average position of the three measures,
scaled by an arbitrary factor.
- The
number of votes the voted noun received over the most votes a noun
received, scaled by an arbitrary factor.
- The
position in the sentence over the length of the sentence, scaled by an
arbitrary factor
- An
additional bonus if the current position is 3 or less words into the
sentence.
- If
these values are already defined for the current sentence and another
word is encountered a bonus is given to the confidence value.
- If
the nouns selected across the measures match, a bonus is given.
- The
sentence the current one is to be linked to is then decided by:
- Preference
is given to the nearest sentence to the current one when linking, but if
two of the measures share the same noun, then preference is given to the
closest of the two.
- If
the confidence factor is greater than .8:
- If
the sentence is not linking to itself
- If
the sentence is greater than or equal any other values assigned to this
type of link (default value is –1)
- Assign
a link between the two sentences.
- If
linking to a sentence that already links to another, link to its
target.
For all sentences selected to keep in the summarized version
of the text, also keep the sentences linked as well.
Results of approach 2:
The same input data is used as with the first approach.
Links created (ignoring summarization results): 4 -> 3 Confidence: 1.45 6 -> 5 Confidence: 1.42 8 -> 7 Confidence: 1.42 9 -> 7 Confidence: 1.20 10 -> 7 Confidence: 0.96 (Incorrect choice) 11 -> 9 Confidence: 0.98 16 ->14 Confidence: 0.84
Future Work
The reality is that both of these approaches are limited by
their “understanding” of the text. When
trying to get accurate results, natural
language processing (NLP) is the next step for most. Most aspect of NLP tend to be
computationally intensive, for this reason, it has been avoided in this
exploration.
There are ways to mitigate the exclusion of NLP from the
solution. The most obvious extension to
try is to add tables for names, places and organizations. If the pronoun in question is “he”, look
before that point for male names. If
the pronoun is “it”, exclude names from the list of possible resolution points.
For the simple sample above, most of the selections worked
correctly for both algorithms. For the
one incorrect choice, the problem lies in the linking of sentence 9 to sentence
7. The system does not yet understand
that a link off the pronoun “they” should not be linking to a singular name
“Bill” and instead should link to the plural noun. This is another step in the direction of NLP, but again, should
be able to be improved by a set of basic rules.
Look for further postings on this topic in the future.
-erik |