Two Authors
Previous page: Stylometric Analysis
Text Categories
Consider the following scenarios regarding two pieces of text on the same basic subject, A and B, where A is written first and B is created later (Note: Referring to B as being created rather than written allows for copying or editing from A, and/or independent creative writing). For the purpose of this exercise, let us make the basic assumption that the two authors (call them aA and aB) have natural profiles pA and pB that are sufficiently different that one can be distinguished from the other. There are three cases to consider, depending on the actions of aB:
aB copies word for word from A. Barring mistakes in transcription, the words in B will be identical to those in A, and pB will therefore be the same as pA.
B is created by editing A. Depending on how many of the words of A are copied or changed by aB in the editing process, the result will have a profile that may or may not be similar to either pA or pB;
B is written independently from A. pA and pB will depend on purely on their respective authors, with no influence from each other.
We can therefore see that when two people author text on the same subject, the similarities between the profiles of the texts will vary, depending on how much one person copies from or edits the text of the other. Now suppose that A and B each contain several passages (i.e. strings of words forming whole or part of a description, an event, a story, etc.), and that, as before, the whole of A is written before B. A includes complete passages not included in B, and vice versa. In addition, the two texts include common passages, where B is copied or edited from A. The text in A and B can then be split into the following five categories:
Category 1: Text in A: Passages not in B. This text is not influenced by B, and so has profile pA.
Category 2: Text in A: Passages also included in B, but where aB does not use A’s actual words (i.e. where B was edited from A).
Category 3: Text in both A and B: Complete or partial passages in B that were copied word-for-word from A. This text has a profile that is a ‘combined’ form of pA and pB, since it contains just that subset of the words from A that aB decided (for whatever reason) to also use.
Category 4: Text in B: Passages also included in A, but where aB does not use A’s actual words (i.e. where B was edited from A).
Category 5: Text in B: Passages not in A. This text is not influenced by A, and so has profile pB.
A New Notation
At this point it is helpful to introduce a new notation for each of the above five categories. Each category is identified by a two-digit number, with the first digit representing text in A, and the second digit representing text in B, where:
A ‘2’ identifies that the words in this category are used in this text (either A or B).
A ‘1’ identifies words not used in a passage in this text (A or B) that are used in a parallel in the other text (B or A respectively).
A ‘0’ identifies a passage in the other text not present in this text.
Therefore:
Category 1 is denoted by c20 (Words in A in passages that have no parallel in B);
Category 2 is denoted by c21 (Words in A in passages with parallels in B that use different words to those in A);
Category 3 is denoted by c22 (Words in A with identical words in parallels in B);
Category 4 is denoted by c12 (Words in B in passages with parallels in A that use different words to those in B);
Category 5 is denoted by c02 (Words in B in passages that have no parallel in A).
Categories c20, c21, and c22 together contain all the text in A, and for convenience we can identify the combination of these three categories as c2X, when the ‘X’ indicates any value (0, 1, or 2) for the text in B. Similarly, c22, c12, and c02 contain all the text in B, and this combination can be identified as cX2. These relationships are shown diagrammatically in this Two Person Venn Diagram .
Common Passages: Categories c21, c22, and c12
Categories c21, c22, and c12 are all created by the action of aB deciding to copy, edit, add to, or simply not use words or sentences from passages in A. For example, suppose that A contains a passage including the sentence; “The brown fox jumps over the lazy dog,” and that aB includes a variant of it in B:
A contains: “The brown fox jumps over the lazy dog,” and
B contains: “The very quick brown fox leaps over the dog who was asleep.
Then, all the words in A and B can be assigned to one of the categories c21, c22, and c12:
c21 contains “jumps” and “lazy;”
c22 contains “The” “brown” “fox” “over” “the” “dog;”
c12 contains “very quick,” “leaps,” and "who was asleep."
It is important to realize that although all the words in c21 and c22 were originally written by aA, aA has no influence over how the words are distributed across these two categories. The distribution of these words, and all the words in c12, are determined solely by the actions of aB. The categories only exist because aB decided to include in B (for whatever reason) edited versions of passages that were already in A. It is worth noting that:
If aB simply copies complete passages from A and does not change any of the words in them, then c22 will contain the copied passages, and c21 and c12 will be empty;
If aB edits and attempts to never copy aA’s actual words, then c22 will probably be empty, although it is still possible that c22 will contain at least a few names, subject-specific words, etc. purely because aB happened to independently use some essential words that aA also used.
It is perhaps more likely that aB would choose not to use some sentences from A, re-use some with no change, edit others, and add some of his own. In this case c21 will contain a mixture of complete sentences and individual words from A not used by aB, c22 will contain a mixture of complete sentences and individual words from A re-used by aB, and c12 will contain complete sentences and individual words added by aB.