Correlations between Profiles

Positive Correlations

Depending on how heavily aB edited A, p21, p22, and p12 (the frequency profiles of the categories c21, c22, and c12) may or may not be similar to either p20 (which, barring subject dependencies, we assume is the same as, or very similar to, pA) or p02 (again barring subject dependencies, the same as or very similar to pB), and also may or may not be similar to each other. By ‘similar’ we mean that the frequencies with which words occur in categories being compared with each other are similar. By creating a scatter graph of the frequencies with which each word appears in a pair of categories we can determine how similar the frequencies are in each, i.e. how closely the points on the graph approximate to a straight line. Although such a graph can be created for each pair of categories being compared, it is often not easy to see exactly how similar any pair of profiles actually are. However, there are various statistical methods that can give a numerical value representing the degree of similarity, and one of these methods is through correlation:

In probability theory and statistics, correlation (often measured as a correlation coefficient) indicates the strength and direction of a linear relationship between two random variables… In general statistical usage, correlation or co-relation refers to the departure of two random variables from independence. (Wikipedia)

Another way of saying this is that the more similar two profiles are (the stronger the correlation), the less the likelihood that the similarity is random, and the greater the likelihood that there is a causative factor. As determined above, depending on the actions of aB, different similarities (correlations) may exist:

In other words, if A precedes B then p21 and p22 may have strong correlations with p20 (but not with p02), and p12 may have a strong correlation with p02 but not with p20. Note that we do not need the whole of A to precede B. Instead, it is enough that aA wrote the sentences that end up in c21 and c22 before aB performed his or her edits. Essentially, there are four ‘events’:

1.     aA writes category c20 (because aB does not use it later).

2.     aA writes text that will later become c21 or c22 as a result of the actions of aB.

3.     aB creates c21 and c22 by copying some of aA’s words to form c22, and adds the words that form c12.

4.     aB writes c02.

The only time dependency here is that 'event' 2 must precede 3, and so all we actually need is that the words in c21 and c22 were written before those in c12. The asymmetry is important: If c21 and c22 precede c12, then p22 (text originally written by aA) may have a strong correlation with p20, but not with p02. Conversely, if c22 and c12 precede c21, then p22 (text originally written by pB) may have a strong correlation with p02, but not with p20.

Different Relationships (Negative Correlations)

Strictly speaking, in discussing the possible correlations mentioned above we should be referring to positive correlations. This is because we are talking about situations in which high frequencies of use of a word in one category match high frequencies in another, and low frequencies of use in one match low frequencies in the other. However, if instead of editing by replacing complete sentences aB edits by selecting or omitting individual words from A, then p21, p22, and p12 may not have any positive correlations with either p20 or p02. Nevertheless,  there may be a different form of linear relationship between the profiles, where high frequencies of use in one category match low frequencies of use in another, and low frequencies in one match high frequencies in the other. This is termed a negative correlation.

To explain how negative correlations can arise in this situation it will be helpful to go back to the example given earlier, where:

Here aB used “leaps” in place of "jumps." If aB changed every use of “jumps” in A to "leaps," then all of the uses of “jumps” in A that were in passages shared with B would be located in c21, and none in c22. Conversely, if “brown fox” was consistently kept by aB wherever it existed in A, then we would find both "brown" and “fox” in c22, but not in c21. Thus, if aB consistently chose not to use many of the words used by aA and consistently kept others, we would see that words common in c21 would be rare in c22, and vice versa. A scatter graph of the two profiles would show that high frequencies in one corresponded to low frequencies in the other.

Another way of looking at this is that c21 and c22 together constitute a fixed ‘pool’ of words in A, and the choices made by aB regarding which of those words to keep in B determine how many instances of each of aA’s words (e.g. how many uses of “jumps”) are in either c21 or c22. Some of the words will be more frequent in c21 and less frequent in c22, while others will be more frequent in c22 and less frequent in c21. Although c21 and c22 are not similar, there is nevertheless still a strong linear relationship between their profiles, caused directly by the actions of aB, i.e. a negative correlation.

In addition to keeping the words in c22, aB modifies what A wrote by adding/changing words and/or sentences that then form c12 (and of course aB can add new material as c02), and together c22 and c12 together constitute aB’s version of the passages that aA has in c21 and c22 . As before, we can view c21 and c22 as a 'pool' of words written by aA, with the choices made by aB determining how many instances of each word appear in either c21 or c22. If c22 contains just a selection of words from A then c12 could contain just words added to c21 to form complete sentences. Under these circumstances the words in c21 might possibly not appear at all in c12, and vice versa, again giving rise to a negative correlation. If instead both c22 and c12 contain complete sentences (from aA and aB respectively) then there is unlikely to be a correlation between them. The overall effect is as follows:

Other Possible Relationships

Category 21: Although c21 contains only words from A, p21 is likely to be a composite of words from A and B, since the choices made by aB have split aA’s words between c21 and c22. However,  because aB 'rejected' the words in c21, there is unlikely to be a positive correlation between p21 and p02, but for the same reason there may be a negative correlation between these two profiles.

Category 22: Because c22 contains just those words from A that aB also chose to use, p22 may or may not positively correlate with p20. However, even though aB chose these words it is unlikely that p22 is similar to p02.

Category 12: c12 contains words that aB wrote, but omits those words that aB copied directly from A (which are in category 22). Therefore, depending on how B was copied or edited from A, p12 may or may not positively correlate with p02. Because c12 contains words that aB used instead of aA’s words, there may be a negative correlation between p12 and p20.

Category c20 contains that part of A that is not shared in some way with B. Similarly, c02 contains that part of B that is not shared with A. We know that p20 is likely to be different to that of p02 (they were written by different people), but in addition, c20 refers to events, names, places, etc. not described in c02, because by definition they contain no parallels. As a result, we may find many words in either c20 or c02 that are not used in other categories simply because the subjects referred to are different. In this case we may see negative correlations between p20, p02 and other profiles that are due to the subject matter, even if the style of writing is otherwise the same. For example, p20 and (p21 + p22) could show a negative correlation because they address different subjects, even if they were both written by aA.

Asymmetry of Correlations

In the possible correlations discussed above there are several asymmetries. These occur because the roles of aA and aB are not the same, and also because of the five categories discussed, three contains words from aA, and but only two contain words from aB. In particular, the asymmetry regarding c22 is crucial: If the words in c21 and c22 precede those in c12, then p22 (originally written by aA) may have a strong correlations with p20 and/or p21, but not with p02 and p12. Conversely, if c22 and c12 precede c21, then p22 (now originally written by aB) may have a strong correlation with p02 and/or p12, but not with p20 and p21.

In this case, and the other correlations discussed above, the asymmetry can be 'reversed' simply by assuming that B was written before A. Therefore, by examining the details of any correlations between two texts that include shared passages, we may be able to determine which one was written first. In addition, the correlations between p21, p22, and p12 vary according to the copying/editing/replacing choices made by the second author, and an examination of the different correlations may provide some insight as to these choices.

Homogeneity

c20, c21, and c22 together constitute the whole of A, and is here denoted by c2X. c20 is defined as containing passages from A that have no parallels in B, and therefore the text that is left (the combination of c21 and c22) is all the passages from A with identical passages in B (c22) or with non-identical parallels in B (c21). If this combination is denoted by c2N, then:

In total c20 and c2N contain all of A, and we can compare their profiles. c20 contains words only from A, while c2N includes c22, which contains words that might have come from either A or B. Therefore, the more similar p20 and p2N are, the more likely it is that c20 and c2N came from the same source, i.e. from A. We can also combine c20 and c21 (which will be termed c2A) and in the same way compare their resulting profile with that of c22:

Because c20, c21, and c22 together contain all the text in A, p2X is pA. Using our initial assumption that A was created before B (and therefore c2N contains only words written by aA) we can therefore predict the following:

Because p21 varies according to how aB copies or edits A, p21 may or may not be similar to p20, and so:

Similarly, c02, c12, and c22 together contain all the text in B. Using the initial assumption, some of these words (the words in c22) were copied from A. Therefore:

If the whole of A (2X) is written by the same person (and assuming similar subject matter in c20 and c2N), we should expect to see a strong correlation between p20 and p2N. Conversely, if the whole of B (X2) is written by the same person, we should expect to see a strong correlation between p02 and pN2

Note that p20 and p2N are each likely to show a strong correlation with p2X irrespective of whether A or B came first, because c2X contains both c20 and c2N. The same applies to comparing p02 and pN2 with pX2. This highlights a general rule for comparing combinations of categories:

Next: Summary of Possible Positive and Negative Correlations