Stylometrics And The Synoptic Problem
Application to the Synoptic Problem
The Stylometric Analysis page describes what might be achieved by comparing the styles (the frequencies with which different words are used)) of parallel portions of text created by either Two Authors or Three Authors, so how can this be applied to the synoptic problem? Because there are three texts, the gospels of Matthew (Mt), Mark (Mk), and Luke (Lk), we can perform three sets of pair-wise comparisons:
Mt vs. Mk
Mt vs. Lk
Mk vs. Lk
Using the reasoning given in Stylometric Analysis, we may be able to find correlations between the styles of the text in the categories already identified, and if so, then we may be able to say something about the dependencies of one text upon another. However, this depends on having the text of the synoptic gospels split up into the previously described categories. For this we can turn to the:
Synoptic Concordance of the First Three Gospels: A Research Project of the Department of New Testament Studies of the University of Bamberg, Germany.
Quoting from the web-site of Prof. Dr. Thomas Hieke on this concordance:
Under the leadership of Prof. Dr. Paul Hoffmann, this research project was undertaken by Dr. Thomas Hieke, and by Dr. Ulrich Bauer, who developed the necessary computer programs. After preliminary planning and experimentation, financed by the University of Bamberg, the Deutsche Forschungsgemeinschaft has supported the project since 1996.
The Synoptic Concordance is a new research tool for the analysis of the first three Gospels, in that it presents an extensive mass of data that facilitates in a major way their literary and linguistic analysis.
The advantages of a concordance are combined with those of a synopsis: Each occurrence of a word in the synoptic Gospels, along with a swath of text that provides its context, is displayed in three columns. The effect is that one not only sees the occurrences of a certain word in one Gospel, but also the parallels in the other two Gospels.
The HHB Concordance
The Synoptic Concordance from Hoffmann, Hieke, Bauer (the HHB Concordance, or HHBC) splits the words of the synoptic gospels (Matthew, Mark, and Luke) into nineteen categories, in exactly the same way as discussed in Stylometric Analysis when considering Three Authors. Using the ordering in the HHBC, Matthew is text A, Mark is text B, and Luke is text C, so we have:
Mt = (c200+c201+c202) + (c210+c211+c212) + (c220+c221+c222) (nine categories with ‘2’ in the 1st place);
Mk = (c020+c021+c022) + (c120+c121+c122) + (c220+c221+c222) (nine categories with ‘2’ in the 2nd place);
Lk = (c002+c012+c022) + (c102+c112+c122) + (c202+c212+c222) (nine categories with ‘2’ in the 3rd place).
As before, because six of the categories shown above exist in two texts, and one in all three, there are only nineteen unique categories, as indicated in the Synoptic Venn Diagram to the right, in which the degree of overlap of the texts increases towards the center, with 200, 020 and 002 representing words unique to one gospel, and 222 representing words common to all three .This can also be seen in the HHB Concordance example page, which identifies the nineteen different categories, and also shows that the notation used above to denote the different categories (using 0s, 1s, and 2s) is taken directly from the HHB concordance.
The HHB Concordance data used here was entered into an Excel spreadsheet by Dave Gentile, and used in his own analysis of the synoptic problem. As described by Gentile:
In the HHB Synoptic Concordance, 807 of the most common words were completely tabulated. The less common words were covered with less complete tabular information. The 807 tabulated words were used in this study. Therefore the raw data for the study was a table with 807 rows, 1 for each vocabulary item, and 19 columns, 1 for each synoptic category. Each cell in the data shows the number of times the particular vocabulary item occurs in the particular synoptic category.
Using the HHB Concordance data
Gentile’s statistical analysis of the HHBC data looks for similarities in the profiles of all the different pairings of two of the nineteen HHBC categories. However, it is also worth comparing only two of the synoptic gospels at a time, thus reducing the number of categories involved in each set of comparisons. In order to do this we need to identify which of the HHBC categories need to be combined in order to be able to perform these pair-wise comparisons, but it is useful at this point to associate particular HHBC categories with the traditional names often used for specific overlaps between the synoptics, or where there is unique material:
"The triple tradition is material that is common to all three of the synoptics. Almost all of Mark's content is found in Matthew, and about two-thirds of Mark is found in Luke. The triple tradition largely consists of narrative material (miracles, healings, and the passion) but also contains some sayings material" (Carlson). The triple tradition text is contained in the seven HHBC categories that do not contain a zero: (c211 + c221 + c121 + c122 + c112 + c212 + c222)
"The double tradition is the substantial amount of material (about 200 verses) that is shared between Matthew and Luke but is not found in Mark. Its content is mainly saying material (mostly of Jesus, but some by John the Baptist) but includes at least one miracle story (the Centurion's Servant) as well. The double tradition exhibits some of the most striking verbatim agreements in some passages but quite divergent versions in other passages" (Carlson). This material is contained in the three HHBC categories with a zero only in Mark: (c201 + c202 + c102)
The Great Omission - This refers to the large number of consecutive verses from (approximately) Mk 6:47a - 8:27b that are not included in Luke. This text could be considered to be part of a Matthew-Mark Double Tradition, although it is not referred to as such. Three categories: (c120 + c220 + c210)
Verbatim Agreements - Triple tradition words common to Matthew and Luke, but not Mark. These are all contained in c212 (The HHBC data is not structured in such a way as to be able to distinguish between the Minor and Major Agreements). Note: Where Mark differs from both Matthew and Luke, the Markan words are contained in c121.
The HHBC data does not capture any information regarding order, therefore it is not possible to distinguish agreements or disagreements in order.
Sondergut (or Special) Luke - Words unique to Luke: c002
Sondergut (or Special) Matthew - Words unique to Matthew: c200
Note that terms such as 'Double Tradition' and 'Minor Agreements' are only used when comparing Matthew and Luke against Mark, and the equivalent terms for referring to words or passages in Matthew and Mark but not Luke, or Mark and Luke but not Matthew, are not used. Similarly, it is not usual to refer to Sondergut (or Special) Mark. However, in this analysis terms such as 'Matthew-Mark Double Tradition' or 'Mark-Luke Agreements' will be used as appropriate to aid understanding.
For the 807 different words included in the HHB Concordance each of the three synoptic gospels can be ‘recreated’ by combining nine of the HHBC categories. Using the categories shown above we have:
Matthew = (c200 + c201 + c202) + (c210 + c211 + c212) + (c220 + c221 + c222) = c2XX
Mark = (c020 + c021 + c022) + (c120 + c121 + c122) + (c220 + c221 + c222) = cX2X
Luke = (c002 + c012 + c022) + (c102 + c112 + c122) + (c202 + c212 + c222) = cXX2
In comparisons involving only two synoptic gospels the third gospel is irrelevant and is ignored. For example, when comparing Matthew with Mark, Luke is ignored. Using ‘X’ to represent the third synoptic gospel in each case, we have:
Matthew – Mark comparisons: c20X, c21X, c22X, c12X, c02X
Mark – Luke comparisons: cX20, cX21, cX22, cX12, cX02
Matthew – Luke comparisons: c2X0, c2X1, c2X2, c1X2, c0X2
As is perhaps more easily shown in the diagram above, each of these categories can be created by combining three categories from the HHB Concordance. For example, for Matthew – Mark comparisons we have:
c20X = c200 + c201 + c202
c21X = c210 + c211 + c212
c22X = c220 + c221 + c222
c12X = c120 + c121 + c122
c02X = c020 + c021 + c022
Also:
c2XX = c20X + c21X + c22X
cX2X = c02X + c12X + c22X
So, by combining HHB categories, we can calculate the counts of each of the 807 different words in the concordance in the five Matthew – Mark categories, and can then use this data to look for correlations between pairs of the categories.
For Mark – Luke comparisons the equivalent categories are:
cX20 = c020 + c120 c220
cX21 = c021 + c121 + c221
cX22 = c022 + c122 + c222
cX12 = c012 + c112 + c212
cX02 = c002 + c102 + c202
cX2X = cX20 + cX21 + cX22
cXX2 = cX02 + cX12 + cX22
For Matthew – Luke comparisons they are:
c2X0 = c200 + c210 + c220
c2X1 = c201 + c211 + c221
c2X2 = c202 + c212 + c222
c1X2 = c102 + c112 + c122
c0X2 = c002 + c012 + c022
c2XX = c2X0 + c2X1 + c2X2
cXX2 = c0X2 + c1X2 + c2X2
Determining the Values to Use When Looking For Correlations
In any pair of HHBC categories we need to look for words that appear at a similar frequency in both categories (a positive correlation) or where words have a high frequency in one category and a low frequency in the other (a negative correlation). Because each of the categories contains a different number of words in total, we cannot simply compare the actual number of each different word in a pair of categories when looking for a correlation. Instead, we need to look at the frequency with which each word appears in each category in comparison with a baseline, i.e. all the profiles combined.
Step 1: Determine the frequency. For all categories under comparison, for each of the different words in the category divide the number of times the word appears by the total number of words in that category. The result can then be expressed as a percentage.
Step 2: Remove the ‘language’ factor. Some words will be more frequent than others simply because of the language the texts are written in (Greek), irrespective of who created them. Therefore, there will be an inherent degree of positive correlation between any pair of categories however different the natural ‘styles’ of the people who created the texts. To remove this factor the frequencies calculated in Step 1 have to be normalized, and there are various ways in which this could be done. First, the maximum, mean (average), and minimum frequencies of each of the words across all 19 categories must be calculated. Then, using the actual frequency values calculated in step 1 for each word in each category in turn:
Method 1: Subtract the mean frequency from the actual. The difference between the maximum and minimum frequencies (the range) is preserved, but the absolute values are not. For example, a difference of 0.1% (from 0.1% to 0.2%) could appear the same after adjustment as a difference of 0.1% (from 10.1% to 10.2%). While this method does remove the ‘language’ factor, it introduces a potential complication because frequencies above the mean will be positive after adjustment, and those below will be negative.
Method 2: Divide the actual frequency by the maximum. This method gives a value between zero and one in all cases, with zero indicating that the word does not appear in the category, and one indicating that the category contains the word at its maximum relative frequency. Where the frequency of a word is similar in all categories, this method will give adjusted frequencies close to one. Unlike in method 1 the range is not preserved, and the higher the maximum frequency, the smaller the range of values after adjustment. Using the same values as above, values ranging from 0.1% to 0.2% will range from 0.5 to 1 after adjustment, and values ranging from 10.1% to 10.2% will range from 0.99 to 1.
Method 3: Divide the difference between the actual and the minimum frequencies by the difference between the maximum and the minimum. This method also calculates a relative difference. As with Method 2, this gives an adjusted value varying between zero and one for each word. However, unlike Method 2, an adjusted value of zero represents the minimum frequency whether the actual minimum is zero or not. For all words the adjusted values will range from zero to one, with zero representing the minimum frequency, and one representing the maximum.
Method 3 can be dismissed quite simply, since here even for common words on the same subject and written by the same person, tiny frequency variations across the categories are greatly exaggerated, making categories appear to have very different profiles even when they are actually very similar.
Method 2 has the disadvantage that variations in the frequencies of infrequent words appear more significant than in frequent ones, i.e. the method is more ‘sensitive’ to variations in words that are likely to be subject dependent, and less so for words likely to be in common use by all authors.
Method 1 has the apparent disadvantage of creating negative values. However, this is actually not a problem, since pairs of negative values and pairs of the same numerical (but positive) values are exactly the same so far as the degree of correlation is concerned. This method ensures that pairs of frequencies that are both above or both below the mean contribute to a positive correlation, and pairs of frequencies where one is above and one below the mean contribute to a negative correlation. Overall, method 1 is simplest, and avoids the disadvantages of method 2 and 3. This is actually a common transformation, and data resulting from subtracting the mean is called mean-deviation data, the mean of which is always zero.
Step 3: Select the words to use when determining the correlation. The paper quoted at the beginning of this analysis stated:
Discard items which are linked in any literal sense to the text topic. Ignore very rare items (eg with frequency less than 5 over 20000 words). To save yourself time, and to maximise the sensitivity of your tests, look at only the 10 or so items with the largest differential frequency.
Using computers and spreadsheet or statistics programs the issue of time is moot, but how do rare words, or words with a similar relative frequency across several categories, affect the result?
Rare Words
Consider a word that appears only once, in one of the categories. This would perhaps not be expected to affect the result significantly. However, in comparisons of two categories not containing the word at all it nevertheless contributes to a positive correlation: The frequency is slightly below average, and hence the adjusted frequency has the same slightly negative value in both categories, whereas in comparison with the one category that contains the word there is a negative correlation.
The overall effect of including many rare words is to bias all the correlations towards being positive, and for this reason rare words should in most cases be excluded. However, words that appear multiple times in a small number of categories but not at all in the others may be significant, and should not necessarily be excluded, even where in total they are still rare.
As the HHBC data contains a total of 25856 words, then using the ‘cut off’ suggested above (less than 5 over 20000 words, or 0.025%) we should ignore all words with fewer than 7 instances in total. It should be noted that the HHBC data already excludes words with only one instance, and so we would be additionally excluding all words with just 2-6 instances. This excludes the 200 least frequent words out of the 807 in the HHBC data.
Subject Specific Words
If we are looking just for similarities in writing styles then, as far as possible, subject-specific words should normally be excluded. However, although the various names used for Jesus, God, Father, and various prophets, angels, and demons/devils may be considered subject-specific (and are clearly linked to the text topic), in this analysis all the authors are basing their text on essentially the same subject, and hence we must be careful not to exclude variations in names, etc. where these are valid elements of the styles in which the authors tell essentially the same story.
Even though the very design of the HHB concordance may have already largely placed the subject-specific words into different categories, we cannot assume that before the analysis. For example, categories 200 and 002 by definition contain the material unique to Matthew and Luke respectively, which quite likely contains subject-specific words. However, we cannot in advance of the study simply exclude words that only appear in one category as being subject specific. Nevertheless, it may turn out after initial analysis that some of these words can in fact be excluded, possibly simply by excluding more of the less frequent words.
Words with similar relative frequencies
Consider a word that appears at roughly the same frequency in all categories, i.e. where the range of frequencies is small. As indicated above, after removing the ‘language’ factor the adjusted (mean-deviation) frequencies will lie on either side of the zero mark, with positive values representing frequencies above the average, and negative values representing frequencies below the average, with a small range between the maximum and minimum values. Although comparisons of these small values will contribute towards an overall correlation value, their effect will be individually small.
Excluding them will therefore tend to make the analysis more ‘sensitive,’ by focusing on words where the mean-deviation frequencies deviate from zero by a greater degree. However, there is a danger that by excluding too many ‘low range’ words a small number of ‘high range’ words may change the result (e.g. from a positive to a negative). Therefore, excluding ‘low range’ words should only be undertaken when the ‘sense’ of the result is unchanged, and where the result becomes ‘sharper,’ i.e. a positive correlation becomes more positive, or a negative one more negative.
Calculating The Correlations
A simple way to compare the mean-deviation frequencies of these words in the different categories is to use the CORREL function of Microsoft Excel, which returns a correlation coefficient value ranging from -1 to +1. More specifically, the CORREL function computes the Pearson Product Moment Correlation (Pearson's correlation for short), which reflects the degree of linear relationship between two variables, which in this case are the adjusted frequencies in two of the HHBC categories (or meta-categories created by combining two or more HHBC categories).
A limitation of this measure of correlation is that it only produces valid results where the input variables are normally distributed, i.e. in this case where the lists of mean-deviation frequencies are normally distributed. Therefore, we need to test the data. A simple way of checking the distribution (in Excel) is to use the SKEW and KURT functions on each list of mean-distribution frequencies. SKEW determines how symmetrical the values are about their mean (which in this case is zero), and KURT measures the excess kurtosis, which is how high the central peak in the data is in comparison to the shape of a normal distribution (which has a kurtosis of 3).
When using all of the 807 words in the HHBC data, most of the lists of mean-distribution frequencies are significantly non-normal, with positive skewness, and very high kurtosis values. In the main this is due to the large number of small negative mean-deviation values, caused in turn by many HHBC categories having none of the low frequency words. However, the more of the lower frequency words that are excluded, the closer the lists of frequencies get to having normal distributions, until with just the 29 most frequent words all the lists are acceptably close to normal. At the same time this reduction in the words used has ‘sharpened’ the results, without altering the ‘sense’ of the correlations. However, reducing the number of words still further adversely affects both the data distribution and the sense of the correlations, so in the results discussed on the next page (Stylometric - Synoptic Results) it should be assumed (unless otherwise specified) that the correlation values are based on just the 29 most frequent words in the HHBC data.