Stylometrics And The Synoptic Problem

Application to the Synoptic Problem

The Stylometric Analysis page describes what might be achieved by comparing the styles (the frequencies with which different words are used)) of parallel portions of text created by either Two Authors or Three Authors, so how can this be applied to the synoptic problem? Because there are three texts, the gospels of Matthew (Mt), Mark (Mk), and Luke (Lk), we can perform three sets of pair-wise comparisons:

Using the reasoning given in Stylometric Analysis, we may be able to find correlations between the styles of the text in the categories already identified, and if so, then we may be able to say something about the dependencies of one text upon another. However, this depends on having the text of the synoptic gospels split up into the previously described categories. For this we can turn to the: 

Synoptic Concordance of the First Three Gospels: A Research Project of the Department of New Testament Studies of the University of Bamberg, Germany.

Quoting from the web-site of Prof. Dr. Thomas Hieke on this concordance:

Under the leadership of Prof. Dr. Paul Hoffmann, this research project was undertaken by Dr. Thomas Hieke, and by Dr. Ulrich Bauer, who developed the necessary computer programs. After preliminary planning and experimentation, financed by the University of Bamberg, the Deutsche Forschungsgemeinschaft has supported the project since 1996.

 The Synoptic Concordance is a new research tool for the analysis of the first three Gospels, in that it presents an extensive mass of data that facilitates in a major way their literary and linguistic analysis.

 The advantages of a concordance are combined with those of a synopsis: Each occurrence of a word in the synoptic Gospels, along with a swath of text that provides its context, is displayed in three columns. The effect is that one not only sees the occurrences of a certain word in one Gospel, but also the parallels in the other two Gospels.

The HHB Concordance

The Synoptic Concordance from Hoffmann, Hieke, Bauer (the HHB Concordance, or HHBC) splits the words of the synoptic gospels (Matthew, Mark, and Luke) into nineteen categories, in exactly the same way as discussed in Stylometric Analysis when considering Three Authors. Using the ordering in the HHBC, Matthew is text A, Mark is text B, and Luke is text C, so we have:

As before, because six of the categories shown above exist in two texts, and one in all three, there are only nineteen unique categories, as indicated in the Synoptic Venn Diagram to the right, in which the degree of overlap of the texts increases towards the center, with 200, 020 and 002 representing words unique to one gospel, and 222 representing words common to all three .This can also be seen in the HHB Concordance example page, which identifies the nineteen different categories, and also shows that the notation used above to denote the different categories (using 0s, 1s, and 2s) is taken directly from the HHB concordance.

The HHB Concordance data used here was entered into an Excel spreadsheet by Dave Gentile, and used in his own analysis of the synoptic problem. As described by Gentile:

In the HHB Synoptic Concordance, 807 of the most common words were completely tabulated. The less common words were covered with less complete tabular information. The 807 tabulated words were used in this study. Therefore the raw data for the study was a table with 807 rows, 1 for each vocabulary item, and 19 columns, 1 for each synoptic category. Each cell in the data shows the number of times the particular vocabulary item occurs in the particular synoptic category.

Using the HHB Concordance data

Gentile’s statistical analysis of the HHBC data looks for similarities in the profiles of all the different pairings of two of the nineteen HHBC categories. However, it is also worth comparing only two of the synoptic gospels at a time, thus reducing the number of categories involved in each set of comparisons. In order to do this we need to identify which of the HHBC categories need to be combined in order to be able to perform these pair-wise comparisons, but it is useful at this point to associate particular HHBC categories with the traditional names often used for specific overlaps between the synoptics, or where there is unique material:

Note that terms such as 'Double Tradition' and 'Minor Agreements' are only used when comparing Matthew and Luke against Mark, and the equivalent terms for referring to words or passages in Matthew and Mark but not Luke, or Mark and Luke but not Matthew, are not used. Similarly, it is not usual to refer to Sondergut (or Special) Mark. However, in this analysis terms such as 'Matthew-Mark Double Tradition' or 'Mark-Luke Agreements' will be used as appropriate to aid understanding.

For the 807 different words included in the HHB Concordance each of the three synoptic gospels can be ‘recreated’ by combining nine of the HHBC categories. Using the categories shown above we have:

In comparisons involving only two synoptic gospels the third gospel is irrelevant and is ignored. For example, when comparing Matthew with Mark, Luke is ignored. Using ‘X’ to represent the third synoptic gospel in each case, we have:

As is perhaps more easily shown in the diagram above, each of these categories can be created by combining three categories from the HHB Concordance. For example, for Matthew – Mark comparisons we have:

Also:

So, by combining HHB categories, we can calculate the counts of each of the 807 different words in the concordance in the five Matthew – Mark categories, and can then use this data to look for correlations between pairs of the categories.

For Mark – Luke comparisons the equivalent categories are:

For Matthew – Luke comparisons they are:

Determining the Values to Use When Looking For Correlations

In any pair of HHBC categories we need to look for words that appear at a similar frequency in both categories (a positive correlation) or where words have a high frequency in one category and a low frequency in the other (a negative correlation). Because each of the categories contains a different number of words in total, we cannot simply compare the actual number of each different word in a pair of categories when looking for a correlation. Instead, we need to look at the frequency with which each word appears in each category in comparison with a baseline, i.e. all the profiles combined.

Step 1: Determine the frequency. For all categories under comparison, for each of the different words in the category divide the number of times the word appears by the total number of words in that category. The result can then be expressed as a percentage.

Step 2: Remove the ‘language’ factor. Some words will be more frequent than others simply because of the language the texts are written in (Greek), irrespective of who created them. Therefore, there will be an inherent degree of positive correlation between any pair of categories however different the natural ‘styles’ of the people who created the texts. To remove this factor the frequencies calculated in Step 1 have to be normalized, and there are various ways in which this could be done. First, the maximum, mean (average), and minimum frequencies of each of the words across all 19 categories must be calculated. Then, using the actual frequency values calculated in step 1 for each word in each category in turn:

Method 3 can be dismissed quite simply, since here even for common words on the same subject and written by the same person, tiny frequency variations across the categories are greatly exaggerated, making categories appear to have very different profiles even when they are actually very similar.

Method 2 has the disadvantage that variations in the frequencies of infrequent words appear more significant than in frequent ones, i.e. the method is more ‘sensitive’ to variations in words that are likely to be subject dependent, and less so for words likely to be in common use by all authors.

Method 1 has the apparent disadvantage of creating negative values. However, this is actually not a problem, since pairs of negative values and pairs of the same numerical (but positive) values are exactly the same so far as the degree of correlation is concerned. This method ensures that pairs of frequencies that are both above or both below the mean contribute to a positive correlation, and pairs of frequencies where one is above and one below the mean contribute to a negative correlation. Overall, method 1 is simplest, and avoids the disadvantages of method 2 and 3. This is actually a common transformation, and data resulting from subtracting the mean is called mean-deviation data, the mean of which is always zero.

Step 3: Select the words to use when determining the correlation. The paper quoted at the beginning of this analysis stated:

Discard items which are linked in any literal sense to the text topic. Ignore very rare items (eg with frequency less than 5 over 20000 words). To save yourself time, and to maximise the sensitivity of your tests, look at only the 10 or so items with the largest differential frequency.

Using computers and spreadsheet or statistics programs the issue of time is moot, but how do rare words, or words with a similar relative frequency across several categories, affect the result?

Rare Words

Consider a word that appears only once, in one of the categories. This would perhaps not be expected to affect the result significantly. However, in comparisons of two categories not containing the word at all it nevertheless contributes to a positive correlation: The frequency is slightly below average, and hence the adjusted frequency has the same slightly negative value in both categories, whereas in comparison with the one category that contains the word there is a negative correlation.

The overall effect of including many rare words is to bias all the correlations towards being positive, and for this reason rare words should in most cases be excluded. However, words that appear multiple times in a small number of categories but not at all in the others may be significant, and should not necessarily be excluded, even where in total they are still rare.

As the HHBC data contains a total of 25856 words, then using the ‘cut off’ suggested above (less than 5 over 20000 words, or 0.025%) we should ignore all words with fewer than 7 instances in total. It should be noted that the HHBC data already excludes words with only one instance, and so we would be additionally excluding all words with just 2-6 instances. This excludes the 200 least frequent words out of the 807 in the HHBC data.

Subject Specific Words

If we are looking just for similarities in writing styles then, as far as possible, subject-specific words should normally be excluded. However, although the various names used for Jesus, God, Father, and various prophets, angels, and demons/devils may be considered subject-specific (and are clearly linked to the text topic), in this analysis all the authors are basing their text on essentially the same subject, and hence we must be careful not to exclude variations in names, etc. where these are valid elements of the styles in which the authors tell essentially the same story.

Even though the very design of the HHB concordance may have already largely placed the subject-specific words into different categories, we cannot assume that before the analysis. For example, categories 200 and 002 by definition contain the material unique to Matthew and Luke respectively, which quite likely contains subject-specific words. However, we cannot in advance of the study simply exclude words that only appear in one category as being subject specific. Nevertheless, it may turn out after initial analysis that some of these words can in fact be excluded, possibly simply by excluding more of the less frequent words.

Words with similar relative frequencies

Consider a word that appears at roughly the same frequency in all categories, i.e. where the range of frequencies is small. As indicated above, after removing the ‘language’ factor the adjusted (mean-deviation) frequencies will lie on either side of the zero mark, with positive values representing frequencies above the average, and negative values representing frequencies below the average, with a small range between the maximum and minimum values. Although comparisons of these small values will contribute towards an overall correlation value, their effect will be individually small.

Excluding them will therefore tend to make the analysis more ‘sensitive,’ by focusing on words where the mean-deviation frequencies deviate from zero by a greater degree. However, there is a danger that by excluding too many ‘low range’ words a small number of ‘high range’ words may change the result (e.g. from a positive to a negative). Therefore, excluding ‘low range’ words should only be undertaken when the ‘sense’ of the result is unchanged, and where the result becomes ‘sharper,’ i.e. a positive correlation becomes more positive, or a negative one more negative.

Calculating The Correlations

A simple way to compare the mean-deviation frequencies of these words in the different categories is to use the CORREL function of Microsoft Excel, which returns a correlation coefficient value ranging from -1 to +1. More specifically, the CORREL function computes the Pearson Product Moment Correlation (Pearson's correlation for short), which reflects the degree of linear relationship between two variables, which in this case are the adjusted frequencies in two of the HHBC categories (or meta-categories created by combining two or more HHBC categories).

A limitation of this measure of correlation is that it only produces valid results where the input variables are normally distributed, i.e. in this case where the lists of mean-deviation frequencies are normally distributed. Therefore, we need to test the data. A simple way of checking the distribution (in Excel) is to use the SKEW and KURT functions on each list of mean-distribution frequencies. SKEW determines how symmetrical the values are about their mean (which in this case is zero), and KURT measures the excess kurtosis, which is how high the central peak in the data is in comparison to the shape of a normal distribution (which has a kurtosis of 3).

When using all of the 807 words in the HHBC data, most of the lists of mean-distribution frequencies are significantly non-normal, with positive skewness, and very high kurtosis values. In the main this is due to the large number of small negative mean-deviation values, caused in turn by many HHBC categories having none of the low frequency words. However, the more of the lower frequency words that are excluded, the closer the lists of frequencies get to having normal distributions, until with just the 29 most frequent words all the lists are acceptably close to normal. At the same time this reduction in the words used has ‘sharpened’ the results, without altering the ‘sense’ of the correlations. However, reducing the number of words still further adversely affects both the data distribution and the sense of the correlations, so in the results discussed on the next page (Stylometric - Synoptic Results) it should be assumed (unless otherwise specified) that the correlation values are based on just the 29 most frequent words in the HHBC data.

Next: Stylometric - Synoptic Results