Stylometric Analysis describes what might be achieved by comparing the styles of parallel portions of text created by either Two Authors or Three Authors, so how can this be applied to the synoptic problem? Because there are three texts, the gospels of Matthew, Mark, and Luke (or GMt, GMk, and GLk), we can perform three sets of pair-wise comparisons:
“Synoptic Concordance of the First Three Gospels: A Research Project of the Department of New Testament Studies of the University of Bamberg, Germany.”
Quoting from the web-site of Prof. Dr. Thomas Hieke on this concordance:
“Under the leadership of Prof. Dr. Paul Hoffmann, this research project was undertaken by Dr. Thomas Hieke, and by Dr. Ulrich Bauer, who developed the necessary computer programs. After preliminary planning and experimentation, financed by the University of Bamberg, the Deutsche Forschungsgemeinschaft has supported the project since 1996.
The Synoptic Concordance is a new research tool for the analysis of the first three Gospels, in that it presents an extensive mass of data that facilitates in a major way their literary and linguistic analysis.
The advantages of a concordance are combined with those of a synopsis: Each occurrence of a word in the synoptic Gospels, along with a swath of text that provides its context, is displayed in three columns. The effect is that one not only sees the occurrences of a certain word in one Gospel, but also the parallels in the other two Gospels”.The Synoptic Concordance from Hoffmann, Hieke, Bauer (the HHB Concordance, or HHBC) splits the words of the synoptic gospels (GMt, GMk, and GLk) into nineteen categories, in exactly the same way as discussed in Stylometric Analysis when considering Three Authors. Using the ordering in the HHBC, GMt is text A, GMk is text B, and GLk is text C, so we have:
This can also be seen in the HHB Concordance example page, which shows the nineteen different categories. This page also shows that the notation used above to denote the different categories (using 0s, 1s, and 2s) is taken directly from the HHB concordance.The HHB Concordance data has been entered into an Excel spreadsheet by Dave Gentile, and used in his own analysis of the synoptic problem. As described by Gentile:
“In the HHB Synoptic Concordance, 807 of the most common words were completely tabulated. The less common words were covered with less complete tabular information. The 807 tabulated words were used in this study. Therefore the raw data for the study was a table with 807 rows, 1 for each vocabulary item, and 19 columns, 1 for each synoptic category. Each cell in the data shows the number of times the particular vocabulary item, occurs in the particular synoptic category.”Gentile’s statistical analysis of the HHBC data looks for similarities in the profiles of all the different pairings of two of the nineteen HHBC categories. However, it is also worth comparing only two of the synoptic gospels at a time, thus reducing the number of categories involved in each set of comparisons. In order to do this we need to identify which of the HHBC categories need to be combined in order to be able to perform these pair-wise comparisons.
It is useful at this point to associate particular HHBC categories with the traditional names often used for specific overlaps between the synoptics, or where there is unique material:
Each of the three synoptic gospels can (for the 807 different words included in the HHB Concordance) be ‘recreated’ by combining nine of the HHB categories. As shown above we have:
For GMk – GLk comparisons the equivalent categories are:
For GMt – GLk comparisons they are:
In any pair of HHBC categories we need to look for words that appear at a similar frequency in both categories (a positive correlation) or where words have a high frequency in one category and a low frequency in the other (a negative correlation).
Because each of the categories contains a different number words in total, we cannot simply compare the actual number of each different word in a pair of categories when looking for a correlation. Instead, we need to look at the frequency with which each word appears in each category in comparison with a baseline, i.e. all the profiles.
Step 1: Determine the frequency. For all categories under comparison, for each of the different words in the category divide the number of times the word appears by the total number of words in that category. The result can then be expressed as a percentage.
Step 2: Remove the ‘language’ factor. Some words will be more frequent than others simply because of the language (Greek) the texts are written in, irrespective of who created them. Therefore, there will be an inherent degree of positive correlation between any pair of categories however different the natural ‘styles’ of the people who created the texts. To remove this factor the frequencies calculated in Step 1 have to be normalized, and there are various ways in which this could be done. First, the maximum, mean (average), and minimum frequencies of each of the words across all 19 categories must be calculated. Then, using the actual frequency values calculated in step 1 for each word in each category in turn:
Method 3 can be dismissed quite simply, since here even for common words on the same subject and written by the same person, tiny frequency variations across the categories are greatly exaggerated, making categories appear to have very different profiles even when they are actually very similar.
Method 2 has the disadvantage that variations in the frequencies of infrequent words appear more significant than in frequent ones, i.e. the method is more ‘sensitive’ to variations in words that are likely to be subject dependent, and less so for words likely to be in common use by all authors.
Method 1 has the apparent disadvantage of creating negative values. However, this is actually not a problem, since pairs of negative values and pairs of the same numerical (but positive) values are exactly the same so far as the degree of correlation is concerned. This method ensures that pairs of frequencies that are both above or both below the mean contribute to a positive correlation, and pairs of frequencies where one is above and one below the mean contribute to a negative correlation. Overall, method 1 is simplest, and avoids the disadvantages of method 2 and 3. This is actually a common transformation, and data resulting from subtracting the mean is called mean-deviation data, the mean of which is always zero.
Step 3: Select the words to use when determining the correlation. The paper quoted at the beginning of this analysis stated:
“Discard items which are linked in any literal sense to the text topic. Ignore very rare items (eg with frequency less than 5 over 20000 words). To save yourself time, and to maximise the sensitivity of your tests, look at only the 10 or so items with the largest differential frequency.”
With modern computers and spreadsheet or statistics programs the issue of time is moot, but how do rare words, or words with a similar relative frequency across several categories, affect the result?
Consider a word that appears only once, in one of the categories. This would perhaps not be expected to affect the result significantly. However, in comparisons of two categories not containing the word at all it nevertheless contributes to a positive correlation: The frequency is slightly below average, and hence the adjusted frequency has the same slightly negative value in both categories, whereas in comparison with the one category that contains the word there is a negative correlation.
The overall effect of including many rare words is to bias all the correlations towards being positive, and for this reason rare words should in most cases be excluded. However, words that appear multiple times in a small number of categories but not at all in the others may be significant, and should not necessarily be excluded, even where in total they are still rare.
As the HHBC data contains a total of 25856 words, then using the ‘cut off’ suggested above (less than 5 over 20000 words, or 0.025%) we should ignore all words with fewer than 7 instances in total. It should be noted that the HHBC data excludes words with only one instance, and so we would be excluding all words with just 2-6 instances. This excludes the 200 least frequent words out of the 807 in the HHBC data.
Subject Specific Words
If we are looking just for similarities in writing styles then, as far as possible, subject-specific words should normally be excluded. However, although the various names used for Jesus, God, Father, and various prophets, angels, and demons/devils may be considered subject-specific (and are clearly linked to the text topic), in this analysis all the authors are basing their text on essentially the same subject, and hence we must be careful not to exclude variations in names, etc. where these are valid elements of the styles in which the authors tell essentially the same story.
Even though the very design of the HHB concordance may have already largely placed the subject-specific words into different categories, we cannot assume that before the analysis. For example, categories 200 and 002 by definition contain the material unique to Mt and Lk respectively, which quite likely contains subject-specific words. However, we cannot in advance of the study simply exclude words that only appear in one category as being subject specific. Nevertheless, it may turn out after initial analysis that some of these words can in fact be excluded, possibly simply by excluding more of the less frequent words.
Words with similar relative frequencies
Consider a word that appears at roughly the same frequency in all categories, i.e. where the range of frequencies is small. As indicated above, after removing the ‘language’ factor the adjusted (mean-deviation) frequencies will lie on either side of the zero mark, with positive values representing frequencies above the average, and negative values representing frequencies below the average, with a small range between the maximum and minimum values. Although comparisons of these small values will contribute towards an overall correlation value, their effect will be individually small.
Excluding them will therefore tend to make the analysis more ‘sensitive,’ by focusing on words where the mean-deviation frequencies deviate from zero by a greater degree. However, there is a danger that by excluding too many ‘low range’ words a small number of ‘high range’ words may change the result (e.g. from a positive to a negative). Therefore, excluding ‘low range’ words should only be undertaken when the ‘sense’ of the result is unchanged, and where the result becomes ‘sharper,’ i.e. a positive correlation becomes more positive, or a negative one more negative.
With the computing power available today the number of different words used in the analysis is likely to have little effect on the speed with which the results can be calculated. Therefore, there is no need to consider reducing the number of words to speed up the calculation.
A simple way to compare the mean-deviation frequencies of these words in the different categories is to use the CORREL function of Microsoft Excel, which returns a correlation coefficient value ranging from -1 to +1. More specifically, the CORREL function computes the Pearson Product Moment Correlation (Pearson's correlation for short), which reflects the degree of linear relationship between two variables, which in this case are the adjusted frequencies in two of the HHBC categories (or meta-categories created by combining two or more HHBC categories).
A limitation of this measure of correlation is that it only produces valid results where the input variables are normally distributed, i.e. in this case where the lists of mean-deviation frequencies are normally distributed. Therefore, we need to test the data. A simple way of checking the distribution (in Excel) is to use the SKEW and KURT functions on each list of mean-distribution frequencies. SKEW determines how symmetrical the values are about their mean (which in this case is zero), and KURT measures the excess kurtosis, which is how high the central peak in the data is in comparison to the shape of a normal distribution (which has a kurtosis of 3).
When using all of the 807 words in the HHBC data, most of the lists of mean-distribution frequencies are significantly non-normal, with positive skewness, and very high kurtosis values. In the main this is due to the large number of small negative mean-deviation values, caused in turn by many HHBC categories having none of the low frequency words. However, the more of the lower frequency words that are excluded, the closer the lists of frequencies get to having normal distributions, until with just the 29 most frequent words all the lists are acceptably close to normal. At the same time this reduction in the words used has ‘sharpened’ the results, without altering the ‘sense’ of the correlations. However, reducing the number of words still further adversely affects both the data distribution and the sense of the correlations, so in the results shown below it should be assumed (unless otherwise specified) that the correlation values are based on just the 29 most frequent words in the HHBC data.