Stylometric Analysis

What is Stylometric Analysis?

On the 4th June, 1994, Robert Sigley, PhD, posted a message to the b-greek discussion board. Portions of this posting (re-formatted for readability) are reproduced below:

I've just finished reading… a collection of papers which I can thoroughly recommend to anyone trying to identify an author by style… It's called ‘Statistics and Style’ and appeared in 1968... Analyses covered include comparisons of:

The final paper in the collection is perhaps the most important, as it deals with the general question of reliability. Overall, it has to be said that the crude general statistics above are not useful for deciding questions of authorship unless:

In short, the rewards of such analysis are mostly not worth the considerable time it used to take to compute the statistics. The results are, I'm afraid, especially indecisive in answering questions in classics (where this volume of work, and the historical information about potential alternative authors, is often lacking). But if there is only one candidate author, with a large known corpus, and the exercise is simply to determine how similar to that author's style the unknown text is, then it can still be attempted.

 (1) and (2) above are now relatively quick and easy to calculate with most concordance programs - providing the text is in machine-readable form to start with! But they are the least author-specific methods.

 (4) and (5) could be useful, but are still very time-consuming to calculate, and require a whole lotta manual tagging of the texts. Best avoided.

 So (3) is probably going to be of most use in identifying a specific author. The best approach I can think of would be to construct a concordance…for a large corpus (20000 words minimum) of the candidate author, and then do the same for a similar-sized matched-genre corpus from the author's contemporaries. (If the text's general *date* is in doubt, you may as well give up now.)

Then you compare the frequency ratios of common vocabulary items (i.e. frequency in candidate corpus/ frequency in mixed-contemporary corpus).

This will identify a number of vocabulary items which are used proportionately much more or much less by the candidate author, and so can be used as `characteristic' of that author. Discard items which are linked in any literal sense to the text topic. Ignore very rare items (e.g. with frequency less than 5 over 20000 words). To save yourself time, and to maximise the sensitivity of your tests, look at only the 10 or so items with the largest differential frequency.

Now calculate the frequencies of the remaining items in the contested text. Compare these with both the candidate-author and contemporary corpora frequencies.

Finally, conduct a series of statistical tests to determine whether any differences you find can reasonably be attributed to chance. The best method will depend on the frequencies you get at the end of all this; ask a friendly statistician.

Is it possible that this form of analysis might help with The Synoptic Problem? It suggests that if we restrict ourselves to:

then, because we only have a small number of candidate authors, there is a good possibility that similarities in style that are not attributable to chance can be identified.

Word Frequency Profiles

As indicated above, perhaps the best stylometric technique for investigating the synoptic problem is to look at word frequency profiles. Every person has a different way of using the language in which he or she writes, and the frequencies with which a person uses words in that language together form a characteristic ‘fingerprint’ or profile.

For example, I tend to use “however” quite frequently. However, another person might use “but” instead. And (many sentences in the gospel according to Mark begin with ‘and’) yet another person might avoid using a conjunction in this way altogether. Because of these differences, it is possible to look at how frequently words are used in different pieces of text, and to use similarities and differences in those frequencies as an indication of authorship.

Of course, this does not mean that every piece of text written by an individual will have exactly the same word frequency profile, or that every piece of text written by one person will have a different word frequency profile to those written by another. For example, the subject matter clearly plays a part. Some words used when writing about a particular subject will not be used when writing about another. Conversely, some words may be so intrinsic to the language that they are used at roughly the same frequency whatever the subject, and whoever wrote them.

Whether the person was rushed at the time will also have an effect, as will the age of the person, whether the text was dictated to someone else who had some latitude regarding changing some of the words, and a host of other (unknown) things that affected the person at the time of writing. Nevertheless, it is reasonable to suppose that different people may each have different typical word frequency profiles (at least so far as non subject-specific or ‘function’ words are concerned) that can be used to distinguish text authored by one person from text authored by another. In the rest of this analysis I will refer to these simply as profiles.

Next: Profiles Involving Two Authors