Cipher ID tests

Notes on cipher ID tests.

Sometimes the Cryptogram magazine will have a cipher whose type is unknown. Before you can solve it you have to identify the method that was used to encrypt it. I've tried three methods of identifying ciphers of an unknown type:

Comparison to Statistical Averages

One simple method of trying to identify a cipher is to calculate various statistical measures, such as Index of Coincidence , sum of logarithms of digraph frequencies [1], etc, for the unknown cipher and then compare the results to the averages for various known cipher types. The most likely type for the unknown cipher is the one whose differences from the type's averages is smallest. For example this cipher,

RWASD PCRWP DPALD CIPBD TERPA LTIAL PRQSC YMRWQ TPLIP IMLPC RTYRK

LBPXK MSRPV TVTDC KLBDT PDKVV CHBPT ALSWK VKEEK BMTUM UTPRD KICTD

RWPSK IPRTI PCIME DKUUK RPLBP YICAK LLTAL MTHPA LRPBR KKUCI PDEVU

VRKWI QITMK LPULT LTHRA LC.

has an index of coincidence (scaled up by 1000) of 64, which is about the same as the average for standard English. The sum of the logarithms of its digraphs, suitably scaled, is 518, which is much lower than the average sum of 750 for standard English, or even the average sum for transposition ciphers, which is in the 600's. We might deduce therefore that the cipher is not a plain transposition, but is a simple substitution or perhaps simple substitution combined with some kind of transposition.

This method of comparison to averages gives the correct result for the above cipher. It is a Bazeries cipher, which is a simple substitution combined with transposition by reversing the order of its letters according to a fixed pattern.

But the comparison to averages method does not always work. For example, in the CE column for the Nov-Dec 2012 issue of the Cryptogram, out of 26 ciphers only 5 were correctly placed at the top of the comparison program's list.

Decision Trees

One attempt at a better identification method is to construct a decision tree . A decision tree is a collection of questions with "yes-no" answers that keeps narrowing down the list of possible cipher types until only one type, hopefully the correct one, remains. Given some example ciphers for training, a computer can construct a decision tree by choosing the question that best divides the cipher types into two clusters. The question might be, "Is the index of coincidence greater than 55", or "Does the ciphertext contain numbers?". The computer proceeds to choose questions that give the best subdivisions of the data until, in the ideal case, only one cipher type remains in each of the final subdivisions.

How does the computer judge which is the "best" of the possible subdivisions? One method is to sum the squares of the cipher-type frequencies in each subdivision. If there is just one cipher type in the subdivision -- the best possible result -- then the sum of the squares is 1. If there were say, 5 cipher-types in equal numbers in the subdivision, then the sum of squares would be 1/5, a less desirable result [2].

In real life decision trees don't always work out smoothly. There is a problem known as overfitting , where the computer is assigning cipher types based on statistical "noise" that may have no real meaning. For example suppose there were just two ciphers remaining in a subdivision, one a columnar transposition with an index of coincidence of 64 and the other a swagman transposition with an index of coincidence of 68. The computer might choose between them with a ridiculous question like "is the index of coincidence greater than 66?". Unfortunately the index of coincidence for transposition ciphers is the same as the index of coincidence of the original plaintext and therefore is no help at all in distinguishing between transposition types. But the computer doesn't know this and only has two example ciphers to work with.

To help guard against overfitting I decided to limit the computer to a maximum of 12 questions per unknown cipher. There were 16,800 ciphers in the set of training examples. If each question divided the examples in half then after 12 questions each subdivision would contain only about 4 ciphers; any further questions would probably cause overfitting. If a data subdivision still contains more than one cipher type after 12 questions, the computer chooses the most common type among the ciphers remaining.

Using the 12 question limit the computer produced a decision tree based on the training set of 16,800 ciphers, choosing questions from a set of 36 statistical measures. The measures included obvious questions like "Does the cipher have a J?, and "Is the length of the cipher even?". Also included were the 9 reference stats from the MJ 2005 Cryptogram, some of AAHJU's cipher tests from his computer columns [1][3], THE RAT's NORMOR test [4], GIZMO's Bifid period test[5], plus a couple of stats invented on the spot [6].

Decision Tree Example.

Here's an example of using that decision tree. The cipher is:

VUGNO OXOWI NBFWC SPYDL CBUWA OVCIR GACFD DILGI YFNSH DYNXB EFSIF

FNSNA JEYGX KYDLT DMEDO AXNME DGECG XLFIE PGEEM GVXBR SGOLX HBCAA

HIFWJ EWNWO RJJFF XZSJE ELCWV QVJDL FEQME DYEXW NOUYD SSAGV BDNQW

FBXBW QD.

(1) First question: Is the sum of the logarithms of the 2-letter (digraph) frequencies (LDI) greater then 28? Answer: LDI for this cipher is 474 so answer is yes. The only cipher types with LDI less than 28 are ciphertexts made up of numbers, which have an LDI of 0. This question really seems a way of dividing numerical ciphers from other types.

(2) Next question: Does the cipher contain any numbers? Answer: No. This question eliminates ciphers based on a 6x6 keysquare. 6x6 ciphers can contain both numbers and letters.

(3) Next question: Is the LDI greater than 600?. Answer: No -- the LDI for this cipher is 474. An LDI of less than 600 means the cipher is probably not a transposition type. (Note that this is the second question about LDI. Numerical variables can be used several times with different cutoff points.)

(4) Next question: Does the ciphertext contain a J ? Answer: Yes. This question eliminates playfair, two-square, and all other cipher types based on a 5x5 keysquare. The 5x5 keysquare replaces all J's by I's so those ciphers never have a J.

(5) Next question: Does the ciphertext contain a # symbol ? Answer: No. This eliminates most trifid and digrafid ciphers.

(6) Is the Portax digraph index (PTX) greater than 600? [3]. Answer: PTX for this cipher is 491, so answer is no. The PTX is a specialized statistic, a sort of index of coincidence for pairs of letters that have fixed encryptions in the Portax cipher. This question eliminates the Portax type.

(7) Is the index of coincidence greater than 54?. Answer: IC for this cipher is 44, so answer is no. This means the cipher is unlikely to be a simple substitution.

(8) Is the maximum periodic index of coincidence (MIC) greater than 56 ? Answer: The MIC is 76 for this cipher, so answer is yes. This indicates the cipher is likely to be a periodic type. The periodic method of encryption is simple substitution using several substitution alphabets, changing the alphabet with every letter until the end of the "period" and then starting over with the first alphabet, and so on.

(9-12) The last four questions are grouped together because they all ask about the Vigenere family of periodic ciphers. In the Vigenere family the substitutions for each alphabet are determined by a rigid formula. Using the formula the logarithms of the resulting digraphs can be calculated . In this cipher the result of each of the 4 calculations is too low for the cipher to be in the Vigenere family.

After these 12 questions, the decision tree outputs the most common cipher type among those types that have not been eliminated. The resulting type: QUAGMIRE CIPHER.

And indeed, the example cipher is a quagmire.

The decision tree method does represent an improvement over simple comparison with statistical averages. In the CE column for the Nov-Dec 1012 Cryptogram, it was correct on 15 of the 26 ciphers, compared with only 5 correct for the averages method.

Random Forests of Decision Trees.

We can try improving our decision tree results by using more than one decision tree. The idea behind the Random Forest method is to use many randomly generated decision trees and for the final answer choose the cipher type that is the most common selection among the trees. To generate a random decision tree we choose, for each question, a small random selection, say 6, from among the 36 total statistics available, and then choose the best question from among those 6. Any one of these random decision trees will probably not be as accurate as the tree made by always choosing the best statistic from among all 36 stats available. But the entire collection of trees taken together may be more accurate than any one decision tree.

The random forest method does seem to result in an improvement over our single decision tree. For example, this cipher:

NDJPZ CSLCQ LYBYE LDVDD YYJFI DKVVD HXKCO EWVRG DOZSK UDVDN VAOPY

BGJJP VDSGW XZZRJ FNDJG IURYA WGLCK XMXRO YJPYU JOHXJ IDCUO EJYGQ

DOYSG QLVOG OMXUU FEHDO PVONA DHYQV JWZRW ZREJH.

has somewhat similar statistics to our first decision tree example. In fact, it gives exactly the same yes-no answers to the single decision tree questions. And so the decision tree identifies this cipher as a Quagmire. But that is not correct. It's a Fractionated Morse cipher.

A random forest of 30 decision trees did choose "Fractionated Morse" as the consensus type for this cipher as well as correctly choosing "Quagmire" for the first decision tree example.

In the CE column for the Nov-Dec 1012 Cryptogram, the 30 tree random forest was correct on 17.5 (there was one tie vote) of the 26 ciphers, compared with 15 correct for the single decision tree.

This accuracy ranking has been the same for all collections of ciphers that I've tested so far -- admittedly I need to do more testing -- but comparison with averages is less accurate than the single decision tree, which in turn, is not as accurate as the 30 tree random forest. But there is still a long way to go. Even the best results have been only about 70 percent correct.

You can try the 30 tree random forest ID test at this link.

footnotes:

[1] See computer column by AAHJU in the SO 2001 Cryptogram (or here).

[2] To calculate the "best" of the subdivisions I used the "gini impurity" formula..

[3] See AAHJU's computer column in the MJ 2001 Cryptogram (or here)

[4] See SO 2009 Cryptogram

[5] See JF 1979 Cryptogram

[6] For one of the new stats, see the computer column in the JA 2006 Cryptogram ( or here ).