PH Hill-climbing

A Strange but Effective Hill-Climbing Algorithm.

In what follows I'm assuming you have had some experience in solving ciphers by hill-climbing. If not, read the computer column in the SO 2006 Cryptogram, or the ND 1995 Cryptogram.

I was trying to understand why simulated annealing is so good at solving Playfair and other ciphers (see Anchises' article in the MJ 2007 Cryptogram). One way simulated annealing differs from ordinary hill-climbing is that it sometimes allows "downslope" moves, that is, when you swap a pair of letters to get a new key, you sometimes replace the old key with the new one even if the new key has a slightly lower score. As an aid to understanding how accepting lower scores could result in solving ciphers, I wrote a program to generate statistics about score differences after swapping a pair of letters in the key. But it turned out that the program not only generated statistics, but to my surprise, it also solved ciphers!

The basic idea is to allow a "fudge factor" in comparing the new key's score with the old key's score. In ordinary hill-climbing the basic step is:

IF new_score >= old_score THEN

keep new key

set old_score = new_score

ELSE

discard new key, restore old key

I changed this to:

IF new_score >= old_score - fudge_factor THEN

keep new key

set old_score = new_score

ELSE

discard new key, restore old key

To calculate a suitable fudge factor, I looked at some statistics from my hill-climbing program. In a typical run with scoring by summing logarithms of tetragraphs the biggest negative difference between new_score and old_score was about 2.5 times the ciphertext length. I decided to take ten percent of that, or 0.25, rounded off to 0.2, times the text length, as the initial fudge factor.

To generate statistics about different size fudge factors, I tried dividing the initial fudge factor by various numbers and recording the resulting "new_key - old_key" score differences. Now the basic step was:

IF new_score >= old_score - (fudge_factor/divisor) THEN

keep new key

set old_score = new_score

ELSE

discard new key, restore old key

To my surprise, when I started cycling through various divisors, this algorithm started churning out solutions to the ciphers!

Here's the algorithm in a little more detail:

Fix 4 numbers: fudge_factor, initial_divisor, divisor_increment, cycle_length. Then:

--------

start with random key and score the corresponding trial decrypt

DO "FOREVER":

reset divisor to initial_divisor

FOR i = 1 TO cycle_limit:

get new key and its corresponding score

IF new score is an all-time high, print out the trial decrypt. Maybe it's the solution!

IF new_score > old_score - (fudge_factor/divisor) THEN:

accept new key

old_score = new_score

ELSE

discard new key, restore old key

increment divisor by divisor_increment

--------

As I mentioned, for fudge_factor I use ( 0.2*ciphertext_length) but this number depends, of course, on the magnitude of the numbers in the tetragraph table you use for scoring. For Playfair ciphers I've had good luck with initial_divisor = 1.0, divisor_increment = 1.5, and cycle_length = 20. For Phillips ciphers I often use initial_divisor = 1.5, divisor_increment = 2.0, and cycle_length = 11.

This algorithm solved the Playfair cipher (Challenge cipher) on this web site using less than 4 million trial decrypts. On that solving run I set initial_divisor = 1.0, divisor_increment = 1.0, and cycle_length = 40.

Why does this algorithm work?

First a warning. It was only by accident that I discovered that this algorithm solves ciphers. I've tried to figure out why it works, but since my reasoning is "after the fact" it might just be bs.

This algorithm seems to be based on a "probability histogram". That is if you graph the probability of accepting a certain negative score difference between the new key and the old key, with the score difference along the horizontal axis and the probability of accepting the new key along the vertical axis, then you get a histogram. A histogram is a collection of horizontal lines joined together by vertical segments where one of the horizontals ends and the next begins. See the first graph below. This is in contrast to the simulated annealing algorithm where, if you fix a temperature, the corresponding probability graph is a nice, smooth exponential function. (For negative score differences that is. You always accept a new key with a non-negative score difference, so along the positive horizontal axis the graph is a horizontal line with height 1.0).

To see why we get a histogram let's take some specific numbers. Suppose the fudge factor is 0.2 times the text length, the initial_divisor is 1.0, the divisor_increment is 1.5, and the cycle_length is 20. Also let's "normalize" the score differences by dividing them by the text length.

When we start the cycle, -fudge_factor/divisor is equal to -0.2, and it never is less than that. If a score difference is more negative than -0.2 then we reject the new key, and the graph has a segment at height 0.0 stretching from -0.2 to negative infinity. If the score difference is equal to -0.2 then we accept the new key only when the divisor is at its initial value of 1.0. This happens once per cycle, or 1/20 = 0.05 of the time (since the cycle length is 20). So the graph contains the point (-0.2,0.05). When the divisor is incremented once, it's new value is 2.5, and -fudge_factor/divisor is now -0.2/2.5 = -0.08. A score difference of -0.08 is accepted only if the divisor is at its first or second value, or 2/20 = 0.1 of the time. So the graph contains the point (-0.08,0.1). Score differences bewteen -0.2 and -0.08 are accepted only when the divisor is 1.0, so the graph has a horizontal segment with height 0.05 between -0.2 and -0.08.

I hope it's now clear how to draw the graph for the rest of the divisors in the cycle. The complete graph is below.

(horizontal units are (new_score-old_score)/text_length)

We can compare this probability distribution with the simulated annealing distribution you get by choosing the simulated annealing "temperature" so that the leftmost point (-0.2,0.05) lies on both graphs. With this fixed temperature the simulated annealing probabilities for the points where the divisor changes are shown as hollow diamonds on the graph below. You can get the complete simulated annealing graph by imagining a smooth curve drawn through the hollow diamonds.

This "probability histogram" algorithm (and simulated annealing too) seem to work by finding a good balance between accepting enough downslope steps to avoid getting stuck at a local maximum, and not accepting so many that you are prevented from making the number of consecutive upward steps needed to reach the highest peak.

A few statistics from a "probability histogram" run might throw some light on the kind of balance that works. On one run to solve the Playfair challenge cipher, it took about 22 million trial decrypts to find the solution. Of these, about 21 million were downslope steps, and only 1 million were upslope steps. Of the 21 million downslope steps, only about 300,000 were accepted as current keys. This run had a cycle limit of 21 steps. When the divisor was at its minimum of 1.0, there were about 850,000 downslope steps rejected, 140,000 downslope steps accepted, and 62,000 upslope steps. When the divisor was at its maximum of 31.0 there were 990,000 downslope steps rejected, 1800 downslope steps accepted, and 62,000 upslope steps.

We might speculate that most ciphers have a sort of "resonance histogram" that, if found, allows our algorithm to solve them with a minimum number of trial decrypts.