## Guaranteeing proof termination

### Dealing with infinite proof search in reinforcement-learning automated proofs

October 2019

In my previous experiment with reinforcement learning for theorem proving, the agent's proof search always terminated in success by brute force.

Now after I added in another proof tactic for the agent to be able to use -- namely, apply transitivity -- it ran into a problem: a proof search that never ends.

In particular, a proof state that started innocently enough with 2 subgoals:

Theorem meet_associative : forall a b c:A, (a ⊓ b) ⊓ c = a ⊓ (b ⊓ c).
2 subgoals A : Set H : Poset A L : Lattice A H a, b, c : A ============================ (a ⊓ b) ⊓ c ≤ a ⊓ (b ⊓ c)subgoal 2 is: a ⊓ (b ⊓ c) ≤ (a ⊓ b) ⊓ c

...eventually became this monstrosity with 109 subgoals and counting:

Theorem meet_associative : forall a b c:A, (a ⊓ b) ⊓ c = a ⊓ (b ⊓ c).
109 subgoals A : Set H : Poset A L : Lattice A H a, b, c : A ============================ (a ⊓ b) ⊓ c ≤ b /\ (a ⊓ b) ⊓ c ≤ c
subgoal 2 is: b ⊓ c ≤ bsubgoal 3 is: a ⊓ b ⊓ c ≤ csubgoal 4 is: b ⊓ c ≤ bsubgoal 5 is: a ⊓ b ⊓ c ≤ csubgoal 6 is: b ⊓ c ≤ bsubgoal 7 is: a ⊓ b ⊓ c ≤ csubgoal 8 is: b ⊓ c ≤ b...

## The Problem

The issue? Every tactic I used in the previous project, for example apply associativity, when applied incorrectly always produced a Coq error and left the state of the Coq proof intact. The apply transitivity tactic, however, always works.

And that means the proof can go on and on and on, without terminating.

...............................................104 subgoals A : Set H : Poset A L : Lattice A H a, b, c : A ============================ (a ⊓ b) ⊓ c ≤ bsubgoal 2 is: (a ⊓ b) ⊓ c ≤ csubgoal 3 is: b ⊓ c ≤ b...............................................

Agent chooses action: Apply Transitivity. Breaking up subgoal... (a ⊓ b) ⊓ c ≤ b ...into 2 subgoals (a ⊓ b) ⊓ c ≤ b ⊓ c and b ⊓ c ≤ b.

...............................................105 subgoals A : Set H : Poset A L : Lattice A H a, b, c : A ============================ (a ⊓ b) ⊓ c ≤ b ⊓ csubgoal 2 is: b ⊓ c ≤ bsubgoal 3 is: a ⊓ b ⊓ c ≤ c...............................................

The solution seems obvious -- to somehow limit or disincentivize actions that increase the number of subgoals created in the proof, or to somehow limit or disincentivize ridiculously long proofs.

But the issue is deeper. The apply transitivity tactic often generates subgoals that are impossible to prove. And once it has done that, there is no way for the agent to know that the impossible-to-prove inequality a ⊓ b ≤ c is not going to become easier to prove after rewriting it as a ⊓ b ≤ d and d ≤ c. And because the agent doesn't know this technique isn't working, it doesn't know it should backtrack to give other proof tactics a chance.

## What Does Not Work

So how to fix it? First, a briefing on the most obvious solutions, and why they don't work.

Non-solution #1: Give small negative rewards to the reinforcement learning agent when it takes a step or increases the number of subgoals.

• Why it doesn't work: As mentioned above, all it takes is one misstep for the agent to generate an impossible-to-solve subgoal. At that point, every Coq tactic will fail except apply transitivity. And so, no matter how negative the expected reward, the agent will apply transitivity. As Sherlock Holmes says, once you eliminate the impossible, whatever remains, no matter how [negatively rewarded in the reinforcement learning q-table], must be the [next action chosen by the agent].

Non-solution #2: Prune off proof paths that will fail from the proof search tree. That is, once the proof search yields a subgoal that is impossible to prove, backtrack.

• Why it doesn't work: Pruning failing nodes of a tree requires knowing where the failing nodes are. But this proof tree has no failing nodes -- the apply transitivity tactic never fails. Even if the code gets stuck in a loop cycling back and forth between the same proof state, the agent doesn't know which of the many tactics it has applied, or inequalities it is trying to solve, is causing the problem.

Non-solution #3: Set a limit for how long a proof can be.

• Why it doesn't work: This theorem requires 23 steps to prove, and there are about 10 actions the agent could take at each step. So, even with the shallowest depth limit allowable, the proof search may take up to 10^23 tries..so while this technique guarantees proof convergence, it doesn't do it nearly quickly enough.

Non-solution #4: Set a limit for how many subgoals the proof state can have.

• Why it doesn't work: The issue is the same as above -- even setting a limit of 4 maximum subgoals (which is how many subgoals the proof eventually needs), there are so many different options for what each subgoal could be, the proof search is still unwieldy.

Non-solution #5: Once the proof search visits an already-visited state (that is, an identical set of subgoals), backtrack one step of the proof.

• Why it doesn't work:, Just because the set of subgoals was already visited, doesn't mean any future actions are doomed. Suppose you are the agent, and you reach a point in the proof such that the only subgoal left to prove a=a. But because you are a silly machine, instead of applying reflexivity to finish the proof, you apply transitivity. It doesn't work, so you backtrack. But now you're at the same state you've visited before -- where the subgoal is a=a! So you suppose you're caught in a cycle of some sort, and you backtrack to some other part of the proof where you have to prove a*1=a, never to return to the state where your only subgoal was a=a. And now that you've stopped yourself from returning to that state, you can never finish the proof.

Now...what finally did cause the proof to terminate in a reasonable amount of time?

## What Does Work

It was a combination of non-solution #3 and #4 (limiting proof depth) in addition to a variation of non-solution #5 (not visiting already-visited states) that did the trick.

Solution:

First, make sure to cap on the number of past actions and subgoals, and never visit the same set of actions twice...

class State: ... def get_available_actions(self): ... if len(self.past_actions) > 25: # max out at more than 25 steps return [] if self.num_subgoals > 4: # max out at more than 4 subgoals return [] ... return self.possible_actions_given_past_actions[self.past_actions] ... def add_attempted_action(self, action): ... self.possible_actions_given_past_actions[self.past_actions].remove(action)

...and then, once the proof search visits an already-visited state (that is, an identical set of subgoals) in the list of past actions, then backtrack one step of the proof.

class Environment: ... def step(self, action): response = self.coqinterface.sendone(action.to_coq()) ... if self.state.all_subgoals in self.state.visited_subgoals: to_undo = True

The issue with non-solution #5 was that it didn't correctly detect when the agent was caught in a loop. Non-solution #5 said the agent was in a loop if the current node shared its proof state with any other visited node. In fact, the agent is in a loop only if the current node shares its proof state with an its own ancestor node. This new solution correctly catches those loops.

Why it works: This technique saves the agent from getting caught in loops, which greatly speeds up the proof search, and causes rather fast convergence on a solution. Proof convergence is guaranteed by limiting the depth of the proof search tree to 25 steps (assuming of course that the agent is given all of the actions it needs and only needs 25 steps to solve the proof), and adding this sort of loop-pruning gets rid of sufficiently many options that the proof can converge quickly.

## Further Speed-Ups

Of course, there were other factors at play.

The proof would not have converged quite as quickly without a reward function that penalizes creating errors and entering loops, and rewards solving subgoals.

class Environment: ... def step(self, action): ... if "Error" not in response: reward = 10 # slight positive reward for not producing an error ... if self.state.done: reward += 100 # make reward even bigger if it was the last step to proving the theorem ... else: if self.state.num_subgoals < old_num_subgoals: reward += 10 # make reward bigger if number of subgoals was reduced. ... if self.state.all_subgoals in self.state.visited_subgoals: reward = -20 # make reward smaller for getting us into a loop ... else: reward = -1000 # big negative reward for error

...which served to allow the proof to terminate after 416 steps of proof search...

# --------------------------------# Training...# --------------------------------
The proof of meetIsAssociative...took 416 steps.
Proof generated: [ApplyAntisymmetric, ApplyMeetDefinition, SplitTheAndStatement, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyMeetDefinition, SplitTheAndStatement, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyMeetDefinition, SplitTheAndStatement, ApplyMeetDefinition, SplitTheAndStatement, ApplyMeetIsLowerBound, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound]

....and which updated the q-table such that after just this one round of training, the agent completed its next attempt at the proof in the minimal number of steps (choosing the optimal action at each step).

# --------------------------------# Evaluating...# --------------------------------
The proof of meetIsAssociative...took 23 steps.
Proof generated: [ApplyAntisymmetric, ApplyMeetDefinition, SplitTheAndStatement, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyMeetDefinition, SplitTheAndStatement, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyMeetDefinition, SplitTheAndStatement, ApplyMeetDefinition, SplitTheAndStatement, ApplyMeetIsLowerBound, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound, ApplyTransitivity, ApplyMeetIsLowerBound, ApplyMeetIsLowerBound]

## Takeaways

So yes, running a proof-search by (1) limiting proof depth and then (2) backing up the tree when about to enter a loop does allow the proof search to converge fairly rapidly.

But there is still something heartbreaking about the naivete of this agent, always hopeful, even when a particular subgoal is impossible to prove, that applying transitivity will make it work out...

Really, the most aesthetic solution seems to be pruning the tree when a subgoal has proven impossible to resolve given the tactics at hand. But proving "impossibility" in a proof search that has no dead ends (because some tactics such as apply transitivity will always work to change the proof state) seems to be a challenge...and possibly undecidable. If I'm not completely sure how to describe why I as a human have the intuition that a ⊓ b ≤ c is not going to be proven with transitivity, how can I teach it to a machine? Still, training a reinforcement learning agent to know at what point a proof is going to be impossible seems to be an interesting (although possibly far-fetched) question to guide future research. Another interesting question would be determining the limitations of theorem-proving agents, given that proving impossibility may be undecidable.

Code Repository: The code for this project is available on https://github.com/anshula/meet-associativity-theorem-prover. It consists of a Python reinforcement learning framework that guarantees termination for a potentially-endless Coq proof search, using the particular example of proving associativity of the meet operation for lattice theory.
Acknowledgements: The reinforcement learning part of this project builds on the theorem-proving reinforcement learning framework of a team I worked on while at the Center for Brains, Minds, and Machines at MIT, and the Coq code on lattice theory builds on the lattice theory code of other researchers. I did this project as part of my ongoing research at UNAM under the guidance of Prof. Favio Ezequiel Miranda Perea and Prof. Lourdes del Carmen González Huesca.