PhD project

I am interested in speeding up Cylindrical Algebraic Decomposition, an algorithm with multiple applications in applied and pure mathematics. To speed it up I'm both trying to improve the theory behind it and using Machine Learning to choose the variable ordering, a parameter that has a huge impact in the resources needed to run the CAD algorithm.

I worked on my PhD at the Centre for Computer Science and Mathematical Modelling at Coventry University under the supervision of Dr. Matthew England, Dr. AmirHosein Sadeghimanesh, Prof. Vasile Palade and Dr Kamal Bentahar

My PhD project is titled Heuristics and Machine Learning to Improve Symbolic Computation Algorithms: Speeding Up Quantifier Elimination and consists on developing strategies to make choices that can speed up Cylindrical Algebraic Decomposition (CAD) without compromising the validity of its output.

It is common for the people developing these procedures to design human-made heuristics based on their expertise. Nevertheless, in recent years Machine Learning has been shown to outperform human experts in almost every imaginable field.

The main focus of my PhD will be on the choice of Variable Ordering for Cylindrical Algebraic Decomposition. Dorian Florescu already used Machine Learning to make this choice, I will extend his work and try alternative approaches to develop the best existing methodology to train machine learning models that receive sets of polynomials as inputs.

Object of Study: CAD

Uses

Given a set of polynomials with real coefficients in n variables, the CAD algorithm is able to break the nth dimensional space into connected regions in which the polynomials are sign-invariant.

It can be used to solve Quantifier Elimination problems, finding multiple applications in applied mathematics and if one is able to determine which cells are "neighbors", it can be used to find the number of connected components of any algebraic variety.

Conclusions reached from CAD are certain, as opposed to those obtained with numerical algorithms; it was recently used to disprove a biological conjecture that numerical algorithms appeared to verify.

CAD of the polynomial set {xy + y - 1, x^2 - y^2-y^3}. The polynomials are sign invariant in every red point, blue, green or grey line and in every white region.

Comparing exponential and doubly exponential growth.
1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192,16384,32768,65536...
2,4,16,256,65536,4294967296,18446744073709551616(not in the image)...

Complexity

CAD's potential is strongly limited by its huge computational complexity, with a cost that grows doubly exponentially on the number of variables.

Everyone have heard about the fast growth of exponential functions that determined the spread of COVID-19; this growth is illustrated by the piles in the left of the image (without including the biggest). Doubly exponential growth is illustrated by the piles in the right of the image, including the last pile!!

Variable ordering

However, not everything is lost!
Researchers realised that the complexity of this algorithm was strongly dependant on the variable ordering chosen (in which order we study the variables).

In fact, they discovered some polynomials for which a bad variable resulted in a doubly exponential computational complexity while a good variable ordering had a constant complexity, that is, remained the same regardless of the number of variables.

CADs of the set of polynomials {x^5+5 x^4+5 x^3-5 x^2-6x-2y}.

Studying y first, we obtain a CAD with 57 cells (18 shaded areas, 27 curve segments and 12 points).
Studying x first generates only 3 cells (2 areas and one curve).

Contributions

Dataset

Number of sets for which each variable ordering is optimal before and after augmenting.

The SMT-LIB library is an online repository containing a very useful collection of sets of polynomials (QF_NRA) that had been used by previous research to evaluate the performance of the strategies they used to choose the variable ordering and to train machine learning models for that task.

During an exploration of this collection we discovered that this dataset presented some issues: it contained very similar sets of polynomials and even repeated ones, moreover it was imbalanced, meaning that the optimal variable ordering was the same for many of these sets of polynomials. This implied that using this collection for machine learning could result in the models being trained in the same examples they will be tested, moreover, they could simply learn to choose the most common optimal variable ordering for all sets.

We presented a methodology to overcome these issues that included augmenting the training dataset of sets of polynomials in a similar way to how datasets for computer vision are augmented (rotating the algebraic varieties of the sets of polynomials).

Heuristics

Various researchers used their expertise on the field to propose heuristics for choosing the variable ordering, but the features taken into account appeared quite arbitrary to me.

During my PhD I presented a methodology to deduce a heuristic from an existing complexity analysis (gmods). This methodology can be easily replicated for other choices in symbolic computation and in our case study it became the new state-of-the-art heuristic.

In the first image the comparison of existing heuristics. In the second one how the best existing heuristic compares to the new heuristic.

Machine learning

Regarding machine learning, simply by being trained in our improved dataset, existing models experienced reduction of 38% of the total time needed to build a CAD on average.

One of the contributions I'm most proud of is creating a categorisation of machine learning paradigms that helped me understand the field better and that I hope can be useful for other researchers.

Difficulties

In classical Machine Learning problems each sample has a fixed number of features, for example, in order to predict someone's salary we might be given their age, level of education, field of expertise and years in the company. This is why Machine Learning models are design to be fed a fixed number of features.

However, a set of polynomials has exactly the opposite shape, it is made out of an arbitrary number of polynomials, where each polynomial has an arbitrary number of monomials, which in turn has an arbitrary number of variables.

This makes polynomials not the ideal object to input into machine learning models. However, it is possible to extract a constant set of features from each set of polynomials and input those into the model, other alternatives are inputting the description of the polynomial as natural text to a Recurrent Neural Network or inputting a multidimensional "image" of the points where the polynomial is equal to 0 to a multidimensional Convolutional Neural Network.

We ultimately overcame this difficulty imitating the methodology presented by Dorian Florescu and Matthew England.

Shifting from classification to regression

This categorisation motivated a shift of paradigm from classification to regression to increase the information provided to the models. This paradigm change did not substantially improve performance on average, but it generated the best-performing model in this thesis.

Moreover, even though for choosing the variable ordering in CAD this methodology is still limited to a fixed number of variables, it has more potential applications for other symbolic computation choices.

Making a series of choices

Our last contribution regarding machine learning was proposing a methodology for training models capable of making a series of choices rather than a single choice.

This methodology has an even broader potential for making choices in symbolic computation, and in particular, for our case study it can choose a variable ordering for sets of polynomials in an arbitrary number of variables while not losing performance in our three-variable testing dataset.

Machine learning results

Legend

A: Trained and tested on imbalanced dataset (cheating). The rest are tested in balanced.

B: Classification. Trained in imbalanced dataset.

C: Classification. Trained in balanced dataset.

D: Classification. Trained in augmented dataset.

E: Shifting to Regression

F: Making a series of choices

Explainable AI

Finally, we close the cycle coming back to heuristics by being the first researchers in symbolic computation that made use of Explainable AI to extract insights from trained machine learning models, which often behave as black boxes, to deduce heuristics that can be understood by humans.

Thanks to this pioneering effort, we were able to uncover a new state-of-the-art heuristic, highlighting the potential of explainable AI in revealing deeper understandings within symbolic computation.

A plot illustrating the impact that each of the features had in a decision made by an MLP model. The model was deciding whether an ordering was the optimal ordering, the features were interpreted as contributing to that ordering being optimal and it was ultimately the decision made by the model.

Results

Here you can play with various strategies and experience how they compare with each other using a box plot and a survival plot.