Note: Jan 13 recorded Q&A session on these presentations can be found at the bottom of this page.
Review on formal methods for AI robustness
The problem of adversarial robustness has been studied extensively for neural networks. However, for boosted decision trees and decision stumps there are almost no results, even though they are widely used in practice (e.g. XGBoost) due to their accuracy, interpretability, and efficiency. We show in this paper that for boosted decision stumps the exact min-max robust loss and test error for an Linfinity-attack can be computed in O(T log T) time per input, where T is the number of decision stumps and the optimal update step of the ensemble can be done in O(n^2 T log T), where n is the number of data points. For boosted trees we show how to efficiently calculate and optimize an upper bound on the robust loss, which leads to state-of-the-art robust test error for boosted trees on MNIST, FMNIST, and CIFAR-10. Moreover, the robust test error rates we achieve are competitive to the ones of provably robust convolutional networks. The code for training our provably robust models is available at https://github.com/max-andr/provably-robust-boosting.
Do you have any illustration of adversarial examples on tabular data ?
In our paper (https://arxiv.org/pdf/1906.03526.pdf), we have illustrations only for image data (MNIST, German Trafic Sign Dataset). The reason is that it's harder to illustrate adversarial examples for tabular datasets. In particular, this would require us to have some domain knowledge of a particular dataset and its features. We tried to provide a bit more insights on this, e.g., in Fig. 5 by analyzing which features and how often are used with normal and robust training, and we already see substantial difference there. However, in order to conclude whether this difference makes sense, we would need a domain expert with a deeper knowledge of the problem. Perhaps, that's the reason why mostly the research on adversarial robustness focuses on image data -- there is no need to have any special domain expertise in order to interpret the classifier and its adversarial perturbations.
Can you provide more details on your last experiments: 'We outperform almost all provable defenses for CNNs'. Did you compare with SOTA verifiability methods such as Reluplex or DeepPoly ?
To the best of my knowledge, ReluPlex hasn't been shown to be scalable to large neural networks that achieve competitive verified error on MNIST for eps=0.3. We based our comparison primarily on the numbers reported in the Interval Bound Propagation (IBP) paper (https://arxiv.org/pdf/1810.12715.pdf). Note that they also compared to ""Differentiable abstract interpretation for provably robust neural networks"" (one of the papers from Martin Vechev's group preceding DeepPoly) in the appendix where they found that those results are suboptimal compared to IBP: 18.0% vs 8.1% verified error (note that our robust trees achieve 12.5%). However, to remind, our paper was published in 2019, so it's likely that during the last 1-2 years there have been new approaches that could achieve tighter verification.
Is there a generally accepted approach on how to scale and distribute attacker budget on heterogenous tabular data, as the different variables are generally incomparable? (e.g. apply attack to standardized data)
In our paper, we used the Linfinity ball constraints which is uniform in each dimension (but note that we first normalized all features to lie in range between 0 and 1). Of course, it's likely that it's not a practically relevant perturbation set. On the other hand, to make it more relevant, one can use non-uniform bounds for each dimension using our method, but then one would need to have a domain expert to set them. This would be interesting to explore in the future work in collaboration with domain experts.
Did you experiment with using pre-activation variables of the first layer of a CNN or equivalently apply a linear projections to the inputs to obtain more meaningful inputs?
We haven't experimented with this, but it's a natural future direction. I can imagine that there can be benefits of combining the best of both worlds: the expressivity of CNNs (and suitability for computer vision) and the ease of certification of tree-based models.
We showcase two recent advances in robustness verification. First we show how randomized smoothing, which allows statistical robustness verification of large models and complex datasets can be expended to the verification of geometric perturbations (e.g., rotations and translations). Second we show how formal methods developed in the robustness verification of neural networks can be used to certify individual fairness, that is certify that similar individual are treated similarly.
When you learn a fair representation, how do you guarantee that the data producer, which is a neural network, can accurately encode the similarity function? Especially, is it possible to guarantee the producer correctly produces the set of the representation of all similar indviduals for all possible inputs, including those not included in training datasets?
We use the standard empirical risk minimisation setting for training, similar to adversarial training. In order to verify the similarity property we need to perform verification (using MILP) for each individual input. Since we verify individual inputs it does not matter wether these inputs come from the training dataset or are unseen.
Can you give an intuition on how much percent you usually abstracted your models? E.g., what is the ratio of final/initial size?
In our paper we reported a wide range of final sizes - from a 25% increase to a 66% decrease. Interestingly, even for the cases where the size increased, solving times were significantly improved, presumably because the resulting network was still much simpler (due to the splitting of neurons into the 4 categories).
Any idea on how to generalize your abstraction to other neural networks (i.e., neural networks using other activation functions than ReLUs)?
Answered orally in the recorded Q&A session.
In your opinion, in order to reach the “right” abstracted network (the one able to prove the property concerned but with less neurons), is it convenient to start from the original network and merge a few neurons per iteration or to start from the most abstracted one and going backward?
Coming up with the "right" initial abstraction is key here - it can have a very significant impact on performance. This is an active research direction that we're currently looking into. I don't have much intuition here, except that, e..g, for local adversarial robustness queries it seems useful to start from the most abstract network and work your way backwards. Presumably, this is because these queries tend to be simple and local, and thus can still be answered even when a great deal of information about the DNN is abstracted away.
With your experience, is the scalability issue of current reachable methods such as Reluplex more linked to width or the depth of the network ?
In general, deeper networks are easier to verify than wider networks. The intuition is that in deeper networks, case splits in earlier layers can fix many subsequent activation functions into one of their linear phases. Of course, this is just a rule of thumb, and things vary between networks and queries.
Can you detail more on how your method deals to split neurons into increasing and decreasing ingoing edges ? Can it be applied on any hidden layers or does it have to be a backward approach ? Do you have some empirical analysis on the size of the surrogate network ? Have you extend it to other activation function ?
Currently the approach works by working our way backwards, from the output layer to the input layer. There's no way to "skip" layers for now - this is an interesting research question. In the worst case, each neuron is split in 4, and each of these new neurons is classified into one category. Thus, in the worst case, the surrogate network is 4 times larger (before abstraction is applied). Currently we only applied this approach to ReLUs.
From a development point of view, how do you manage to modify recursively the graph to obtain your surrogate ?
We just iterate over the layers backwards, from output layer towards the input layer; each time split each neuron in 4, and adjust the edges accordingly.
Can this approach be extended to more complex architectures such as ResNets and especially DenseNets? It seems like the minimum number of condensed numbers would be exponential in the number of direct "child" layers.
Answered orally during the recorded Q&A session.
Deep neural networks are widely used for nonlinear function approximation with applications ranging from computer vision to control. Although these networks involve the composition of simple arithmetic operations, it can be very challenging to verify whether a particular network satisfies certain input-output properties for safety and robustness. In this talk, we will first provide an overview of existing methods for soundly verifying such properties. These methods borrow insights from reachability analysis, optimization, and search. On the other hand, in certain applications, the neural networks need to be adapted online to meet performance requirements. To perform adaptation safely, we may either constrain the network parameters with respect to the safety specification; or perform run-time verification together with the adaptation. We will discuss potential approaches to apply existing verification methods online to perform run-time verification.
How did you test the julia implementation of existing methods compared to the code provided (or not) by the authors. Did you encounter some mismatch bewteen your results and the one provided by the litterature ? Any plans to add more abstract interpretation based methods in your framework ?
Answered orally during the recorded Q&A session.
Could you expand on what is considered an unaffected branch in the adaptation setting? Have you experimented with applying special, additional regularization to the updates to (heuristically) increase the portion of unaffected branches?
Answered orally during the Q&A session.
We present two recent advances towards precise and scalable robustness certification. Firstly, we show how compositional architectures can combine a state-of-the-art deep network with (smaller) certified ones deciding at inference time on which network processes the input sample to boost the robustness of the deep networks significantly. Secondly, we introduce a precise and fast multi-neuron relaxation based certification framework leveraging novel approximation algorithms of the convex hull computation, to improve runtime and precision of state-of-the-art incomplete verification methods.
In your talk, you successfully break the single neuron convex relaxation barrier by relaxing a group of neurons in a convex hull. How does the complexity of the algorithm grows with k, the number neurons in a group? In the extreme case, suppose we can form a convex hull for all neurons, how large the relaxation barrier will be compared to complete verifiers such as MIP?
Answered orally in the recorded Q&A session.
In your ACE paper you use an additional network to decide whether to use a provably robust network or a standard network to control the accuracy-robustness trade-off. How exactly is this network trained and how did you evaluate robustness? Does the attacker know of this additional network and is it taken into account for the attack?
Answered orally in the recorded Q&A session.
Machine Learning (ML) is now used in a range of systems with results that are reported to exceed, under certain conditions, human performance. Many of these systems, in domains such as healthcare, automotive and manufacturing, exhibit high degrees of autonomy and are safety critical. Establishing justified confidence in ML forms a core part of the safety case for these systems. We introduce a methodology for the Assurance of Machine Learning for use in Autonomous Systems (AMLAS). AMLAS comprises a set of safety case patterns and a process for (1) systematically integrating safety assurance into the development of ML components and (2) for generating the evidence base for explicitly justifying the acceptable safety of these components when integrated into autonomous system applications. AMLAS preliminary evaluation is performed through a number of industrial case studies in which ML implements safety-critical functionalities.
Since AMLAS is currently under evaluation with some industrial use cases, does it mean that the current tools and algorithms for the robustness and the certification of ML systems are already enough for some assurance cases ? In your opinion, what are the main technological bricks that are missing regarding AMLAS ?
To my knowledge robustness algorithms are limited to some application and formal verification it is possible only in limited cases. Method for certification of ML systems are not clearly defined and AMLAS try to address this problem defining a process for the development of a safety case containing enough evidence to be accepted from regulators. Currently some of the main problems are defining better methods for formal verification or test based verification, analysis of out of distribution for the data useful to establish confidence in the model and understanding during operation if we can trust the model in a particular situation. The scope of AMLAS is defining a process and assurance patterns useful for practitioners interested to build a safety case for their complex system containing an ML component.
Do you have examples for data requirement? In the case of images, how do you specify the data requirement?
ML data requirements shall include consideration of the relevance, completeness, accuracy and balance of the data.
Relevance should consider that the data are appropriate for the operational domain.
EXAMPLE
For an ML component used for object detection on a vehicle the following may be defined as an ML data requirement for relevance: “Each data sample shall assume sensor positioning which is representative of that to be used on the vehicle”. This requirement is defined to ensure that images that provide a very low or very high viewpoint of the road (such as an aerial view) are not used in development.
Completeness should consider how the development data must be complete with respect to a set of measurable dimensions of the operating domain.
EXAMPLE
The operational domain for an autonomous vehicle indicates that the vehicle is to operate at all times of day and that the ML component should be robust to changing light levels. An ML data requirement for completeness may state: “Data samples should be gathered at all times of day and under the following light conditions: sunlight, cloud, rural with headlights and urban street lighting”.
ML data requirements shall include requirements that specify the required accuracy of the development data.
EXAMPLE
Consider an ML safety requirement that all pedestrians should be identified within 50cm of their true position. Given that the pedestrians are not point masses but instead represented as coloured pixels in the image, an accuracy requirement must clearly specify the required position of the label including the positioning of labels for partially occluded objects. An example accuracy requirement may state that: “When labelling data samples, the position of all pedestrians shall be recorded as their extremity closest to the roadway”.
ML data requirements relating to balance shall specify the required distribution of samples in the data sets.
EXAMPLE
DeepMind’s ML model for detecting acute Kidney failure reports incredible accuracy and predictive power. However analysis [16] shows that the data used to train the model was overwhelmingly from male patients (93.6%). In this case an ML data requirement for balance in the gender of the data sources should have been explicitly specified since this feature is relevant to the operating context of the model (which will be used for both male and female patients). Similarly, the data was collected from a set of individuals that lacked other forms of diversity. This could lead to the results in operation falling far short of those promised for the affected groups of patients.
In AMLAS, are you assuming that test distribution (in operations) is the same as the training distribution? If yes, how do you deal with safety for users with different usage distributions? Is a global distribution is enough to ensure safety for every usage? Is AMLAS applicable in an open world environment?
No we don't assume that the distribution in operation is equal to the training distribution. The internal test set is used by the developers to test their models on new data assuring they met the ML safety requirements defined previously. In addition when the developers are happy with the model during the verification phase the model will be tested with the verification data, unknown to the developers and representing edge cases or in general very difficult cases (e.g. identification of pedestrian in very poor visibility condition). The model shall met the ML safety requirements also when tested with the verification data. During the deployment stage all the assumptions and the input data should be monitored to assure we have enough confidence in the model output. We believe AMLAS is suitable for an open world environment but not for online/continuous learning. AMLAS is particularly good for supervised learning but need to be tested for unsupervised and reinforcement learning.
Who is supposed to generate the verification data? Authorities? System developer?
Verification data should be generated from a team that is totally independent from system developers. It is really important that these data are completely unknown to the developers. The team generating verification can be authorities or other developers knowing the problem, but the evidence of independence between development data and verification data shall be provided.
Did I understand correctly, that you suggested that the validation dataset should include or even consist only of edge cases or specifically hard samples? If so, how do we define what these are and do we / should we consider this evaluation on Out Of Distribution samples during model design?
Yes the verification data shall include edge cases and hard samples. These are defined knowing the operational domain and combining the performance and robustness requirements defined. The robustness requirements define regions of the domain where is required a certain accuracy. These regions of the domains can be analysed and edge cases/ hard samples can be determined. For example if the car should identify pedestrians in all atmospheric conditions and also when they are partially occluded, a hard sample could be the pedestrian in a foggy day partially occluded by a parked car. More than evaluation on Out of Distribution samples I would see this as an evaluation of model robustness. We are aware that especially in an open world we can't anticipate all the possible hard samples this is why monitoring the system in operation is necessary.
Recently, there is growing concern that machine-learned software, which currently assists or even automates decision making, reproduces, and in the worst case reinforces, bias present in the training data. The development of tools and techniques for certifying fairness of this software or describing its biases is, therefore, critical. In this paper, we propose a perfectly parallel static analysis for certifying fairness of feed-forward neural networks used for classification of tabular data. When certification succeeds, our approach provides definite guarantees, otherwise, it describes and quantifies the biased input space regions. We design the analysis to be sound, in practice also exact, and configurable in terms of scalability and precision, thereby enabling pay-as-you-go certification. We implement our approach in an open-source tool called Libra and demonstrate its effectiveness on neural networks trained on popular datasets.
Would it be possible to apply your method by downscaling the models, e.g. using a method similar to that in Guy Katz's talk?
Yes, absolutely. One thing to keep in mind, though, is that doing so will make the approach sound but not complete (while it is currently complete in practice since we use polyhedra for the backward analysis).
In your paper, you write that deleting a sensitive feature altogether from the training set is not a satisfying solution because there can exist another feature that can serve as a proxy for the deleted one. Is there any way to take this into account within your certification framework? E.g. you can certify that the classifier doesn't take into account the sensitive feature "age", but maybe it still relies on some related proxy variables (it can be also a group of variables, not necessarily a single one). Is there a way to prevent this, ideally with provable guarantees?
Our approach is parametric in the choice of the sensitive feature(s) so if one has knowledge about proxies they can use it to guide the choice of which features should be considered sensitive by the analysis. Otherwise no, the approach is not currently able to infer proxies for sensitive features. This is something that has to be done separately in some other way.
Why do you not apply any splits in the input space along the sensitive input features?
If I did, I would have separate partitions for different values of the sensitive features, say, one partition for males and one partition for females. At this point I have no way of finding bias in a given partition between people with different gender.
Do you leverage the output abstract elements of the forward pass (or their convex hull after grouping by activation pattern) during the backward pass? If not, it seems like the fixed (part of the) activation pattern might be if feasible for large parts of the considered search volume potentially reducing precision.
I use the output abstract element to figure out if in some partition there is a unique output meaning that I can conclude that there is no bias already in the forward analysis. If that is not the case, the partition has to be analyzed by the backward analysis. So it goes into a pool of partition+activation pattern which is then reorganized to group together multiple partitions with a common activation pattern. The backward analysis then runs for each of these common patterns and the bias check is done for each input partition associated with a common pattern.
Currently, the machine learning models we can verify are limited to very small ones. However, modern deep neural networks are becoming increasing larger to achieve close-to-human performance on complex tasks. For example, OpenAI GPT-3 contains 175 billion parameters. Do you think it makes sense to use verification techniques on these very large models? As an analogy, human brain contains 100 billion nerve cells and if we try to "verify" human behaviour from the neurone level, it can be very difficult. Do you think if we need alternative formulations or definitions for the verification problem for the increasingly larger networks in future?
Answered orally during the recorded Q&A session.
When you talk about unseen data or the data on which we want to prove properties, do you think it is reasonable to use a global data model that will cover all users or to prove the properties for all specific usage?
Answered orally during the recorded Q&A session.
In this talk, we discuss some challenges of neural network verification and present some of our ongoing research to improve Marabou, a solver for verification of deep neural networks. In particular, we present Split-and-Conquer - a highly-parallelizable algorithm, implemented in Marabou. The algorithm has a number of parameters to tailor the search to a given architecture. We discuss the performance benefits of parallelization, some heuristics involved and show results of deploying Split-and-Conquer algorithm in the cloud.
From my understanding, marabou is compatible with tensorflow, but has to be converted into nnet object. Is there any way to update the weights of an nnet object given a tensorflow neural network. This would be highly useful to use reluplex during the training pipeline. Thank you for the amazing work you put on this library.
Currently there are no such features available, although it is possible to update the nnet object directly.Marabou would have to start the analysis from scratch since the constraints will change. So one way is to regenerate the nnet network object from the tensorflow representation. If one knows the updates to the weights one can directly update the nnet coefficients, but the analysis has to start from scratch regardless.
In his presentation, Guy Katz mentionned that the downsampling network method was built on top of marabou, but is there any plan to release a new version with this downsampling inside marabou ?
Eventually, we want to integrate as many techniques into Marabou, but for now there is no timeline for integration of the abstraction method.
Is there any plan to make marabou pytorch compatible ?
There are no concrete plans for integration, beyond the already present Python API, which should allow use of Marabou within a PyTorch workflow. Internally, Marabou will continue to operate on its own neural network representation, which means that translation steps will be necessary. We welcome use cases from the community and will gladly do what we can to support through the API.
We develop an automatic framework to enable neural network verification on general network structures using linear relaxation based perturbation analysis (LiRPA). Our framework generalizes existing LiRPA algorithms such as CROWN and DeepPoly to operate on general computational graphs. The flexibility, differentiability and ease of use of our framework allow us to obtain state-of-the-art certified defense on fairly complicated networks like DenseNet, ResNeXt, LSTM and Transformer that are not supported by prior works. Our framework also enables loss fusion, a technique that significantly reduces the computational complexity of LiRPA for certified defense. For the first time, we demonstrate LiRPA based certified defense on Tiny ImageNet and Downscaled ImageNet where previous approaches (e.g., CROWN-IBP) cannot scale to due to the relatively large number of label classes. Our work yields an open source library for the community to apply LiRPA to areas beyond robustness verification without much LiRPA expertise, e.g., we create a neural network with a provably flat optimization landscape by applying LiRPA to network parameters (model weights). Our open source library is available at https://github.com/KaidiXu/auto_LiRPA.
How do you handle layer behaviour depending on the full batch (e.g BatchNorm) during training with LiRPA?
During test time, BatchNorm is just a linear transformation which needs no relaxation. During training time, since the mean and standard deviation are also functions of input x, it can be hard to fully bound the correlation within a batch of elements. In our auto_LiRPA library (https://github.com/KaidiXu/auto_LiRPA.git) we compute per-batch mean and std and treat them as constants during a batch when computing the bounds, so BatchNorm can still be treated as a linear transformation and bounded efficiently. This does not affect the correctness during test time, and we find it works pretty well during training. Importantly, just as in regular neural network training, we backprop the gradients through the mean and std (so they are not treated as constants for computing gradients).
How do you compute/refine the neuron-wise bounds during the optimization of \alpha for the work in "Fast and Complete..."? Do you directly optimize for the bounds on the specification or refine the neuronwise-bounds using different optimization targets before?
In our "Fast and Complete..." paper (https://arxiv.org/pdf/2011.13824.pdf) we define the lower bounds for the specifications (denoted as L) as the objective, and optimize the bounds by changing the lower bound slope \alpha for unstable ReLUs. The objective L is a function of \alpha, as well as intermediate layer neuron bounds which determines the tightness of relaxation. The intermediate layer neuron bounds themselves are not fixed during relaxations, and are also functions of \alpha. Overall, L is just a complicated and non-convex function of \alpha, and we optimize \alpha using gradient descent. This work very well in practice, and we can outperform linear programming (LP) with fixed intermediate layer bounds. Additionally, our optimization process can be done efficiently on GPUs. Combing these benefits allows us to outperform all existing complete verifiers and achieves a over 50x speedup compared to the classic Planet relaxation based complete verification.
Animated by Mélanie Ducoffe (Airbus)