Pre-recorded talks:

Empirical Robustness

Note: Jan 12 recorded Q&A session on these presentations can be found at the bottom of this page.

Adam Dziedzic

Review on adversarial ML

Questions & Answers

Concerning the slide with the smooth activation functions providing better security: how reproducible is this on other data-sets?

The dataset in the slide 50 is CIFAR-10. This work https://arxiv.org/pdf/2006.14536.pdf showed an improvement for the ImageNet dataset trained on the ResNet-50 architecture (Fig. 3).


When you talk about the trade-off between labelled and unlabeled data in adversarial training, and find the best ratio to be 3:7, does this hold across data-sets?

The ratio of labeled-to-unlabeled data per batch should be tuned per dataset and model. As a rule of thumb, we usually have more unlabeled data and should give proportionally more weight to this data.


Can you provide more details on 3D printing for adversarial attacks. You mentionned the EOT technique but it seems unclear how an attack on an image is used to modify the 3D model to be printed.

By introducing EOT, a general-purpose algorithm for creating robust adversarial examples, and by modeling 3D rendering and printing as a transformation within the framework of EOT, the authors succeeded in fabricating three-dimensional adversarial objects. The key point is to differentitate through the rendering process. Please, check Section 2.2 in http://proceedings.mlr.press/v80/athalye18b/athalye18b.pdf


(from [GA21]'The closer to boundary, the larger weight' ). As you mentioned some methods for adversarial defense use importance sampling on adversarial data. Are you aware of any study that measures the impact of these methods on the uncertainty of the model on rare cases ?

I am not aware of any study that measures impact of the importance sampling on the uncertainty of the model on rare cases. It sounds like a good idea that is worth trying.


(from [GQ20]). Given those two claims: "larger models are more robust" but "robust generalization [and larger models]requires more data", what is your feeling towards the best practice data scientists should adopt for safety critical ML systems ?

I would recommend: (1) training models with high-capacity from scratch, (2) curating the data before training, (3) using the state of the art defense techniques.


(from "Channel-wise activation magnitude"). The correlation of the magnitude of some channels and adversarial attacks have been observed for PGD attacks. Has it been confirmed for other attacks as well ? Could it be used as an efficient safeguard to detect adversarial attacks ?

In this work [IA21], the robustness (accuracy on adversarial examples) is evaluated under attacks: FGSM [GS15], PGD-20 [MM18], and CW1 [https://arxiv.org/abs/1608.04644] optimized by PGD (Section 5.1 in [IA21]). In my opinion, the method is yet to be tested in terms of robust defense. However, it gives us an insight into internal mechanisms in neural networks that are activated during the attacks.


Is the metric of robustness always performed with PGD ? Is there any work using formal deterministic methods or the CLEVER score to measure the robustness to adversarial attacks ? It seems to me that the robustness score should not be biased by the choice of heuristics used to created adversarial attacks.

No, PGD is a non-adaptive white-box attack and is not strong enough to assess robustness. The adaptive white-box attacks in a way described in https://arxiv.org/pdf/2002.08347.pdf are the state-of-the-art methods.


(from [YG19] high accuracy on high frequency) On what data (low/high freqsuency) the model has been trained on ?

The results in my talk were presented for the ImageNet dataset. The authors of the paper also used CIFAR-10 dataset as well as datasets with common corruptions, e.g. ImageNet-C

https://arxiv.org/abs/1807.01697 (images with fog, snow, etc.).


Regarding the open questions at the end of the talk: many of the listed open questions are problems that people worked on for the last 3-5 years (e.g., why adversarial examples exist). In your talk it sounded a bit like these problems are still as "open" as before. From reviewing all this literature, in your opinion, what are the interesting/promising directions to continue working on to addresse these problems? I would be particularly interested in the problem of why adversarial examples exist and, in terms of robustness, whether adversarial training based defenses will solve robustness or whether we need to rethink our strategy.

There are open problems and progress has been made, however, we do not have fully satisfying answers as yet. The community actually lacks consensus on this point: Szegedy et al. [BS14] suggest that neural networks have blind spots, Goodfellow et al. [GS15] posit that linearity is the main culprit and quantized inputs can break up linearity, and Xu et al. [https://arxiv.org/abs/1704.01155] claims that quantization makes the adversarial search space smaller. Personally, I like the visualization presented by Roth et al. [RK19], it intuitively shows that adversarial examples are caused by small measure regions of adversarial class “jutting” into a correct decision region. We can imagine that such a situation might occur in a high dimensional space. This model is commonly accepted because the inputs remain correctly classified when perturbed with small random noise (e.g., Uniform or Gaussian). However, we can also easily find the attack vector that moves the inputs to decision regions where they become misclassified. Up to this point, we have focused on models being the main culprit of why the adversarial examples exist. Mądry et al. shifted the gears and analyzed the adversarial examples from the dataset perspective, they defined the non-robust features that neural networks exploit to achieve a high accuracy classification. In my opinion, currently, adversarial training is the best empirical defense but the robustness that it provides is far from the desired one (for instance, up to 60% while the accuracy on clean data is already about 99%). I suspect that the promising directions are to design new model architectures and control the features that are utilized by neural networks during training.

Ajaya Adhikari


Detection of military assets on the ground can be performed by applying deep learning-based object detectors on drone surveillance footage. The traditional way of hiding military assets from sight is camouflage, for example by using camouflage nets. However, large assets like planes or vessels are difficult to conceal by means of traditional camouflage nets. An alternative type of camouflage is the direct misleading of automatic object detectors. Recently, it has been observed that small adversarial changes applied to images of the object can produce erroneous output by deep learning based detectors. In particular, adversarial attacks have been successfully demonstrated to prohibit person detections in images, requiring a patch with a specific pattern held up in front of the person, thereby essentially camouflaging the person for the detector. Research into this type of patch attacks is still limited and several questions related to the optimal patch configuration remain open. This work makes two contributions. First, we apply patch-based adversarial attacks for the use case of unmanned aerial surveillance, where the patch is laid on top of large military assets, camouflaging them from automatic detectors running over the imagery. The patch can prevent automatic detection of the whole object while only covering a small part of it. Second, we perform several experiments with different patch configurations, varying their size, position, number and saliency. Our results show that adversarial patch attacks form a realistic alternative to traditional camouflage activities, and should therefore be considered in the automated analysis of aerial surveillance imagery.

Questions & Answers

It is really interesting that colourful patches seem to be more successful. Do you have any intuition if this is related to the encoding or some parts of the model?

Answered orally during the Q&A session.


Is there any data on how well the applied object recognition methods function against more traditional camouflage (targeted agains humans)?

Answered orally during the Q&A session.

Pin-Yu Chen

Holistic Adversarial Robustness of Machine Learning Models


In this talk, I will provide a holistic view of adversarial robustness for modern machine learning models, especially for neural networks. The talk will cover a comprehensive overview on attacks, defenses, robustness certification and evaluation methods, and novel applications. I will also share my thoughts in terms of the roadmap toward holistic adversarial robustness, penetration testing, and model hardening. The talk will be concluded with current research insights, open challenges, and online resources. Please visit www.pinyuchen.com for more details.

Questions & Answers

You mentioned that the ideal case of fixing a non-robust model is to add some kind of patches to it. Could you elaborate in more detail what patches do you mean here? Would techniques like adversarial training qualify as patches? I'm asking because for me it seems that by requiring adversarial robustness (say, with respect to small Lp-bounded perturbations), we change the classifier quite dramatically, in particular its standard accuracy, privacy properties, local smoothness, etc. In other words, by changing the objective from a standard loss minimization to robust loss minimization, we are essentially looking for a completely different classifier which wouldn't qualify as a patch from this point of view.

By patching I mean the principle of strengthening a trained classifier for improved adversarial robustness, instead of training the model again from scratch. You are right that if one uses techniques like adversarial training on a standard trained model, their loss landscapes are too different to be meaningful. However, as we discovered in our recent AAAI paper <https://arxiv.org/abs/2012.11769>, we developed efficient and attack-agnostic training method to achieve this goal and make patching more practical. Similar, for mitigating backdoor effects, we also used model-connectivity based principle to "sanitize" the model with few clean data. See our ICLR'20 paper <https://openreview.net/forum?id=SJgwzCEKwH>.


One of the points of your roadmap is certification. What would be your opinion on what are the right perturbation sets to certify in the first place, and moreover how to determine their magnitude which is of practical interest? E.g. even if we are interested in being robust to small Linfinity perturbations, it's usually unclear what is the right Linfinity radius that we really want to be robust in. Of course, this aspect is very application-dependent, but I'm curious to hear your thoughts on this, maybe based on your experience with practical deployment of robust models.

This is a great question. I'd use the analogy from collision test for car models to motivate the answer. In car model development, we care a lot about safety, and there have been many safety reports, collision tests, and "guarantees" being reported. If we look closely, those numbers come with pre-assumptions (car driving at a certain speed; environment condition, etc). So I'd say instead of setting one robustness threshold and putting endless debate on this value, we should do more education on AI model developers and users to prepare them with the mindsets of how to use the model in a robust manner, just like the driver would not expect any collision guarantee at the spped of 300 km/hr, but the driver would expect the car to be "safe" at the speed of 60 km/hr. Similarly, AI model is not possible to be robust to unbounded perturbation. We have to standardize the notion of "normal mode" for model use.


How do you compute your CLEVER score? Is it based on a sampling around the data point or is it computed formally? Also. is this score computed for each. data. sample or are you able to derive a global score for the model?

Similar to randomized smoothing, we compute the gradient norms of data samples nearby a given data input, and use those sampled values to evaluate the local cross Lipschitz constant associated with our derived lower bound on minimal perturbation. So the CLEVER score is a local robustness metric estimating the distance to the closest decision boundary. Extending to global robustness, one can take data samples from different classes, compute their CLEVER scores, and use the statistics to indicate global robustness. A generic CLEVER score computation wrapper is provided by ART <https://adversarial-robustness-toolbox.readthedocs.io/en/stable/modules/metrics.html#clever>


AI is worth using if it performs better than other alternatives. Do you think that reducing accuracy to increase robustness might make the AI not worth using anymore?

From the standpoint of practical deployment, if one cannot guarantee robust generalization from in-house model development and testing to real-life problem solving in the wild, I am concerned that the continuous improvement in accuracy on a pre-defined test set may give us a false sense of technological breakthrough, especially when one needs to trade robustness for improved accuracy. That being said, there are low-risk and high error-tolerance uses cases where AI models without rigorous guarantees and validations are still valuable - such as automated question-answering, recommendation system, speech processing and so on. I'd say for high-stakes decision making tasks we need to be more cautious about the AI model. But there are certainly many other scenarios where introducing AI models are certainly beneficial.


Can you tell us more on the AI incident database and its future goals and derived applications ?

There is a very nice article on it: https://bdtechtalks.com/2021/01/14/ai-incident-database/


You talked about robustness in the context of larger systems and also mentioned model patching on the slides. I am wondering how important the ability to patch an existing model is in practice/in industry and whether we as a research community are somehow missing this application by mostly providing adversarial training based defenses (which usually need to be trained from scratch)?

In some cases clients want the service provider to find and train the best model on their behalf (e.g. AutoAI), then it's reasonable to invoke adversarial training and train from scratch. However, we also encounter scenarios that clients already have a preferred and trained model but want to understand better about the model properties, such as fairness, explainability, robustness, and so on. While inspecting and strengthening their model, the client may not want to drastically change the model significantly so as to remain the model utility (but granted that training from scratch with a different approach could be a solution if the utility wouldn't change much). I do feel this "on-demand" model hardening is inspired by practical needs and has not received much attention in the research community. In our recent AAAI'21 paper <https://arxiv.org/abs/2012.11769>, we also proposed a new on-demand training method that can bridge this gap while standard adversarial training methods fail to do so.

Kathrin Grosse

Why is ML security so hard?


In many areas of AML, we observe ongoing arms-races. In my talk, I pick three (of the many) aspects that make security in machine learning such a hard task. One highlights the importance of the user, who needs to be aware of possible threats, as we are currently not able to simply solve them. The second one highlights the relationships between different attacks, and shows that even in cases where we might be able to increase robustness towards one attack, this can still result in higher vulnerability towards another attack. Finally, I show an example where the attack itself seems rather undefined, and it is not possible to clearly and soundly define the problem.

Questions & Answers

Do you think that the poisoning technique could be encoded also as formal robustness (not only empirical)? Where the "external" pattern could be seen as the input perturbation.

Given the dynamics and stochasticity during DNN training, this would be really difficult. I wouldn't exclude the possibility that there we find a formalization later on in research though.


How do you imagine that it is possible to change the initialization of network weights?

You could either think about the attacker having access to the machine of the victim, or a drive-by download, for example.


Can you detail more how Adversarial initialization would be a threat for safety ? Maybe with some concrete use cases ?

One example could be two competing companies for an ML-based product. Both companies, when collecting data, would not know the correct performance on that data. One competitor could then trick the other into believing that more data is needed, and gain an advantage.


When you compare short length scale and long length scale, as you do a relative shift (+ or - l/2), you don't compare exactly the same thing and maybe with a larger range for the short length scale, you would be able to have the same localization effect, hence having a good idea of the length scale used for the GP. What do you think of this comment?

An interesting idea! However, we are looking for the compared lengthscale with minimal distance to the original GP. In case of a short length scale, there is a plateau around the true length scale, making it hard to detect it.


Did you evaluate the robustness of the introduces W-measure against adaptive attacks, that generate slight perturbations of the patterns used during the poisoning attack, to increase the entropy?

We did not test this particular case, no. However I'd expect that an increase in entropy corresponds to a less table decision surface, e.g. the backdoor trigger would work as reliable.


Do you know or expect a similar trade-off in hyper-parameters in DNNs as for Gaussian Processes? Like could there be hyper-parameters that make membership inference harder while being easier to re-engineer or vice-versa?

Although the settings we studied are indeed very specific to GP, there are papers pointing in that direction for DNN as well. They outline that adversarial training increases privacy risks, for example (https://ieeexplore.ieee.org/abstract/document/8844607?casa_token=3D_OmgwnRwwAAAAA:D83SXE0NQs1GdyOV6p2LQ_CjEj2wRFGKjsbYikTSNDr_eOKkdC0vqnNjZ_mxMiuEnlYHekia)


This approach assumes that the pattern for detection is known, right? And does is also work for less perceivable patterns (adversarial example like)?

We did test large and smaller triggers (where the smallest trigger was 4 pixels), and saw no differences there. However, we did not test other, less visible perturbations. As the measure relies on decision surface stability, I would not expect the perceptibility of the trigger having a huge influence on it.

Nils Jansen

Towards Dependable and Robust Planning: Learning and Verification


The subject of this talk are planning problems where agents operate inside environments that are subject to uncertainties and partial observability. Such problems are naturally modelled by partially observable MDPs (POMDPs). The goal is to compute a policy for an agent that is guaranteed to satisfy certain safety or performance specifications. We show how to leverage machine learning, in particular recurrent neural networks, to efficiently synthesize provably correct strategies. We present approaches to render such policies verifiable and understandable to humans. We also discuss settings where an agent operates under information limitation due to imperfect knowledge about the accuracy of its sensors. In the underlying models, probabilities are not exactly known but part of so-called uncertainty sets. We showcase the applicability of our methods by means of an aircraft collision avoidance system and robust spacecraft navigation.

Questions & Answers

Can you provide more details on your 2020 IJCAI paper ? In particular, why using FSC to explain RNNs ? Could you have used other explanation heuristics ? How do you force the model to have understandable explanations ?

The idea was that FSCs provide a good formalism to 'trade off' performance of the policy (originally given by an RNN) and the complexity of the controller. By the integrated model checking, we have the means to assess the performance of the extracted controller, and can likewise control the number of states in the FSC.


The examples shown in this talk is a simple grid environment. However, in practice the environment model can be much more complicated, e.g., with an infinite number (continuous) of states or actions. Is it possible to handle these cases?

I used these examples to explain the concepts, in our numerical evaluation, we have a large range of practical examples, for instance within aircraft collision avoidance and spacecraft motion planning. Moreover, there are multiple industrial partners we currently work with, for instance in the area of predictive maintenance.

Mykel Kochenderfer

Validation using sampling-based approaches and formal methods


Building reliable decision making systems for safety-critical applications requires making them robust to various sources of uncertainty. Recent advances have allowed for the systematic optimization of systems like the ACAS X collision avoidance system. In such applications, the optimizer can reason about low-probability events and arrive at decisions in a way that can be superior to that of human experts. However, such systems need to be rigoriously validated. We will discuss both sampling-based methods that tend to scale to more complex domains but do not provide formal guarantees as well as formal methods that do provide guarantees but are limited in their scalability. The talk will conclude with some new results on combining these categories of approaches to find failures in a neural network system designed for steering on a taxiway.

Questions & Answers

How long does it take to run those simulations, and how adaptive are they? E.g., having a similar task (say, for navigating robots in a car factory), could I simply use your procedure? How much time would I need to set this up?

The Carla simulator and the X Plane simulators run in real time. The idea of adaptive stress testing can be used in conjunction with many different simulators. We have open sourced both Julia and Python implementations.


For the two first methods you present (adaptive stress testing and formal proofs on NN), you rely on sampling. over a state space. For the 1st method, do you have any suggestions on how to choose the correct state space that will contain your most critical failure. On the second one, the state space is set by your initial model./table. Does your method is able to scale with a high dimensional state space, as you need to discretise it? Any idea to deal with a large dimensional state space?

Adaptive stress testing is not sensitive to state space representation. It is completely black box. Some of our reachability work involves discretizing the state space into cells. This is generally not really scalable beyond a half-dozen dimensions. Of course, we can do dimensionality reduction, but when we do that, we tend to loose our formal guarantees. However, we have other work that I covered in my IJCAI keynote that does not require discretization and might be more scalable.


Can you provide more details on the TaxiNet use case (input dimension before and after downsampling, how large is the network)?

It starts with a 200 x 360 RGB image that is then downsampled to 8 x 16 grayscale. There are 3 hidden layers with 32 ReLus total.More details in Sec. IV.A here: https://arxiv.org/pdf/2003.02381.pdf


When verifying the neural network controller, you used reachability analysis to predict the possible actions taken by the neural network and then move the environment to all possible next states. When the system dynamics is more complicated (e.g., involving a large number or continuous actions), or when the trajectory is very long, the number of states visited can be quite large. Do you have any thoughts on how to scale reachability analysis to that setting?

Great question. You might be interested in some of our work that looks at symbolic reachability, which is briefly covered in my IJCAI keynote: https://youtu.be/9b4jryW1JtA

Bo Li

End-to-end robustness for sensing-reasoning pipeline in adversarial environment



Questions & Answers

What kind of input perturbation are you using ? Is the set of sensors represented by the outputn probablities of a single neural network or by an ensemble of neural networks ? How did you design the reasoning model (based on ontology ?)

The perturbation can be general, in terms of L_p bounded or unrestricted for, say, image as an example. Basically the perturbations are vectors drawn from the same domain from the true data trying to mislead the prediction of machine learning models given the perturbed data.

It is a set of models (neural networks) for different tasks. For instance, taking the road sign recognition as an example, some sensors in charge of identifying the shapes, some for the contents etc. The design of the reasoning model is based on commonsense knowledge represented as first order logic rules, and using graphical models such as markov logic networks or Bayesian networks to embed them.


It seems that the more structure you put in your knowledge (hierarchies, relationships, etc.) the more useful it will be to eliminate adversarial examples (that will not match the structure). Is it correct?

Yes. If the structure of the knowledge is clear, the first order logic rules will be clear and therefore the graphical model can be constructed and we can make inference based on it.


You mentioned unrobust assignments and the need for more robust knowledge. How do you discover consistent (e.g. truck and not-animal) but incorrect predictions? Are you using some regularization encouraging diversity between the different underlying networks (e.g. similar to DVERGE https://arxiv.org/pdf/2009.14720.pdf)?

The sensing-reasoning is different with ensemble models such as DVERGE did which leverages statistical features to construct ensemble models. The knowledge integrated ML pipeline aims to use commonsense knowledge to help identify conflicts and improve the robustness of ML predictions. Indeed, if there is an attack that could attack every sensors such that all the rules are satisfied, the pipeline would be attacked. However we believe that's very hard, and if that can be done, maybe that instance is already a "true" sample instead of an adversarial instance.

David Stutz

Confidence-Calibrated Adversarial Training and Bit Error Robustness of DNNs


In this talk, I want to highlight to recent research directions regarding robustness. First, with our confidence-calibrated adversarial training (CCAT), we address the problem of robustness against various types of adversarial examples, even those unseen during training. CCAT biases the deep neural network (DNNs) towards low confidence predictions on adversarial examples. By allowing to reject those low confidence adversarial examples at test time, robustness generalizes beyond the threat model employed during training. Second, more recently, we tackle the problem of bit error robustness in (quantized) DNN weights. In the context of DNN accelerators, robustness to such bit errors allows to reduce operating voltage, thereby improving energy efficiency significantly. This way, we enable energy savings of up to 30% with only a small increase in test error, even for 4-bit quantized DNNs.

Questions & Answers

You talked peripherally about saving energy with your approach. (Using four bit, robust models at lower voltage). Could you give an intuition on how much energy you could save this way?

As the bit error rate increases exponentially when reducing voltage (and energy scaling quadratically with voltage), there is already a big gain in efficiency for comparably small voltage reductions. Based on our data corresponding to the DANTE chip (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8675205) enduring a bit error rate of 0.1% can reduce energy consumption by roughly 20%, independent of the quantization used. Going up to 1% bit error rate would result in roughly 30% energy savings. Any energy savings obtained through lowering the numbers of bits user per weight value is in addition to these low-voltage savings. Energy (at least in terms of memory access, but to some extent also computation) scales more or less linear with the numbe rof used bits.


In your opinion, is there a possibility that an adversary uses voltage differences to attack neural networks?

Yes, this is definitely possible and has, to the best of my knowledge, already been tried to circumvent different security measures - however, I am not entirely up-to-date on the literature in this regard. In our case, a voltage attack is similar to intentional voltage scaling. This means that we are able to provide security against limited voltage drops (i.e., any voltage drops resulting in a bit error rate less than X%) for a chip operated at nominal voltage. When operating the chip at lower voltage, this "security margin" is reduced. Besides, in practice, there is not only voltage to be considered but also frequency, both of which could be manipulated by an attacker. However, in our paper so far, we focus purely on voltage at fixed frequency.


In the second step of Transition to Low Confidence, can you confirm that the variable K is the number of classes. Could you provide more explanations on your representation for the target distribution ?

Yes, good catch, K is the number of classes. The representation of the target distribution is essentially a categorical distribution over the K classes. At the training example, this results in a one-hot encoding of the target classes. This then transitions to a uniform distribution, i.e., (1/K, ..., 1/K), depending on the perturbation size |\delta|.


How high is the availability of the network against the different adversaries? Have you evaluated robustness against adaptive attacks aiming to make your model unavailable?

Can you specify how exactly you define availability in the context of adversarial examples? We consider correctly classified CLEAN test examples as positives. So, I assume that you are interested in the false and true positive rates (FPR and TPR). On SVHN, for example, the FPR (adversarial examples that are not rejected and mis-classified) can be up to 49% against large L_inf perturbations. In this case we get a robust test error (thresholded) of roughly 52%, which subsumes the FPR. No, we did not explicitly considered this case. In fact, my thoughts are as follows: the easiest way to make the service unavailable is to craft "easy" adversarial examples that might not be mis-classified but are rejected. In this case, the model could give you the correct answer, but it doesn't based on low confidence. I am convinced that this is the behavior that we actually want in many applications. I do not want a system to make decision because it "thinks" it "could" still be right. I want the system to reject making a decision as soon as the system has evidence that the input might be manipulated.


Have you experimented with using (significantly) larger perturbations during training to induce larger regions of high confidence, while maintaining the generalization to low confidence?

No, we did not explicitly try training on larger perturbations. Part of the reason is obviously that we want to compare to related work and the used epsilon-balls are pretty standard. However, it is also argued that significantly larger epsilon balls cannot ensure label constancy. Meaning there is an ongoing discussion in the literature but also within workshops etc. that significantly larger epsilon balls may contain valid examples of a different class. For the data at hand (training or test examples) larger epsilons balls might still be fine (see, e.g., https://davidstutz.de/what-lp-adversarial-examples-make-sense-on-common-vision-datasets/), but it is unclear whether this reflects the true underlying data distribution. Apart from that, we tried various transitions, including an exponential transition where the model does not resort to a completely uniform distribution, but just lowers confidence while still being forced to predict the correct class. This allows to interpolate between the behavior of our approach and regular adversarial training. In general, however, we found that robustness benefits from resorting to a uniform distribution as quickly as possible and avoiding larger regions of high confidence. Nevertheless, it is important to note that within the data distribution, the model usually predicts high confidence (e.g., when interpolating between to test examples).

Huan Zhang

Robust reinforcement learning against adversarial perturbations on state observations


A reinforcement learning (RL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises and can mislead the agent into making suboptimal actions. Several works have shown this vulnerability via adversarial attacks, but existing approaches on improving the robustness under adversarial perturbations on state observations have limited success and lack for theoretical principles. We propose the state-adversarial Markov decision process (SA-MDP) to study the fundamental properties of this problem, and develop a theoretically principled robust policy regularization which can be applied to a large family of deep RL algorithms, including proximal policy optimization (PPO), deep deterministic policy gradient (DDPG) and deep Q networks (DQN). Additionally, we show that under the SA-MDP framework, we can solve an optimal adversary which is significantly stronger than existing adversarial attacks, and we can alternatively train the agent with a learned optimal adversary to improve the robustness of RL agents under strong attacks. We significantly improve the robustness of PPO, DDPG and DQN agents under a suite of strong white box adversarial attacks, including new attacks of our own. Additionally, we find that a robust policy noticeably improves DRL performance even without an adversary in a number of environments.

Questions & Answers

Is it possible to run adaptive attacks on your robust models?

Adaptive attacks are mostly used for adversarial defenses based on certain heuristics. Our approach is based on a maximin robust regularizer, which works in a manner similar to adversarial training. Thus there is no obvious adaptive attack in this setting. Additionally, in our recent paper (https://arxiv.org/pdf/2101.08452.pdf) we propose the optimal adversarial attack, which is significantly stronger than other attacks and can theoretically find the worst case adversary. In that setting, our robustly trained agents still remain robust under this attack, showing that they are truly robust.


Could you give some details on the training time of your models? What is the price to pay to learn your most complex policy compared to other methods you compare with?

The training time for state-adversarial regularized agents (SA-PPO, SA-DDPG and SA-DQN) are 3x to 5x slower than training a vanilla agents. When alternating training with a learned adversary (ATLA) is used, it requires more iterations to converge due to the existence of an adversary, and the training time is 5x to 20x slower than training a vanilla agent. However, even for the classification setting, PGD based adversarial training also needs an order of magnitude more time to train (PGD steps are slow, and more epochs are needed). There is always a cost to pay to make a ML system robust, and our robust RL training procedure has a cost similar to those classification settings. We believe there is still a large room to improve the training efficiency for robust reinforcement learning.

Questions & Answers to presentations on empirical methods

Animated by MĂ©lanie Ducoffe (Airbus)