Adversarial Deep Learning

Researchers have discovered that "adversarial examples" can be used to fool machine learning models. An adversarial example is a natural example plus some carefully designed perturbation. For example, by adding small noise to the original ostrich image on the left hand side, we can successfully turn the predicted label into "safe", "shoe shop" and "vacuum" , and the added noise is too small to be detected by human eyes. This leads to security issues when applying ML models in real world---for example, if an ML model is deployed in a self-driving car system, the attackers can slightly change the traffic sign to fool the system.

We are trying to answer the following questions: How to generate adversarial examples to attack an ML model? How to verify the safety and robustness of a model? How to make deep neural networks more robust to adversarial attacks? See the introductory slides here. We are pursuing the following directions:

Attack: How to construct an adversarial example?

White-box attacks (when the model is transparent to the attacker). In this case, attack can be formulated as an optimization problem, and a simple gradient descent can be used to solve it (see C&W attack). Here are some of our recent work on white-box attacks:

We attack some image-captioning systems. The main challenge here is that the output is a sequence instead of a class-label, so the attack objective needs to be redefined. We successfully conduct targeted attack (turn the output caption into a specific target sentence) and targeted keyword attack (inject malicious keywords into the target sentence). Also check our code.

Black-box attacks (when the model is unknown to the attacker). In this case, the attacker can only make queries to the black-box model and observe the corresponding outputs. This is a more realistic setting but is also more challenging, since the gradient of the attack objective cannot be computed (back-propagation is not available).

We proposed the first optimization-based algorithm for scoring-based black-box attack (assume output probability can be observed). In this case, although the gradient of C&W objective function is not available, the objective function value can still be computed for each query. Therefore, the same objective function can be solved, not via gradient descent but via zeroth order optimization. With this simple idea, scoring-based black-box attack can produce adversarial examples with same quality as white-box attacks.

AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking Black-box Neural Networks

ZOO can produce very good adversarial examples but require a lot of queries. We improve the query efficiency of ZOO by switching to Nesterov's gradient-free optimization and a novel auto-encoder-based technique.

Query-Efficient Hard-label Black-box Attack: An Optimization-based Approach

ZOO and AutoZOOM (and many other black box attacks) assume the output probability of the target classifier is available. This is still unrealistic in real applications. Let's consider the most difficult setting---hard-label black-box setting---for each query, attacker can only observe the output label instead of probabilities. Surprisingly, we show hard-label attack can still be formulated as another different optimization problem, and this objective can be solved by any zeroth order optimization technique!

Design algorithms to attack language applications. This is quite challenging due to the discrete input space: for each word in the sentence, we only have finite choices (perturb it into another word in the vocabulary). This is still an open problem and we propose two algorithms for this hard task:

Verification: measure the robustness of deep neural networks

Given a neural network, how can we verify that the prediction won't be changed with small perturbation? This is known as the verification problem of deep neural network and is important for applying ML models in real world. The problem can be defined in a simple way: given a model f(.) and a data point x, how to find the region Ball(x, r) such that all the perturbations within this ball won't change the prediction? This can be illustrated as the following picture:

Statistical estimator of r

It is known that the Lipschitz constant can be used to compute the lower bound of "safe region"---intuitively, Lipschitz constant bounds the change of output with respect to small input perturbation. In the following paper, we propose a sampling-based approach to estimate Lipschitz constant, which implies an estimation of r (safe region).

Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach (ICLR '18)

Certified verification:

Unfortunately, the statistical estimator proposed in our ICLR paper is not a certified bound---it's possible that we over-estimate the safe region due to the sampling error. To compute a certified safe region, we propose a novel layer-by-layer computation. The idea is very simple: first, compute the upper/lower bound for the 1st layer neuron. Then using the weights and by bounding the activation function, we can bound the upper/lower bounds of the second neuron. Repeat this until the output neuron, we can get the maximum perturbation of output given certain input perturbation. We first derive this algorithm for the ReLU network in the following ICML paper, and then extend the idea to compute neural network with general activation functions in the NIPS paper. This is the first algorithm for solving the verification problem for general neural network!

Towards Fast Computation of Certified Robustness for ReLU Networks (ICML '18)
Efficient Neural Network Robustness Certification with General Activation Functions (NIPS '18)

Defense: How to make machine learning models more robust?

Eventually, we don't want to be bad guys; our ultimate goal is to make machine learning models more robust and make attack impossible. This is an extremely hard problem, especially for very deep neural networks. We proposed the following algorithms that achieve state-of-the-art performance (competitive / better than Madry's model)

Randomized approach:

All the attacks are based on computing or estimating gradients. Let's try to make attacker unable to do that! The main idea is to add randomization into the model, so the gradient is hard to compute, and is almost impossible to be estimated in the black-box setting. In the white-box setting, we perform similar to Madry's adversarial training model, but our training is much faster!

Towards Robust Neural Networks via Random Self-ensemble (ECCV '18)

Improved adversarial training:

We also try to improve the adversarial training model by Generative Adversarial Network (GAN). In our system, there are generator, discriminator, and attacker playing three-player games. Surprisingly, this not only improves the robustness of discriminator (so our system achieves state-of-the-art defense result), and moreover, the generator also converges to a better solution!

From Adversarial Training to Generative Adversarial Networks

We are still exploiting all these directions including other directions in adversarial deep learning. Feel free to contact us/talk with us if you are interested.

Adversarial Deep Learning

Attack: How to construct an adversarial example?

Verification: measure the robustness of deep neural networks

Defense: How to make machine learning models more robust?

Other reading materials: