Researchers have discovered that "adversarial examples" can be used to fool machine learning models. An adversarial example is a natural example plus some carefully designed perturbation. For example, by adding small noise to the original ostrich image on the left hand side, we can successfully turn the predicted label into "safe", "shoe shop" and "vacuum" , and the added noise is too small to be detected by human eyes. This leads to security issues when applying ML models in real world---for example, if an ML model is deployed in a self-driving car system, the attackers can slightly change the traffic sign to fool the system.
We are trying to answer the following questions: How to generate adversarial examples to attack an ML model? How to verify the safety and robustness of a model? How to make deep neural networks more robust to adversarial attacks? See the introductory slides here. We are pursuing the following directions:
We attack some image-captioning systems. The main challenge here is that the output is a sequence instead of a class-label, so the attack objective needs to be redefined. We successfully conduct targeted attack (turn the output caption into a specific target sentence) and targeted keyword attack (inject malicious keywords into the target sentence). Also check our code.
We proposed the first optimization-based algorithm for scoring-based black-box attack (assume output probability can be observed). In this case, although the gradient of C&W objective function is not available, the objective function value can still be computed for each query. Therefore, the same objective function can be solved, not via gradient descent but via zeroth order optimization. With this simple idea, scoring-based black-box attack can produce adversarial examples with same quality as white-box attacks.
ZOO can produce very good adversarial examples but require a lot of queries. We improve the query efficiency of ZOO by switching to Nesterov's gradient-free optimization and a novel auto-encoder-based technique.
ZOO and AutoZOOM (and many other black box attacks) assume the output probability of the target classifier is available. This is still unrealistic in real applications. Let's consider the most difficult setting---hard-label black-box setting---for each query, attacker can only observe the output label instead of probabilities. Surprisingly, we show hard-label attack can still be formulated as another different optimization problem, and this objective can be solved by any zeroth order optimization technique!
Given a neural network, how can we verify that the prediction won't be changed with small perturbation? This is known as the verification problem of deep neural network and is important for applying ML models in real world. The problem can be defined in a simple way: given a model f(.) and a data point x, how to find the region Ball(x, r) such that all the perturbations within this ball won't change the prediction? This can be illustrated as the following picture:
It is known that the Lipschitz constant can be used to compute the lower bound of "safe region"---intuitively, Lipschitz constant bounds the change of output with respect to small input perturbation. In the following paper, we propose a sampling-based approach to estimate Lipschitz constant, which implies an estimation of r (safe region).
Unfortunately, the statistical estimator proposed in our ICLR paper is not a certified bound---it's possible that we over-estimate the safe region due to the sampling error. To compute a certified safe region, we propose a novel layer-by-layer computation. The idea is very simple: first, compute the upper/lower bound for the 1st layer neuron. Then using the weights and by bounding the activation function, we can bound the upper/lower bounds of the second neuron. Repeat this until the output neuron, we can get the maximum perturbation of output given certain input perturbation. We first derive this algorithm for the ReLU network in the following ICML paper, and then extend the idea to compute neural network with general activation functions in the NIPS paper. This is the first algorithm for solving the verification problem for general neural network!
Eventually, we don't want to be bad guys; our ultimate goal is to make machine learning models more robust and make attack impossible. This is an extremely hard problem, especially for very deep neural networks. We proposed the following algorithms that achieve state-of-the-art performance (competitive / better than Madry's model)
All the attacks are based on computing or estimating gradients. Let's try to make attacker unable to do that! The main idea is to add randomization into the model, so the gradient is hard to compute, and is almost impossible to be estimated in the black-box setting. In the white-box setting, we perform similar to Madry's adversarial training model, but our training is much faster!
We also try to improve the adversarial training model by Generative Adversarial Network (GAN). In our system, there are generator, discriminator, and attacker playing three-player games. Surprisingly, this not only improves the robustness of discriminator (so our system achieves state-of-the-art defense result), and moreover, the generator also converges to a better solution!
We are still exploiting all these directions including other directions in adversarial deep learning. Feel free to contact us/talk with us if you are interested.
Thanks to my students (Huan Zhang, Minhao Cheng, Xuanqing Liu, Puyudi Yang) and collaborators (Pin-Yu Chen, Jinfeng Yi, Lily Weng, Hongge Chen, Thong Le, Yash Sharma, Dong Su, Yupeng Gao, Luca Daniel, Inderjit Dhillon, Duane Boning, Zhao Song) for all of these interesting work.