Comparison of the true boundary of the problem and the boundary curve acquired from training (up), and how a slight perturbation in input leads to completely different classification (down)
If simplified, training of machine learning models like deep neural networks can be viewed as a curve-fitting process. We cannot know the true decision curve of the problem when we train a model (if we know the true curve, we don't have to do complex machine learning). All we can know about the problem is individual data points, which are labeled in case of supervised learning. Given such restriction, the process of training is to derive a decision curve that feasibly fits the set of training samples while hoping the acquired curve to be as close as possible to the true decision curve.
However, no matter how large the training set is, there're even more ways to feasibly draw a curve which fits well to the training set, and there is generally a considerable discrepancy between the true curve and the curve from training. The shape of discrepancy depends on how flexible a model is. Stiffer models represented by various traditional machine learning schemes (e.g. linear regression) tend to be unable to fully reflect the complexity of the true curve, whereas more flexible models represented by deep learning tend to overly reflect the complexity. Though lack of complexity often leads to lower performance of the trained model, overly complicated models perform well in most cases. However, under adversarial circumstances where the attacker intentionally exploits such an overly flexible decision curve, this may lead to a security problem. This is ironic considering the abundant flexibility is one of the main source of the powerful performance of deep learning models.
Adversarial machine learning schemes exploit such flexibility and generally work as follows. Given a benign data point and the trained victim model itself, the attacker can derive the gradient at the point toward which the point best deviates from the original output (==decision) of the model. In case of a complex decision curve like the one in the above figure, the attacker can completely alter the decision only by slightly perturbing the point. In contrast, under the true decision curve, such a slight perturbation does not change the decision at all. Mimicking human intelligence and perception being one of the most notable purpose of machine learning schemes, the unknown true decision curve is mostly that of the humans in generating adversarial examples. Given humans as a reference, perceptual similarity with the original and the perturbed is the key factor in determining the successfulness of an attack. Once perturbations can be noticed by humans or even humans are deceived adversarial samples become pointless. Because how human perception works has not been revealed yet, various distance metrics (like L0, L2, L∞ norms) are adopted to represent perceptual similarity.
Let me introduce adversarial machine learning more technically with Fast Gradient Sign Method (FGSM), the first suggested adversarial machine learning scheme introduced by Ian Goodfellow. Given x and f, where x represents the data point to be perturbed and f the victim model. Under benign circumstances, f will produce the right output y.
y = f(x)
Then the attacker derives the required adversarial perturbation, Δx as below:
Δx = ε·sign[∇J(f, x, y')]
Here, ε and y' represent the magnitude of the perturbation (larger magnitude works well but less stealthy because it considerably modifies the input data) and the output the attacker wants to produce with the attack respectively. sign is the sign function and J is a scala loss function based on the model f, which represents how far current input (x) is from the desired output y'.
Finally the acquired adversarial perturbation (Δx) is added to the original data point (x) to produce an adversarial point, x' as below. With adequate choice of the magnitude of perturbation (ε), this adversarial input will alter the decision of the victim model. And as I said earlier, small ε would suffice when the victim model is flexible one like deep learning.
x' = x + Δx
The figure below shows how a benign input (image of a panda) is corrupted by FGSM to be erroneously classified as gibbon by the victim model. Though the perturbation is so delicate that humans cannot recognize it, it makes the victim model 99.3% sure of its decision, which is much larger than that of the original decision, the decision the model correctly classified it.
The process of deriving adversarial input from an image input. N.B. θ represents the model parameters. This is equivalent to the model f in my formulae. (image from Goodfellow's original paper)
The problem of physical adversarial examples
Originally the field of adversarial machine learning did not pay much attention on how the attacker can inject adversarial inputs to the victim model. Adversarial inputs were assumed to be injected without loss/error of a single bit. Except some special cases like typing texts to Google Assistant, this assumption cannot hold in most cases because the attacker is actually able to drive the model as he wants even without any use of the adversarial machine learning skills, given the privilege of providing any form of inputs to the victim model. Practically, it would be realistic to think the victim model takes inputs only from trusted sources like its own sensors (e.g. microphones and cameras) which the attacker cannot intervene.
One may think the attacker is still able to inject adversarial inputs via the sensors of the victim. However, here comes a new problem: survival of the delicate adversarial perturbation over the sensing procedures. Being sensors are devices with limited sensitivity, subtle details of the adversarial input, especially the adversarial perturbation is highly likely to be lost, which leads to the loss of its effect. Whereas, legitimate content of the input tends to be more obvious, and thus it survives the sensing procedure. The field of physical adversarial machine learning tries to overcome this restriction. Preserving the adversarial effect even when adversarial inputs are filmed from a distance or played back over the air.
Existing researches (e.g. adversarial glass frames and adversarial road signs) try to find the answer from obvious but not peculiar perturbations. Although perceptual similarity is often replaced by various distance metrics, it is not that simple. Large distance from the original input does not necessarily indicate the perturbed one is perceptually different. Because obvious perturbations can survive the sensing procedures, devising obvious perturbations which people do not feel weird can be a possible breakthrough of the field of physical adversarial example.
The field of physical adversarial examples can be thought as a intersection between CPS security and adversarial machine learning, because systems with its own sensors tend to be a CPS. Or viewing the whole system (sensors and back end machine learning models) as an advanced sensor module, physical adversarial examples can be a special form of sensor attacks. This paper shows how two fields from completely different domain can be combined to envision future directions for both fields.