A Brief Background of Adversarial Attacks

Introduction

Over the past few years, one of the most frequently discussed and critiqued aspects of all social media platforms has been fake news and misinformation campaigns, which causes us to wonder why has this problem become so pronounced in the recent past? One of the possible reasons is the increase in consumption of content through social media platforms. Another interesting point to note highlighted by Vosoughi et al. is the observation that "falsehoods diffuse significantly farther, faster, deeper, and more broadly than the truth in all categories of information."

Social media platforms have definitely tried to address this problem at differing levels of priority and with this project, we try to address one such platform - Instagram's fact-checking and misinformation classification algorithm. We aim to demonstrate that the current measures in place can be tricked by introducing an adversarial perturbation to an image that otherwise would have been flagged as fake. Our goal is to highlight and demonstrate the necessity for enforcing stronger checks in the existing processes.

All the original images as well as the perturbed images along with the code can be found in our Github repository.

Optical Character Recognition (OCR)

OCR systems transform a two-dimensional image of text, that could contain machine-printed or handwritten text from its image representation into machine-readable text. Formally, Optical Character Recognition(OCR) enables us to translate documents and images into analyzable, editable, and searchable data. Present-day platforms, like Instagram, rely heavily on OCR to classify and flag images and even limit the spread of certain images that have flagged content on them. One such example is the flagging of images that contain the word "Covid" or "vaccines" on them. In an attempt to limit the spread of misinformation - all such images are currently tagged with an additional banner that redirects the users to reliable information sources.

Tesseract

In the current phase of our implementation - we are using Tesseract, a deep learning-based OCR engine developed by Google, to extract text from images. Tesseract operates in a step-by-step pipeline wherein the first step is a connected component analysis that results in the identification of "blobs". These blobs are then organized into lines and the lines are broken into words based on character spacing. This is followed by a two-step recognition process - in the first pass, the algorithm attempts to recognize each word, and satisfactorily recognized words are sent to an adaptive classifier as training data. The second pass refines the classification based on the knowledge gained from the first pass.

Successfully evading Tesseract detection is one of the milestones for our project. Since Instagram is essentially a black box and we aren't aware of the OCR being used, we will be implementing Tesseract as a baseline to evaluate our adversarial examples. Once we achieve satisfactory accuracy with Tesseract - we will expand our attack to Instagram.

HopSkipJump

Adversarial examples have evolved over the years with masked perturbations to images becoming barely detectable to human eyes. There are several models that have been developed to generate adversarial examples that can evade OCR detections. Some interesting works include Fast Adversarial Watermark Attack - FAWA works by disguising perturbations as watermarks so as to evade human eye detection. While this is a really interesting approach, the algorithm is implemented in a white-box environment which is not feasible for our problem statement.

We, therefore, plan to extend HopSkipJump for our project. The HopSkipJump attack is based on an estimate of the gradient direction using binary information at the decision boundary. Experimentally, HopSkipJump has demonstrated requiring fewer queries than other models and can be implemented as both targeted and untargeted attacks. HopSkipJump is also effective against several defence mechanisms such as defensive distillation, region-based classification, and adversarial training. We have implemented the HopSkipJump attack to misclassify certain words in images as their antonyms using a targeted attack.

What we know about Instagram's fact-checking

On December 16, 2019 - Instagram released a blog discussing its efforts to combat misinformation on the platform. The blog listed the working of this new system as follows :

Any images reported as fake/misinformation are sent to third-party fact-checkers for verification.
Once this image is fact-checked and found to be false or partly false - the platform "reduces" its distribution by reducing its visibility. In addition, we also see a banner that covers the image as shown in Figure 1.

The interesting bit about this is, that Instagram further uses Image Matching Algorithms to identify similar images and flag them as false.

Page updated

Google Sites

Report abuse