All You Need is "Love": Evading Hate Speech Detection

Hate Speech Detection is a tough problem: transferability and adversaries make it so hard

[W1] All You Need is "Love": Evading Hate Speech Detection.
T. Gröndahl, L. Pajola, M. Juuti, M. Conti, N. Asokan.
AISec '18: Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security.

Objective:

Performance: evaluate performance of different architectures in different datasets
Transferability: evaluate original models (models train in their original settings) over different datasets
Adversarial resistence: evaluate of the resistance of the models in the presence of adversaries

Performance

All the models show similar results when trained and tested over the same corpora. The choice of the model seems irrelevant at this stage. With this result we don't imply that further test shouldn't focus on Deep Neural Network but that all of the architecture perform poorly in the same way.

2. Transferability

In the cross-domain test we highlight the poor performances of all of the model in different domain. We believe that this is the biggest issue of this topic so far.

3. Adversarial Resistance

We implemented simple adversarial techniques (e.g., typos) and we inserted in hateful (negative) words. The approach is simple but effective, reaching strong vulnerabilities of most of the models. Character based models show the better resistance. We test these attack also on an industrial implementation, Google Perspective, showing good evasion results also here.

Media Coverage

Wired, Aalto University press release, New Scientist, PC Magazine, The Next Web, Digital Trends, Indian Express, Science Daily, Electronics 360, NDLYSS, Helsingin Sanomat (Finnish), Repubblica (Italian), Corriere Comunicazioni (Italian), Focus (German), Scinexx (German), Slashdot, The Register

Page updated

Google Sites