With 3 classmates from Télécom, and partnering with researchers from Thalès (a french tech company), we reviewed and tested several neural network watermarking techniques.
Those techniques aim at "marking" the neural networks weights so that their owner can prove their ownership of a model by just running inferences with it. For example, if the model was stolen and exposed over the internet, a few well-chosen prompts to the model could prove that the model is indeed stolen.
Following this work, Erwan Fagnou and I decided to test an attack on current watermarking algorithm, to show that they aren't as robust as previously thought. Our attack was designed to be realistic.
The attack is called "Ensemble attack", and is explained in details here (in french). I will sketch here the main ideas:
Neural network watermarking consist in modifying your network so that it miss-classifies on purpose some on its inputs. You (the owner) know which inputs are miss-classified, but if someone stole your network, it would be hard for him to find out which inputs are the watermarks, or to modify the neural network to remove them. (More explanations here).
But as watermarks are miss-classified inputs, if the thief has multiple stolen models, he can run the inputs through the different models and use majority voting to choose the answer. In this case, your watermarked model will probably give a different answer from the other model, and its output will be ignored by majority voting. The thief has effectively escaped watermark detection.
This method even works if all of the stolen models have been watermarked, as long as they have been watermarked with different inputs (which is in practice the case).
See our work for more details.
Our method manages too reduce watermark detection rate from 98-100% initially to less than 20%, while maintaining accuracy (and even improving, thanks to ensemble method).