In 2023, I did a 6-months AI Research internship at Continental Berlin AI research center. My research question was: how can self-supervised learning improve object detection in the specific context of the automotive industry?
SSL consists in pre-training a neural network model with unlabeled data, on a pretext task (for example, comparing images or predicting missing parts of the images). Hopefully, to succeed in this pretext task, the neural network will need to learn features in the image that will also be useful for the downstream task (for us, Object Detection).
First, I did literature review and tried to find interesting sub-questions to really tackle the specificities of the automotive industry that matter for self-supervised learning. I found that:
Images in the automotive industry contains a lot of "useless pixels" with no objects in them (road, sky, trees, ...)
There are a lot of small objects in the images (lots of cars in front of the camera, lots of pedestrians around, ...)
There is a lot of occlusion: cars (partially) hiding other cars, or pedestrians, ...
The distribution of the classes has a long tail: some classes like cars or pedestrians are very common. Some classes like emergency vehicles are very rare in the datasets.
I also took into account business constraints from Continental:
Our neural networks models needed to be lightweight
I should mainly focus on CNNs models for easy integration with Continental Explicability framework
...
The first 2 months:
I tested a lot of the existing methods for image self-supervised learning (SimCLR, MoCo, VicReg, VicRegL, ...)
I tried to interpret the results by comparing classification on small and big objects, on the different classes, etc. I also studied the impact of varying parameters that matters for ressource consumption like image sizes, batch sizes, ...
During the 4 remaining months, I then tested some of my ideas:
Can we adapt some methods that work well for Transformers, so they work well for CNNs, especially all the mask-and-predict reconstruction methods?
Can we simplify the VicRegL algorithm: cropping the images as a pre-processing step (input of the network), instead of cropping the feature maps (output of the network)?
Can we oversample some parts of the image (based on some heuristic) to avoid putting too much emphasis on "useless pixels"?
Can we train the CNN with varying input image sizes, so that objects of different sizes are better represented?
In the pre-processing step of the SSL algorithms, how can we ensure we don't crop out entire objects in case of high occlusion?
...
I cannot give all the details about my results, as I signed a NDA, and the results are not publicly published.
I found new and better ways to adapt reconstruction methods for CNNs, which proved better than the naive adaptation, but were still behind contrastive methods. I concluded that CNNs are inherently incompatible with masked prediction SSL.
I showed that it was possible to get similar results to VicRegL with cropping the images as a pre-processing step, instead of using a complex and costy matching loss. Unfortunately, even with a lot of training, my method did not exceeded VicRegL.
Oversampling some parts of the image is counter-intuitively useless and detrimental/neutral to the performance of the model.
Training a CNN with varying input sizes is possible if we use global pooling layer. There was unfortunately no or few benefits to the performance. I suspect the advantages of using varying input sizes are counterbalanced by the global pooling layer, that looses a lot of information.
Avoiding cropping out entier objects in case of occlusion boosts the performance. I proved this using an oracle to avoid cropping objects (i.e. training on a dataset that actually contained annotations, which I only used here to avoid cropping out objects). But I didn't manage to find an heuristic (using the image gradient? or edge detection?) that could predict object boundaries well enough to get the same results as the oracle.
I chose to work with pytorch, timm and MMSelfSup for fast prototyping. In 6 months, I had the time to learn to use a framework like MMSelfSup, which saved me a lot of time implementing different existing methods. I was able to test a lot of existing methods fast, and then adapt the code with my new ideas.
My initial objective was to publish my results in a scientific paper. Unfortunately, we judged that most of my results were "negative results" (i.e. methods did not work very well), and that my work would probably not be published. Hence, during the end of my internship, I focused on testing new ideas and helping my colleagues understand and use my work, instead of writing a paper.