Self-Supervised Geometric Features Discovery via Interpretable Attention for Vehicle Re-Identification (Jan 2022 - Present (not yet completed))
Introduction: Vehicle ReID is a fundamental but challenging problem in video surveillance due to subtle discrepancy among vehicles from identical make and large variation across view points of the same instance. To solve this problem we learn informative geometric features for vehicle ReID without supervisions from fine-grained annotations. An interpretable attention module, whose design is easy to explain and whose concentration are physically important locations, is implemented to highlight the automatic regions of interest.
Overview of the framework: Self-supervised geometric features discovery via interpretable attention, which consists of Global Branch (GB), Self-supervised learning branch (SLB), and Geometric features branch (GFB). Some key components are Interpretable Attention Module (IAM), Batch Normalization Neck (BNNeck), Cosine Classifier (CC), Global Average Pooling (GAP), Hard Mining Triplet Loss (Tri), and Smoothed Cross Entropy Loss (SCE). Below is the figure of the framework:
Framework
We give the architecture configurations in the above and each color represents a subnetwork. Referring to the prior research, we choose ResNet50, with stride = 2 in conv5 x replaced with stride = 1, as the backbone of GB. It is divided into two subnetworks, i.e., the first (conv1, conv2 x, conv3 x) and the second (conv4 x, conv5 x), denoted by green and red respectively. The shared encoder between SLB and GFB is implemented by ResNet18 (orange) whose stride in conv4 x, conv5 x is set as 1. In SLB, another subnetwork (purple), consisting of two basic ResNet blocks with stride = 2, is appended to the encoder to further condense features. In GFB, each image is first downsampled by 8 times by passing through the attention encoder and then the obtained tensor is processed by IAM to get the attention map. By element-wise multiplication, it is broadcast to every channel of the features from the first subnetwork of GB backbone, followed by another subnetwork (blue) composed of conv4 x ′ , conv5 x ′ .
Experimentation:
Dataset: VeRi-776 contains 49,357 images of 776 vehicles and 37,778 images of 576 identities compose its training set.
Implementation: PyTorch to implement framework, Adam optimizer (Betas -> B1 = 0.9, B2 = 0.999), weight decay = 5e-4, images resized to 256 X 256, mini batch size = 28, learning rate = 1e-4, number of epochs 20.
Evaluation: Evaluation of the approach in done by four widely used metrics in ReID literature, i.e, image-to-track retrieval mean Average Precision (tmAP), image to image retrieval mAP (imAP), Top-1, Top-5 Accuracy.