Video-Based Person Re-Identification

Abstract

In recent years, with the rapid development of intelligent surveillance devices and the increasing demand for public safety, a large number of cameras have been deployed in public places such as airports, communities, streets and campuses. These camera networks typically span large geographic areas with non-overlapping coverage and generate a large amount of surveillance video every day. We use this video data to analyze the activity patterns and behavioral characteristics of pedestrians in the real world for applications such as target detection, multi-camera target tracking and crowd behavior analysis.

Introduction

Person Re-ID can be traced back to the problem of multi-target multi-camera tracking (MTMCT tracking), which aims to determine whether pedestrians captured by different cameras or pedestrian images from different video clips of the same camera are the same pedestrian. Fig-1 illustrates an example of a surveillance area monitored by multiple cameras with non-overlapping fields of view.

Person Re-ID system mainly consists of two stages:

Pedestrian Detection: For pedestrian detection, many algorithms with high detection accuracy have emerged, such as YOLO, SSD and Fast R-CNN.
Person Re-ID: Person Re-ID constructs a large image dataset (Gallery) from the detected pedestrian images and retrieves matching pedestrian images from it using probe images (Probe), so person Re-ID can also be regarded as an image retrieval task.

The key of person Re-ID is to learn discriminative features of pedestrians to distinguish between pedestrian images with the same identity and those with different identities. Fig-2 shows the complete flow of the Person Re-ID System.

Traditional person Re-ID methods mainly used manual extraction of fixed discriminative features or learned better similarity measures, which were error-prone and time-consuming, and greatly affected the accuracy and real-time performance of pedestrian Re-ID tasks. In 2014, deep learning was first used in the person Re-ID field.

Deep Learning based Person Re-ID Methods

[1] classified deep learning-based person Re-ID methods into four categories with classification structure, including methods for:

depth metric learning
local feature learning
generative adversarial learning
sequences feature learning.

In addition, they also subdivided the above four categories according to their methodologies and motivations, discussing and comparing the advantages and limitations of part subcategories. Fig-4. shows the classification structure of Deep Learning based Person Re-ID methods.

Fig- 4. Classification structure of deep learning-based person re-identification methods

Deep Metric Learning:

Deep metric learning (DML) is a subfield of machine learning and computer vision that focuses on learning a feature representation (embedding) of data points such that the similarity between data points in the embedding space reflects their semantic similarity in the original data space.

Classification Loss: Classification loss functions such as softmax cross-entropy are used in tasks where the goal is to assign input samples to predefined classes or categories. It penalizes the difference between the predicted class probabilities and the true class labels.
Validation Loss: Validation loss is a metric used to evaluate the performance of a model on a validation dataset during training. It measures how well the model generalizes to unseen data and helps in preventing overfitting.
Contrastive Loss: Contrastive loss is used in siamese or triplet networks for learning embeddings such that similar samples are brought closer together while dissimilar samples are pushed farther apart. It encourages the embeddings of similar pairs to be close in the embedding space and dissimilar pairs to be separated by a margin.
Triplet Loss: Triplet loss is another loss function used in siamese or triplet networks for learning embeddings. It involves selecting triplets of anchor, positive, and negative samples and penalizes the distance between the anchor and positive samples while pushing the anchor and negative samples apart by a margin.
Quadruplet Loss: Quadruplet loss is an extension of triplet loss where each training sample consists of two pairs of similar and dissimilar samples (quadruplet). It aims to further improve the discriminative power of the learned embeddings by considering additional negative samples, leading to better separation between classes.

An illustration of the most commonly used loss functions is shown in Fig-5. These deep metric learning methods enable models to learn discriminative features automatically, which can solve the problem of manually designing features that consume a lot of labor costs.

Fig-5. Illustration of different loss functions used in Person Re-ID

Local Feature Learning:

Local feature learning involves extracting features from specific regions or patches within an input data sample, such as an image or a video frame. Instead of considering the entire input as a whole, local feature learning focuses on capturing information from localized regions, which can be beneficial for tasks like object detection, where objects may appear at different locations within an image.

- Predefined Stripe Segmentation: Predefined stripe segmentation refers to dividing an input image or data sample into predefined horizontal or vertical stripes or segments. This segmentation approach can be useful for processing data in a structured manner, especially when the input data has a specific layout or organization.
- Multi-scale Fusion: Multi-scale fusion involves combining information from multiple scales or resolutions of the input data. By considering features at different scales, a model can capture both fine-grained details and global context, leading to more robust representations.
- Soft Attention: Soft attention is a mechanism used in neural networks to dynamically weight the importance of different parts of the input data. Unlike hard attention, where the model focuses on a single part of the input, soft attention computes a soft distribution over the entire input, allowing the model to attend to multiple parts simultaneously.
- Semantic Extraction: Semantic extraction refers to the process of extracting high-level semantic information from raw data. In the context of computer vision, semantic extraction may involve identifying objects, regions of interest, or meaningful patterns within images or videos.
- Global-Local Feature Learning: Global-local feature learning involves jointly learning global and local features from the input data. Global features capture the overall structure or context of the data, while local features focus on specific details or regions. By combining both types of features, a model can achieve a better understanding of the input data and perform more effectively on tasks like object recognition or scene understanding.
Generative Adversarial Learning:

Generative Adversarial Learning, commonly referred to as GAN, is a framework in machine learning where two neural networks, called the generator and the discriminator, are trained simultaneously through adversarial training. The generator network generates fake samples (e.g., images) from random noise, while the discriminator network tries to distinguish between real and fake samples. Through this adversarial process, the generator learns to produce increasingly realistic samples, while the discriminator learns to better differentiate between real and fake samples. GANs have been widely used for various tasks, including image generation, image-to-image translation, super-resolution, and style transfer.

Image-Image Style Transfer: Image-to-image style transfer is a task in computer vision where the style or appearance of an input image is transferred to another image while preserving its content. Unlike traditional image processing techniques, which may involve manual manipulation or filtering, image-to-image style transfer algorithms typically leverage deep learning models trained on large datasets to automatically learn the mapping between input and output images. Style transfer can be used for various creative applications, such as artistic rendering, photo enhancement, and visual effects.
Data Augmentation (Data Enhancement): Data augmentation, also known as data enhancement, is a technique used to artificially increase the size and diversity of a dataset by applying various transformations to the existing data samples. Common data augmentation techniques include random rotations, translations, flips, scaling, cropping, color jittering, and adding noise. Data augmentation is widely used in machine learning, especially in tasks like image classification, object detection, and natural language processing, where having a large and diverse dataset is crucial for training robust models.
Invariant Feature Learning: Invariant feature learning is the process of extracting features from data that are invariant or robust to certain transformations or variations. For example, in computer vision, invariant feature learning aims to extract features from images that remain consistent across changes in lighting, viewpoint, scale, or other factors. Learning invariant features can improve the generalization and robustness of machine learning models, making them less sensitive to irrelevant variations in the input data.

Sequence Feature Learning:

Sequence Feature Learning refers to the process of extracting meaningful features from sequential data, such as time-series data, text sequences, or sequential images. In the context of deep learning, sequence feature learning often involves using recurrent neural networks (RNNs), convolutional neural networks (CNNs), or a combination of both to capture temporal dependencies and spatial patterns in the sequential data. Sequence feature learning is widely used in various tasks, including natural language processing (NLP), speech recognition, time-series forecasting, and action recognition in videos.

Optical Flow: Optical flow is a computer vision technique used to estimate the motion of objects in a sequence of images or video frames. It works by tracking the movement of pixels between consecutive frames and estimating the velocity or displacement of each pixel. Optical flow can be used for various applications, including object tracking, motion analysis, video stabilization, and visual odometry in robotics and autonomous vehicles.
3D Convolutional Neural Network (3D CNN): A 3D convolutional neural network (3D CNN) is a type of deep learning architecture designed to process spatiotemporal data, such as video sequences or volumetric medical images. Unlike 2D CNNs, which operate on 2D spatial grids, 3D CNNs operate on 3D spatiotemporal volumes, allowing them to capture both spatial and temporal features directly from the input data. 3D CNNs have been successfully applied to tasks such as action recognition, video classification, medical image analysis, and dynamic scene understanding.
Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM): Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks are types of neural network architectures designed to process sequential data by capturing dependencies over time. RNNs have recurrent connections that allow information to persist and flow through the network across time steps, making them suitable for tasks involving sequential data, such as natural language processing (NLP), speech recognition, and time-series prediction. Long Short-Term Memory (LSTM) networks are a specialized type of RNNs that address the vanishing gradient problem by introducing gated units, which control the flow of information through the network and allow it to learn long-term dependencies more effectively. RNNs and LSTMs have been widely used in various sequential modeling tasks, including language modeling, machine translation, sentiment analysis, and speech synthesis.
Spatial-Temporal Attention: Spatial-temporal attention mechanisms extend the concept of attention mechanisms in neural networks to spatiotemporal data, such as videos or sequential images. These mechanisms dynamically focus on different spatial and temporal regions of the input data, allowing the model to selectively attend to relevant features at different points in time and space. Spatial-temporal attention can enhance the performance of deep learning models for tasks such as action recognition, video captioning, and video generation by enabling them to capture long-range dependencies and attend to informative regions in the input sequence.
Graph Convolutional Networks (GCNs): Graph convolutional networks (GCNs) are a type of neural network architecture designed to operate on graph-structured data, where entities (nodes) are connected by edges that represent relationships or interactions between them. GCNs generalize the concept of convolutional neural networks (CNNs) to graphs, allowing them to learn feature representations from graph-structured data by aggregating information from neighboring nodes. GCNs have been applied to various tasks, including node classification, link prediction, community detection, recommendation systems, and molecular property prediction, where the data can be naturally represented as graphs.

Applications

Surveillance and Security
Law Enforcement
Retail Analytics
Smart Cities
Border Control and Immigration
Access Control and Authentication
Marketing and Advertising
Human-Computer Interaction (HCI)
Fashion and Retail Intelligence
Event Management
Multi-Camera Person Tracking
Activity Recognition
Human Behavior Analysis
Public Safety in Crowded and Sensitive Places etc.

Related Datasets

Image Based:

ViPeR
CUHK01
CUHK02
CUHK03
Market 1501
DukeMTMC-ReID
MSMT17
Airport
Occluded-DukeMTMC
ImageNet

Video Based:

PRID 2011
iLIDS-VID
MARS
DukeMTMC-VID
LPW
Ai City Challenge 2023 Track-1
MMPTRACK

Evaluation Measures

Cumulative Matching Characteristics (CMC): Cumulative Matching Characteristics (CMC) curves are the most popular evaluation metrics for person re-identification methods. Consider a simple single-gallery-shot setting, where each gallery identity has only one instance. For each query, an algorithm will rank all the gallery samples according to their distances to the query from small to large, and the CMC top-k accuracy is:

which is a shifted step function. The final CMC curve is computed by averaging the shifted step functions over all the queries.

mAP (mean Average Precision): mAP is the average precision calculated across multiple queries.It measures the quality of retrieval results by considering both precision and recall, where precision is the ratio of relevant instances retrieved to the total number of retrieved instances, and recall is the ratio of relevant instances retrieved to the total number of relevant instances in the dataset.

where, Q is the number of queries in the set and AveP(q) is the average precision (AP) for a given query, q.

IDF1 (ID F1 Score): IDF1 is the F1 score for identity tracking, which evaluates the overall performance of multi-object tracking systems. It considers both the detection and association aspects of tracking, measuring the harmonic mean of the detection precision and recall.
IDP (ID Precision): IDP measures the precision of identity tracking, indicating the ratio of correctly associated identities to the total number of associations made by the tracker.
IDR (ID Recall): IDR measures the recall of identity tracking, indicating the ratio of correctly associated identities to the total number of ground truth identities.
Precision and Recall: Precision is the ratio of true positive results to the sum of true positive and false positive results. It measures the accuracy of positive predictions made by a model. Recall is the ratio of true positive results to the sum of true positive and false negative results. It measures the ability of a model to capture all positive instances in the dataset.
MOT (Multi-Object Tracking): MOT evaluates the performance of algorithms for tracking multiple objects over time in video sequences. It considers metrics such as detection precision, detection recall, and tracking accuracy to assess the overall tracking performance.
MOTA (Multi-Object Tracking Accuracy): MOTA is a comprehensive metric for evaluating MOT systems, considering various factors such as false positives, false negatives, and mismatches in association. It provides a single measure of tracking accuracy by accounting for both detection and tracking errors.