This study proposes a three-stage synthetic data generation framework for face recognition that mitigates reliance on large-scale real-world datasets
while enhancing attribute controllability and recognition accuracy. In the first stage, the intra-class distribution of an existing dataset is optimized to
construct RepSet-DC, a compact yet recognition-effective dataset used to train a baseline model. In the second stage, a source face generator and a
Dual-Modal Diffusion Model (DMD) are employed to simultaneously control age and pose variations. A baseline-guided sample selection mechanism
identifies the most recognition-beneficial synthetic images, which are then merged with RepSet-DC to form the expanded RepSet-X. In the third stage,
knowledge distillation is applied to train a lightweight student model on RepSet-X, improving recognition performance on synthetic data and narrowing
the gap with real data. This framework addresses limitations of existing synthesis methods, including insufficient intra-class diversity, lack of effective
sample selection, and performance disparities, achieving results on par with state-of-the-art approaches.
We propose a Dual-Modal Diffusion Model for Synthetic Face Generation and Recognition, embedded within a three-stage synthetic data generation
framework that mitigates reliance on largescale real-world datasets while enhancing attribute controllability and recognition accuracy. In the first stage,
the intra-class distribution of an existing dataset is optimized to construct RepSet-DC, a compact yet recognition-effective dataset used to train a baseline
model. In the second stage, a source face generator and the proposed DMD are employed to simultaneously control age and pose variations. A baseline-
guided sample selection mechanism identifies the most recognition-beneficial synthetic images, which are then merged with RepSet-DC to form the
expanded RepSet-X. In the third stage, knowledge distillation is applied to train a lightweight student model on RepSet-X, improving synthetic data
recognition and narrowing the gap with real data. This framework addresses key limitations of existing synthesis methods, achieving results on par with
state-of-theart approaches.
The core of this research is the proposed Dual-Modal Diffusion Model (DMD), the primary objective of which is to transform synthetic identity images,
denoted as Is, generated by a source face generator, into images exhibiting variations in pose and age while preserving their inherent identity features.
Through controllable pose and age transformation, the DMD can significantly enhance the intra-class variation for the same identity in terms of pose and
age, thereby improving overall face recognition performance. Unlike previous face generation models limited to manipulating a single attribute, the
model in this study can simultaneously control both pose and age attributes.
The DMD operates in two distinct modes. The first mode which shows above as fig. (b) allows for flexible control over age variations, easily achieving
the goals of age progression and age regression. The second mode which shows above as fig. (c) enables free control over the facial angle. By utilizing
these two modes, the model can increase the intra-class variation of an identity, ultimately leading to enhanced recognition performance.