This study presents the Style-Preserving Diffusion Generator (SPDG), a lightweight diffusion-based framework for scene text synthesis and editing that
preserves the original font style and background while enabling flexible, character-level text replacement. SPDG addresses data imbalance and reduces
annotation costs in scene text recognition (STR) through controllable editing and userspecified style generation.
The central technological innovation that enables SPDG's combination of high performance and efficiency is Knowledge Distillation (KD). KD is a model
compression technique, famously proposed by Hinton et al., designed to transfer the knowledge from a large, complex "teacher" model to a smaller,
more efficient "student" model. The core principle is to train the student not just on the ground-truth labels (hard targets), but also on the probability
distributions produced by the teacher model (soft targets). By learning to mimic these nuanced outputs, the student model can inherit the teacher's
powerful capabilities while remaining compact and fast, making it suitable for deployment in resource-constrained environments.
The overall network architecture of the Style-Preserving Diffusion Generator (SPDG) is illustrated in figures above. It is primarily divided into two
stages. In the first stage, the core training of SPDG is built upon a knowledge distillation mechanism. During this stage, SPDG acts as the student model
(Student Generator), and its learning objective is provided by a pre-trained teacher model, TxtCtrl , whose parameters are fixed. SPDG comprises four
key modules: a Text Encoder (T), a Conditional Image Encoder (C), a Style Feature Extraction Module (S), and a Diffusion Generator (G).