Style-Preserving Diffusion Generator for Scene Text Synthesis

Introduction

This study presents the Style-Preserving Diffusion Generator (SPDG), a lightweight diffusion-based framework for scene text synthesis and editing that

preserves the original font style and background while enabling flexible, character-level text replacement. SPDG addresses data imbalance and reduces

annotation costs in scene text recognition (STR) through controllable editing and userspecified style generation.

Core Innovation: Knowledge Distillation (KD)

The central technological innovation that enables SPDG's combination of high performance and efficiency is Knowledge Distillation (KD). KD is a model

compression technique, famously proposed by Hinton et al., designed to transfer the knowledge from a large, complex "teacher" model to a smaller,

more efficient "student" model. The core principle is to train the student not just on the ground-truth labels (hard targets), but also on the probability

distributions produced by the teacher model (soft targets). By learning to mimic these nuanced outputs, the student model can inherit the teacher's

powerful capabilities while remaining compact and fast, making it suitable for deployment in resource-constrained environments.

Architecture of Style-Preserving Diffusion Generator（SPDG）

The overall network architecture of the Style-Preserving Diffusion Generator (SPDG) is illustrated in figures above. It is primarily divided into two

stages. In the first stage, the core training of SPDG is built upon a knowledge distillation mechanism. During this stage, SPDG acts as the student model

(Student Generator), and its learning objective is provided by a pre-trained teacher model, TxtCtrl , whose parameters are fixed. SPDG comprises four

key modules: a Text Encoder (T), a Conditional Image Encoder (C), a Style Feature Extraction Module (S), and a Diffusion Generator (G).

Page updated

Google Sites

Report abuse