Text rendering in complex 3D scenes is challenging due to the need for accurate geometric alignment, realistic lighting adaptation, and preservation of fine font details. Existing 2D datasets lack the spatial cues and rich annotations required for seamless integration into real-world imagery. With growing demand for generative models capable of both visual realism and precise control, this work addresses the gap by combining multi-task learning with fine-grained conditioning, supported by a purpose-built dataset that bridges synthetic training and real-world application.
This figure illustrates the complete workflow of the proposed method in this paper, which consists of:
(a) a pre-trained encoder that encodes an image into a binary latent space representation
(b) a GPT-2 based multi-task Transformer model that simultaneously optimizes multiple tasks through joint learning to enhance predictive capability
(c) an inference stage that recursively generates multi-scale residuals, which are cumulatively summed to reconstruct the final high-resolution image
To fully leverage the potential of discrete visual tokenizers and enhance reconstruction and generation quality, expanding the vocabulary size has been proven to be an effective approach. However, directly increasing the vocabulary in traditional tokenizers, such as VAR and MAGVIT, leads to a substantial increase in memory and computational overhead. To address this limitation, we adopt the "Bitwise Multi-scale Residual Quantizer" proposed and pre-trained by Infinity, selecting Binary Spherical Quantization (BSQ). Its design, which requires no explicit codebook, significantly reduces resource consumption and enables the training of an extremely large vocabulary.
In this study, we set the vector dimension of the Binary Spherical Quantization (BSQ) to L=32. This configuration corresponds to an implicit vocabulary size of 232 ≈ 4.3×109 , which can encompass an extremely rich combination of visual features and improve detail reconstruction capabilities.Concurrently, the complexity of BSQ's quantization and de-quantization process is merely O(L), meaning each vector only requires 32 sign and bitwise operations, thus maintaining a very low computational burden during the inference stage.