Emerging Property of Masked Token
for Effective Pre-training
#9774
Emerging Property of Masked Token
for Effective Pre-training
#9774
Masked Language Modeling
Masked Language Modeling (MLM) is a dominant self-supervised learning approach in natural language processing (NLP) that predicts invisible tokens given visible ones, enabling the learning of large language models such as BERT and GPT. These models remove a portion of the data to predict the removed content and have been shown to scale and generalize well on downstream tasks. This line of research has revolutionized NLP and holds promise for future advancements in the field. However, the problem of necessitating longer training time and enormous computation to train naive MLM still remained, leading to the advancement of more efficient self-supervised pre-training.
Masked Image Modeling
Masked Image Modeling (MIM) is a relatively new technique that has gained popularity in the field of computer vision and machine learning in recent years. The basic idea behind MIM is to predict missing or occluded parts of an image using a neural network trained on partially masked images. Despite the impressive performance, masked autoencoder approaches require a large amount of computation with large-scale training datasets. Researchers have explored using hierarchical Vision Transformers (ViTs) to improve the efficiency of pre-training models for masked image modeling by enabling the ViTs to discard masked patches and only operate on visible ones. In contrast to prior methodologies, the proposed method considers the inherent properties of the tokens employed by MIM as a fundamental approach to effective pre-training.
Proposed Masked Token Optimization (MTO)
Motivation
Despite the plenteous successes of MIM in diverse downstream tasks, the long pre-training phase that it entails tends to impede its efficiency. Concretely, to attain the convergence of the Transformer for transfer learning, a substantial amount of pre-training, typically from 800 to 1600 epochs in advance, is essential. In this paper, as a fundamental approach, we cast this problem from the perspective of the optimization of masked tokens which arises as a result of the modality gap from NLP systems.
Properties of Masked Token
(1) Spatial randomness: Masked tokens must be randomly selected from the corpus of input patches, so that the model can learn to predict tokens in various locations and types.
(2) Substitutional consistency: In the process of masking at the initial embedding, tokens that are masked should consistently be replaced with the same parameter.
(3) Data singularity: Masked tokens in the initial embedding should be unique tokens that have a low likelihood of manifesting in the training data.
Heterogeneity Analysis
Our initial step encompassed a heterogeneity analysis of masked tokens against the visible tokens to demonstrate the manifestation of the masked token’s data singularity characteristic within the model upon reaching convergence.
(a) shows that the heterogeneity between two distinct types of tokens is highest on the initial embedding for both approaches, and it gradually decreases in subsequent layers. Unlike the pre-trained model, the heterogeneity of the non-converged model shown in (b) displays an erratic trend, indicating that the tendency of heterogeneity is acquired through model convergence.
Proposed Optimization
The proposed Masked Token Optimization (MTO) approach encompasses the selective exclusion of semantically inconsequential
masked tokens from the weight aggregation process pertaining to visible tokens, and at the same time, it enforces data singularity constraints based on the depth of the layer to enhance the model’s capability to accurately identify regions necessitating semantic restoration.
Experiments
The comprehensive performance results of applying MTO to various baselines. MTO achieves a substantial improvement in the efficiency of pre-training by attaining the standard performance within approximately 400 epochs across all baseline methods in common. This signifies that remarkable enhancement in efficiency is achievable across any MIM method through the application of MTO, rendering it a viable option for masked tokens. In terms of overall performance, it is noteworthy that the utilization of MTO consistently yields higher fine-tuning accuracy at most epochs. This observation additionally underscores the positive impact of MTO on the representation learning in the pre-training process.
Download code and model parameters: MTO