Since 2021, a paradigm shift has taken place in AI research, characterized by a few keywords:
Multimodal: boundaries between modalities are broken, e.g. vision and language
Generative AI: led by GPTs and Diffusion models, generative AI is now pioneering the AI research
Transformers: network design converges to transformer architecture with attention mechanism
Foundation Model: large foundation models form basis of problem solving, e.g. SAM, CLIP, GPT, SD
Multimodal foundation models are essentially different to the single-model task-specific models.
Data: Billion-Scale Noisy Data From Internet vs (Sub-)Million-Scale Curated/Annotated Task Data
Model: ~Fixed Network Design and Training Recipe vs Various Backbone/Module/Loss Choices
GPUs: 100s of High-end GPUS in Cloud Platform vs 1-8 Ordinary GPUs within Local Server
Cycle: Model Takes Month to Train on 100s of GPUs vs Model Takes Days to Train on 1-10 GPUS
Eval: Lack Consensus in Evaluation Data/Metrics vs Have Task Specific Evaluation Protocol
So, there is reduced workload in network design, while it is much more demanding in dataset collection, model training and evaluation. For project running, the overall risk is much higher, due to:
Limited trainingrounds: training is too expensive to repeat, in terms of time, computes, and money
Data/model/eval/infra work orchestration: the worst case is only one part working, others waiting
Ambiguous evaluation scope and criteria: lack deterministic proof of success, hindering delivery.
Project detailsare in the next.
[1] Multimodal-LLM: 2023 Q2 - 2024 H1
This project aimed at establishing a generic network architecture, and obtaining optimized module design, for creation of multimodal video understanding Foundation Model. The model takes multi-modal prompts including text/video/audio as input, and generates textual answers, via a transformer decoder network. Back to the dates the project started, most SoTA multimodal LLM works were dealing with image and text (e.g. OpenAI GPT-4V,InstructBLIP,LLaVA,Flamingo), very few works started to give attention to video and audio modalities (Video-ChatGPT, Meta's contrastive learning work ImageBind). Kicked off in Q2-2023, this work was in first batch of research explorations in the domain of Generative Video Question Answering. In late 2023, we achieved first few places in this leaderboard.
In this domain, the network architectures mainly fell in two categories:
(1) Multimodal Instruction Tuning: Extending LLMs to Multimodal-LLMs. The network starts with encoding and fusing multimodal prompts instead of only text prompt, then proceeds to an LLM decoder for text generation. The Encoder/Fusion/LLM blocks are pretrained with other tasks, and the M-LLM is obtained by finetuning some/all of these blocks. This type of methodology benefits from pretrained single modality models, largely reducing the model training demands in terms of training data, training time, and computational resources. InstructBLIPand LLaVA are two representative ones, illustrated as following.
(2) Native Multimodal-GPT: autoregressively generate tokens of each modalities beside text (VideoPoet,Gemini,LWM). This type of M-LLMs are more intrinsically multimodal, while it poses much higher requirements on training dataset size, training time, computation resources, and higher difficulties in getting model converged. The following figure illustrates VideoPoetandLWM.
Instruction tuning approach was adopted in the exploration, to gain feasibility insights given the limited data, project time, and GPU resources.
Takeaways: along the journey, a few interesting observations were found:
Simply encoding the video frame-wisely, and aggregating image level features via pooling or concatenation can yield meaningful results in many cases. This hints that the video dynamics at semantic level, does not need much extra bits/feature-length to encode. Interestingly, some recent SoTA work are training free, e.g. LLaVA-NeXT (Video)
The pretrained LLM is a key factor. Encoder-Decoder LLM Flan-T5 tends to generate short and precise answers, while Decoder-only LLM Meta Llama tends to generate longer answers, but hallucinate or repeat in the end. In the Decoder-only LLMs, the Prefix-LLM ChatGLM performed better in Chinese language, while the Causal-LLM Llama model worked better in English.
Dataset size/quality is another key factor. Given the same network design and training recipe, simply increasing the dataset scale can increase the effectiveness considerably. [Recent Sota work VideoLLaMA2 also combined many datasets and tasks]
Be careful that pretrained LLM is too powerful at generation. It can ignore visual or other modalities, and guesses the answer only from the text instructions. Make sure visual/audio information can effectively access and interact with LLM. This is as well observed in MVBench, demonstrated via feeding random noise as visual input to Q-former.
Image level visual Encoder (CLIP-ViT) outperformed video-level encoder (ViCLIP-ViT) in the experiments. Recent works in MVBench seem to get different observations with UMT, as well with InternVideo2 with its Video Foundation Model.
Evaluation: In the domain the evaluations were mostly benchmarked on the VideoInstruct dataset, with 100K semi-automatically annotated test cases, and OpenAI GPT3.5-turbo as the evaluator. Five dimensions were examined, including answer correctness, detail orientation, contextual understanding, temporal understanding and consistency. Another evaluation benchmark is the MVBench, which defined 20 video understanding tasks. Both evaluation protocols are established in lack of scientific rigor, with bias originates from the subjectiveness in prompts and ground truth annotation, and the inaccuracies built in the gpt3.5 model. Human evaluators are still in need to guard the last gate of entrance, for business value justification.
Thoughts: beyond the generic effectiveness, some desirable next-stage capabilities in this domain are long video understanding, and fine-grained video understanding, which are still in exploration in the domain. Towards these directions, Meta SAM may work as object level visual parser, and efficient attention mechanisms such as FlashAttention Series, Blockwise Attention, RingAttention, Multi Query Attention, Grouped Query Attention (used in Llama2and Llama) and other optimization techniques like KV cache, model quantization can be adopted to reduce the Memory and Computational complexity, for long sequence modelling.
[2] Video Generation - 2023 H2 - 2024 H1
Background: Video Diffusion Modelscame under the research spotlight in 2023, following the success of stable diffusion models in image generation. The domain got hotter and hotter with a few milestone works appeared in 2023, such as VideoComposer from Alibaba, and the release of Pika lab API in Discord. There after, many video generation works emerged in short period, forming a fast expanding research domain, such as the VGEN family from Alibaba, and the EMU family from Meta. Initially, the research domain was mainly employing denoising Unet as the backbone network for noise estimation. While since OpenAI introduced Sora in February 2024, there has been a strong shift in the research efforts to explore Diffusion Transformers (DiT) models instead. Some other works include video generation as a subtask in their Any-to-Any generation framework, such as VideoPoetandLWM. While Any-to-Any models requires much higher in data/compute resources, and harder to train and converge, limiting the volume of explorations in this direction.
Task: The project aimed at establishing the methodology for developing a video generation foundation model. Back to the date, limited resource was available in terms of the training dataset, pretrained model weights, and training recipe. And on the capability side, the generated videos from SoTA models was low in resolution, short in length, and often inconsistent with the condition and prompt.
Exploration: To gain overall understanding of current SoTA capability, we first composed a pipeline with pretrained video generation models in the domain, together with post processing for spatial and temporal super resolution. Later we found this strategy is similar to MagicVideo, which chained a few generation models in a pipeline, to achieve high resolution, high framerate video generation. However, this approach was inference only, no training or training recipe design were explored.
Following the release of stable video diffusion (svd) weights and animate-anything framework, we successfully build the training pipeline of a video foundation model. Initialized with svd weights, we were able to extend the model capability for long video generation, from 25 frames to 128, 256, and 512 frames. The model also supports different prompting ways, with text or image, or both. The training set for this exploration was relatively small, with only sub-million training instances.
With Sora appeared in February 2024, the research focus shifted to Diffusion Transformers. Several DiT variants such as Latte, STDiT were explored. After creating an inhouse training set, in Q2 2024, we achieved competitive video generation capability, than publicly SoTA models such as Open-Sora and Open-Sora-Plan.
For Evaluation, common evaluation metrics such as FVD and IS scores are found to be only indicative in comparing generative models. A dedicated manual evaluation team was established.
Following the overwhelming success of Stable Diffusion in image generation domain versus GANs and VAEs, some common artifacts were identified. Human images generation was one of the most common application scenario, while frequently there are artifacts presenting in terms of facial distortion, wrong number of fingers/limbs, and ill-posed body structure. These artifacts create barriers for business application. This project was established to tackle the problem of stable diffusion models in human image generation. The key challenges arise from:
the need of create large scale and high quality <text, human image> dataset
the complexity of SD architecture to identify the cause of the artifacts, and design counter measures
the unavoidable ambiguity in model evaluation, no ground truth can be defined on testing cases.
In exploration, mainly two approaches were explored (1) finetune stable diffusion model on human dataset: here we first explored parameter efficient finetuning with LoRA, then shift to full parameter end2end finetuning with curated <text, human image> data pairs. (2) add pose image as additional control in ControlNet Fashion, with consideration that these artifacts are mostly related to spatial properties, and adding spatial control to the network could give direct spatial guidance. As common practice, DDIM and Classifier free guidance were used in inference.
We adopted model based evaluation in the experiments, such as ImageReward Score, Pick Score, and Human Preference Score. These models capture human preferences, and the scores make possible for automatic model evaluation without human evaluators. One the other hand, the model based evaluation also get limited by the capability of the models.
In late 2023 we achieved a bit better results as measured by human preference models, while the model was not good enough to significantly solve the aforementioned artifacts in human image generation. One lesson here was the hesitation/effort-split between two approaches. If either one of the approach was explored with full efforts, there could be better outcomes. As well, dataset creation and determining the evaluation protocol also took more time & efforts than expected.
[4] Multilingual CLIP - 2023 H1
This project was to develop multi-lingual CLIP(Contrastive Text-Image Pretraining) model with 100+ GPU chips, and billion-scale text-image pairs, serving as backend model of text2image and image2text search engine. The key challenges arise from:
(1) training dataset creation from large scale, noisy, multi-lingual, and imbalanced raw data
(2) distributed training with multi-nodes and multi-gpus, with effectiveness and efficiency
For data management, data quality was strategically highlighted over data quantity, where the noisy training data was filtered via very strict rules, e.g. metadata based filtering to discard un-reliable data sources, model based not-safe-for-work data filtering, model based <text, image> relevance filtering. This procedure removes over 90% of initial data pairs. The training dataset was also composed with consideration to enlarge the data proportion of low resource languages, e.g. Spanish, Arabic languages.
Data parallel strategy was adopted for efficient distributed training, where each GPU server hosts a local trainable copy of the model, and the server has access only to the data splits stored its own hard disk. Other options were considered but rejected, such as creating training dataset duplicates in each node (too big for a single server node), or creating a dedicated data server and loading data to each server during training (too slow). To make training faster, Microsoft deepspeed was applied, while the speedup was not as impressive as it usually shows with NLP models (less than 1.5x versus typically over 10x speedup). To make best use of training resources, an automatic fault recovery strategy was designed, to resume training from latest checkpoint when training crashes.
Two evaluation datasets were created. The first set contains several thousands of <text, image> pairs, with annotations in terms of 'Matched' , 'Unmatched' and 'Not Sure', from a native annotator who speaks that language. The 'Matched' , 'Unmatched' subsets were kept for evaluation, which makes it possible to perform automatic evaluation on trained models and accelerates the Model R&D, with text2image or image2text retrieval tasks. In further, another set contains a few hundreds of query pairs, and million scale gallery pairs, were collected for evaluation in text2image and image2text retrieval tasks. As no ground truth was available for this set, manual evaluators were employed to check and summarize the result. It worth mentioning that manual annotation/evaluation consumes considerable time -- on average a person can annotate/evaluate 500 data pairs per day.
In experiments, quick overfitting was observed when initializing with publicly available pretrained models on new training datasets. Larger overall batch size and lower learning rate helped the model converge better. The developed model outperformed SoTA public models in all languages, especially on the low resource languages, e.g. Spanish and Arabic. It is also observed that larger models always performed better for the same architecture. With this observation, a model distillation approach was also explored, and demonstrated promising results.