Throughout my research in the CS department, I have been re-running and replicating code from open-source repositories in relevant papers (sometimes using publicly available checkpoints) to verify results, cite their methods, or draw inspiration for my own work. This effort primarily focuses on multimodal language modeling, including autoregressive, diffusion-based, and hybrid architectures. Below is a continuously updated list of such papers and repositories:
Imagine While Reasoning in Space: Multimodal Visualization-of-Thought:
Introduces a multimodal reasoning framework where LLMs generate intermediate visual representations to support spatial and visual reasoning, combining autoregressive language modeling with image generation.
AVID: Adapting Video Diffusion Models to World Models:
Proposes adapting pretrained video diffusion models into action-conditioned world models for sequential prediction and planning in dynamic environments.
Pandora: Towards General World Model with Natural Language Actions and Video States:
Presents a hybrid world model that integrates autoregressive language modeling with diffusion-based video generation to connect natural-language actions and visual state transitions.
GILL: Multimodal Chain-of-Thought Reasoning:
Introduces a framework for multimodal CoT reasoning where autoregressive language models generate images as intermediate reasoning steps alongside text.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation:
Proposes a unified multimodal architecture that decouples visual encoding from language modeling to support both understanding and generation within an autoregressive framework.
Autoregressive Model Beats Diffusion: LLaMA for Scalable Image Generation:
Demonstrates that large autoregressive transformers can outperform diffusion models for image generation when scaled appropriately.
Dynamic Parallel Tree Search for Efficient LLM Reasoning:
Introduces a parallelized tree-search inference framework that accelerates autoregressive LLM reasoning by dynamically exploring multiple reasoning paths.
Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search:
Proposes an adaptive tree-search framework that dynamically balances breadth and depth during autoregressive LLM inference to scale test-time compute and improve reasoning performance.