Inference-Time Reasoning for Vision-Language Models Research Track

Overview

This page is for students who are interested in improving vision-language models at inference time through reasoning, verification, or structured test-time computation.

You do not need to begin with large-scale multimodal training or expensive hardware. This track is designed for students who want a modern research topic with clear conceptual questions and a realistic paper-reading path.

The goal of this track is to explore questions such as:

If you are interested in these questions, this track may be a good place to start.

What to avoid at the beginning

Do not begin with:

At the beginning, it is better to focus on a small number of papers with a clear inference-time idea.

When to contact me

If you read some of the papers on this page and feel interested, feel free to contact me.

Part I. A simple starting path

A good starting path is the following:

You do not need to do everything at once.

Part II. Core background for multimodal inference-time reasoning

These are the most important shared starting points for this page.

1. CLIP (ICML 2021)

Paper: Learning Transferable Visual Models From Natural Language Supervision
Code: https://github.com/openai/CLIP

Why read it: a strong starting point for modern vision-language transfer
Focus on: image-text alignment, zero-shot classification, prompt-based inference

2. CoMT / TTS-CoMT (Findings of ACL 2025)

Paper: Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study
Code: https://github.com/DeepLearnXMU/TTS_COMT

Why read it: one of the clearest entry points for this topic
Focus on: multimodal thought, sampling-based scaling, tree-search-based scaling, verifier-guided reasoning

3. UniT (preprint, 2026)

Paper: UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
Project page: Meta AI publication page

Why read it: a strong next step for understanding multi-round reasoning in unified multimodal models
Focus on: iterative reasoning, verification, subgoal decomposition, refinement across multiple rounds

4. MindJourney (NeurIPS 2025 poster)

Paper: MindJourney: Test-Time Scaling with World Models for Spatial Reasoning
Code: https://github.com/UMass-Embodied-AGI/MindJourney
Project page: MindJourney project page

Why read it: a good example of reasoning that goes beyond text generation
Focus on: spatial reasoning, world models, multi-view evidence gathering, visual imagination at test time

5. Frankenstein-Style Analysis (preprint, 2026)

Paper: What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

Why read it: a useful reality check
Focus on: what visual reasoning gains actually mean, where improvements happen inside the model, limitations of benchmark-only evaluation

Part III. Main track — Inference-Time Reasoning for VLMs

Typical question:

How can we improve a vision-language model at inference time through reasoning, verification, or controlled visual exploration, without changing the base model itself?

Why this track is good

This track is suitable for students who want:

Possible directions

A. Multimodal chain-of-thought and test-time scaling

Why study it: this is the most direct analogue of inference-time reasoning in LLMs
Good for students because: the central question is easy to understand and many recent papers are directly comparable

Recommended papers:

Typical questions:

B. Verification, self-correction, and grounded refinement

Why study it: longer reasoning is not automatically better in VLMs, so verification becomes important
Good for students because: this direction forces careful thinking about what "reasoning" actually means in a multimodal model

Possible direction:

This direction fits naturally with recent multimodal test-time scaling work because those papers explicitly discuss verification and refinement rather than one-pass answering.

C. Spatial reasoning and world-model-assisted inference

Why study it: some visual reasoning problems require the model to imagine unseen views or state transitions
Good for students because: this gives a more concrete and visual notion of reasoning than just generating longer text

Recommended paper:

Possible direction:

D. Limits of visual reasoning improvement

Why study it: not every benchmark gain corresponds to genuine visual reasoning
Good for students because: this direction is good for careful readers who want to do analysis rather than only follow performance numbers

Recommended paper:

Possible direction:

Part IV. A promising publication direction

A strong project in this track could be:

Grounded inference-time reasoning for vision-language models under realistic compute constraints

This direction is attractive because it combines:

The page should stay focused on one question:

How can we improve a VLM at inference time while keeping the reasoning process visually grounded and computationally realistic?

Part V. Good starter benchmarks

At the beginning, it is better to avoid very large or overly complex settings.

Good starter tasks include:

Avoid very large end-to-end systems at the beginning.

A good early goal is not to cover every benchmark, but to understand which tasks actually benefit from extra inference-time reasoning and why.

Part VI. Suggested first mini-project

A strong first project is:

This is a good starting point because it answers a concrete and modern question:

Can better inference-time reasoning improve multimodal performance without changing the underlying model?

For a reading group, this can be done entirely as a paper-analysis task without running experiments.

Final note

It is better to have one clear track than to combine too many different ideas under the word "reasoning."

So this page should stay centered on:

This makes the track broad enough to be interesting, but still focused enough for students to follow.