In this lab, we aim to find out the best way to optimize and realize the advanced machine learning algorithms in reality in order to enable people to take advantage of the outstanding performance of the algorithms in practice. A deeper understanding of computing systems and emerging algorithms opens up new potential for high efficiency. However, we try to avoid hand-crafted solutions for optimization. Instead, we often utilize machine learning algorithms to search for the best solutions among candidates. Please note that the ultimate goal is the acceleration of machine learning algorithms, but we are pleased to utilize the machine learning algorithms for acceleration.
LLMs · quantization · model compression · inference efficiency · parameter-efficient fine-tuning · AutoML
Large Language Model (LLM) Optimization research focuses on making enormous language models more efficient, adaptable, and practical. This includes techniques to compress models, merge knowledge from multiple specialized models, and extend their context lengths without sacrificing performance. By optimizing LLMs, we address the pressing real-world need to deploy advanced language AI on limited hardware and ensure they remain cost-effective and fast. This research is crucial as LLMs are increasingly used in applications from virtual assistants to scientific analysis, where improvements in speed or memory footprint can significantly broaden their impact.
GraLoRA: Granular Low-Rank Adaptation for Stable Fine-Tuning (NeurIPS 2025, Spotlight):
GraLoRA enhances the stability and precision of LLM fine-tuning through a granular low-rank adaptation strategy that learns separate adapters for smaller matrix partitions. This fine-grained structure prevents overfitting and abrupt performance drops, yielding smoother optimization. The result is full fine-tuning performance with far fewer trainable parameters and greater robustness to hyperparameters and noise.
AMQ: Enabling AutoML for Mixed-Precision Weight-Only Quantization of Large Language Models (EMNLP 2025, Oral):
AMQ introduces an AutoML-based quantization framework that automatically allocates mixed-precision bit-widths across LLM layers for optimal efficiency. It integrates pruning, proxy modeling, quality prediction, and iterative refinement to identify Pareto-optimal configurations in a single search. The method surpasses previous mixed-precision quantization methods, achieving faster inference and better accuracy under strict memory budgets.
HOT: Hadamard-based Optimized Training (CVPR 2025)
HOT introduces a Hadamard-based optimization technique that enhances the efficiency and stability of large-scale model training. By replacing conventional dense projection matrices with structured Hadamard transforms, it reduces memory consumption and computational cost while preserving representational power. This approach accelerates training convergence and improves parameter efficiency, providing a scalable and lightweight alternative for training large neural networks
QEFT: Quantization for Efficient Fine-Tuning of LLMs (Findings of ACL 2024):
QEFT enables low-precision fine-tuning of LLMs to reduce memory and compute demands while maintaining model quality. By analyzing how quantization affects training dynamics, QEFT shows that quantization-aware fine-tuning can retain task performance with minimal degradation. This approach makes model adaptation lightweight, practical, and resource-efficient for large-scale deployment.
Outlier-Aware Weight Quantization for Efficient Fine-Tuning of LLMs (AAAI 2024, Oral):
This work proposes an outlier-aware quantization method that preserves rare but important weight activations during compression. By dynamically assigning higher precision to outlier weights, it minimizes performance loss in low-bit settings. The technique significantly reduces LLM size while maintaining accuracy, enabling deployment on edge and memory-constrained systems.
Diffusion models · autoregressive generation · visual synthesis · parallelization · latency reduction · efficiency
Visual Generative Model Acceleration research is about speeding up image and video generation without compromising output quality. Modern generative models like diffusion models and autoregressive image transformers can create stunning visuals but often require lengthy sequential computations. Our work in this category develops novel algorithms to make these models run faster and more efficiently, which is vital for real-world use cases (e.g. real-time graphics, interactive design tools) where slow generation is a bottleneck. We focus on clever reuse of computations and parallel generation techniques so that creative AI applications become more responsive and scalable.
Grouped Speculative Decoding for Autoregressive Image Generation (ICCV 2025):
GSD adapts speculative decoding, originally from text generation, to autoregressive image models by predicting and verifying groups of tokens at once. A lightweight draft model proposes multiple future tokens, which the main model validates in a single forward pass. This grouping strategy achieves up to 4× faster image generation while maintaining visual quality.
Picard Consistency Model for Fast Parallel Sampling of Diffusion Models (CVPR 2025):
PCM introduces a parallelizable diffusion framework inspired by Picard iteration to overcome sequential sampling limitations. It predicts and refines denoised outputs in parallel, drastically cutting inference time while maintaining quality. Beyond image generation, PCM accelerates action policy generation in robotics, offering a pathway to real-time embodied intelligence.
PTQ4VM: Post-Training Quantization for Visual Mamba (WACV 2025):
PTQ4VM introduces the first post-training quantization framework designed specifically for Visual Mamba, addressing the unique challenges of state-space model quantization. By modeling the sensitivity of Mamba’s selective update and state-transition mechanisms, it maintains stability under low-bit settings. The method achieves significant memory and latency reductions while preserving visual accuracy, setting a foundation for efficient deployment of Mamba-based vision models.
FRDiff: Feature Reuse for Training-Free Acceleration of Diffusion Models (ECCV 2024):
FRDiff accelerates diffusion inference by exploiting feature redundancy across consecutive denoising steps. Instead of recomputing similar intermediate features, it reuses and refines them, reducing computation without retraining. This method delivers a Pareto-optimal trade-off between speed and quality, offering a simple, training-free path to faster diffusion generation.
Temporal Dynamic Quantization for Diffusion Models (NeurIPS 2023):
TDQ introduces a step-aware quantization strategy for diffusion models, dynamically adjusting precision across denoising stages. Early, noise-heavy steps use higher precision for stability, while later steps adopt lower precision for efficiency. This temporal adaptation significantly reduces memory and compute costs without compromising image fidelity.
Robot learning · generative control · visuomotor policy · retrieval-based demonstration · embodied cognition · adaptive behavior
Efficient Embodied Intelligence research develops AI policies that let robots and embodied systems act naturally and reliably in complex real-world environments, while making the most of limited computational resources. The focus is on integrating powerful generative AI (like diffusion models) into robot control loops in a way that balances deliberation and reactivity. This is important because robots operating in homes, hospitals, or factories need to respond swiftly to the unexpected (robustness) yet also execute smooth, coherent actions (consistency). Our work in this area bridges cutting-edge AI planning algorithms with practical robotics, ensuring that intelligent agents can perform diverse tasks efficiently and adapt on the fly to new situations.
Retrieval-Based Demonstration Refinement for Robot Manipulation (On-going Work):
This work proposes a retrieval-driven learning framework that allows robots to learn continuously from an expanding repository of expert demonstrations without retraining. By embedding visual, linguistic, and motor information in a shared space, the robot retrieves the most relevant examples and refines its behavior through imitation. This approach enables scalable, adaptive robot learning that generalizes across tasks while eliminating the cost of repeated fine-tuning.
Improving Generative Behavior Cloning via Self-Guidance and Adaptive Chunking (NeurIPS 2025):
This research enhances diffusion-based robot control by integrating self-guidance and adaptive chunking into behavior cloning. Self-guidance incorporates recent observations to improve safety and accuracy, while adaptive chunking enables dynamic replanning when conditions change. Together, they significantly boost task success and computational efficiency, achieving robust, responsive control for complex manipulation tasks.
Picard Consistency Model for Fast Parallel Sampling of Diffusion Models (CVPR 2025):
PCM introduces a parallelizable diffusion framework inspired by Picard iteration to overcome sequential sampling limitations. It predicts and refines denoised outputs in parallel, drastically cutting inference time while maintaining quality. Beyond image generation, PCM accelerates action policy generation in robotics, offering a pathway to real-time embodied intelligence.
System optimization · hardware–software co-design · resource scheduling · hybrid parallelism · processing-in-memory (PIM) · scalability
Performant System Design for AI Applications research focuses on the hardware-software co-design needed to run modern AI models at peak efficiency. As AI models grow in size and complexity, conventional computing architectures (CPUs, GPUs) often become the bottleneck – especially for memory-hungry operations. This research area addresses how to redesign systems and algorithms together: from leveraging novel hardware like Processing-in-Memory (PIM) to automating parallelization strategies for distributed AI training. By building systems that are tailored for AI workloads, we enable faster inference and training, lower energy consumption, and the ability to deploy advanced AI in a range of settings from cloud datacenters to mobile devices.
Automated Resource Allocation for Efficient Training and Inference (On-going Work):
This study presents a self-optimizing distributed AI system that automatically determines optimal parallelization strategies for large-scale model training and inference. Leveraging heuristic search and learned cost models, it dynamically balances data, tensor, and pipeline parallelism across heterogeneous hardware. The approach maximizes throughput and minimizes energy consumption, enabling scalable, adaptive resource management without manual tuning.
Cost-Effective Extension of DRAM-PIM for Group-wise LLM Quantization (IEEE CAL 2025):
This work proposes a hardware–algorithm co-design approach that extends DRAM-based Processing-in-Memory (PIM) to efficiently handle group-wise quantized LLM operations. By integrating quantization-aware dataflow directly into memory, it reduces data movement bottlenecks and improves throughput. The result is a cost-effective system that accelerates inference for large-scale models on memory-bound platforms.
Fast Performance Prediction for Efficient Distributed DNN Training (IEEE CAL 2023)
This work presents a lightweight performance prediction model that enables efficient planning of distributed deep neural network (DNN) training. By accurately estimating computation and communication costs under various parallelism strategies, it eliminates the need for expensive full-scale profiling. The proposed method reduces exploration time by orders of magnitude, allowing practitioners to rapidly identify optimal configurations for large-scale training environments .
Reliability · robustness · long-context reasoning · inference consistency · contrastive decoding · model merging
Robust and Reliable AI research aims to ensure that AI systems not only perform well under ideal conditions but also maintain their integrity and trustworthiness in the wild. As AI is deployed in high-stakes domains (from medical diagnosis to autonomous driving), it must be resilient to noisy inputs, adversarial perturbations, or shifts in context, and it should produce outputs that users can trust. Our work in this category ranges from mitigating issues like hallucinations in language models to techniques for stable learning across domains. Ultimately, this research strives to make AI behavior more predictable, transparent, and aligned with human expectations, which is essential for broader adoption of AI technologies.
PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality (EMNLP 2025):
PruneCD tackles the issue of hallucinations in LLMs through a contrastive decoding framework that compares outputs from a full model and a pruned “self” model. The pruned model provides corrective feedback during decoding, helping the main model avoid overconfident or inaccurate generations. This design enhances factual accuracy while maintaining inference speed, offering a lightweight path to more trustworthy LLM outputs.
SEAL: Scaling to Emphasize Attention for Long-Context Retrieval (ACL 2025):
SEAL introduces per-head and per-channel scaling within the self-attention mechanism to improve long-context understanding in LLMs. By adaptively emphasizing key information over extended inputs, SEAL boosts retrieval accuracy on long-sequence benchmarks. This method enables stable and efficient reasoning even as the model’s context window grows.
Merge-Friendly Domain Adaptation via Model Merging (2024):
This research proposes a parameter-merging approach to build multi-domain models from specialized experts without retraining. It resolves conflicts between domain-specific knowledge, ensuring each task’s performance is preserved after merging. The result is a robust, unified model capable of adapting to diverse domains—advancing continual and lifelong learning for reliable AI systems.