HanRong YE @ HKUST

HanRong Ye 叶汉荣

Ph.D. student at CSE HKUST

E-mail

hanrong.ye AT connect.ust.hk

Office

4/F Academic Building, HKUST

Clear Water Bay, Kowloon, Hong Kong

Selected Projects

# Simultaneous 2D/3D Multi-Task Scene Perception

DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data

Accepted by Computer Vision and Pattern Recognition Conference (CVPR) 2024

Website, Paper, Github

Hanrong Ye and Dan Xu

We design a joint multi-task denoising diffusion framework for significantly improving prediction quality under a partially annotated multi-task learning setting, where the initial multi-task predictions are noisy due to the severe lack of ground-truth supervision.

# Multi-Modality Generation for Segmentation

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

Technical Report 2023

Website, Preprint, Github

Hanrong Ye, Jason Kuen, Qing Liu, Zhe Lin, Brian Price, Dan Xu

Can generation models help boost segmentation performance on prevalent segmentation benchmarks? We propose a ground-breaking training data generation method for image segmentation, which is the first data generation method that pushes the performance limits of state-of-the-art semantic/panoptic/instance segmentation models to a significant extent. We first leverage text to generate diverse segmentation masks, and then use the masks to synthesize the corresponding realistic images. This new generation framework cleverly avoids the "chicken or egg dilemma" and thus is able to produce very high-quality synthetic data, which transfers to significantly improved segmentation performance. Notably, in terms of the ADE20K semantic segmentation mIoU, Mask2Former R50 is largely boosted from 47.2 to 49.9 (+2.7); Mask2Former Swin-L is also significantly increased from 56.1 to 57.4 (+1.3). We also observe strong gains on COCO panoptic segmentation (+0.7 PQ) and instance segmentation (+0.8 AP). Moreover, training with our synthetic data makes the segmentation models more robust towards unseen domains.

# Simultaneous 2D/3D Multi-Task Scene Perception

TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts

Accepted by International Conference on Computer Vision (ICCV) 2023

Paper, Cite

Hanrong Ye and Dan Xu

TaskExpert is a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces and decoding task-specific features in a dynamic manner. Specifically, TaskExpert introduces a set of expert networks to decompose the backbone feature into several representative task-generic features. Then, the task-specific features are decoded by using dynamic task-specific gating networks operating on the decomposed task-generic features. To establish long-range modeling of the task-specific representations from different layers of TaskExpert, we design a multi-task feature memory that updates at each layer and acts as an additional feature expert for dynamic task-specific feature decoding. More details are coming.

# Simultaneous 2D/3D Multi-Task Scene Perception

TaskPrompter: Spatial-Channel Multi-Task Prompting for Dense Scene Understanding

Accepted by International Conference on Learning Representations (ICLR) 2023

Paper, Github, Cite

Hanrong Ye and Dan Xu

TaskPrompter jointly models task-generic and task-specific representations, as well as cross-task representation interactions simultaneously in one single module. Its compact design establishes a new SOTA performance while reducing computation costs. Novelty well received by all reviewers (ratings: 8,8,6,6,6). We further propose a new joint 2D-3D multi-task benchmark based on Cityscapes. TaskPrompter can generate predictions for 3D detection, segmentation, and depth with one model, one training, and one inference. Our 3D detection performance largely surpasses the previous best single-task results on Cityscapes.

# Simultaneous 2D/3D Multi-Task Scene Perception

Inverted Pyramid Multi-task Transformer for Dense Scene Understanding

Accepted by European Conference on Computer Vision (ECCV) 2022

Github, Demo, arXiv, Cite

Hanrong Ye and Dan Xu

We propose a novel end-to-end Inverted Pyramid multi-task (InvPT) Transformer to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework. Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively and significantly outperforms previous SOTA.

Journal Extension: InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding

Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2024

Github, Demo, arXiv, Cite

Hanrong Ye and Dan Xu

In the journal extension, we further propose two types of Cross-Scale Self-Attention modules, i.e., Fusion Attention and Selective Attention, based on InvPT. The special attention modules are proposed to facilitate cross-task interaction across different feature scales. The new designs help InvPT++ achieve notably better performance and efficiency.

About

Introduction

Hi! I am now a final-year PhD student at CSE department of The Hong Kong University of Science and Technology (HKUST) supervised by Prof. Dan Xu.

I was a happy research intern at Adobe Research (2023) and NVIDIA Research (2024).

I graduated from Peking University (PKU) with a Computer Science master's degree, and obtained my B.S. from the School of Physics, Sun Yat-sen University (SYSU).

Research Experience

Currently, my research focus lies in developing multi-task/multi-media/multi-modality models for simultaneous 2D/3D visual scene perception, vision-language alignment (Visual-LLM), and generation.

Teaching

TA: COMP 6411B Advanced Topics on 2D and 3D Deep Visual Scene Understanding @ HKUST (2021 Fall)

TA: COMP4411 Computer Graphics @ HKUST (2022 Spring, 2022 Fall)

The photo, "Foggy University", was taken at my dorm at HKUST, 2024.