Haichao Zhang, Yao Lu, Lichen Wang, Yunzhe Li, Daiwei Chen, Yunpeng Xu, Yun Fu
arXiv Preprint 2512.16891, 2025
We developed LinkedOut, the first Video LLM-based recommendation framework that extracts world knowledge directly from raw video pixels, eliminating the language bottleneck of text summarization approaches. The system uses Cross-layer Knowledge-fusion MoE for adaptive semantic granularity selection and a store-and-retrieve architecture for 1000x faster inference. It achieves state-of-the-art on MicroLens with 27% HR@10 improvement. Applications include personalized video feeds, content-aware recommendation, and large-scale video ranking systems.
[PDF]
Yaoxin Zhuo, Zachary Bessinger, Lichen Wang, Naji Khosravan, Baoxin Li, Sing Bing Kang
IEEE Winter Conference on Applications of Computer Vision (WACV), 2025
We developed a training-free mask cache framework for few-shot open-vocabulary semantic segmentation that enhances vision-language models without fine-tuning or additional labeled data. The system constructs adaptive key-value mask caches from cross-modal vision-language embeddings using dynamic filtering, channel reduction, and feature alignment with only 2-32 shots. It achieves up to 5% mIoU improvement over state-of-the-art across ViT, ResNet, and Swin-Transformer architectures, enabling efficient adaptation to new visual categories without retraining. Applications include vision-language foundation model adaptation, embodied AI perception, and open-world scene understanding.
[PDF] [PDF_supplement]
Tonmoay Deb, Lichen Wang, Zachary Bessinger, Naji Khosravan, Eric Penner, Sing Bing Kang
IEEE Conference on Computer Vision and Pattern Recognition(CVPR) Workshop, 2024
We created ZInD-Tell, the first large-scale multimodal dataset bridging 360° panoramas, floor plans, and natural language for indoor scene understanding, with 1575 properties and 3150 human-validated descriptions. The system uses GPT-4/vision with schema/template-based prompt engineering to extract room connectivity graphs and geometric constraints. We further developed ZInD-Agent, a zero-shot baseline integrating vision-language models with layout estimation, achieving significant improvements on language-based home retrieval and description generation for real estate.
[PDF] [PDF_supplement] [GitHub]
Taotao Jing, Lichen Wang, Naji Khosravan, Zhiqiang Wan, Zachary Bessinger, Zhengming Ding, Sing Bing Kang
IEEE Winter Conference on Applications of Computer Vision (WACV), 2024
We developed iBARLE, a data imbalance-aware framework for panoramic room layout estimation that addresses appearance variations and structural complexity through domain generalization. The system combines appearance variation generation with adaptive style transfer, complex structure mix-up via cross-room data augmentation, and gradient-based constraints for occlusion handling. It achieves state-of-the-art performance on large-scale indoor datasets, accurately predicting layouts for diverse room geometries, enabling automated property visualization, virtual touring, and robust AR/VR experiences.
[PDF]
Chang Liu, Lichen Wang, Yun Fu
ACMÂ International Conference on Multimedia (MM), 2023
We developed NCL, a neighborhood consistency learning framework for pseudo-labeling-based unsupervised domain adaptation that prevents pushing uncertain neighborhoods to wrong categories. The system uses correlation matrix matching as consistency objective, performs dual-level semantic and instance learning, and employs uncertainty-aware aggregation to handle negative neighbors. It achieves up to 5.9% improvement over state-of-the-art. Applications include foundation model adaptation, zero-shot domain transfer, and model robustness under distribution shift.
[PDF]
Yue Bai, Lichen Wang, Yunyu Liu, Yu Yin, Hang Di, Yun Fu
IEEE Transactions on Image Processing (TIP)
We developed VSDA, an unsupervised auto-encoder framework that improves action boundary detection in temporal video segmentation. The system employs dual-side long-short distance constraints with velocity-sensitive guidance based on motion energy variations, performing multi-neighbor reconstruction to capture both local temporal patterns and global distinctiveness. It achieves state-of-the-art performance on standard benchmarks, with potential applications in temporal action localization, video content moderation, and video-language understanding.
[PDF]
Can Qin, Lichen Wang, Qianqian Ma, Yu Yin, Huan Wang, Yun Fu
IEEE Transactions on Image Processing (TIP)
We developed an adaptive structure learning framework for semi-supervised domain adaptation. The system employs dual classifiers with contradictory objectives (source-scattering and target-clustering). It integrates MMD-based explicit alignment and entropy-driven implicit alignment. Self-training with consistency regularization further enhances the framework. It achieves state-of-the-art results on domain adaptation benchmarks, enabling applications in label-efficient transfer learning, few-shot adaptation, and domain generalization under distribution shift.
[PDF]
Lichen Wang, Zhengming Ding, Kasey Lee, Seungju Han, Jae-Joon Han, Changkyu Choi, Yun Fu
ACM Transactions on Knowledge Discovery from Data (TKDD)
We developed MUCO, a generative multi-label learning framework combining adversarial data augmentation with explicit correlation modeling. The system uses GANs to address class imbalance and long-tail distributions, paired with a trainable correlation tensor for interpretable label dependency learning. Through end-to-end training, it achieves up to 5.2% precision, 7.8% recall improvement. The framework powers applications in automated content moderation, visual search and retrieval, and intelligent tagging systems.
[PDF]
Yi Xu, Lichen Wang, Yizhou Wang, Can Qin, Yulun Zhang, Yun Fu
International Joint Conference on Artificial Intelligence (IJCAI), 2022
We developed MemREIN, a modular framework for cross-domain few-shot learning that tackles domain shift challenges. It uses instance normalization to reduce domain-specific features, memory banks to preserve discriminative information, and reverse contrastive loss for feature separation. This plug-and-play approach works with existing architectures, achieving up to 16.37% accuracy gains. Applications include edge AI deployment, rapid model customization, and scalable transfer learning pipelines.
[PDF]
Yi Xu, Lichen Wang, Yizhou Wang, Yun Fu
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
We developed T-GNN, the first framework to tackle domain shift in pedestrian trajectory prediction. It combines domain-invariant GNNs with attention-based adaptive learning to transfer motion models from source domains (e.g., ETH streets) to target domains (e.g., HOTEL indoor) without retraining. Through spatial-temporal feature extraction and individual-level knowledge transfer, it outperforms baselines by over 20%. Applications include autonomous vehicle planning, robot crowd navigation, video surveillance, and smart city pedestrian flow.
[PDF]
Chang Liu, Lichen Wang, Yun Fu
SIAM International Conference on Data Mining (SDM), 2022
We developed MAW, a meta-learning framework that solves representation distortion in adversarial domain adaptation. It employs a meta-learner to estimate optimal per-sample weights for preventing misalignment of well-aligned samples. The system constructs a meta-dataset via mix-up of source and pseudo-labeled target samples for bi-level optimization. MAW boosts DANN by 7.2% and CDAN by 2.3% on Office-31. Applications include test-time adaptation, source-free domain transfer, label-efficient learning, and foundation model fine-tuning.
[PDF]
Yue Bai, Zhiqiang Tao, Lichen Wang, Sheng Li, Yu Yin, Yun Fu
SIAM International Conference on Data Mining (SDM), 2022
We developed CAM, a collaborative attention framework for multi-modal time series that addresses overlooked temporal patterns. It introduces Mutual-Aid RNN cells that leverage attention differences across modalities. When one modality focuses on certain time steps, it guides others to explore overlooked information. This strategy enhances knlowedge discovery without direct fusion, improving accuracy by 4% on action recognition benchmarks. Applications include multimodal foundation models, embodied AI, and video-language understanding.
[PDF]
Lichen Wang, Yunyu Liu, Hang Di, Can Qin, Gan Sun, Yun Fu
IEEE Transactions on Image Processing (TIP)
We developed SDRL, a semi-supervised framework for multi-label classification addressing label-efficient learning under long-tail distributions. The system employs adversarial dual-classifier domain adaptation to align labeled and unlabeled data, confidence-based pseudo-label selection from classifier disagreements, and a trainable correlation tensor for pairwise label dependencies. It achieves up to 3% mAP gain across benchmarks. Applications include automated image tagging, content moderation, and visual search.
[PDF]
Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, Yun Fu
arXiv Preprint 2110.06161, 2021
We developed SAM-SLR-v2, an extended version of our CVPR 2021 Challenge championship solution for multi-modal sign language recognition. The system fuses skeleton, RGB, depth, and optical flow via a learned ensemble. It uses graph reduction to extract skeleton graphs, processed through multi-stream SL-GCN and SSTCN. Global Ensemble Model (GEM) automatically learns optimal fusion weights. It achieves SOTA including 99% accuracy on SLR500 and 8% improvement on WLASL2000. Applications include multimodal video understanding and real-time gesture interfaces.
Can Qin, Handong Zhao, Lichen Wang, Huan Wang, Yulun Zhang, Yun Fu
Neural Information Processing Systems (NeurIPS), 2021
We developed a knowledge distillation framework that accelerates graph similarity computation by 10x while enabling offline embedding storage for real-time retrieval. The system uses a multi-level co-attention teacher network with GIN backbone, then distills knowledge to a lightweight student model via embedding decomposition. Applications include molecular drug screening, malware detection, graph-based anomaly detection, and large-scale graph retrieval.
Lichen Wang, Bo Zong, Yunyu Liu, Can Qin, Wei Cheng, Wenchao Yu, Xuchao Zhang, Haifeng Chen, Yun Fu
IEEE International Conference on Data Mining (ICDM), 2021
We developed SentRL, a reinforcement learning framework for aspect-based sentiment classification enabling learning under limited supervision. The system transforms sentences into dependency graphs and deploys an RL-guided agent to explore optimal paths to sentiment-bearing words, skipping irrelevant context to focus on discriminative clues. It achieves up to 3.7% F1 improvement over state-of-the-art. Applications include LLM-enhanced review systems, social listening, voice of customer platforms, and automated brand intelligence pipelines.
[PDF]
Chang Liu, Lichen Wang, Kai Li, Yun Fu
ACMÂ International Conference on Multimedia (MM), 2021
We developed a feature variation decorrelation method for domain generalization that improves robustness across unseen domains. The system uses online memory banks to estimate class prototypes and disentangles semantic variations from features. A decorrelation loss makes variations class-independent, focusing on categorical concepts while ignoring domain-specific changes. It achieves SOTA with 3% improvement across benchmarks. Applications include foundation model adaptation, cross-environment autonomous systems, and training-free vision.
[PDF]
Songyang Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li, Yun Fu
IEEE Computer Vision and Pattern Recognition (CVPR) Workshop (First prize), 2021
We developed SAM-SLR, a skeleton-aware multi-modal sign language recognition framework that won 1st place in the 2021 CVPR Challenge with 98.42% (RGB) and 98.53% (RGB-D) accuracy. The key innovation is graph reduction that simplifies whole-body skeleton for efficient hand gesture and body motion modeling. SL-GCN with decoupled convolution, and STC attention captures spatial-temporal dynamics. A multi-modal ensemble fuses skeleton, RGB, optical flow, and depth for comprehensive understanding. Applications include embodied AI, spatial computing, and multimodal video understanding.
Lichen Wang, Zhengming Ding, Yun Fu
ACM Transactions on Knowledge Discovery from Data (TKDD)
We developed AGMA, a semi-supervised framework for multi-label learning that handles long-tail distributions and label noise with limited annotations. The system learns adaptive similarity graphs for structure discovery, applies marginalized augmentation for robustness, and uses a feature-label autoencoder for visual-semantic projection. Joint optimization achieves improvements across benchmarks in general and zero-shot settings. Applications include vision-language adaptation, content tagging, multi-modal retrieval, and label-efficient training.
[PDF]
Can Qin, Lichen Wang, Qianqian Ma, Yu Yin, Huan Wang, Yun Fu
SIAM International Conference on Data Mining (SDM), 2021
We developed UODA, a semi-supervised domain adaptation framework enabling effective transfer with only 1-3 labeled samples per class. It employs opposite structure learning: a source-scattering classifier expands decision boundaries while a target-clustering classifier groups features through entropy optimization, enabling efficient cross-domain alignment. It achieves up to 3% accuracy improvement on benchmarks. Applications include few-shot domain adaptation, label-efficient training, and cross-domain deployment.
Yue Bai, Lichen Wang, Zhiqiang Tao, Sheng Li, Yun Fu
AAAI Conference on Artificial Intelligence (AAAI), 2021
We developed C2AF, an end-to-end framework for multimodal time series classification fusing complementary sensor information. The system combines LSTM and CNN encoders for global-local temporal patterns, then captures intra-modal and inter-modal correlations through cross-modal learnable fusion with 1x1 convolutions. It achieves state-of-the-art on action recognition and sensor classification benchmarks. Applications include multimodal action recognition, healthcare monitoring, and time series foundation models.
Jiahua Dong, Yang Cong, Gan Sun, Bingtao Ma, Lichen Wang
AAAI Conference on Artificial Intelligence (AAAI), 2021
We developed I3DOL, the first incremental learning framework for 3D perception enabling learning new classes without forgetting previous ones. The key challenge is catastrophic forgetting from irregular point cloud structures. The system uses adaptive geometric centroid for discriminative features, attention-based selection for informative 3D characteristics, and score compensation for balanced prediction. It improves accuracy by up to 25%. Applications include autonomous systems, robotic perception, and continual 3D learning.
Yue Bai, Lichen Wang, Yunyu Liu, Yu Yin, Yun Fu
International Conference on Data Mining (ICDM), 2020
We developed DSAE, an unsupervised framework for high-dimensional time series segmentation that captures temporal patterns without labeled data. It uses a single-to-multiple auto-encoder to preserve local correlations, with dual-side long-short constraints ensuring nearby similarity and distant distinctiveness. It achieves SOTA performance with up to 8.5% NMI improvement. Applications include VideoLLM preprocessing, temporal grounding, embodied AI, and long-form video understanding.
Yunyu Liu, Lichen Wang, Yue Bai, Can Qin, Zhengming Ding, Yun Fu
European Conference on Computer Vision (ECCV), 2020
We developed GVCA, a generative semi-supervised framework for multimodal learning that enables effective training with limited annotations. The key innovation is SeMix, a generative augmentation method that leverages unlabeled data to expand feature distributions. Combined with entropy-guided cross-modal adversarial adaptation and label correlation fusion, the system achieves 89%+ accuracy using only 50% labeled samples. Applications include generative data augmentation, vision-language foundation model pretraining, and label-efficient multimodal learning.
Lichen Wang, Bin Sun, Joseph Robinson, Taotao Jing, Yun Fu
IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020
We introduced EV-Action, the first large-scale action dataset fusing 5 modalities across visual and biosignal domains. It includes RGB, depth, EMG, and dual-skeleton data from 70 subjects performing 20 action classes. The Vicon system captures skeleton at 100fps with sub-millimeter accuracy for fine-grained motion analysis. EMG signals reveal muscle intention and force patterns invisible to visual sensors. An FFT-LSTM baseline demonstrates EMG's unique contribution to recognition. Applications include embodied AI, wearable analytics, and biomechanical research.
Lichen Wang, Bo Zong, Qianqian Ma, Wei Cheng, Jingchao Ni, Wenchao Yu, Yanchi Liu, Dongjing Song, Haifeng Chen, Yun Fu
International Conference on Learning Representations (ICLR), 2020
We developed SEED, a production-ready unsupervised graph learning framework that generates embeddings for arbitrary graphs without requiring labeled training data. The system combines WEAVE-based subgraph sampling with distribution embedding, enabling efficient inference on unseen graph structures. It achieves up to 10% improvement over GNN baselines while supporting graphs with or without node attributes. Applications include malware analysis, knowledge graph systems,and large-scale graph retrieval.
Lichen Wang, Yunyu Liu, Can Qin, Gan Sun, and Yun Fu
AAAI Conference on Artificial Intelligence (AAAI), 2020
We developed DRML, a semi-supervised framework that automatically discovers label correlations for multi-label prediction without manual annotation of label relationships. The system combines dual-classifier domain adaptation with a graph-based relation network, enabling effective learning from limited labeled data. It achieves best F1 scores on 6 image annotation benchmarks while supporting zero-shot generalization. Applications include image tagging, scene attribute recognition, and multi-label content understanding.
Can Qin, Haoxuan You, Lichen Wang, C.-C. Jay Kuo, and Yun Fu
Neural Information Processing Systems (NeurIPS), 2019
We developed PointDAN, a multi-scale domain adaptation framework that enables 3D point cloud models trained on synthetic data to generalize to real-world scans. It uses Self-Adaptive nodes for local geometric alignment and adversarial training for global feature matching. We also released PointDA-10 benchmark dataset with 25K+ samples across synthetic and real-world domains. It improves accuracy by 14% across all 6 adaptation scenarios. Applications include autonomous driving, sim-to-real robotics transfer, and industrial 3D inspection.
Lichen Wang, Zhengming Ding, Seungju Han, Jae-Joon Han, Changkyu Choi, Yun Fu
IEEE International Conference on Data Mining (ICDM) (Long paper), 2019
We developed GCDN, a generative multi-label learning framework that handles long-tail distributions and discovers label correlations automatically. The system uses conditional feature generation to augment sparse training data and learns label dependencies end-to-end without expert-defined relationships. It achieves up to 13% mAP improvement with foundation model-style generalization to novel categories. Applications include vision-language model training, large-scale content tagging, and cross-modal retrieval and understanding.
Denghui Zhang, Junming Liu, Hengshu Zhu, Yanchi Liu, Lichen Wang, Pengyang Wang, Hui Xiong
ACM International Conference on Information and Knowledge Management (CIKM) (Long paper), 2019
We developed Job2Vec, a multi-view graph learning framework for job title matching across companies. The system learns unified embeddings by jointly modeling graph topology, semantic descriptions, transition balance, and transition duration. It achieves 200% improvement in matching accuracy on large-scale career data from major tech and finance companies. Applications include AI-powered talent acquisition, compensation intelligence, career recommendation systems, and workforce analytics.
[PDF]
Lichen Wang, Zhengming Ding, Zhiqiang Tao, Yunyu Liu, Yun Fu
International Conference on Computer Vision (ICCV) (Oral), 2019
We developed GMVAR, a multimodal action recognition framework that works seamlessly with complete, partial, or missing modalities. The system combines conditional generation for cross-modal synthesis with View Correlation Discovery Network (VCDN) for label-space fusion. Unlike feature-level fusion, VCDN captures action-specific distinctiveness across modalities. It achieves up to 5% improvement on 3 RGB-D benchmarks. Applications include multimodal foundation model training, embodied AI, and robust video understanding under sensor degradation.
Can Qin, Lichen Wang, Yulun Zhang, Yun Fu
International Conference on Computer Vision (ICCV) Workshop (Best Paper), 2019
We developed GICT, an unsupervised domain adaptation framework that eliminates the need for expensive real-world data annotation by transferring knowledge from synthetic sources. The system aligns low-level features via cycleGAN generation and matches class-conditional distributions through dual-classifier co-training strategy. It improves segmentation mIOU by 3% on GTA5-to-Cityscapes. Applications include autonomous vehicle perception, industrial sim-to-real transfer, and annotation-free model adaptation.
Gan Sun, Yang Cong, Lichen Wang, Zhengming Ding, Yun Fu
International Conference on Computer Vision (ICCV) Workshop, 2019
We developed OMTC, an online clustering framework that segments human motion videos in real-time across multiple agents without requiring complete video access. The key innovation is an encoder-decoder architecture that learns transferable motion representations across distributed cameras while preserving temporal structure. It improves accuracy by 8-12% with significantly lower memory footprint. Applications include multi-camera surveillance, healthcare monitoring, and distributed video analytics.
[PDF]
Lichen Wang, Zhengming Ding, Yun Fu
IEEE Transactions on Image Processing (TIP)
We developed a transfer learning framework for human motion segmentation that enables knowledge transfer from labeled to unlabeled video data. The system aligns distributions with domain-invariant projections, preserves temporal structure via graph regularization, and applies low-rank clustering constraints. It achieves 5% accuracy improvement over previous methods. Applications include VideoLLM preprocessing, temporal action localization, embodied AI, and automated video content analysis.
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, Yun Fu
European Conference on Computer Vision (ECCV), 2018. (6000+ citations)
We developed RCAN, a channel attention-based super-resolution network enabling very deep architectures (400+ layers) with training stability. The key innovations are Residual-in-Residual (RIR) with long and short skip connections, and Channel Attention (CA) that adaptively rescales features by modeling channel interdependencies. With 6000+ citations, it has become foundational in image restoration, achieving state-of-the-art PSNR/SSIM with 60% fewer parameters. Applications include image enhancement, medical imaging, surveillance, and content delivery.
Lichen Wang, Zhengming Ding, Yun Fu
International Joint Conference on Artificial Intelligence (IJCAI), 2018
We developed AG2E, a graph-based semi-supervised framework for multi-label annotation that learns from both labeled and unlabeled data. The system jointly learns adaptive graph structure and feature embeddings while propagating labels. This design enables robust annotation under noise and label imbalance. It improves precision by 3.5% and recall by 6% across image, audio, and emotion datasets. Applications include automated content tagging, scene attribute recognition, and zero-shot learning.
Lichen Wang, Zhengming Ding, Yun Fu
AAAI Conference on Artificial Intelligence (AAAI), 2018
We developed the first transfer learning framework for human motion segmentation that leverages labeled source data to improve clustering on unlabeled target videos. The system jointly learns domain-invariant projections and temporal graph regularization to align distributions while preserving sequential structure. It achieves 5.3% accuracy and 7.1% NMI improvement over state-of-the-art methods. Applications include video temporal segmentation, action localization, and preprocessing for action recognition.
Lichen Wang, Aimin Zhang, Chujia Guo, Pervez Bhan, Tian Yan
Chinese Control Conference (CCC), 2015
We developed a vision-based optical communication system for real-time multi-target localization and identification. The system uses LED-encoded signals with FFT-based feature extraction and machine learning classification, achieving 1000x speedup over conventional methods. Applications include indoor positioning systems, AGV/robot guidance, autonomous vehicle navigation, and smart manufacturing.
[PDF]
Lichen Wang, Aimin Zhang, Chujia Guo, Songyun Zhao, Pervez Bhan
Chinese Control and Decision Conference (CCDC), 2015
We developed a computer vision algorithm for 3D reconstruction of SMT solder joints. The system uses structured LED lighting to capture shadows, extracts profiles via image processing, and reconstructs geometry using a physics-based shape model derived from solder surface tension. It achieves 2x speedup over Shape from Shading methods with single-camera low-cost hardware. Applications include automated optical inspection, smart factory quality control, and electronics manufacturing.
[PDF]
Lichen Wang
Thesis, Doctors of Philosophy, 2021
This dissertation develops algorithms for multimodal correlation discovery and multi-label learning. For multimodal learning, I developed cross-modal fusion networks that discover complementary patterns across RGB and depth modalities for action recognition. For multi-label prediction, I proposed generative frameworks that automatically learn label correlations and address long-tail distributions through conditional feature generation. For graph-structured data, I introduced SEED, an unsupervised framework that learns graph representations through sampling and embedding distributions. These methods advance vision-language systems, multimodal foundation models, content understanding pipelines, and knowledge graph applications.
Lichen Wang
Thesis, Master of Science in Engineering, 2016
I developed an automated optical inspection (AOI) system for PCB quality control using computer vision and machine learning. The system integrates AdaBoost-based component localization, SVM classification for defect detection, and a novel shadow-based 3D solder joint reconstruction. It achieves pixel-level positioning accuracy and real-time performance, outperforming Shape-from-Shading on reflective surfaces. Applications include smart manufacturing, edge-deployed quality inspection, and electronics production automation.
Lichen Wang
Thesis, Bachelor of Engineering, 2013
I developed an automated optical inspection system for pharmaceutical manufacturing that detects foreign particles in intravenous bottles. The system integrates dual-source illumination, motion-based segmentation, and blob analysis to achieve 50-micron detection sensitivity at 60+ bottles/minute throughput. It was deployed in real manufacturing environments and resulted in a granted patent. Applications include smart manufacturing, edge-deployed quality inspection, and pharmaceutical production automation.