Projects

The goal of the Data Intelligence Lab is to pioneer the inevitable trend of Responsible/Trustworthy/Safe AI, Data-centric AI, and Big Data – AI Integration in all of machine learning including Large Language Models (LLMs). We are especially interested in solving fairness, robustness, privacy, and explainability challenges in machine learning from the data.

Surveys and Tutorials

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective (VLDB Journal '23)
Machine Learning Robustness, Fairness, and their Convergence (ACM SIGKDD '21 tutorial)
A Survey on Data Collection for Machine Learning: a Big Data - AI Integration Perspective (IEEE TKDE '21)
Responsible AI Challenges in End-to-end Machine Learning (IEEE Data Engineering Bulletin '21)
Data Collection and Quality Challenges for Deep Learning (VLDB '20 Tutorial)
Data Lifecycle Challenges in Production Machine Learning: A Survey (ACM SIGMOD Record '18)

Data Acquisition, Generation, and Labeling

PFGuard: A Generative Framework with Privacy and Fairness Safeguards (ICLR '25)
Falcon: Fair Active Learning using Multi-armed Bandits (VLDB '24)
iFlipper: Label Flipping for Individual Fairness (ACM SIGMOD '23)
Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models (ACM SIGMOD '21)
Inspector Gadget: A Data Programming-based Labeling System for Industrial Images (VLDB '21)

Data Cleaning, Validation, Augmentation, and Selection

GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning (ACM SIGKDD'26)
Fair Class-Incremental Learning using Sample Weighting (ACM SIGKDD'26)
MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning (NeurIPS'25)
RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks (ACM SIGKDD'24)
Quilt: Robust Data Segment Selection against Concept Drifts (AAAI'24)
Redactor: A Data-centric and Individualized Defense Against Inference Attacks (AAAI '23)
RegMix: Data Mixing Augmentation for Regression (ArXiv)
Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach (DEEM@ACM SIGMOD '19)
Data Validation for Machine Learning (MLSys '19)

Model Training

T-CIL: Temperature Scaling using Adversarial Perturbation for Calibration in Class-Incremental Learning (CVPR'25)
LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views (ICML'24)
Improving Fair Training under Correlation Shifts (ICML '23)
Dr-Fairness: Dynamic Data Ratio Adjustment for Fair Training on Real and Generated Data (TMLR '23)
XClusters: Explainability-first Clustering (AAAI '23)
Sample Selection for Fair and Robust Training (NeurIPS '21)
FairBatch: Batch Selection for Model Fairness (ICLR '21)
FR-Train: A Mutual Information-based Approach to Fair and Robust Training (ICML '20)

Model Explanation, Validation, and Evaluation

SHAP-based Explanations are Sensitive to Feature Representation (ACM FAccT'25)
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models (NeurIPS'24)
Open-world COVID-19 Data Visualization (DMAH@VLDB '20)
Automated Data Slicing for Model Validation: A Big data - AI Integration Approach (IEEE TKDE '19)
Slice Finder: Automated Data Slicing for Model Validation (IEEE ICDE '19)

Google Sites

Report abuse