Responsible AI & Data-centric AI Projects
Responsible AI & Data-centric AI Projects
Robust Fine Tuning and Model Merging in Large Language Models
(Model Robustness & Data Selection | Microsoft Research Asia)
- Domain: Text
- Motivation: Fine tuning using only new data and model merging between pre-trained and fine-tuned LLMs are not optimal and interpretable.Â
- Method: In progress
Robust Anti-Cheat in Competitive Gaming (PUBG: BATTLEGROUNDS)
(Model Robustness & Data Selection | KRAFTON AI)
- Domain: Tabular
- Motivation: Three data challenges exist in the anti-cheat dataset: label noise, distribution shifts, and class imbalance.
- Method: Robust data-centric framework including label noise cleaning, out-of-distribution (OOD) data filtering, and weighted undersampling.
Robust Data Selection against Concept Drifts
(Model Robustness & Data Selection)
- Domain: Tabular
- Motivation: Although the source of concept drifts is data, concept drift adaptation is relying on model-centric methods rather than data-centric methods.
- Method: Gradient-based data segment selection to discard drifted data segments and select coreset segments.
Robust Data-Feature Selection against Concept Drifts
(Model Robustness & Data-Feature Selection | SK Hynix)
- Domain: Tabular
- Motivation: Concept drifts occur in the semiconductor FDC data due to periodic maintenance over time, which reduces model performance.
- Method: Efficient sequential data-feature selection using negative and positive transitivity properties of data segments and features for robust model performance.
Robust Data Augmentation in Continual Learning
(Model Robustness & Data Augmentation)
- Domain: Image
- Motivation: Applying naive mixup for data augmentation in continual learning can lead to greater catastrophic forgetting.
- Method: Gradient-based selective mixup by mixing only samples from helpful class pairs and not from detrimental class pairs to mitigate catastrophic forgetting.
Robust Data Augmentation for Regression Tasks
(Model Robustness & Data Augmentation)
- Domain: Tabular
- Motivation: C-Mixup is vulnerable to noisy data as it performs selective mixup based on label distance for regression tasks.
- Method: Integrating C-Mixup with multi-round robust training for a synergistic effect: C-Mixup improves robust training in identifying clean data, while robust training provides clean data for C-Mixup.
Fair Catastrophic Forgetting in Continual Learning
(Model Fairness & Sample Weighting)
- Domain: Image, Text, and Tabular
- Motivation: Continual learning using all the new task samples for training results in unfair catastrophic forgetting for certain sensitive groups including classes.
- Method: Fairness-aware sample weighting by adjusting the training weights of the new task samples for better accuracy-fairness tradeoff.
Model Calibration in Continual Learning
(Model Calibration & Data Perturbation)
- Domain: Image
- Motivation: Model confidence calibration is challenging in continual learning as most post-hoc calibration techniques are not designed to work with the limited validation set from old tasks.
- Method: Temperature scaling specifically designed for continual learning that leverages adversarially perturbed new task samples as a validation set without requiring a validation set from old tasks.
Personalized Video Recommendation (NAVER TV)
(Model Personalization & Multimodal Data Fusion | NAVER)
- Domain: Multimodal (Text and Tabular)
- Motivation: Conventional recommendation systems often fail to capture user preferences due to data sparsity and bias toward similar contents.
- Method: Recommendation framework that integrates collaborative filtering and content-based filtering with multimodal data fusion and applies category-aware calibration to improve personalization and diversity.