As machine learning becomes increasingly ingrained in technology, it has slowly but surely begun to creep into higher stakes scenarios such as applications in the medical field, self-driving vehicles, and more. Deployment into the real world becomes increasingly dangerous without appropriate safeguards in place to ensure robustness of models in real-world settings. While state-of-the-art models often excel in controlled lab settings with clean datasets and no malicious interference, these conditions are rarely reflective of the real world. This is especially true for larger models, such as Large Language Models (LLMs), which often rely on vast amounts of unscreened data for training. In these uncontrolled environments, the absence of safety guarantees can lead to significant risks, preventing us from safely harnessing AI's full potential.
One key obstacle we face is model backdoors or trojans. Trojan attacks in machine learning involve embedding hidden triggers within a model during training, causing it to behave maliciously when specific inputs are provided. These attacks severely impact model trustworthiness by creating hidden vulnerabilities that can be exploited, undermining confidence in the model's outputs. Robustness, or the model's ability to perform consistently across different inputs, is also compromised, as the Trojan can lead to unpredictable and harmful behavior. Ensuring models are free from such hidden threats is critical for maintaining their integrity, particularly in high-stakes applications where reliability is essential. At ACES Lab, we aim to combat these challenges by developing advanced techniques to detect and mitigate trojans, enhancing the safety and robustness of AI systems deployed in the real world. Our recent work, MergeGuard, provides an architecture-agnostic framework for cleansing of trojans. It is one of the first methods that is able to extend to injection attacks at the prompt level for text-to-image based generative models. By leveraging lightweight structural linearization rather than trigger inversion or pruning, MergeGuard fully removes prompt-triggered backdoors using only a small clean dataset, while preserving clean-prompt fidelity and remaining practical for large-scale generative systems.
Beyond adversarial threats, our recent work also investigates a new and increasingly important failure mode in large-scale generative modeling: model collapse. As generative models are recursively trained on data produced by earlier generations, their outputs tend to degrade, lose diversity, or converge to pathological distributions, a phenomenon that threatens the long-term reliability of AI ecosystems. Our ForTIFAI framework provides a formal understanding of recursive training-induced failure and introduces principled techniques for preventing collapse in both language and generative models. By combining theoretical insights with large-scale empirical evaluations, we aim to safeguard future AI systems from degradation, ensuring they remain informative, diverse, and aligned even as they evolve over multiple training cycles.
An overview of ForTIFAI framework where model collapse is analyzed under realistic assumptions simulating how the data sources will change in the near future.