As AI systems and large language models (LLMs) scale to unprecedented levels of complexity and deployment, addressing their faults, trustworthiness, and explainability becomes critical to ensure reliable and responsible operation in real-world environments.
FAULTS:
Modern AI pipelines, including training, inference, retrieval-augmented generation (RAG), and agentic systems, are vulnerable to a variety of faults and subtle failures. These include distribution drift, prompt and tool variability, infrastructure noise, and limited-precision effects. Such faults can lead to cascading issues like hallucinations, instability, miscalibration, and silent regressions that waste computational resources, degrade system performance, and may mislead users.
TRUSTWORTHINESS:
Building trust in AI systems at scale requires robust mechanisms to detect, diagnose, and defend against faults and failures. Trustworthiness encompasses system reliability, reproducibility, and resilience under real-world variability. It also involves establishing rigorous measurement, benchmarking, and validation practices that provide confidence in AI-assisted decisions and operational outcomes.
EXPLAINABILITY:
Explainability plays a vital role as a diagnostic and attribution tool across the AI pipeline. By providing insights into the behavior of models, retrieval components, tools, infrastructure, and precision layers, explainability helps identify the root causes of faults and supports transparent evaluation. It enables stakeholders to understand, interpret, and trust AI system outputs, especially when addressing complex failure modes at scale.
We invite original research papers, case studies, and position papers on topics including, but not limited to:
- Faults, failures, and variability in AI/LLM pipelines at scale
- Detection and diagnosis of hallucinations, regressions, and silent errors
- Explainability and interpretability methods for fault attribution
- Measurement, benchmarking, and evaluation protocols for AI system reliability
- Techniques for distribution drift detection and mitigation
- Robustness and fault tolerance in training and inference systems
- Telemetry, monitoring, and observability for large-scale AI deployments
- Impact of infrastructure noise and limited-precision arithmetic on AI outputs
- Reproducibility and validation frameworks for AI systems at scale
- Case studies on operational AI system failures and recovery strategies
- Cross-disciplinary approaches combining HPC, AI, and systems reliability
- Tools and checklists to improve AI system trustworthiness in production
- Security and adversarial fault injection in AI pipelines