Benchmarking autonomous agents for predictive maintenance, diagnosis, work orders, and root cause analysis β at scale.
The AI Challenge is organized as a part of conference - 13th INTERNATIONAL CONFERENCE ON DATA SCIENCE - CODS
Β
This AI challenge invites participants to learn, design, develop, and evaluate autonomous AI agents that solve realistic industrial tasks across the full pipeline: π‘ Sensing β π§ Reasoning β βοΈ Actuation. Youβll work with a curated set of scenarios rooted in Industry 4.0 applications such as predictive maintenance, fault diagnosis, work-order generation, and root-cause analysis. These tasks demand both strong single-agent intelligence and coordinated multi-agent behavior.
Participants will interpret heterogeneous dataβincluding textual logs and multivariate time-seriesβand build modular pipelines where agents take on roles such as Work-Order Agent, Time-Series Foundation Model Agent, and Supervisor Agent. The benchmark encourages innovation in the agent development lifecycle, autonomous decision-making, modular reasoning, and collaborative problem-solving under realistic constraints, advancing the next generation of intelligent industrial systems.
Aligned with national priorities, the challenge targets operational inefficiencies and maintenance bottlenecks in the manufacturing sector. Using the provided datasets and tools, teams aim to improve predictive accuracy, increase operational uptime, and reduce costsβoutcomes that can shape industry practice and set new benchmarks for broad adoption.
π§ Gain hands-on experience tackling real Agentic AI problems in collaboration with leading researchers.
π Unlock opportunities for internships with top startups.
π Make real-world impactβtop solutions may be considered for deployment.
π Get recognizedβwinners will attend the CODS conference with travel fellowships.
Below are the key milestones and dates for the Agentic AI Challenge. We recommend subscribing to our GitHub repository and Codabench Challenge for submission updates and the starter kit.
Sep 01 π Website and dataset releaseΒ
Sep 21 π§ββοΈ Registration deadline : Note: Scoring begins after the registration deadline. Participants can conduct local testing till Registration. .
Nov 13 πSubmission deadlineΒ
Dec 05 π£ Notification of winnersΒ
Dec 20ποΈ Award ceremonyΒ
All deadlines are 11:59 PM AoE (Anywhere on Earth).
Analyze timeβseries (e.g., compressor pressure), fetch documentation, suggest failure modes and next steps.Β
Fuse logs + sensor patterns, identify likely faults, and recommend inspection steps.
Orchestration agent delegates RCA, repair planning, and autoβcreates work items.
Guide field personnel through stepβbyβstep procedures for a given root cause/task.
Summarize diagnosis & maintenance, update asset logs with structured records.
Challenges participants to design better prompts that transform complex multi-agent interactions into clear, structured DAG plans, ensuring effective sequencing, communication, and fallback handling.
Challenges participants to move beyond rigid sequential pipelines and design flexible, fault-tolerant workflows that enable parallelism, multi-agent collaboration, dynamic context sharing, and adaptive execution paths.
You can participate individually or as a team. Submissions must follow our track-specific formats. Resources, data, and examples are provided in the Starter Kit.Β
Includes dataset samples, baseline code, and submission guidelines. Get started with our GitHub repository.
Upload your outputs via Codabench β ranked live on a leaderboard with track filtering.
Participants submit their modified template files for both tracks, ensuring that only the designated TODO sections are edited and the workflow runs end-to-end to produce valid JSON outputs.
Participants are allowed and encouraged to form teams.
There are no strict limits on the number of members in a team, but:
Each participant can only join one team.
Each team member must register with a separate Codabench account.
Only one registration per team is required. Individual members do not need to register separately.Β
Specify:
β Team Name
β Team Leader (Contact Person)
β Email Address
β Codabench Username
The competition organizers will review and approve teams via email.
Once approved, the team will be officially registered for the competition.
A teamβs total number of submissions will be capped according to competition rules:
Stage 1 (Public Phase):
Teams must respect both the daily submission limit and the absolute submission limit.
Current limit: A total of 50 submissions per team. These 50 submissions will be used to compute the public leaderboard.
Based on these 50 evaluations, participants must select one best solution for consideration on the private leaderboard.
If no selection is made, the most recent submission with the highest score will be used by default.
Teamβs total submissions = (number of submissions per day) Γ (days since competition started), capped at 50.
β οΈ Submissions from all team members are pooled together and counted toward this total.
Model: All submissions must use the fixed model: LLaMA-3-70B.
Submissions will be evaluated on the following metrics:
β Task Accomplishment β Public Leaderboard
β T-match β Semantic Measure β Published in Final Testing
Local Development (Warm-up)
Participants may run their solutions on 2β3 scenarios for local testing and debugging.
Phase 1: Leaderboard Evaluation (Agent Development)
The public leaderboard will be based on 10 selected scenarios drawn from the existing pool of 141 scenarios.
Performance on these scenarios will determine Phase 1 rankings.
Phase 2: Generalization Test (Final Testing)
After Phase 1, participants will be asked to submit a finalized solution.
This solution will then be evaluated on 10 new scenarios from an entirely different set of datasets (outside the original 141) and representing different asset classes.
The final leaderboard will be determined by the weighted average of task accomplishment scores across both phases for the same solution.Β
Each participantβs performance in Phase 1 and Phase 2 will be measured separately.
The final score is computed as a weighted average of the two phases.
Top-performing agents may be validated in real industrial settings or high-fidelity simulations.Β
We plan limited-scale trials with domain experts and compare agent-generated actions (e.g., maintenance scheduling, diagnostics) against expert workflows to assess practical effectiveness and safety.
Questions? Contact us atΒ CODS-2025 AI-Agent Challenge Team