Our Orchestrator-based Multi-Agent System (MAS) prioritizes operational explainability and precision over raw detection speed. By integrating Large Multimodal Models (LMMs) with specialized RAG pipelines, the system achieves a critical balance between reducing false alarms and providing actionable, verified intelligence for wildfire management.
The proposed Multi-Agent System was evaluated using distinct LMM backbones to assess the trade-off between reasoning capability and operational efficiency.
Comparative Analysis of Large Multimodal Models
Quantitative performance comparison of OpenAI and Gemini models across wildfire detection tasks
Radar graph for the performance of different models
Our evaluation benchmarked leading Large Multimodal Models (LMMs), including the GPT-4.1 and GPT-5 series and Gemini 2.5, to identify the optimal engine for wildfire reasoning. As detailed in the comparison table, GPT-5-Nano emerged as the superior choice, achieving the highest system-wide Precision of 0.797 while maintaining a competitive inference time of 176.70 seconds. The accompanying radar chart visualizes this distinct performance profile: unlike the symmetrical "diamond" shape of balanced models like GPT-4o, GPT-5-Nano shows a strategic skew toward precision. This indicates a "conservative decision boundary," ensuring the system prioritizes high-confidence, verified alerts to minimize false alarms—a critical requirement for resource-efficient wildfire management.
The results demonstrate that the Orchestrator-based architecture significantly outperforms the standard RAG proxy, improving the F1-score from 0.515 to 0.736 and Precision from 0.569 to 0.797. Although the non-agentic model has a higher value of overall classification performance, including a precision of 0.898, it behaves like a ’black box,’ with no specific cross-modal grounding applicable to disaster response tasks. Unlike the proposed MAS, where the outputs are validated by using the externally retrieved evidence, non-agentic approaches only use the parametric knowledge present in the model, resulting in a high chance of unverifiable hallucinations with no possible links to the original data
Ablation Study
To quantify the dependency of the system on its multimodal components, ablation study was performed by systematically disabling the CSV Data Retrieval, Image Data Retrieval, and Multimodal RAG pipelines.
Visualization of ablation performance
Performance degradation heatmap.
In wildfire management, the hazardous hallucination impacts primarily as False Positives, where the system makes a positive, soundly reasoned warning for the absence of a hazard. In GPT-5-Nano, the precision was scored at 0.797, indicating that the system’s reasoning was mostly linked to the positive examples rather than being counterfeit. The removal of the Multimodal RAG pipeline caused a significant reduction in F1 score from 0.736 to 0.515, thus demonstrating contextual dependency. This implies that there is a cause-and-effect relationship between the insights derived and the evidence extracted, because otherwise, a dramatic reduction in performance would not be expected, if it were relying on hallucinated knowledge.