1. Cross-Domain Difficulty Gradient
The four benchmarks form a clear difficulty hierarchy visible at a glance: GSM8K < MATH < Chemistry < Physics. On GSM8K (Figure 1), nearly every cell across all models is saturated deep red, indicating near-ceiling accuracy regardless of the structural combination. On MATH (Figure 2), while the heatmaps remain predominantly red, high-dimensional corners begin to bleach noticeably. Chemistry (Figure 4) introduces further degradation, requiring domain-specific knowledge such as reaction pathways and stoichiometry, though it relies less on extended causal chains than physics. Finally, Physics (Figure 3) exhibits the most widespread accuracy loss. Crucially, the drop from MATH to Physics is far steeper than the drop from GSM8K to MATH, representing a qualitative shift. Physical problems demand situational modeling and causal grounding on top of formal manipulation, rather than a mere quantitative increase in difficulty.
2. Depth × Complexity (Column 4): A Universal Bottleneck
Across all four datasets and all eight models, the Depth vs. Complexity (Column 4) is consistently the most challenging dimension pair. On GSM8K, this manifests only as a small triangle of grey (missing) cells in the high-difficulty corner, reflecting the natural scarcity of such problems. However, in Chemistry and Physics, this grey zone expands substantially, and the boundary cells fade to pale tones, indicating a steep accuracy decline. This pattern holds universally: when reasoning depth and expression complexity are elevated in isolation, most models degrade gracefully; once both dimensions rise simultaneously, accuracy collapses in a cliff-like fashion. The interaction is therefore multiplicative rather than additive—the joint difficulty of deep reasoning over complex expressions is far greater than the sum of individual challenges.
3. Asymmetric Gradients and Structural Patterns
Within each heatmap, the transition from the "easy" corner to the "hard" corner is rarely a smooth, monotonic gradient. Instead, many models sustain high accuracy when only one dimension increases—such as high execution time but low complexity—yet suffer abrupt drops when both dimensions co-escalate. This asymmetry is most pronounced in Physics. Additionally, dimension pairs involving Time vs. Space (Column 2) or Depth vs. Time (Column 3) tend to produce more structured, gradient-like patterns. In contrast, Time vs. Complexity (Column 1) maps are often speckled and irregular, suggesting that execution time and expression complexity interact in less predictable ways across different model architectures.
4 Checkerboard Instability in Reasoning Models
A striking visual signature distinguishes certain models like QwQ and o4-mini: alternating dark–light checkerboard textures, where adjacent difficulty bins exhibit sharply different accuracy. This instability is virtually absent in GSM8K but intensifies through MATH and Chemistry, becoming severe in Physics. We interpret this as evidence that Chain-of-Thought (CoT) reasoning strategies are brittle; certain parameter combinations happen to align with a model’s reasoning templates, while nearby combinations fall into blind spots. In contrast, the strongest model, ChatGPT-5, displays a markedly smoother and more uniform color distribution. This uniformity of the capability surface, rather than peak accuracy on any single slice, may be the most reliable indicator of robust, parameter-insensitive reasoning.
Figure 1:GSM8K Dataset: Heatmap visualization of metric correlations in model success rates.
Figure 2:MATH Dataset: Heatmap visualization of metric correlations in model success rates.
Figure 3:Physics Dataset: Heatmap visualization of metric correlations in model success rates.
Figure 4:Chemistry Dataset: Heatmap visualization of metric correlations in model success rates.