X-RAY: Mapping LLM Reasoning Capability Frontiers via

Formalized and Calibrated Probes

To illustrate the core idea of our framework, we present two examples to find the reasoning boundaries of LLMs

Example 1: Postage Stamp Problem

Figure 1: Capability of GPT-4o and o4-mini measured on stamp-coverage probes as a solution space restructuring ex ample.

The Base Task:

"Consider the question: "Using precise 20 stamps of denominations 1, 5, what is the smallest number of 5-cent stamps required to form every integer amount from 1 to 100?"

Formal Calibration:

To generate difficulty-controlled variants that preserve the same constraint topology, we calibrate two semantically meaningful values while keeping the formal structure unchanged: (i) the coverage horizon, from [1, c2] to [1, x1], and (ii) a stamp budget, by fixing the total inventory size to N = x2. Under this calibration, the probe becomes: "Using x2 stamps of denominations {1, y1} (1 < y1), what is the smallest number of y1 required to form every integer amount from 1 to x1?"

Importantly, this is not a surface paraphrase: x1 and x2 directly control the coverage requirement and resource budget, thereby inducing monotone changes in feasibility/optimality. Similarly, we can generate a harder probe with three types of face value: "Using x2 stamps of denominations {1, y1, y2} (1 < y1 < y2), what is the smallest number of y2 required to form every integer amount from 1 to x1?

Example 2: N-Primable Numbers

Figure 2: Capability of GPT-4o and o4-mini measured on N-primable problems as a constraint refinement example.

The Base Task:

"A positive number is called $n$-primable if it is divisible by $n$ and each of its digits is a one-digit prime number. How many 3-primable positive integers are there that are less than 1000?"

Formal Calibration:

We systematically increase the compositional density of the prompt by injecting two levels of extra semantic constraints while keeping the formal skeleton unchanged:

1 Extra Condition (sum_prime): The sum of its digits must itself be a prime number.

2 Extra Conditions (sum_prime + non_decreasing): An additional structural ordering is that the digits must be in non-decreasing order from left to right.

Our comparative analysis reveals a systematic asymmetry in the reasoning capabilities of LLMs:

Robustness to Constraint Refinement: In the $N$-Primable Numbers problem, adding conditions like "non_decreasing" or "sum_prime" inherently acts as constraint refinement. Models exhibit relatively strong robustness when facing such refinements because these extra conditions merely shrink the existing solution space. They filter the current valid set without fundamentally altering the underlying search topology.

Contrast with Solution-Space Restructuring: This explains why the performance degradation here is less severe than in the postage stamp example（especially for reasoning models like o4 mini）. In the postage stamp problem, altering variables induces solution-space restructuring—modifications that change the basic structural form of the solution manifold. When faced with this reorganization of the underlying geometry, the model's reasoning capabilities degrade significantly.

Page updated

Google Sites

Report abuse