In summary, Chain-of-Thought prompting means asking the model to “show its work.” This often improves accuracy on complex tasks and yields interpretable solutions.
Why and When Does CoT Prompting Work?
From a research perspective, CoT prompting works because it leverages the way large language models have learned from text. LLMs are typically trained on vast amounts of internet data, books, and other sources, which likely include examples of people reasoning through problems (think of forums where math problems are solved stepwise, or Q&A sites with explanations). By prompting the model to produce a reasoning chain, we are activating those learned patterns of stepwise explanation. Essentially, we nudge the model to use its latent knowledge in a more structured way, which can reduce errors from jumping straight to a conclusion.
Another way to understand CoT’s effectiveness is to consider the cognitive load of a question. A complex question might require combining several facts, performing a calculation, or considering multiple aspects. If we force the model to answer in one step, it has to implicitly handle all these sub-tasks in a single forward pass of text generation. With CoT prompting, the model’s generation is broken into parts. It can allocate more computation (more internal “thought” or, technically speaking, "reasoning tokens") to each part of the problem. In essence, CoT acts like dynamic time allocation for hard questions in which more steps for more complex problems.
CoT prompting is most useful in scenarios where reasoning or multi-step analysis is needed. According to the original CoT research, it shines on tasks like multi-step math problems, logical inference, and commonsense reasoning. In the context of finance and accounting, many tasks fit this description: analyzing a financial report involves reasoning over multiple sections of text, determining the implications of a policy change requires a chain of logical deductions, and diagnosing why a certain metric changed involves piecing together several data points.
However, CoT is not a silver bullet for all tasks. If a question is purely factual recall (e.g., “What is the capital of Japan?”), a chain of thought might be unnecessary. The model either knows the fact or not. In some straightforward classification tasks, CoT might even introduce confusion if the reasoning is trivial. CoT can also be counterproductive if the model is not capable enough to stick to logical steps (a small model might produce incoherent “reasoning”). Thus, CoT is particularly valuable when the task is complex, ambiguous, or requires combining multiple pieces of information. In finance/accounting research, such tasks abound.
To ground this, consider a practical example: financial ratio analysis. If you ask an LLM directly, “The company’s revenue grew 5% but its net income fell 10%. What might explain this discrepancy?”, a simple model might give a superficial answer or make something up. A CoT-prompted model, by contrast, could reason: “Revenue up 5% could be offset by higher costs. Perhaps expenses or one-time charges grew significantly. Let’s consider: if costs grew more than revenue, net income could drop. A 10% profit drop with 5% revenue rise suggests margin contraction…” and then conclude with a plausible explanation. The chain-of-thought ensures the model evaluates the components of the problem (revenue vs expense changes) and thus provides a more grounded answer.
In short, CoT prompting works because it aligns the model’s output format with human-like analytical reasoning, and it is most useful for tasks where such analysis is needed, which includes many scenarios in accounting and finance research.
Model Uncertainty, Calibration, and Limitations
Although CoT prompting enhances the reasoning capabilities of LLMs, it does not make the models infallible. It is important for researchers to understand the limitations and potential pitfalls:
Overconfidence and calibration: LLMs, by default, often sound very confident even when they are incorrect. This is a well-documented issue: the probability or “confidence” a model assigns to an answer does not always correlate well with actual correctness (poor calibration). CoT prompting alone doesn’t solve this, as a model can produce a very convincing chain-of-thought that leads to a wrong conclusion. In fact, a detailed but flawed explanation can be more misleading than a terse “I think the answer is X.” Researchers should remain critical of model outputs.
When CoT may not help: If a task is primarily about factual recall or straightforward language understanding, CoT might be unnecessary. For example, asking “What year was the Sarbanes-Oxley Act passed?” doesn’t benefit from a chain-of-thought, as the model either knows it (2002) or not. CoT could even introduce errors if the model tries to “derive” a fact from flawed memory. Similarly, if the question is extremely simple (“Calculate 2+2”), CoT is overkill. In some cases, CoT can degrade performance on trivial tasks by introducing verbosity or chances for the model to go off-track. There’s also evidence that for models below a certain size, forcing CoT yields gibberish reasoning (they mimic the format without actual understanding).
Hallucinations and logical errors: CoT can mitigate some hallucinations (especially factual ones, when combined with ReAct or by the model double-checking itself like self-consistency), but it can also produce lengthy hallucinated justifications. A model might invent an entire sequence of financial analysis that sounds plausible but is entirely fictional with respect to the input data. Always ensure the chain-of-thought stays grounded in verifiable information. One best practice is to restrict CoT to using provided context rather than the model’s open-ended knowledge, if possible. For example, prefix the prompt with: “Base your reasoning only on the report above.”
Bias in reasoning: The chain-of-thought can reveal biases in the model’s thinking. This is double-edged: on one hand, it’s good to see them (transparency), on the other, the model might articulate problematic reasoning. For instance, a model might (incorrectly) reason that a CEO is “greedy” because of certain language, reflecting a stereotype rather than fact. Such bias can be spotted thanks to CoT, but users must be vigilant. In sensitive applications (like deciding if a statement is fraudulent or if an executive is doing something unethical), the model’s reasoning may include unsound jumps. Intervention might be needed to correct or guide it.
Scaling and cost: CoT answers are longer. If using an API like OpenAI’s, this means more tokens and higher cost. It also means slower responses. In a research pipeline where hundreds of thousands of textual analyses are analyzed, the token overhead could be significant. One has to balance the improved accuracy vs. the cost. Sometimes a hybrid approach works: use a quick non-CoT classification to narrow candidates, then use CoT on the borderline or most complex cases.
Model dependency: The effectiveness of CoT prompting is model-dependent. GPT-4, for example, is generally better at CoT reasoning (and more likely to follow through correctly) than GPT-3.5. If one is using open-source models, the difference can be stark. Some newer open models (like Anthropic’s Claude or others) are trained with more emphasis on reasoning and might respond well to CoT prompts, whereas older GPT-2 level models will not. Always test on a small scale to ensure the model you use actually benefits from CoT. Wei et al. (2022) found that at smaller scales, CoT did not help or even hurt. The magic kicked in at a large model size. In 2025, most cutting-edge models are large, but if using a smaller one for privacy/offline reasons, be aware of this.
Human in the loop: Especially in finance and accounting, expert oversight is needed. The outputs of a CoT-empowered model can be very convincing. It’s easy to get seduced by the logical flow and assume it must be correct. But as any teacher knows, a student can have a very logical-looking solution that arrives at the wrong answer due to one assumption being off. The same is true for LLMs. Treat the chain-of-thought as you would a student’s explanation: check the premises, check the math, verify the factual claims. A positive development is that some research shows models can identify their own mistakes if prompted to reflect (e.g., a technique called “reflective prompting” or using the model to critique its earlier answer). But this is not foolproof and can double the work (the model might need to be run again to check itself).
Uncertainty estimation: As touched on, there are prompting strategies to get models to express uncertainty (“I’m not entirely sure but I think…”). Interestingly, Zhou et al. (2023) found that injecting phrases of uncertainty led to increased accuracy. Why? Possibly because it allows the model to consider alternatives rather than forcing a single answer. However, one must be careful: just because the model says “I’m not sure” doesn’t guarantee it actually knows when it’s wrong. There are cases where models incorrectly claim high confidence or undue uncertainty. That said, combining uncertainty prompting with CoT can produce more calibrated responses. For example, if you want an assessment of risk from a text, you might prompt: “Provide a step-by-step rationale and if confidence is low, say you are unsure.” The result might be a nuanced answer with an explicit note of uncertainty if appropriate.
When CoT fails or is overkill: CoT may not help if the primary difficulty is not reasoning but knowledge. For instance, “What is the current GDP of Panama?” The model either knows from training data or not. Reasoning won’t invent the correct number (hopefully). In fact, CoT might lead it to guess a number via some flawed reasoning. Similarly, if the task is to parse a straightforward structured document (e.g., extract a value from a specific field), a simple regex-like approach might outperform an LLM with CoT, which could hallucinate or misinterpret the task. Use CoT where reasoning is the bottleneck, not where precision or retrieval is the main issue.
Finally, despite these limitations, it’s worth highlighting that CoT often does help even beyond what we expect. Wei et al. (2022) have reported that simply prompting the model to explain or reason can uncover correct answers to problems the model initially gets wrong. CoT can also make the model’s errors more detectable. So, it’s a bit of a trade-off: you get more insight into the model’s thinking, which helps you spot errors, but you also get a lot more text to sift through.
The safest approach is to treat LLM outputs as a draft analysis: helpful, time-saving, but needing verification. CoT makes that draft more useful for verification. In critical research, you might use the model to get 90% of the way (with CoT), then have a research assistant or co-author verify key points, akin to how you would verify a colleague’s work.
Conclusion
Chain-of-Thought prompting has opened a new frontier in how we interact with AI models, moving from terse question-answering to a more dialogue-like, explanation-rich process. For accounting and finance researchers and educators, this is a promising development. We have seen what CoT prompting is and why it works: it leverages the latent reasoning abilities of large language models by simply asking them to articulate intermediate steps.
As of 2025, tools like GPT-4 or even GPT-5 have made CoT prompting accessible without needing to fine-tune models or write custom code. You simply ask the model to “walk through the reasoning.” Looking ahead, I expect CoT prompting and its offshoots to become standard in analytical AI applications. Models might become better calibrated, or have built-in mechanisms to check their own work (we see early signs of this in research on self-reflection and verification steps). For the research community, an exciting possibility is combining human and machine reasoning, e.g., a researcher and an AI both provide chains-of-thought on a problem and then reconcile differences. Such “hybrid reasoning” could lead to more robust conclusions.