LaTex2Web logo

Documents Live, a web authoring and publishing system

If you see this, something is wrong

Collapse and expand sections

To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.

Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.

Cross-references and related material

Generally speaking, anything that is blue is clickable.

Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.

Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.

Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.

Discussions

By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.

If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.

Table of contents

First published on Monday, Jun 16, 2025 and last modified on Monday, Jun 16, 2025 by François Chaplais.

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
arXiv
Published version: 10.48550/arXiv.2505.13775

Kaya Stechly SCAI, Arizona State University, Email

Karthik Valmeekam SCAI, Arizona State University, Email

Atharva Gundawar SCAI, Arizona State University, Email

Vardhan Palod SCAI, Arizona State University, Email

Subbarao Kambhampati SCAI, Arizona State University, Email

Abstract

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens—often anthropomorphized as “thoughts” or reasoning traces and which are claimed to display behaviors like backtracking, self-verification, and meta-cognition—actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver. By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. Our experiments involve training transformer models on traces and solutions generated by A* search. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or “Chains of Thought” reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.

1 Introduction

Recent advances in general planning and problem solving have been spearheaded by so-called “Long Chain-of-Thought” models, most notably DeepSeek’s R1 [8]. These transformer-based large language models are further post-trained using iterative fine-tuning and reinforcement learning methods. Following the now-standard teacher-forced pre-training, instruction fine-tuning, and preference alignment stages, they undergo additional training on reasoning tasks: at each step, the model is presented with a question; it generates a sequence of intermediate tokens (colloquially or perhaps fancifully called a “Chain of Thought" or “reasoning trace"); and it ends it with a specially delimited answer sequence. After verification of this answer sequence by a formal system, the model’s parameters are updated so that it is more likely to output sequences that end in correct answers and less likely to output those that end in incorrect answers.

While (typically) no optimization pressure is applied to the intermediate tokens [1, 2], empirically it has been observed that language models perform better on many domains if they output such tokens first [3, 4, 5, 6, 7, 8, 9, 10, 11]. While the fact of the performance increase is well-known, the reasons for it are less clear. Previous work has often framed it in anthropomorphic terms, claiming that these models are “thinking” before outputting their answers [3, 12, 8, 13, 2, 14]. Simultaneously, the process of performing more auto-regressive forward passes before outputting the final answer has been credited as an instance of inference-time scaling – that is, these models are assumed to be doing problem-adaptive computation.

Famously, DeepSeek’s R1 paper claimed that one of the most impressive observed behaviors of their trained models was the so-called “aha” moment: as part of the chain of thought it was producing in order to answer some question, the model output the token “aha”, seeming to indicate that it had come upon a sudden realization. While a human may say “aha” to indicate exactly a sudden internal state change, this interpretation is unwarranted for models which do not have any such internal state, and which on the next forward pass will only differ from the pre-aha pass by the inclusion of that single token in their context. Interpreting this token as meaningful in this way requires making an additional assumption that has thus far been brushed to the side in discussions of how long CoT models function and what they do – that the derivational traces they produce are semantically meaningful in the same way that the traces they were trained on were or at least in the way that a human might expect them to be.

For R1 and similar large models, this is nearly impossible to check. The intermediate tokens that massive pre-trained and post-RL’d models produce meander for dozens of pages, are written wholly in ambiguous and polysemantic natural language, and – perhaps much worse – are the result of long, opaque training processes on data that we have no access to and cannot compare against.

In this paper, we shed some light on the question of whether intermediate traces are semantically meaningful. Following previous work that elucidated important functional aspects of large scale models through controlled small scale experiments [32, 33, 34] and working within a sort of “model organism” paradigm, we focus the current work on fully controlled, open, and replicable models trained from scratch. Our models are trained on a simple and well-understood shortest path planning problem for randomly generated mazes, with our training runs including varying kinds of intermediate traces – from none to ones generated by the classic A\( ^*\) algorithm to noisy and irrelevant ones. This setup is not only well-understood as a classical computer science problem, but has also grown to be well-studied domain for trace-augmented transformer training [18, 21, 35, 36].

We approach the problem of understanding intermediate token semantics from three major novel angles, performing empirical evaluations on models we train on small planning tasks. First, we construct a validator for A\( ^*\) execution traces and use it to validate and compare trace accuracy to solution accuracy, finding only a loose correlation between the two. Then, we train half billion parameter Qwen models on none, correct, and deliberately irrelevant traces. We present a dataset manipulation that – despite the fact that it removes all specific-problem-relevant semantics – leads to trained models that perform better on both in and out of distribution tasks. We argue that, if performance is the goal, assuming human-like or algorithm-interpretable trace semantics is not only unnecessary but potentially misleading.

2 Related Work

Training Transformer models from scratch to plan using derivational traces - There have been various approaches to train transformer models on algorithmic execution traces. Searchformer - transformer models trained on A* search algorithm execution traces to emulate the A* search algorithm for solving pathfinding tasks [18] . Similarly Stream-of-Search trained Transformers to internalize search processes of Breadth-first search (BFS) and Depth first search (DFS) algortihms within the linguistic representation for solving the Countdown arithmetic game [20]. Yang et al. [37] trained transformer models on search trajectories to mimic search strategies such as Monte Carlo Tree Search or BFS. In concurrent research System 1.x, introduced a dual-model approach coordinated by an external meta-controller, where one model quickly generates solutions without explicit reasoning traces, while the other operates more deliberately, providing step-by-step explanations [38]. Similarly, SwiftSage also employed multiple models within an agent-style workflow to differentiate between fast and slow modes of reasoning [39]. In these works, the System 2 or slow modes of reasoning models are transformer models trained on execution traces of formal procedures. Pan et al. [40] trained transformer models for solving Boolean SAT problems. They had also measured the trace correctness with respect to the abstract Davis–Putnam–Logemann–Loveland (DPLL) procedure over which the transformer models were trained on. But their work was mainly focused on showing that decode-only transformers with CoT can solve the boolean SAT problems whereas our work is focused on providing a much deeper analysis of trace-plan corelation and semantic correctness.

Trace evaluation - Previous studies have examined whether intermediate derivational steps correspond meaningfully to the final answers generated by Large Language Models (LLMs) and Large Reasoning Models (LRMs), finding notable discrepancies. Specifically, prior research has demonstrated that intermediate CoT steps, despite appearing coherent, do not reliably reflect the actual computations used by models to reach their conclusions [41, 42]. Even for the SOTA reasoning models, whose performance gains are typically attributed to employing explicit intermediate "think" tokens, have been shown to produce intermediate reasoning chains that fail to meaningfully correspond with the underlying computations leading to their final outputs [15, 16, 17, 1].

Training on Noisy traces - Li et al. [11] investigated the impact of perturbations in reasoning traces by distilling DeepSeek R1 and QwQ 32B-Preview’s derivational outputs on math and coding tasks. Their findings reveal that models remain robust to noise in the trace—showing improved performance even when trained on derivations containing incorrect mathematical operations, relative to the base model. However, as previously noted, systematically evaluating the semantic correctness of natural language derivational traces remains infeasible. Therefore, no definitive conclusions can be made regarding the semantic alignment between reasoning traces and final answers. Nonetheless, work does seem to indicate that there is no strong causal connection between trace correctness and solution correctness.

Dualformer, an extension of Searchformer, trained transformer models on truncated A* derivational traces by arbitrarily removing steps from the original A* search process [21]. This pruning renders the traces semantically invalid, as they no longer reflect any faithful execution of the A* algorithm. Despite this, models trained on these shortened traces outperformed those trained on full A* traces used in Searchformer. These findings further support the notion that the semantic correctness of derivational traces with respect to ground-truth algorithms like A* may not be causally linked to the correctness of the final output plan.

Post Training methods - Numerous approaches have shown that Supervised fine tuning (SFT) over derivational traces and RL methods improve the task performance of LLMs over planning and reasoning tasks [19, 11, 10, 43, 44, 45, 8, 46]. Among these, one of the first approaches that has shown impressive results is STaR. STaR is a method in which the LLM is prompted to generate multiple responses given a problem with their intermediate CoTs and subsequently the responses are filtered based on whether the final answer is correct. The LLM is further finetuned over examples where the model generated the correct final answer [19]. It has been shown that this approach significantly outperforms direct answer prediction fine-tuning. Recently, since the release of DeepSeek’s R1 which is post trained using GRPO [8], two major types of post training methods have emerged - 1. Finetuning LLMs on derivational traces of LRMs, mainly R1, to enable CoT generation in smaller LLMs (Model Distillation), 2. Using various RL algorithms, mainly different versions GRPO, to improve task performance [47, 48]. In all of these approaches there is no semantic evaluation of the derivational trace generated by the LLMs or LRMs. In the case of Iterative SFT, the model responses are filtered based on the correctness of the final answers and the smaller models are trained to mimic these long derivational traces to "elicit" reasoning and procedure following. Similarly for RL based approaches, reward is determined by sound verifiers based only the final answers. Since there are no process reward models, there is no local evaluation of the correctness of the produced intermediate tokens.

3 Background

Though recently popularized by Large Reasoning Models, especially DeepSeek’s R1 [2], training on intermediate traces in order to improve transformer performance dates back to at least GPT-2 [3] and has been extended and analyzed from many angles [18, 19, 20, 21, 22, 23, 24] While these papers demonstrated ways to improve final answer accuracy, they neither evaluate the trace accuracy nor do they explicitly attempt to train on incorrect or irrelevant traces.

Thus, while they do show that accuracy increases, they leave open the question of whether that accuracy increase actually stems from the additional semantic information in the trace. In many cases, especially with pre-trained models, it is nearly impossible to formally verify reasoning traces, due to the ambiguity of natural language and the lack of a clear ground truth. However, for small, well-scoped formal domains like the gridworld path planning domain used in [18, 21] and this paper, and with carefully from-scratch trained models on those domains, we have the ability to check whether generated traces follow the exact semantics enforced in the training data and causally predict the final solutions that the model outputs.

Figure 2
Figure 3
Figure 1. Examples of mazes. The left is generated by Wilson’s algorithm and is used for model training. The right is generated by the Drunkard’s Walk algorithm and used to evaluate models as an out of distribution task. The goal is represented by a green square and the start state by a yellow square. Black squares represent impassable walls. Blue squares represent steps along the optimal path (as found by A\( ^*\) ). Gray squares are squares that were explored by A\( ^*\) but are not along the optimal path. White squares are unexplored traversable squares.

3.1 The Maze Pathfinding Domain

We consider a standard grid-based path-finding domain. The task is to find a legal path between a given start cell and goal cell in a \( 30\times30\) grid. Every cell of this grid is either free (traversable) or a wall (impassable). The agent begins at the start state, and at every state may take one of four actions: go up, go down, go left, or go right. The transformer is given a full description of this problem (in token format – we follow the formulation used by [18] and [21]) and must output as its final answer a plan, which consists of a sequence of actions. A plan is considered correct if it executable – that is, every action it presents moves the agent from a free cell to an adjacent free cell – and its final action results in the agent being at the goal cell.

We generate navigation problems using diverse generation algorithms, resulting in varied structural patterns and exploration dynamics. This enables systematic out-of-distribution (OOD) evaluation by testing models on maze types unseen during training – which was all done on mazes generated with Wilson’s algorithm. These generation algorithms can be sorted into two major categories: 1) algorithms that do not permit cycles and sample over the spanning trees of the \( 30\times30\) grid and 2) algorithms that permit loops and create noisy, less-structured dungeon or cave-like instances. For all algorithms except SearchFormer’s, which has its own start and goal generation loop, we sample a legal (start, goal) pair after maze generation.

Acyclic Maze Generation

  1. Wilson’s algorithm: This is the algorithm that we use to generate mazes for training models. Wilson’s algorithm generates uniform random mazes by performing loop-erased random walks from unvisited cells until they connect to the current maze [49]. Each walk removes any loops it creates, ensuring a valid tree structure. This process continues until all cells are included, producing a uniform sample from the space of all possible spanning trees of the \( 30\times30\) graph.
  2. Kruskal’s algorithm: Kruskal’s algorithm, originally proposed for finding a minimum spanning forest of an undirected edge-weighted graph [50], generates mazes by treating each cell as a node and randomly removing walls between unconnected regions, using a union–find structure to avoid cycles. This results in a fully connected maze without loops, though the maze distribution is not perfectly uniform. The method produces mazes biased towards short local connections and dead ends.
  3. Randomized Depth-First Search algorithm: The randomized depth-first search (DFS) or recursive backtracker algorithm generates mazes by carving a path forward until reaching a dead-end [51]. When it hits a dead-end (no unvisited neighbors), it backtracks until it finds a new direction to explore, repeating until all cells are visited and connected into a complete maze. Depth-first search is biased towards generating mazes with low branching factors and many long corridors.

Cave Generation

  1. 4. Drunkard’s Walk: We implement a version of the “Drunkard’s Walk” algorithm, as described by [25], and originally used for procedurally generating dungeons for top-down two-dimensional video games. Starting from a grid of solid walls, a random walk is performed, carving out the current cell on every step. The walk continues until a predefined number or percentage of floor tiles has been dug out. This method preserves cycles, producing cave-like structures with open chambers and looping corridors. The output space includes grid states unreachable by perfect acyclic maze generators.
  2. 5. SearchFormer style generation We also implement the random generation algorithm used in the SearchFormer paper [18], though we use it for evaluation rather than training. Tasks are generated by exhaustive rejection sampling: first randomly select a number between 30% and 50%. Then select that percentage of cells to be wall cells. Randomly choose a start and goal location and execute A\( ^*\) to find an optimal plan. Reject unsolvable, too easy, or duplicate instances and resample. These instances also allow for loops and so are also out of distribution for our models.

3.2 The A\( ^*\) Search Algorithm

A* is a classic best-first graph–search procedure that combines the uniform-cost guarantee of Dijkstra’s algorithm [26] with domain-specific heuristics to focus exploration on promising states, originally introduced to compute minimum-cost paths in state-space graphs [27].

The algorithm maintains an open list (a priority queue) keyed by \( f(n) = g(n) + h(n)\) , where \( g(n)\) is the exact cost from the start and \( h(n)\) is a heuristic estimate to the goal, and also maintains a closed list of already visited nodes. It repeatedly pops the open list node with the smallest \( f\) ; if this is the goal, it reconstructs the path that lead to this node and this is returned as the final plan. Otherwise, it generates child nodes (in our case, traversable neighbor cells) and calculates their \( g\) and \( f\) values. For each node, it either inserts it into the open list or – if the node is already in the list – updates its \( g\) value if the new value is lower. The popped node is added to the closed list to prevent re-expansion.

The effectiveness of A* is dependent on the heuristic it is implemented with. For solvable graph search problems like the ones featured in this paper, any consistent (\( h(n) \leq c(n, n') + h(n')\) for all neighboring \( n'\) ) heuristic will guarantee that the plan returned is not only satisficing but optimal [28].

For the maze path planning problems we examine in this paper, we use the very standard Manhattan heuristic \( h(n) = |x_n - x_g| + |y_n - y_g|\) computes the sum of horizontal and vertical displacements between a cell and the goal. On a 2-D grid with only orthogonal, unit-cost movement, this heuristic is consistent, ensuring A* returns an optimal path.

Finally, following SearchFormer and Stream of Search, we modify the A* implementation to output a linearized execution trace [20, 18]. That is, whenever the algorithm creates a child node and adds it to the open list, it prints create x y cA cB and when it closes a node and adds it to the closed list, it prints close x y cA cB. Here, ‘A’ in cA represents the exact cost from the start state to the node (i.e., the \( g(n)\) value) and ‘B’ in cB represents the heuristic estimate from that node to the goal state (i.e., the \( h(n)\) value). Similar to Searchformer notation [18], we use the prefix "c" to differentiate between the node co-ordinates and its cost estimations. In the next section, we construct an A* validator that reverses this process – it takes in a linearized trace and attempts to simulate the corresponding open and closed list operations to check if they are valid with respect to the semantics of this implementation.

4 Validating Traces and Solutions

<span data-controller="mathjax">Trace validation procedure. Our A ^*) validator runs through the model’s output stream sequentially. Assuming no parsing errors, it will flag a trace as invalid if at some point it contains an invalid action. The left bottom corner is (0,0)) . The goal is represented by a green square and the start state by a yellow square.</span>
Figure 4. Trace validation procedure. Our A\( ^*\) validator runs through the model’s output stream sequentially. Assuming no parsing errors, it will flag a trace as invalid if at some point it contains an invalid action. The left bottom corner is \( (0,0)\) . The goal is represented by a green square and the start state by a yellow square.

While previous work evaluated the final accuracy of trace-trained models, it did not evaluate the traces themselves. For large, production ready RL-post-trained models like DeepSeek’s R1, this is practically impossible. For even a simple query, the model produces dozens of pages of convoluted and meandering output before arriving at an answer, and this output is all in natural language, which makes it very easy to read multiple equally valid interpretations into it.

To truly tell whether the traces that were trained on helped in the expected way, we need a formal way of validating their correctness. By training models on traces produced by a well-known algorithm with well-known semantics, it is possible to check whether the model’s emulations of the algorithm’s execution trace are correct.

We construct a formal verifier for A\( ^*\) traces. The format for these traces follows [18], and is described in more detail in Section 3. Essentially, our validator consumes the generated trace and simulates the operations proposed in that trace on open and closed lists. It runs through the generated trace sequentially, parsing each action x y cA cB sequence as an operation and using it to update its open and closed list. It marks a trace as valid if it can correctly execute this procedure until it closes the goal node. Errors in execution can be one of the following:

  • Parsing Error: a substring is malformed and does not parse into either a create or a close action with the correct arguments.
  • Invalid Neighbor: the current create action is attempting to create an illegal child, either referencing a wall cell or a cell that is not adjacent to the last closed node.
  • Already Closed: the current create action is attempting to close an already closed node.
  • Not in Open List: the current close action is referencing a node that is not in the open list.
  • Not Lowest \( f\) -value: the current close action is attempting to close a node when there is at least one other node in the open list with a lower \( f\) -value.
  • Goal Not Reached: after the entire sequence was processed, the goal node was not in the closed list, and so the reconstruction step cannot proceed.

With this verifier in hand, we can now distinguish between plan validity and trace validity for models trained on this kind of dataset. To construct our training sets, we generate 50,000 mazes using Wilson’s algorithm, and randomly select a start and goal cell. Then, we use A* with the Manhattan distance heuristic to find an optimal plan for each maze as well as to produce a trace that is saved with each datapoint.

We modify the architecture of the Qwen2.5 0.5B [29] to support a vocabulary of exactly 944 different tokens (which reduces the parameter count to about 380 million from 500 million), randomly initialize the model, and then train it for 85,000 training steps with a batch size of 8 on two NVIDIA H100s. The model has a context length of 32,000 tokens to support the long lengths of intermediate token generation. (Our other experiments later in the paper also use this architecture, but train on different datasets, from solution-only through to irrelevant and noisy traces. All code and data will be made public.)

We test this model trained on Wilson mazes on a thousand instances of mazes generated by Wilson, Kruskal, DFS, SF-Style and Drunkard approaches, evaluating the solution accuracy as well as the trace validity. We present these results as confusion matrices in Figure 5, with each domain represented by a separate matrix. These results break down the correlation between model accuracy and trace validity. As can be seen from the results, trace accuracy is not a perfect predictor of plan accuracy. In fact, as can be seen from the diagonal entries, the model can produce valid traces and then continue on to produce an incorrect plan or produce invalid traces and yet end up at a correct plan .

<span data-controller="mathjax">Plan versus trace validity for the model trained on correct traces, measured across domains. Wilson = generated by Wilson’s algorithm, Kruskal = mazes generated by Kruskal’s algorithm, DFS = mazes generated by Depth-First Search, SF-Style = instances generated in the SearchFormer Style, Drunkard = instances generated using the Drunkard’s algorithm.</span>
Figure 5. Plan versus trace validity for the model trained on correct traces, measured across domains. Wilson = generated by Wilson’s algorithm, Kruskal = mazes generated by Kruskal’s algorithm, DFS = mazes generated by Depth-First Search, SF-Style = instances generated in the SearchFormer Style, Drunkard = instances generated using the Drunkard’s algorithm.

5 Training with Traces: Does Meaning Matter?

If plan and trace validity are only loosely connected for models trained on the A* trace dataset, then perhaps the validity of the trace isn’t as important to the performance increase as previously believed. To test this empirically, we construct a second training dataset called Swap, which we build by randomly permuting reasoning traces between problems.

This dataset consists of the exact same problems as the original Trace 50,000, but problem 1’s trace will be given to, say, problem 4; problem 4’s will be given to problem 7; and so forth. In other words, while the traces continue to have the right form and some generic domain information, they no longer have any connection to the specific problems they are associated with. Training examples consist of a start and goal state, a maze definition, an A\( ^*\) trace for searching for the shortest path across a totally unrelated maze from a different start and goal state, and the correct solution plan for the original maze.

What we find is that our most competent model not only maintains performance on the in-distribution test set, but it generalizes better than the other maze distributions we test! All despite the lack of algorithmically valid semantics in the trained upon and generated traces.

For these experiments, we continue to use the same model architecture described in the previous section, varying the datasets we train on to see how they affect performance – even as they further corrupt or completely destroy the correctness of the traces. For best results, we performed hyperparameter optimization (via Optuna [30]). We provide additional details on hyperparameters and initializations in the Appendix, and we will publicly open-source our codebase for full transparency.

The most basic training run is the standard solution-only baseline, where the model is trained on just solutions without any derivational traces. The next baseline, following previous work [18, 21, 20] is training the model with A* generated traces, teacher-forcing during training to make it output intermediate tokens before the final solution. These are the models discussed in the previous section. Finally, we use the same underlying training data and distribution, but modify it by corrupting the traces. Our trace corruption process is very simple: we randomly permute which problem is associated with which traces – so, for example, the third problem might have the 5th problem’s trace, which is an execution trace of A* on an unrelated maze with unrelated start and goal states.

The problems in our training data are all generated by Wilson’s algorithm. For our test sets we generate data with several maze generation algorithms (as described in Section 3), including Wilson’s algorithm, to get both in and out of distribution data. Our training data consists of 50k samples, while our test sets each contain 1k.

Unintuitively, as seen in Table 2, the best model in both in and out of distribution test sets turns out to be the model trained on swapped (incorrect) traces! We see that the swapped model has a 0% trace validity – as it has been trained to output well-structured but problem-irrelevant traces in response to every problem – but nevertheless performs noticeably better than both the correct trace and solution-only baselines. An interesting point to note is the performance difference on out-of-distribution datasets. While most of the performance differences are within a few percentage points, and in-distribution testing results in near identical performance, on the Drunkard dataset the swapped model is 10 times better than the original model, giving 26% to the correct trace model’s 2.6%, and on the DFS maze set, it reaches 41.7% to the original model’s 30.8%.

Table 1. Performance of Swap, A* Trace, and Solution-Only Models across maze distributions. "Plan Val." = Plan Validity, "Trace Val." = Trace Validity within Valid Plans
Soln. OnlyRegular A* tracesSwapped A* traces
Plan Val.Plan Val.Trace Val.Plan Val.Trace Val.
Wilson4.0%50.1%95.2%51.6%0%
Kruskal2.1%49.7%96.2%51.9%0%
DFS2.8%30.8%82.1%41.7%0%
Drunkard0.0%2.5%4.0%26.0%0%
Searchformer-style0.0%0%0%0.2%0%

If intermediate tokens improve accuracy because they teach the model a given reasoning procedure, then we should expect their influence on performance to fluctuate exactly with their connection to the problem. However, we find that this is not always the case – in fact, intermediate token sequences that have almost nothing to do with the problem at hand can provide a significantly higher performance boost (and which, counterintuitively might even generalize better) than well-grounded semantically meaningful execution traces, thus throwing doubt on the seemingly wide-spread intuition that the effectiveness of traces stems from allowing the transformer to perform structured, interpretable, and algorithmic procedures.

5.1 Intermediate Tokens Don’t Need to be Thoughts

Our results hint that the impact of trace content on performance and the legibility of that content have been somewhat conflated – if all we care about is increasing the accuracy and capability of a model, enforcing human readability may be counterproductive, a lesson also mentioned in the R1 paper [2]. Furthermore, examining traces produced by a model – though they may look right at first glance – is not necessarily informative if those traces are not predictive of the model’s final answers.

Of course, if trace semantics don’t matter, then the question immediately arises: why does generating intermediate tokens increase accuracy at all? We speculate that what is helping is finding the right prompt augmentation. That is, for a given task prompt \( T\) , there exists a prompt augmentation \( PA\) which boosts the LLM’s performance on that task:

\[ \exists\,PA\;\text{s.t.}\; \mathbb{P}\bigl(\mathrm{Sol}\bigl(\mathrm{LLM}(T + PA),\,T\bigr)\bigr) > \mathbb{P}\bigl(\mathrm{Sol}(\mathrm{LLM}(T),\,T)\bigr) \]

Here \( \text{Sol}(y,T)\) indicates that \( y\) solves \( T\) , and \( \text{LLM}(x)\) is the model’s completion for input \( x\) . The central challenge then is to learn the Skolem function

\[ \mathrm{PA} = f_{\theta}(T, \text{LLM}), \]

that maps each task to an effective augmentation. This can be accomplished through modifying the model itself to inherently and automatically augment prompts, as is the case in models that first generate long chains of intermediate tokens before their final answers. Crucially, prompt augmentations have no need to be human‐interpretable. In fact, we see results that back this up in the adversarial prompting literature, where effective jailbreaks can be effected by augmenting prompts with human-uninterpretable strings [52, 53, 54, 55] or modifying them with random syntactic permutations, capitalizations, and shufflings [31].

6 Conclusion

In this paper, we challenged the prevailing narrative that intermediate tokens or “Chains of Thought” generated by Large Reasoning Models like DeepSeek’s R1 are interpretable, semantically valid sequences with predictable effects on the model’s behavior. As we don’t have access to any frontier LLM’s training data or even exact training procedure, and since the traces these models output are in multiply-interpretable natural language without a concrete ground truth, we designed a series of experiments building on previous smaller model reasoning work – mainly Searchformer and Stream of Search [20, 18] – and constructed an A\( ^*\) trace validator, finding that there is only a loose correlation between the correctness of the trace and the correctness of the output plan. We then trained additional models on noisy or irrelevant traces and found that there are (nonsensical) trace formats that nevertheless maintain or even increase the model’s performance – all despite them being much less informative or connected to the problem at hand. Finally, we argue that, if the goal is to increase model performance, enforcing trace semantics is unnecessary and potentially very misleading. All together, our counter-intuitive results demonstrate ways in which common interpretations of Large Reasoning Models may be anthropomorphizations or simplifications.

7 Acknowledgements

This research is supported in part by ONR grant N0001423-1-2409, DARPA grant HR00112520016, and gifts from Qualcomm, J.P. Morgan and Amazon.

Appendix

A Appendix

A.1 Additional experiment details

For all our experiments, we trained the Qwen-2.5-0.5B decoder only models. We used a custom tokenizer with domain specific vocabulary which reduced the model parameters to around 380M. We optimized with AdamW (\( \beta_1\) =0.9, \( \beta_2\) =0.999) and applied a weight decay of 0.1528, a peak learning rate of 2.2758e-4, and 100 warm-up steps, all under bf16 precision. We train the models for 95k training steps. All randomness was controlled with fixed seeds. The training dataset size is 50k datapoints unless specified otherwise.

A.2 Validating Traces and Solutions of Searchformer Models

Along with our own trained models, we have also evaluated models trained by [18]. These models have an encoder-decoder architecture and are trained on A* generated traces on 30x30 mazes . The mazes are generated by their random generation method as described in Section 3. We see that across model sizes (from 15M to 175M parameters) there are a significant number of instances where the model produces a correct plan but the trace that it outputs is invalid. This is in line with the results of our models and provide further evidence that trace accuracy is not a perfect predictor of plan accuracy.

<span data-controller="mathjax">Plan validity versus trace validity for models trained on correct A* traces on 30x30 mazes, measured across varying model sizes and averaged over five runs (6400 responses per run).</span>
Figure 6. Plan validity versus trace validity for models trained on correct A* traces on 30x30 mazes, measured across varying model sizes and averaged over five runs (6400 responses per run).

A.3 Training on more data

To check the influence of more data on the performance we trained models on larger datasets containing 500k data points. We trained these models for 200k steps. Our results (in Table 2) show that the swapped model completely outperforms the regular model. On all the datasets, the swapped model achieves significant performance gains over the regular model. Notice that the trace validity for the swapped model is 0% while it achieves 70% solution accuracy. This shows that trace correctness can actually be an albatross for the model’s performance.

Table 2. Performance of Swap, A* Trace, and Solution-Only Models across maze distributions. "Plan Val." = Plan Validity, "Trace Val." = Trace Validity within Valid Plans
Soln. OnlyRegular A* tracesSwapped A* traces
Plan Val.Plan Val.Trace Val.Plan Val.Trace Val.
Wilson3.30%31.9%79.0%70.3%0%
Kruskal2.10%34.3%80.76%75.8%0%
DFS2.60%18.1%78.45%52.5%0%
Drunkard0.0%0%0%12.9%0%
Searchformer-style0.0%0%0%0.4%0%

A.4 Training on other maze generation algorithms

We also train models on mazes other than Wilson’s. We specifically choose the Searchformer-style mazes as all the Wilson models perform the worst on this data. We generate 50k unique problems and then train on both correct and swapped traces. Similar to what we have seen with the Wilson models, we see that the swapped model performs better than the regular model on two out-of-distribution datasets (DFS and Drunkard’s walk).

Table 3. Performance of Swap and A* trace models across maze distributions. The training data is generated via Searchformer‐style trace generation. “Plan Val.” = Plan Validity, “Trace Val.” = Trace Validity within valid plans.
Test SetRegular A* tracesSwapped A* traces
Plan Val.Trace Val.Plan Val.Trace Val.
Wilson28.5%81.8%6.8%0.0%
Kruskal31.7%77.9%8.6%0.0%
DFS12.0%66.7%14.3%0.0%
Drunkard’s Walk44.3%87.6%60.9%0.0%
Searchformer‐style47.0%63.4%23.4%0.0%

A.5 Training with a different swap seed

To check if the way the traces have been swapped change any of the already seen trends in performance, we also train on a dataset where the traces were swapped using a different random seed from the previously swapped dataset. As seen in Table 4, we find that the new model still outperforms the regular model on the DFS and Drunkard’s walk datasets.

Table 4. Performance of Solution‐Only, A* Trace, original Swapped‐Trace, new seed Swapped-Trace Models across maze distributions. “Plan Val.” = Plan Validity, “Trace Val.” = Trace Validity within valid plans.
Soln. OnlyRegular A* tracesSwapped A* tracesSwapped A* traces (new seed)
Plan Val.Plan Val.Trace Val.Plan Val.Trace Val.Plan Val.Trace Val.
Wilson4.0%50.1%95.2%51.6%0.0%41.5%0.0%
Kruskal2.1%49.7%96.2%51.9%0.0%42.2%0.0%
DFS2.8%30.8%82.1%41.7%0.0%35.9%0.0%
Drunkard’s Walk0.0%2.5%4.0%26.0%0.0%31.3%0.0%
Searchformer‐style0.0%0.0%0.0%0.2%0.0%0.1%0.0%

A.6 Solving Sokoban Problems

To test if similar results can be obtained on a different and more complex task with a different transition structure, we repeat our experiments for Sokoban puzzles. Sokoban is a PSPACE-complete [56], grid-based puzzle where an agent must push each box from its start position onto a designated storage dock. At each timestep, the agent may move one cell up, down, left, or right, and can push—but never pull—exactly one adjacent box into the next cell, provided that both the agent’s target cell and the box’s destination cell are free. We encode the entire level (grid layout, agent and box start positions, and dock positions) as a token sequence similar to that of [18]. The model must output a sequence of valid moves that, when executed, places all boxs on their docks; a plan is correct only if every action is executable in the simulated environment and achieves the goal configuration.

Similar to the A* trace generation described in Section 3.2, we modify the A* implementation to output a linearized execution trace. Whenever the algorithm creates a child node and adds it to the open list, it prints create worker x y box a b box c d cA cB and when it closes a node and adds it to the closed list, it prints close worker x y box a b box c d cA cB. Here, x y denotes the worker location, a b and c d denote the respective box locations, ‘A’ in cA represents the exact cost from the start state to the node (i.e., the \( g(n)\) value) and ‘B’ in cB represents the heuristic estimate from that node to the goal state (i.e., the \( h(n)\) value). Similar to Searchformer notation [18], we use the prefix "c" to differentiate between co-ordinates and cost estimations. We compute the heuristic value at each node by finding the sum of the Manhattan distances for every possible pairing of boxes to docks, and then take the minimum over all such assignments as our heuristic value.

Similar to the validator described in Section 4, we construct an A* validator that reverses this process for Sokoban puzzles – it takes in a linearized trace and attempts to simulate the corresponding open and closed list operations to check if they are valid with respect to the semantics of this implementation.

Training and Test Dataset - We use the same problem generation procedure used by [18]. A 7 × 7 grid was sampled and two additional wall cells were added as obstacles to the interior of the map. Then two docks, boxes, and the worker locations were randomly placed. If the sampled task is solvable by A*, then the task was admitted to the dataset. We generate 50,000 Sokoban puzzles to construct our training dataset. We also generate the swap dataset for Sokoban problems.

For the test dataset, we use plan length as a proxy for problem difficulty and generate problems that have a final plan length greater than the mean plan length of training dataset problems.

Table 5. Performance of Swapped, A* Trace, and Solution-Only Models on the Test dataset. "Plan Val." = Plan Validity, "Trace Val." = Trace Validity within Valid Plans
Soln. OnlyRegular A* tracesSwapped A* traces
Plan Val.Plan Val.Trace Val.Plan Val.Trace Val.
Sokoban18.1%1.1%0%2.3%0%

Even in the case of Sokoban problems, we see that the correct traces do not help the model perform better than swapped (and incorrect) traces (as shown in Table 5). This reinforces our point that trace accuracy and plan accuracy are not semantically co-related.

References

[1] Bowen Baker and Joost Huizinga and Leo Gao and Zehao Dou and Melody Y Guan and Aleksander Madry and Wojciech Zaremba and Jakub Pachocki and David Farhi Monitoring reasoning models for misbehavior and the risks of promoting obfuscation arXiv preprint arXiv:2503.11926 2025

[2] Hengguang Zhou and Xirui Li and Ruochen Wang and Minhao Cheng and Tianyi Zhou and Cho-Jui Hsieh R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model arXiv preprint arXiv:2503.05132 2025

[3] Maxwell Nye and Anders Johan Andreassen and Guy Gur-Ari and Henryk Michalewski and Jacob Austin and David Bieber and David Dohan and Aitor Lewkowycz and Maarten Bosma and David Luan and others Show your work: Scratchpads for intermediate computation with language models 2021

[4] Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Fei Xia and Ed Chi and Quoc V Le and Denny Zhou and others Chain-of-thought prompting elicits reasoning in large language models Advances in neural information processing systems 2022 35 24824–24837

[5] Zhuosheng Zhang and Aston Zhang and Mu Li and Alex Smola Automatic chain of thought prompting in large language models arXiv preprint arXiv:2210.03493 2022

[6] Cheng-Yu Hsieh and Chun-Liang Li and Chih-Kuan Yeh and Hootan Nakhost and Yasuhisa Fujii and Alexander Ratner and Ranjay Krishna and Chen-Yu Lee and Tomas Pfister Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes arXiv preprint arXiv:2305.02301 2023

[7] Yuxian Gu and Li Dong and Furu Wei and Minlie Huang MiniLLM: Knowledge distillation of large language models arXiv preprint arXiv:2306.08543 2023

[8] Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Ruoyu Zhang and Runxin Xu and Qihao Zhu and Shirong Ma and Peiyi Wang and Xiao Bi and others Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning arXiv preprint arXiv:2501.12948 2025

[9] Jacob Pfau and William Merrill and Samuel R Bowman Let's think dot by dot: Hidden computation in transformer language models arXiv preprint arXiv:2404.15758 2024

[10] Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto s1: Simple test-time scaling arXiv preprint arXiv:2501.19393 2025

[11] Dacheng Li and Shiyi Cao and Tyler Griggs and Shu Liu and Xiangxi Mo and Eric Tang and Sumanth Hegde and Kourosh Hakhamaneshi and Shishir G Patil and Matei Zaharia and others LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! arXiv preprint arXiv:2502.07374 2025

[12] Kanishk Gandhi and Ayush Chakravarthy and Anikait Singh and Nathan Lile and Noah D Goodman Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars arXiv preprint arXiv:2503.01307 2025

[13] Shu Yang and Junchao Wu and Xin Chen and Yunze Xiao and Xinyi Yang and Derek F Wong and Di Wang Understanding Aha Moments: from External Observations to Internal Mechanisms arXiv preprint arXiv:2504.02956 2025

[14] Sébastien Bubeck and Varun Chadrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and others Sparks of artificial general intelligence: Early experiments with gpt-4 2023

[15] Yanda Chen and Joe Benton and Ansh Radhakrishnan and Jonathan Uesato and Carson Denison and John Schulman and Arushi Somani and Peter Hase and Misha Wagner and Fabien Roger and others Reasoning Models Don't Always Say What They Think arXiv preprint arXiv:2505.05410 2025

[16] James Chua and Owain Evans Are DeepSeek R1 And Other Reasoning Models More Faithful? ICLR 2025 Workshop on Foundation Models in the Wild

[17] Iván Arcuschin and Jett Janiak and Robert Krzyzanowski and Senthooran Rajamanoharan and Neel Nanda and Arthur Conmy Chain-of-thought reasoning in the wild is not always faithful arXiv preprint arXiv:2503.08679 2025

[18] Lucas Lehnert and Sainbayar Sukhbaatar and DiJia Su and Qinqing Zheng and Paul Mcvay and Michael Rabbat and Yuandong Tian Beyond a*: Better planning with transformers via search dynamics bootstrapping arXiv preprint arXiv:2402.14083 2024

[19] Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman Star: Bootstrapping reasoning with reasoning Advances in Neural Information Processing Systems 2022 35 15476–15488

[20] Kanishk Gandhi and Denise Lee and Gabriel Grand and Muxin Liu and Winson Cheng and Archit Sharma and Noah D Goodman Stream of search (sos): Learning to search in language arXiv preprint arXiv:2404.03683 2024

[21] DiJia Su and Sainbayar Sukhbaatar and Michael Rabbat and Yuandong Tian and Qinqing Zheng Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces The Thirteenth International Conference on Learning Representations 2024

[22] Juno Kim and Taiji Suzuki Transformers provably solve parity efficiently with chain of thought arXiv preprint arXiv:2410.08633 2024

[23] William Merrill and Ashish Sabharwal The expressive power of transformers with chain of thought arXiv preprint arXiv:2310.07923 2023

[24] Guhao Feng and Bohang Zhang and Yuntian Gu and Haotian Ye and Di He and Liwei Wang Towards revealing the mystery behind chain of thought: a theoretical perspective Advances in Neural Information Processing Systems 2023 36 70757–70798

[26] Edsger W. Dijkstra A Note on Two Problems in Connexion with Graphs Numerische Mathematik 1959 1 1 269–271 10.1007/BF01386390

[27] Peter E Hart and Nils J Nilsson and Bertram Raphael A formal basis for the heuristic determination of minimum cost paths IEEE transactions on Systems Science and Cybernetics 1968 4 2 100–107

[28] Judea Pearl Heuristics: intelligent search strategies for computer problem solving Addison-Wesley 1984

[29] Qwen Team Qwen2.5 Technical Report arXiv preprint arXiv:2412.15115 2024

[30] Takuya Akiba and Shotaro Sano and Toshihiko Yanase and Takeru Ohta and Masanori Koyama Optuna: A next-generation hyperparameter optimization framework Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining 2019 2623–2631

[31] John Hughes and Sara Price and Aengus Lynch and Rylan Schaeffer and Fazl Barez and Sanmi Koyejo and Henry Sleight and Erik Jones and Ethan Perez and Mrinank Sharma Best-of-n jailbreaking arXiv preprint arXiv:2412.03556 2024

[32] Boshi Wang and Xiang Yue and Yu Su and Huan Sun Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization arXiv preprint arXiv:2405.15071 2024

[33] Alethea Power and Yuri Burda and Harri Edwards and Igor Babuschkin and Vedant Misra Grokking: Generalization beyond overfitting on small algorithmic datasets arXiv preprint arXiv:2201.02177 2022

[34] Ziqian Zhong and Ziming Liu and Max Tegmark and Jacob Andreas The clock and the pizza: Two stories in mechanistic explanation of neural networks Advances in neural information processing systems 2023 36 27223–27250

[35] Niklas Nolte and Ouail Kitouni and Adina Williams and Mike Rabbat and Mark Ibrahim Transformers Can Navigate Mazes With Multi-Step Prediction arXiv preprint arXiv:2412.05117 2024

[36] Yongjing Yin and Junran Ding and Kai Song and Yue Zhang Semformer: Transformer language models with semantic planning arXiv preprint arXiv:2409.11143 2024

[37] Mengjiao Sherry Yang and Dale Schuurmans and Pieter Abbeel and Ofir Nachum Chain of thought imitation with procedure cloning Advances in Neural Information Processing Systems 2022 35 36366–36381

[38] Swarnadeep Saha and Archiki Prasad and Justin Chih-Yao Chen and Peter Hase and Elias Stengel-Eskin and Mohit Bansal System-1. x: Learning to balance fast and slow planning with language models arXiv preprint arXiv:2407.14414 2024

[39] Bill Yuchen Lin and Yicheng Fu and Karina Yang and Faeze Brahman and Shiyu Huang and Chandra Bhagavatula and Prithviraj Ammanabrolu and Yejin Choi and Xiang Ren Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks Advances in Neural Information Processing Systems 2023 36 23813–23825

[40] Leyan Pan and Vijay Ganesh and Jacob Abernethy and Chris Esposo and Wenke Lee Can Transformers Reason Logically? A Study in SAT Solving arXiv preprint arXiv:2410.07432 2024

[41] Miles Turpin and Julian Michael and Ethan Perez and Samuel Bowman Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting Advances in Neural Information Processing Systems 2023 36 74952–74965

[42] Tamera Lanham and Anna Chen and Ansh Radhakrishnan and Benoit Steiner and Carson Denison and Danny Hernandez and Dustin Li and Esin Durmus and Evan Hubinger and Jackson Kernion and others Measuring faithfulness in chain-of-thought reasoning arXiv preprint arXiv:2307.13702 2023

[43] Hunter Lightman and Vineet Kosaraju and Yuri Burda and Harrison Edwards and Bowen Baker and Teddy Lee and Jan Leike and John Schulman and Ilya Sutskever and Karl Cobbe Let's verify step by step The Twelfth International Conference on Learning Representations 2023

[44] Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James V Miranda and Alisa Liu and Nouha Dziri and Shane Lyu and others T\( \backslash\) " ulu 3: Pushing frontiers in open language model post-training arXiv preprint arXiv:2411.15124 2024

[45] Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng Free process rewards without process labels arXiv preprint arXiv:2412.01981 2024

[46] Hao Sun Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond arXiv preprint arXiv:2310.06147 2023

[47] Daman Arora and Andrea Zanette Training language models to reason efficiently, 2025 URL https://arxiv. org/abs/2502.04463

[48] Qiying Yu and Zheng Zhang and Ruofei Zhu and Yufeng Yuan and Xiaochen Zuo and Yu Yue and Tiantian Fan and Gaohong Liu and Lingjun Liu and Xin Liu and others Dapo: An open-source llm reinforcement learning system at scale arXiv preprint arXiv:2503.14476 2025

[49] David Bruce Wilson Generating random spanning trees more quickly than the cover time Proceedings of the twenty-eighth annual ACM symposium on Theory of computing 1996 296–303

[50] Joseph B Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem Proceedings of the American Mathematical society 1956 7 1 48–50

[51] Robert Tarjan Depth-first search and linear graph algorithms SIAM journal on computing 1972 1 2 146–160

[52] Andy Zou and Zifan Wang and Nicholas Carlini and Milad Nasr and J. Zico Kolter and Matt Fredrikson Universal and Transferable Adversarial Attacks on Aligned Language Models 2023

[53] Valeriia Cherepanova and James Zou Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs 2024

[54] Yue Liu and Xiaoxin He and Miao Xiong and Jinlan Fu and Shumin Deng and Bryan Hooi FlipAttack: Jailbreak LLMs via Flipping OpenReview pre-print, submitted to ICLR 2025 2024

[55] William Hackett and Lewis Birch and Stefan Trawicki and Neeraj Suri and Peter Garraghan Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails 2025

[56] Joseph Culberson Sokoban is PSPACE-complete 1997