If you see this, something is wrong
To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.
Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.
Generally speaking, anything that is blue is clickable.
Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.
Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.
Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.
By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.
If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.
First published on Monday, Jun 16, 2025 and last modified on Monday, Jun 16, 2025 by François Chaplais.
SCAI, Arizona State University, Email
SCAI, Arizona State University, Email
SCAI, Arizona State University, Email
SCAI, Arizona State University, Email
SCAI, Arizona State University, Email
Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. In this paper, we critically examine that interpretation by investigating how the semantics of intermediate tokens—often anthropomorphized as “thoughts” or reasoning traces and which are claimed to display behaviors like backtracking, self-verification, and meta-cognition—actually influence model performance. We train transformer models on formally verifiable reasoning traces and solutions, constraining both intermediate steps and final outputs to align with those of a formal solver. By constructing a formal interpreter of the semantics of our problems and intended algorithm, we systematically evaluate not only solution accuracy but also the correctness of intermediate traces, thus allowing us to evaluate whether the latter causally influences the former. Our experiments involve training transformer models on traces and solutions generated by A* search. We notice that, despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions. To further show that trace accuracy is only loosely connected to solution accuracy, we then train models on noisy, corrupted traces which have no relation to the specific problem each is paired with, and find that not only does performance remain largely consistent with models trained on correct data, but in some cases can improve upon it and generalize more robustly on out-of-distribution tasks. These results challenge the assumption that intermediate tokens or “Chains of Thought” reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly correct forms) as evidence of human-like or algorithmic behaviors in language models.
Recent advances in general planning and problem solving have been spearheaded by so-called “Long Chain-of-Thought” models, most notably DeepSeek’s R1 [8]. These transformer-based large language models are further post-trained using iterative fine-tuning and reinforcement learning methods. Following the now-standard teacher-forced pre-training, instruction fine-tuning, and preference alignment stages, they undergo additional training on reasoning tasks: at each step, the model is presented with a question; it generates a sequence of intermediate tokens (colloquially or perhaps fancifully called a “Chain of Thought" or “reasoning trace"); and it ends it with a specially delimited answer sequence. After verification of this answer sequence by a formal system, the model’s parameters are updated so that it is more likely to output sequences that end in correct answers and less likely to output those that end in incorrect answers.
While (typically) no optimization pressure is applied to the intermediate tokens [1, 2], empirically it has been observed that language models perform better on many domains if they output such tokens first [3, 4, 5, 6, 7, 8, 9, 10, 11]. While the fact of the performance increase is well-known, the reasons for it are less clear. Previous work has often framed it in anthropomorphic terms, claiming that these models are “thinking” before outputting their answers [3, 12, 8, 13, 2, 14]. Simultaneously, the process of performing more auto-regressive forward passes before outputting the final answer has been credited as an instance of inference-time scaling – that is, these models are assumed to be doing problem-adaptive computation.
Famously, DeepSeek’s R1 paper claimed that one of the most impressive observed behaviors of their trained models was the so-called “aha” moment: as part of the chain of thought it was producing in order to answer some question, the model output the token “aha”, seeming to indicate that it had come upon a sudden realization. While a human may say “aha” to indicate exactly a sudden internal state change, this interpretation is unwarranted for models which do not have any such internal state, and which on the next forward pass will only differ from the pre-aha pass by the inclusion of that single token in their context. Interpreting this token as meaningful in this way requires making an additional assumption that has thus far been brushed to the side in discussions of how long CoT models function and what they do – that the derivational traces they produce are semantically meaningful in the same way that the traces they were trained on were or at least in the way that a human might expect them to be.
For R1 and similar large models, this is nearly impossible to check. The intermediate tokens that massive pre-trained and post-RL’d models produce meander for dozens of pages, are written wholly in ambiguous and polysemantic natural language, and – perhaps much worse – are the result of long, opaque training processes on data that we have no access to and cannot compare against.
In this paper, we shed some light on the question of whether intermediate traces are semantically meaningful. Following previous work that elucidated important functional aspects of large scale models through controlled small scale experiments [32, 33, 34] and working within a sort of “model organism” paradigm, we focus the current work on fully controlled, open, and replicable models trained from scratch. Our models are trained on a simple and well-understood shortest path planning problem for randomly generated mazes, with our training runs including varying kinds of intermediate traces – from none to ones generated by the classic A\( ^*\) algorithm to noisy and irrelevant ones. This setup is not only well-understood as a classical computer science problem, but has also grown to be well-studied domain for trace-augmented transformer training [18, 21, 35, 36].
We approach the problem of understanding intermediate token semantics from three major novel angles, performing empirical evaluations on models we train on small planning tasks. First, we construct a validator for A\( ^*\) execution traces and use it to validate and compare trace accuracy to solution accuracy, finding only a loose correlation between the two. Then, we train half billion parameter Qwen models on none, correct, and deliberately irrelevant traces. We present a dataset manipulation that – despite the fact that it removes all specific-problem-relevant semantics – leads to trained models that perform better on both in and out of distribution tasks. We argue that, if performance is the goal, assuming human-like or algorithm-interpretable trace semantics is not only unnecessary but potentially misleading.
Training Transformer models from scratch to plan using derivational traces - There have been various approaches to train transformer models on algorithmic execution traces. Searchformer - transformer models trained on A* search algorithm execution traces to emulate the A* search algorithm for solving pathfinding tasks [18] . Similarly Stream-of-Search trained Transformers to internalize search processes of Breadth-first search (BFS) and Depth first search (DFS) algortihms within the linguistic representation for solving the Countdown arithmetic game [20]. Yang et al. [37] trained transformer models on search trajectories to mimic search strategies such as Monte Carlo Tree Search or BFS. In concurrent research System 1.x, introduced a dual-model approach coordinated by an external meta-controller, where one model quickly generates solutions without explicit reasoning traces, while the other operates more deliberately, providing step-by-step explanations [38]. Similarly, SwiftSage also employed multiple models within an agent-style workflow to differentiate between fast and slow modes of reasoning [39]. In these works, the System 2 or slow modes of reasoning models are transformer models trained on execution traces of formal procedures. Pan et al. [40] trained transformer models for solving Boolean SAT problems. They had also measured the trace correctness with respect to the abstract Davis–Putnam–Logemann–Loveland (DPLL) procedure over which the transformer models were trained on. But their work was mainly focused on showing that decode-only transformers with CoT can solve the boolean SAT problems whereas our work is focused on providing a much deeper analysis of trace-plan corelation and semantic correctness.
Trace evaluation - Previous studies have examined whether intermediate derivational steps correspond meaningfully to the final answers generated by Large Language Models (LLMs) and Large Reasoning Models (LRMs), finding notable discrepancies. Specifically, prior research has demonstrated that intermediate CoT steps, despite appearing coherent, do not reliably reflect the actual computations used by models to reach their conclusions [41, 42]. Even for the SOTA reasoning models, whose performance gains are typically attributed to employing explicit intermediate "think" tokens, have been shown to produce intermediate reasoning chains that fail to meaningfully correspond with the underlying computations leading to their final outputs [15, 16, 17, 1].
Training on Noisy traces - Li et al. [11] investigated the impact of perturbations in reasoning traces by distilling DeepSeek R1 and QwQ 32B-Preview’s derivational outputs on math and coding tasks. Their findings reveal that models remain robust to noise in the trace—showing improved performance even when trained on derivations containing incorrect mathematical operations, relative to the base model. However, as previously noted, systematically evaluating the semantic correctness of natural language derivational traces remains infeasible. Therefore, no definitive conclusions can be made regarding the semantic alignment between reasoning traces and final answers. Nonetheless, work does seem to indicate that there is no strong causal connection between trace correctness and solution correctness.
Dualformer, an extension of Searchformer, trained transformer models on truncated A* derivational traces by arbitrarily removing steps from the original A* search process [21]. This pruning renders the traces semantically invalid, as they no longer reflect any faithful execution of the A* algorithm. Despite this, models trained on these shortened traces outperformed those trained on full A* traces used in Searchformer. These findings further support the notion that the semantic correctness of derivational traces with respect to ground-truth algorithms like A* may not be causally linked to the correctness of the final output plan.
Post Training methods - Numerous approaches have shown that Supervised fine tuning (SFT) over derivational traces and RL methods improve the task performance of LLMs over planning and reasoning tasks [19, 11, 10, 43, 44, 45, 8, 46]. Among these, one of the first approaches that has shown impressive results is STaR. STaR is a method in which the LLM is prompted to generate multiple responses given a problem with their intermediate CoTs and subsequently the responses are filtered based on whether the final answer is correct. The LLM is further finetuned over examples where the model generated the correct final answer [19]. It has been shown that this approach significantly outperforms direct answer prediction fine-tuning. Recently, since the release of DeepSeek’s R1 which is post trained using GRPO [8], two major types of post training methods have emerged - 1. Finetuning LLMs on derivational traces of LRMs, mainly R1, to enable CoT generation in smaller LLMs (Model Distillation), 2. Using various RL algorithms, mainly different versions GRPO, to improve task performance [47, 48]. In all of these approaches there is no semantic evaluation of the derivational trace generated by the LLMs or LRMs. In the case of Iterative SFT, the model responses are filtered based on the correctness of the final answers and the smaller models are trained to mimic these long derivational traces to "elicit" reasoning and procedure following. Similarly for RL based approaches, reward is determined by sound verifiers based only the final answers. Since there are no process reward models, there is no local evaluation of the correctness of the produced intermediate tokens.
Though recently popularized by Large Reasoning Models, especially DeepSeek’s R1 [2], training on intermediate traces in order to improve transformer performance dates back to at least GPT-2 [3] and has been extended and analyzed from many angles [18, 19, 20, 21, 22, 23, 24] While these papers demonstrated ways to improve final answer accuracy, they neither evaluate the trace accuracy nor do they explicitly attempt to train on incorrect or irrelevant traces.
Thus, while they do show that accuracy increases, they leave open the question of whether that accuracy increase actually stems from the additional semantic information in the trace. In many cases, especially with pre-trained models, it is nearly impossible to formally verify reasoning traces, due to the ambiguity of natural language and the lack of a clear ground truth. However, for small, well-scoped formal domains like the gridworld path planning domain used in [18, 21] and this paper, and with carefully from-scratch trained models on those domains, we have the ability to check whether generated traces follow the exact semantics enforced in the training data and causally predict the final solutions that the model outputs.
We consider a standard grid-based path-finding domain. The task is to find a legal path between a given start cell and goal cell in a \( 30\times30\) grid. Every cell of this grid is either free (traversable) or a wall (impassable). The agent begins at the start state, and at every state may take one of four actions: go up, go down, go left, or go right. The transformer is given a full description of this problem (in token format – we follow the formulation used by [18] and [21]) and must output as its final answer a plan, which consists of a sequence of actions. A plan is considered correct if it executable – that is, every action it presents moves the agent from a free cell to an adjacent free cell – and its final action results in the agent being at the goal cell.
We generate navigation problems using diverse generation algorithms, resulting in varied structural patterns and exploration dynamics. This enables systematic out-of-distribution (OOD) evaluation by testing models on maze types unseen during training – which was all done on mazes generated with Wilson’s algorithm. These generation algorithms can be sorted into two major categories: 1) algorithms that do not permit cycles and sample over the spanning trees of the \( 30\times30\) grid and 2) algorithms that permit loops and create noisy, less-structured dungeon or cave-like instances. For all algorithms except SearchFormer’s, which has its own start and goal generation loop, we sample a legal (start, goal) pair after maze generation.
Acyclic Maze Generation
Cave Generation
A* is a classic best-first graph–search procedure that combines the uniform-cost guarantee of Dijkstra’s algorithm [26] with domain-specific heuristics to focus exploration on promising states, originally introduced to compute minimum-cost paths in state-space graphs [27].
The algorithm maintains an open list (a priority queue) keyed by \( f(n) = g(n) + h(n)\) , where \( g(n)\) is the exact cost from the start and \( h(n)\) is a heuristic estimate to the goal, and also maintains a closed list of already visited nodes. It repeatedly pops the open list node with the smallest \( f\) ; if this is the goal, it reconstructs the path that lead to this node and this is returned as the final plan. Otherwise, it generates child nodes (in our case, traversable neighbor cells) and calculates their \( g\) and \( f\) values. For each node, it either inserts it into the open list or – if the node is already in the list – updates its \( g\) value if the new value is lower. The popped node is added to the closed list to prevent re-expansion.
The effectiveness of A* is dependent on the heuristic it is implemented with. For solvable graph search problems like the ones featured in this paper, any consistent (\( h(n) \leq c(n, n') + h(n')\) for all neighboring \( n'\) ) heuristic will guarantee that the plan returned is not only satisficing but optimal [28].
For the maze path planning problems we examine in this paper, we use the very standard Manhattan heuristic \( h(n) = |x_n - x_g| + |y_n - y_g|\) computes the sum of horizontal and vertical displacements between a cell and the goal. On a 2-D grid with only orthogonal, unit-cost movement, this heuristic is consistent, ensuring A* returns an optimal path.
Finally, following SearchFormer and Stream of Search, we modify the A* implementation to output a linearized execution trace [20, 18]. That is, whenever the algorithm creates a child node and adds it to the open list, it prints create x y cA cB and when it closes a node and adds it to the closed list, it prints close x y cA cB. Here, ‘A’ in cA represents the exact cost from the start state to the node (i.e., the \( g(n)\) value) and ‘B’ in cB represents the heuristic estimate from that node to the goal state (i.e., the \( h(n)\) value). Similar to Searchformer notation [18], we use the prefix "c" to differentiate between the node co-ordinates and its cost estimations. In the next section, we construct an A* validator that reverses this process – it takes in a linearized trace and attempts to simulate the corresponding open and closed list operations to check if they are valid with respect to the semantics of this implementation.
While previous work evaluated the final accuracy of trace-trained models, it did not evaluate the traces themselves. For large, production ready RL-post-trained models like DeepSeek’s R1, this is practically impossible. For even a simple query, the model produces dozens of pages of convoluted and meandering output before arriving at an answer, and this output is all in natural language, which makes it very easy to read multiple equally valid interpretations into it.
To truly tell whether the traces that were trained on helped in the expected way, we need a formal way of validating their correctness. By training models on traces produced by a well-known algorithm with well-known semantics, it is possible to check whether the model’s emulations of the algorithm’s execution trace are correct.
We construct a formal verifier for A\( ^*\) traces. The format for these traces follows [18], and is described in more detail in Section 3. Essentially, our validator consumes the generated trace and simulates the operations proposed in that trace on open and closed lists. It runs through the generated trace sequentially, parsing each action x y cA cB sequence as an operation and using it to update its open and closed list. It marks a trace as valid if it can correctly execute this procedure until it closes the goal node. Errors in execution can be one of the following:
With this verifier in hand, we can now distinguish between plan validity and trace validity for models trained on this kind of dataset. To construct our training sets, we generate 50,000 mazes using Wilson’s algorithm, and randomly select a start and goal cell. Then, we use A* with the Manhattan distance heuristic to find an optimal plan for each maze as well as to produce a trace that is saved with each datapoint.
We modify the architecture of the Qwen2.5 0.5B [29] to support a vocabulary of exactly 944 different tokens (which reduces the parameter count to about 380 million from 500 million), randomly initialize the model, and then train it for 85,000 training steps with a batch size of 8 on two NVIDIA H100s. The model has a context length of 32,000 tokens to support the long lengths of intermediate token generation. (Our other experiments later in the paper also use this architecture, but train on different datasets, from solution-only through to irrelevant and noisy traces. All code and data will be made public.)
We test this model trained on Wilson mazes on a thousand instances of mazes generated by Wilson, Kruskal, DFS, SF-Style and Drunkard approaches, evaluating the solution accuracy as well as the trace validity. We present these results as confusion matrices in Figure 5, with each domain represented by a separate matrix. These results break down the correlation between model accuracy and trace validity. As can be seen from the results, trace accuracy is not a perfect predictor of plan accuracy. In fact, as can be seen from the diagonal entries, the model can produce valid traces and then continue on to produce an incorrect plan or produce invalid traces and yet end up at a correct plan .
If plan and trace validity are only loosely connected for models trained on the A* trace dataset, then perhaps the validity of the trace isn’t as important to the performance increase as previously believed. To test this empirically, we construct a second training dataset called Swap, which we build by randomly permuting reasoning traces between problems.
This dataset consists of the exact same problems as the original Trace 50,000, but problem 1’s trace will be given to, say, problem 4; problem 4’s will be given to problem 7; and so forth. In other words, while the traces continue to have the right form and some generic domain information, they no longer have any connection to the specific problems they are associated with. Training examples consist of a start and goal state, a maze definition, an A\( ^*\) trace for searching for the shortest path across a totally unrelated maze from a different start and goal state, and the correct solution plan for the original maze.
What we find is that our most competent model not only maintains performance on the in-distribution test set, but it generalizes better than the other maze distributions we test! All despite the lack of algorithmically valid semantics in the trained upon and generated traces.
For these experiments, we continue to use the same model architecture described in the previous section, varying the datasets we train on to see how they affect performance – even as they further corrupt or completely destroy the correctness of the traces. For best results, we performed hyperparameter optimization (via Optuna [30]). We provide additional details on hyperparameters and initializations in the Appendix, and we will publicly open-source our codebase for full transparency.
The most basic training run is the standard solution-only baseline, where the model is trained on just solutions without any derivational traces. The next baseline, following previous work [18, 21, 20] is training the model with A* generated traces, teacher-forcing during training to make it output intermediate tokens before the final solution. These are the models discussed in the previous section. Finally, we use the same underlying training data and distribution, but modify it by corrupting the traces. Our trace corruption process is very simple: we randomly permute which problem is associated with which traces – so, for example, the third problem might have the 5th problem’s trace, which is an execution trace of A* on an unrelated maze with unrelated start and goal states.
The problems in our training data are all generated by Wilson’s algorithm. For our test sets we generate data with several maze generation algorithms (as described in Section 3), including Wilson’s algorithm, to get both in and out of distribution data. Our training data consists of 50k samples, while our test sets each contain 1k.
Unintuitively, as seen in Table 2, the best model in both in and out of distribution test sets turns out to be the model trained on swapped (incorrect) traces! We see that the swapped model has a 0% trace validity – as it has been trained to output well-structured but problem-irrelevant traces in response to every problem – but nevertheless performs noticeably better than both the correct trace and solution-only baselines. An interesting point to note is the performance difference on out-of-distribution datasets. While most of the performance differences are within a few percentage points, and in-distribution testing results in near identical performance, on the Drunkard dataset the swapped model is 10 times better than the original model, giving 26% to the correct trace model’s 2.6%, and on the DFS maze set, it reaches 41.7% to the original model’s 30.8%.
| Soln. Only | Regular A* traces | Swapped A* traces | |||
| Plan Val. | Plan Val. | Trace Val. | Plan Val. | Trace Val. | |
| Wilson | 4.0% | 50.1% | 95.2% | 51.6% | 0% |
| Kruskal | 2.1% | 49.7% | 96.2% | 51.9% | 0% |
| DFS | 2.8% | 30.8% | 82.1% | 41.7% | 0% |
| Drunkard | 0.0% | 2.5% | 4.0% | 26.0% | 0% |
| Searchformer-style | 0.0% | 0% | 0% | 0.2% | 0% |
If intermediate tokens improve accuracy because they teach the model a given reasoning procedure, then we should expect their influence on performance to fluctuate exactly with their connection to the problem. However, we find that this is not always the case – in fact, intermediate token sequences that have almost nothing to do with the problem at hand can provide a significantly higher performance boost (and which, counterintuitively might even generalize better) than well-grounded semantically meaningful execution traces, thus throwing doubt on the seemingly wide-spread intuition that the effectiveness of traces stems from allowing the transformer to perform structured, interpretable, and algorithmic procedures.
Our results hint that the impact of trace content on performance and the legibility of that content have been somewhat conflated – if all we care about is increasing the accuracy and capability of a model, enforcing human readability may be counterproductive, a lesson also mentioned in the R1 paper [2]. Furthermore, examining traces produced by a model – though they may look right at first glance – is not necessarily informative if those traces are not predictive of the model’s final answers.
Of course, if trace semantics don’t matter, then the question immediately arises: why does generating intermediate tokens increase accuracy at all? We speculate that what is helping is finding the right prompt augmentation. That is, for a given task prompt \( T\) , there exists a prompt augmentation \( PA\) which boosts the LLM’s performance on that task:
Here \( \text{Sol}(y,T)\) indicates that \( y\) solves \( T\) , and \( \text{LLM}(x)\) is the model’s completion for input \( x\) . The central challenge then is to learn the Skolem function
that maps each task to an effective augmentation. This can be accomplished through modifying the model itself to inherently and automatically augment prompts, as is the case in models that first generate long chains of intermediate tokens before their final answers. Crucially, prompt augmentations have no need to be human‐interpretable. In fact, we see results that back this up in the adversarial prompting literature, where effective jailbreaks can be effected by augmenting prompts with human-uninterpretable strings [52, 53, 54, 55] or modifying them with random syntactic permutations, capitalizations, and shufflings [31].
In this paper, we challenged the prevailing narrative that intermediate tokens or “Chains of Thought” generated by Large Reasoning Models like DeepSeek’s R1 are interpretable, semantically valid sequences with predictable effects on the model’s behavior. As we don’t have access to any frontier LLM’s training data or even exact training procedure, and since the traces these models output are in multiply-interpretable natural language without a concrete ground truth, we designed a series of experiments building on previous smaller model reasoning work – mainly Searchformer and Stream of Search [20, 18] – and constructed an A\( ^*\) trace validator, finding that there is only a loose correlation between the correctness of the trace and the correctness of the output plan. We then trained additional models on noisy or irrelevant traces and found that there are (nonsensical) trace formats that nevertheless maintain or even increase the model’s performance – all despite them being much less informative or connected to the problem at hand. Finally, we argue that, if the goal is to increase model performance, enforcing trace semantics is unnecessary and potentially very misleading. All together, our counter-intuitive results demonstrate ways in which common interpretations of Large Reasoning Models may be anthropomorphizations or simplifications.
This research is supported in part by ONR grant N0001423-1-2409, DARPA grant HR00112520016, and gifts from Qualcomm, J.P. Morgan and Amazon.
For all our experiments, we trained the Qwen-2.5-0.5B decoder only models. We used a custom tokenizer with domain specific vocabulary which reduced the model parameters to around 380M. We optimized with AdamW (\( \beta_1\) =0.9, \( \beta_2\) =0.999) and applied a weight decay of 0.1528, a peak learning rate of 2.2758e-4, and 100 warm-up steps, all under bf16 precision. We train the models for 95k training steps. All randomness was controlled with fixed seeds. The training dataset size is 50k datapoints unless specified otherwise.
Along with our own trained models, we have also evaluated models trained by [18]. These models have an encoder-decoder architecture and are trained on A* generated traces on 30x30 mazes . The mazes are generated by their random generation method as described in Section 3. We see that across model sizes (from 15M to 175M parameters) there are a significant number of instances where the model produces a correct plan but the trace that it outputs is invalid. This is in line with the results of our models and provide further evidence that trace accuracy is not a perfect predictor of plan accuracy.
To check the influence of more data on the performance we trained models on larger datasets containing 500k data points. We trained these models for 200k steps. Our results (in Table 2) show that the swapped model completely outperforms the regular model. On all the datasets, the swapped model achieves significant performance gains over the regular model. Notice that the trace validity for the swapped model is 0% while it achieves 70% solution accuracy. This shows that trace correctness can actually be an albatross for the model’s performance.
| Soln. Only | Regular A* traces | Swapped A* traces | |||
| Plan Val. | Plan Val. | Trace Val. | Plan Val. | Trace Val. | |
| Wilson | 3.30% | 31.9% | 79.0% | 70.3% | 0% |
| Kruskal | 2.10% | 34.3% | 80.76% | 75.8% | 0% |
| DFS | 2.60% | 18.1% | 78.45% | 52.5% | 0% |
| Drunkard | 0.0% | 0% | 0% | 12.9% | 0% |
| Searchformer-style | 0.0% | 0% | 0% | 0.4% | 0% |
We also train models on mazes other than Wilson’s. We specifically choose the Searchformer-style mazes as all the Wilson models perform the worst on this data. We generate 50k unique problems and then train on both correct and swapped traces. Similar to what we have seen with the Wilson models, we see that the swapped model performs better than the regular model on two out-of-distribution datasets (DFS and Drunkard’s walk).
| Test Set | Regular A* traces | Swapped A* traces | ||
| Plan Val. | Trace Val. | Plan Val. | Trace Val. | |
| Wilson | 28.5% | 81.8% | 6.8% | 0.0% |
| Kruskal | 31.7% | 77.9% | 8.6% | 0.0% |
| DFS | 12.0% | 66.7% | 14.3% | 0.0% |
| Drunkard’s Walk | 44.3% | 87.6% | 60.9% | 0.0% |
| Searchformer‐style | 47.0% | 63.4% | 23.4% | 0.0% |
To check if the way the traces have been swapped change any of the already seen trends in performance, we also train on a dataset where the traces were swapped using a different random seed from the previously swapped dataset. As seen in Table 4, we find that the new model still outperforms the regular model on the DFS and Drunkard’s walk datasets.
| Soln. Only | Regular A* traces | Swapped A* traces | Swapped A* traces (new seed) | ||||
| Plan Val. | Plan Val. | Trace Val. | Plan Val. | Trace Val. | Plan Val. | Trace Val. | |
| Wilson | 4.0% | 50.1% | 95.2% | 51.6% | 0.0% | 41.5% | 0.0% |
| Kruskal | 2.1% | 49.7% | 96.2% | 51.9% | 0.0% | 42.2% | 0.0% |
| DFS | 2.8% | 30.8% | 82.1% | 41.7% | 0.0% | 35.9% | 0.0% |
| Drunkard’s Walk | 0.0% | 2.5% | 4.0% | 26.0% | 0.0% | 31.3% | 0.0% |
| Searchformer‐style | 0.0% | 0.0% | 0.0% | 0.2% | 0.0% | 0.1% | 0.0% |
To test if similar results can be obtained on a different and more complex task with a different transition structure, we repeat our experiments for Sokoban puzzles. Sokoban is a PSPACE-complete [56], grid-based puzzle where an agent must push each box from its start position onto a designated storage dock. At each timestep, the agent may move one cell up, down, left, or right, and can push—but never pull—exactly one adjacent box into the next cell, provided that both the agent’s target cell and the box’s destination cell are free. We encode the entire level (grid layout, agent and box start positions, and dock positions) as a token sequence similar to that of [18]. The model must output a sequence of valid moves that, when executed, places all boxs on their docks; a plan is correct only if every action is executable in the simulated environment and achieves the goal configuration.
Similar to the A* trace generation described in Section 3.2, we modify the A* implementation to output a linearized execution trace. Whenever the algorithm creates a child node and adds it to the open list, it prints create worker x y box a b box c d cA cB and when it closes a node and adds it to the closed list, it prints close worker x y box a b box c d cA cB. Here, x y denotes the worker location, a b and c d denote the respective box locations, ‘A’ in cA represents the exact cost from the start state to the node (i.e., the \( g(n)\) value) and ‘B’ in cB represents the heuristic estimate from that node to the goal state (i.e., the \( h(n)\) value). Similar to Searchformer notation [18], we use the prefix "c" to differentiate between co-ordinates and cost estimations. We compute the heuristic value at each node by finding the sum of the Manhattan distances for every possible pairing of boxes to docks, and then take the minimum over all such assignments as our heuristic value.
Similar to the validator described in Section 4, we construct an A* validator that reverses this process for Sokoban puzzles – it takes in a linearized trace and attempts to simulate the corresponding open and closed list operations to check if they are valid with respect to the semantics of this implementation.
Training and Test Dataset - We use the same problem generation procedure used by [18]. A 7 × 7 grid was sampled and two additional wall cells were added as obstacles to the interior of the map. Then two docks, boxes, and the worker locations were randomly placed. If the sampled task is solvable by A*, then the task was admitted to the dataset. We generate 50,000 Sokoban puzzles to construct our training dataset. We also generate the swap dataset for Sokoban problems.
For the test dataset, we use plan length as a proxy for problem difficulty and generate problems that have a final plan length greater than the mean plan length of training dataset problems.
| Soln. Only | Regular A* traces | Swapped A* traces | |||
| Plan Val. | Plan Val. | Trace Val. | Plan Val. | Trace Val. | |
| Sokoban | 18.1% | 1.1% | 0% | 2.3% | 0% |
Even in the case of Sokoban problems, we see that the correct traces do not help the model perform better than swapped (and incorrect) traces (as shown in Table 5). This reinforces our point that trace accuracy and plan accuracy are not semantically co-related.
[1] Monitoring reasoning models for misbehavior and the risks of promoting obfuscation arXiv preprint arXiv:2503.11926 2025
[2] R1-Zero's" Aha Moment" in Visual Reasoning on a 2B Non-SFT Model arXiv preprint arXiv:2503.05132 2025
[3] Show your work: Scratchpads for intermediate computation with language models 2021
[4] Chain-of-thought prompting elicits reasoning in large language models Advances in neural information processing systems 2022 35 24824–24837
[5] Automatic chain of thought prompting in large language models arXiv preprint arXiv:2210.03493 2022
[6] Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes arXiv preprint arXiv:2305.02301 2023
[7] MiniLLM: Knowledge distillation of large language models arXiv preprint arXiv:2306.08543 2023
[8] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning arXiv preprint arXiv:2501.12948 2025
[9] Let's think dot by dot: Hidden computation in transformer language models arXiv preprint arXiv:2404.15758 2024
[10] s1: Simple test-time scaling arXiv preprint arXiv:2501.19393 2025
[11] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! arXiv preprint arXiv:2502.07374 2025
[12] Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars arXiv preprint arXiv:2503.01307 2025
[13] Understanding Aha Moments: from External Observations to Internal Mechanisms arXiv preprint arXiv:2504.02956 2025
[14] Sparks of artificial general intelligence: Early experiments with gpt-4 2023
[15] Reasoning Models Don't Always Say What They Think arXiv preprint arXiv:2505.05410 2025
[16] Are DeepSeek R1 And Other Reasoning Models More Faithful? ICLR 2025 Workshop on Foundation Models in the Wild
[17] Chain-of-thought reasoning in the wild is not always faithful arXiv preprint arXiv:2503.08679 2025
[18] Beyond a*: Better planning with transformers via search dynamics bootstrapping arXiv preprint arXiv:2402.14083 2024
[19] Star: Bootstrapping reasoning with reasoning Advances in Neural Information Processing Systems 2022 35 15476–15488
[20] Stream of search (sos): Learning to search in language arXiv preprint arXiv:2404.03683 2024
[21] Dualformer: Controllable fast and slow thinking by learning with randomized reasoning traces The Thirteenth International Conference on Learning Representations 2024
[22] Transformers provably solve parity efficiently with chain of thought arXiv preprint arXiv:2410.08633 2024
[23] The expressive power of transformers with chain of thought arXiv preprint arXiv:2310.07923 2023
[24] Towards revealing the mystery behind chain of thought: a theoretical perspective Advances in Neural Information Processing Systems 2023 36 70757–70798
[26] A Note on Two Problems in Connexion with Graphs Numerische Mathematik 1959 1 1 269–271 10.1007/BF01386390
[27] A formal basis for the heuristic determination of minimum cost paths IEEE transactions on Systems Science and Cybernetics 1968 4 2 100–107
[28] Heuristics: intelligent search strategies for computer problem solving Addison-Wesley 1984
[29] Qwen2.5 Technical Report arXiv preprint arXiv:2412.15115 2024
[30] Optuna: A next-generation hyperparameter optimization framework Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining 2019 2623–2631
[31] Best-of-n jailbreaking arXiv preprint arXiv:2412.03556 2024
[32] Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization arXiv preprint arXiv:2405.15071 2024
[33] Grokking: Generalization beyond overfitting on small algorithmic datasets arXiv preprint arXiv:2201.02177 2022
[34] The clock and the pizza: Two stories in mechanistic explanation of neural networks Advances in neural information processing systems 2023 36 27223–27250
[35] Transformers Can Navigate Mazes With Multi-Step Prediction arXiv preprint arXiv:2412.05117 2024
[36] Semformer: Transformer language models with semantic planning arXiv preprint arXiv:2409.11143 2024
[37] Chain of thought imitation with procedure cloning Advances in Neural Information Processing Systems 2022 35 36366–36381
[38] System-1. x: Learning to balance fast and slow planning with language models arXiv preprint arXiv:2407.14414 2024
[39] Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks Advances in Neural Information Processing Systems 2023 36 23813–23825
[40] Can Transformers Reason Logically? A Study in SAT Solving arXiv preprint arXiv:2410.07432 2024
[41] Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting Advances in Neural Information Processing Systems 2023 36 74952–74965
[42] Measuring faithfulness in chain-of-thought reasoning arXiv preprint arXiv:2307.13702 2023
[43] Let's verify step by step The Twelfth International Conference on Learning Representations 2023
[44] T\( \backslash\) " ulu 3: Pushing frontiers in open language model post-training arXiv preprint arXiv:2411.15124 2024
[45] Free process rewards without process labels arXiv preprint arXiv:2412.01981 2024
[46] Reinforcement learning in the era of llms: What is essential? what is needed? an rl perspective on rlhf, prompting, and beyond arXiv preprint arXiv:2310.06147 2023
[47] Training language models to reason efficiently, 2025 URL https://arxiv. org/abs/2502.04463
[48] Dapo: An open-source llm reinforcement learning system at scale arXiv preprint arXiv:2503.14476 2025
[49] Generating random spanning trees more quickly than the cover time Proceedings of the twenty-eighth annual ACM symposium on Theory of computing 1996 296–303
[50] On the shortest spanning subtree of a graph and the traveling salesman problem Proceedings of the American Mathematical society 1956 7 1 48–50
[51] Depth-first search and linear graph algorithms SIAM journal on computing 1972 1 2 146–160
[52] Universal and Transferable Adversarial Attacks on Aligned Language Models 2023
[53] Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs 2024
[54] FlipAttack: Jailbreak LLMs via Flipping OpenReview pre-print, submitted to ICLR 2025 2024
[55] Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails 2025
[56] Sokoban is PSPACE-complete 1997