If you see this, something is wrong

Collapse and expand sections

To get acquainted with the document, the best thing to do is to select the "Collapse all sections" item from the "View" menu. This will leave visible only the titles of the top-level sections.

Clicking on a section title toggles the visibility of the section content. If you have collapsed all of the sections, this will let you discover the document progressively, from the top-level sections to the lower-level ones.

Cross-references and related material

Generally speaking, anything that is blue is clickable.

Clicking on a reference link (like an equation number, for instance) will display the reference as close as possible, without breaking the layout. Clicking on the displayed content or on the reference link hides the content. This is recursive: if the content includes a reference, clicking on it will have the same effect. These "links" are not necessarily numbers, as it is possible in LaTeX2Web to use full text for a reference.

Clicking on a bibliographical reference (i.e., a number within brackets) will display the reference.

Speech bubbles indicate a footnote. Click on the bubble to reveal the footnote (there is no page in a web document, so footnotes are placed inside the text flow). Acronyms work the same way as footnotes, except that you have the acronym instead of the speech bubble.

Discussions

By default, discussions are open in a document. Click on the discussion button below to reveal the discussion thread. However, you must be registered to participate in the discussion.

If a thread has been initialized, you can reply to it. Any modification to any comment, or a reply to it, in the discussion is signified by email to the owner of the document and to the author of the comment.

First published on Tuesday, Jul 1, 2025 and last modified on Tuesday, Jul 1, 2025 by François Chaplais.

Multi-Timescale Hierarchical Reinforcement Learning for Unified Behavior and Control of Autonomous Driving

arXiv
Published version: 10.48550/arXiv.2506.23771

Guizhe Jin School of Automotive Studies, Tongji University, Shanghai 201804, China Email

Zhuoren Li School of Automotive Studies, Tongji University, Shanghai 201804, China Email

Bo Leng School of Automotive Studies, Tongji University, Shanghai 201804, China Email

Ran Yu School of Automotive Studies, Tongji University, Shanghai 201804, China Email

Lu Xiong School of Automotive Studies, Tongji University, Shanghai 201804, China Email

Keywords: Hierarchical reinforcement learning, autonomous driving, multiple timescale, motion planning

Abstract

Reinforcement Learning (RL) is increasingly used in autonomous driving (AD) and shows clear advantages. However, most RL-based AD methods overlook policy structure design. An RL policy that only outputs short-timescale vehicle control commands results in fluctuating driving behavior due to fluctuations in network outputs, while one that only outputs long-timescale driving goals cannot achieve unified optimality of driving behavior and control. Therefore, we propose a multi-timescale hierarchical reinforcement learning approach. Our approach adopts a hierarchical policy structure, where high- and low-level RL policies are unified-trained to produce long-timescale motion guidance and short-timescale control commands, respectively. Therein, motion guidance is explicitly represented by hybrid actions to capture multimodal driving behaviors on structured road and support incremental low-level extend-state updates. Additionally, a hierarchical safety mechanism is designed to ensure multi-timescale safety. Evaluation in simulator-based and HighD dataset-based highway multi-lane scenarios demonstrates that our approach significantly improves AD performance, effectively increasing driving efficiency, action consistency and safety.

1 Introduction

Reinforcement learning (RL) has demonstrated strong capabilities in solving sequential decision-making problems, making it a promising paradigm for autonomous driving (AD) applications [1, 2]. However, current RL-based AD approaches often suffer from inappropriate policy output structures, resulting in weak correlations between agent outputs and actual driving behavior. Typically, the RL agent directly outputs vehicle control commands, such as steering angle and acceleration [3, 4]. Fluctuations in policy network’s outputs can cause inconsistent control sequences [5], making it difficult to achieve stable and coherent driving—especially in lane-structured scenarios—thereby increasing risks [6, 7].

Hierarchical policy output structures are better suited to AD tasks than directly outputting control commands, as they more closely resemble human driving [8]. Behavioral science indicates that human driving behavior is inherently hierarchical in nature, involving both conscious trajectory planning and subconscious action control [9]. Based on this, a common RL approach is to use a hierarchical structure where the high-level policy outputs discrete semantic decisions or trajectory goals, while the low-level rule-based policy generates control commands [10]. However, this design limits the flexibility of RL in vehicle control and makes it difficult to produce optimal control commands that adapt to high-level outputs [11].

In contrast, a hierarchical structure in which both high- and low-level actions are generated by RL policies better leverages RL’s flexibility, enabling unified learning of driving behaviors and control commands for complex tasks [12]. Some studies implement this approach using either a single RL agent with a parameterized actor-critic architecture or two independently trained agents [13, 14]. However, these designs often impose timescale consistency constraints on both levels, leading to either fluctuating high-level behaviors or slow low-level control responses [12]. In practice, high-level policy require long-timescale behavioral goals, while low-level policy need short-timescale immediate control [15]. Moreover, given the safety-critical nature of driving, capturing the safer hierarchical policy outputs is also essential.

Therefore, this paper proposes a Multi-Timescale Hierarchical RL approach for autonomous driving. Specifically, we design a hierarchical RL policy structure with high- and low-level components operating at different timescales: the high-level policy generates long-timescale guidance-actions (i.e., motion guidance), while the low-level policy produces short-timescale execution-actions (i.e., control commands). Both policies are jointly trained to achieve unified optimal performance. Furthermore, we construct a continuous-discrete hybrid-action motion guidance to accommodate structured road constraints—discrete laterally and continuous longitudinally. To further enhance safety, we develop a hierarchical safety mechanism that operates in parallel with the policy structure. The main contributions are as follows:

2 Related Works

A general RL-based AD approach with a hierarchical architecture combines RL with a rule-based method for vehicle control. Specifically, RL is used for the high-level policy, producing outputs that can either be semantic decisions (e.g., lane changes) [16, 17, 7, 10], motion primitives from a discrete space [8, 18, 19], or target points in a continuous space [20, 6, 21, 22]. Based on these behavior goals, the low-level rule-based controller generates actual vehicle control commands (e.g., steering angle, acceleration). However, this structure limits the flexibility of the RL policy due to indirect vehicle control. Additionally, the low-level controller may fail to respond effectively to dynamic environmental changes, or its response may deviate from the intended high-level behavior, preventing unified optimization across both levels [11].

In contrast, using RL policies to simultaneously generate both abstract driving behaviors and concrete control commands is more advanced. Some classical studies construct implicit hierarchical policy structures [5, 13, 23, 24, 25, 26], or train independent RL agents for high- and low-level policies [11, 12, 14, 27], to enhance unified optimization between the two levels. However, these methods face timescale consistency constraints, making it difficult to set appropriate timescales for both levels. Specifically, a too-short timescale leads to driving behavior fluctuations, while a too-long timescale slows responses to dynamic environment changes.

Further, some studies attempt to use RL with different timescales to construct hierarchical policy structures, commonly adopting a skill-based approach [9, 28, 29]. For such an approach, low-level RL agents are pre-trained to output sequences of control commands over short timescales, known as motion skills. A high-level RL agent is then trained to select the optimal motion skill from this skill space. This approach breaks the timescale consistency constraint between different levels, thereby better leveraging the flexibility of RL. However, the skill space is typically fixed, and the high-level RL policy essentially learns to combine these time-extended control commands. This limits the potential of RL to explore optimal actions at each driving step [30].

To address the limitations of previous works, we propose a multi-timescale hierarchical RL approach. Two joint-trained hierarchical RL policies output long-timescale abstract motion guidance and short-timescale concrete control commands, respectively. Few studies have explored similar approaches in AD [15, 30], and those that do have notable shortcomings: (1) high-level outputs are restricted to either purely discrete or continuous action spaces, failing to match structured road constraints; and (2) hierarchical policies lack safety considerations. In contrast, we integrate the parameterized actor-critic (P-AC) technique into the hierarchical structure, explicitly representing high-level motion guidance with discrete-continuous hybrid actions. Additionally, a hierarchical safety mechanism is designed to support the policy structure.

Figure 1. The framework of our multi-timescale hierarchical RL approach. It consists of two unified-trained policies at the high and low levels, with supporting safety mechanisms.

3 Methodology

3.1 Framework Construction

3.1.1 MDP Re-Formulation

Inspired by DAC [31], we reformulate the training as two augmented Markov Decision Processes (MDPs): high-MDP \( {\cal M}^h\) and low-MDP \( {\cal M}^l\) . The high-level policy \( \pi^h\) and low-level policy \( \pi^l\) make decisions within their respective MDPs and are optimized jointly.

The \( {\cal M}^h\) can be defined by a tuple \( <{{\cal S}^h},{{\cal H}^h},{{\cal R}^h},{{\cal T}^h},\gamma >\) , where: 1) \( {\cal S}^h\) is the high-level state space, derived from the environment; 2) \( {\cal H}^h\) is the hybrid action space, composed of discrete and continuous subspaces, \( {\cal O}\) and \( {\cal A}^h\) ; 3) \( {\cal R}^h\) is the high-level reward function, determined by low-level rewards and agent violations; and 4) \( {\cal T}^h\) and \( \gamma\) are the high-level transition function and discount factor.

Similarly, the \( {\cal M}^l\) is defined by \( <{{\cal Z}^l},{{\cal A}^l},{{\cal R}^l},{{\cal T}^l},\gamma>\) , where: 1) \( {\cal Z}^l\) is the low-level state space, combining the original low-level state space \( {\cal S}^l\) and all motion guidance mapping to \( {\cal H}^l\) ; 2) \( {\cal A}^l\) is the low-level continuous action space; 3) \( {\cal R}^l\) is the low-level reward function derived from environmental feedback; and 4) \( {\cal T}^l\) is the transition function.

At each long timestep \( {T^h}\) , \( {\pi ^h}( o, {a^h} \mid {s^h} )\) outputs guidance-action \( (o, {a^h}) \in {{\cal H}^h}\) , with \( {s^h} \in {{\cal S}^h}\) . Each guidance-action explicitly represents a motion guidance \( {\cal G}\) via a bi-directional mapping: \( {\cal G} \leftrightarrow (o, {a^h})\) . Hereafter, \( {\cal G}(o, a^h)\) denotes the motion guidance corresponding to \( (o, a^h)\) . Then, at each short timestep \( {T^l}\) , \( {\pi ^l}( {a^l} \mid )\) receives extend-state \( {z^l} = ({s^l}, {\cal G}(o, {a^h}))\) from the environment and the high-level policy \( \pi^h\) , where \( {z^l} \in {{\cal Z}^l}\) and \( {s^l} \in {{\cal S}^l}\) , to output a execution-action \( a^l \in {{\cal A}^l}\) . The timesteps are related by \( {T^h} = n{T^l}\) , where \( n\) is determined by the termination function \( \beta\) , i.e., \( {\arg _i}\left[ {\beta ( {z_{t + i{T^l}}^l} ) = 1} \right]\) . Here, \( z_{t + i{T^l}}^l\) is required for the \( i\) -th output of \( {\pi ^l}\) after receiving \( \cal G\) . Thus, each high-level motion guidance corresponds to a variable-length sequence of low-level control commands.

3.1.2 Safety Mechanism Definition

The safety mechanism for the hierarchical policy includes: 1) safety evaluation module, 2) high- and low-level independent safety correction module, and 3) the safety-aware termination function. The safety evaluation module crosses both levels and generates risk severity of motion guidance \( {\cal K}( {{\cal G}( {o,{a^h}} ),s} )\) at different timescales. The safety correction module fuses \( {\cal K}\) with action values to produce safer actions \( o^{[s]}\) , \( a^{h,[s]}\) , and \( a^{l,[s]}\) . The safety-aware termination function \( \beta\) combines the safety evaluation results from both levels, prioritizing high-level correction to further enhance safety.

3.2 Multi-Timescale Hierarchical Policy Design

The policies \( \pi^h\) and \( \pi^l\) operate in two parallel augmented MDPs, which is trained simultaneously under same sampling conditions [31]. Optimizing \( \pi^h\) requires the parameterized actor-critic algorithm [32], while any policy optimization algorithm can be applied to \( \pi^l\) . To achieve joint optimality of \( \pi^h\) and \( \pi^l\) , a strong coupling between the two policies is essential: 1) The guidance-action from \( \pi^h\) is incorporated into the extend-state as input to \( \pi^l\) , and 2) the rewards obtained by \( \pi^l\) within the timestep \( {T}^l\) are also used to update \( \pi^h\) .

3.2.1 High-Level Policy

In high-MDP, the state-action value function for the optimal high-level policy is defined by the following Bellman optimality equation:

\[ \begin{equation} \begin{split} {Q^h}&\left( {s_t^h,{o_t},a_t^h} \right) = \\ &{\mathbb{E}} \left[ {r_{t + {T^h}}^h + \gamma {\max }\limits_{o \in {\cal O}} {\sup }\limits_{{a^h} \in {{\cal A}^h}} {Q^h}\left( {s_{t + {T^h}}^h,o,{a^h}} \right)} \right], \end{split} \end{equation} \]

(1)

where \( r_{t + {T^h}}^h \in {\cal R}^h\) is given by:

\[ \begin{equation} \begin{split} r_{t + {T^h}}^h = \left( {1 - {f_{v}}\left( {{s^l}} \right)} \right){{\mathbb{E}}_{i = 1 \sim n}}\left[ {r_{t + i{T^l}}^l} \right] + {f_{v}}\left( {{s^l}} \right){{\cal R}_{vio}}, \end{split} \end{equation} \]

(2)

where \( r_{t + i{T^l}}^l \in {\cal R}^l\) is the feedback reward from the environment for the \( i\) -th action of \( \pi^l\) . The violation flag function \( f_v\) is set to 1 in case of agent violations (e.g., vehicle collisions) and 0 otherwise. Introducing \( f_v\) prevents the dilution of violation-related rewards \( {\cal R}_{vio}\) by expectation-seeking operations, ensuring that the high-level policy maintains a strong emphasis on violations. The definition of \( {\cal R}_{vio}\) is provided in Sec. 4.1.

However, finding the optimal \( a^h\) in a hybrid action space is challenging. Following the idea of parameterized actor-critic, the high-level policy \( \pi^h\) outputs \( a^h\) through the cooperation between a deterministic policy network \( \mu^h(s^h; \theta^h)\) and a value network \( Q^h(s^h, o, a^h; \omega^h)\) . Details of this cooperation can be found in our previous work [13]. Thus, with \( {\pi ^h} = ( { \cdot \left| {{\mu ^h}( { \cdot {\rm{ }};{\theta ^h}} ),{Q^h}( { \cdot {\rm{ }};{\omega ^h}} ),{s^h}} \right.} )\) , the optimal state-action value function can be rewritten as:

\[ \begin{equation} \begin{split} &{Q^h} \left( {s_t^h,{o_t},a_t^h} \right) = \\ & {\mathbb{E}} \left[ {r_{t + {T^h}}^h + \gamma {\max }\limits_{o \in {\cal O}} {Q^h}\left( {s_{t + {T^h}}^h,o,{\mu ^h}\left( {s_{t + {T^h}}^h;{\theta ^h}} \right);{\omega ^h}} \right)} \right]. \end{split} \end{equation} \]

(3)

This function’s solution is the optimal guidance-action \( (o, a^h)\) .

The \( (o, a^h)\) explicitly represents the motion guidance \( {\cal G}\) through a bi-directional mapping:

\[ \begin{equation} \begin{split} {\cal G} \leftarrow \Psi \left( {o,{a^h}} \right),{\rm{ }}\left( {o,{a^h}} \right) \leftarrow {\Psi ^{ - 1}}\left( {\cal G} \right). \end{split} \end{equation} \]

(4)

where \( \Psi\) is an explicit representation function, depending on the practical significance of \( (o, a^h)\) and \( \cal G\) . A generalized example for AD is provided in Sec. 4.1.

3.2.2 Low-Level Policy

Illustration of the low-level extend-state transition. Assume that ^ h) generates a guidance-action of &rsquo;Lane Left&rsquo; for T^h) . When the vehicle crosses the lane divider at iT^l) , the guidance-action observed from the agent&rsquo;s viewpoint becomes &rsquo;Lane Keeping&rsquo;, even though the T^h) has not yet ended. — Figure 2. Illustration of the low-level extend-state transition. Assume that \( \pi ^ h\) generates a guidance-action of ’Lane Left’ for \( T^h\) . When the vehicle crosses the lane divider at \( iT^l\) , the guidance-action observed from the agent’s viewpoint becomes ’Lane Keeping’, even though the \( T^h\) has not yet ended.

In low-MDP, \( {\cal G}(o, a^h)\) is provided every \( T^h\) , while the low-level policy \( \pi^l\) acquires the observed extend-state \( z_{t + i{T^l}}^l\) at much shorter timestep \( T^l\) . As a result, the high-level output observed from the low-level perspective may change, as illustrated in Figure 2 with a lane-change scenario. Directly using the fixed \( {\cal G}(o, a^h)\) in constructing \( z_{t + i{T^l}}^l\) would introduce state inconsistencies, hindering stable training of \( \pi^l\) . To address this, motion guidance is incrementally updated at each short timestep using physical information, allowing the guidance-action to be naturally updated:

\[ \begin{equation} \begin{split} \left( {o_{t + i{T^l}}^{\left[ u \right]},a_{t + i{T^l}}^{h,\left[ u \right]}} \right) \stackrel{\Phi}{\leftrightarrow } {{\cal G}_{t + i{T^l}}^{\left[ u \right]}} \leftarrow _{t}\left( {{o_t},a_t^h} \right), \forall i \in \left[ {1, \cdots ,n} \right], \end{split} \end{equation} \]

(5)

where the superscript \( u\) denotes the variable has undergone incremental updating. Accordingly, the actual low-level extend-state becomes \( z_{t + i{T^l}}^l = ( {s_{t + i{T^l}}^l,{\cal G}_{t + i{T^l}}^{\left[ u \right]}} )\) . Since motion guidance is environment-specific, incremental updating using physical information is straightforward to implement. An example is provided in Sec. 4.1.

Therefore, in low-MDP, the state-action value function for the optimal low-level policy is given by the following Bellman optimality equation:

\[ \begin{equation} \begin{split} {Q^l} \left( {z_t^l,a_t^l} \right) = \mathbb{E} \left[ {r_{t + {T^l}}^l + \gamma U\left( {z_{t + {T^l}}^l} \right)} \right], \end{split} \end{equation} \]

(6)

\[ \begin{equation} \begin{split} U \left( {z_{t + {T^l}}^l} \right) =& \left( {1 - \beta \left( {s_{t + {T^l}}^l} \right)} \right) {\sup }\limits_{{a^l} \in {{\cal A}^l}} {Q^l}\left( {z_{t + {T^l}}^l,{a^l}} \right) + \\ &\beta \left( {s_{t + {T^l}}^l} \right) {\sup }\limits_{{a^l} \in {{\cal A}^l}} {Q^l}\left( {{\pi ^h}\left( {s^h} \right),s_{t + {T^l}}^l,{a^l}} \right), \end{split} \end{equation} \]

(7)

where \( U( {z_{t + {T^l}}^l} )\) is the optimal state value function. Since \( \pi^l\) outputs continuous actions and is compatible with any optimization algorithm, it can be directly approximated using a policy network \( {\mu ^l}( {{z^l}{\rm{ }};{\theta ^l}} )\) . Meanwhile, a value network \( {Q^l}( {z_t^l,a_t^l;{\omega ^l}} )\) is introduced to estimate the state-action value.

3.3 Hierarchical Safety Mechanism

The original motion guidance from \( \pi^h\) may pose safety risks. To quantify these risks, a safety evaluation module is used to assess risk severity: \( {\cal K}_t^h( {{\cal G}_t( {{o_t},a_t^h} ),s_t^h} )\) . Since the parameterized actor-critic provides values for all alternative actions, when risk exceeds a threshold, i.e., \( \eta {\cal K}_t^h \ge {{\cal K}_{th}}\) , the safety correction module generates safer guidance-actions guided by \( {{\cal K}^h}\) and \( Q^h\) :

\[ \begin{equation} \begin{split} \begin{array}{l} \left( {o_t^{\left[ s \right]},\!a_t^{h,\left[ s \right]}} \right) = \left\{{ \begin{array}{*{2}{l}} { {\arg \max }_{\cal {H}^{[s]}} Q^h\left( s_t^h,\!o,\!a^{h}; \omega^h \right)}&{{{\cal H}^{\left[ s \right]}} \ne \emptyset }\\ \arg \min _{\cal {O}, \mu^h} \cal{K}^{h}\left( \cal{G}\left( o,\!a^h \right),\!s_t^h \right) &\cal{H}^{\left[ s \right]} = \emptyset \end{array}} \right.\\ \end{array} \end{split} \end{equation} \]

(8)

\[ \begin{equation} \begin{split} \begin{array}{l} {{\cal H}^{\left[ s \right]}} = {\arg }_{o,a^h} \left[ \eta \cal{K}^h \left( \cal{G}\left( o,a^h \right) , s_t^h \right) < \cal {K}_{th} \right] \left| _{\forall o \in \cal{O},{a^h} = \mu^h} \right. \end{array} \end{split} \end{equation} \]

(9)

where the superscript \( [s]\) indicates processing by the safety mechanism. The variable \( \eta\) is an attention weight for \( {\cal K}_t^h\) , gradually increased during training to avoid early convergence to a conservative policy. The \( {{\cal H}^{\left[ s \right]}}\) represents a safe guidance-action space comprising alternative actions with risk severity below \( {{\cal K}_{th}}\) . Then, safer motion guidance \( {{\cal G}_t^{\left[ s \right]}}\) is reconstructed based on \( ( {o_t^{\left[ s \right]},a_t^{h,\left[ s \right]}} )\) , and they are updated to \( {{\cal G}_t^{\left[ {s,u} \right]}}\) and \( ( {o_t^{\left[ {s,u} \right]},a_t^{h,\left[ {s,u} \right]}} )\) according to Eq. 5. Additionally, \( ( {o_t^{\left[ s \right]},a_t^{h,\left[ s \right]}} )\) forms a tuple with \( s_t^h\) , \( r_t^h\) , and \( s_{t+T^h}^h\) , which is stored in the high-level replay buffer \( {\cal D}^h\) for updating \( \pi^h\) .

The safety mechanism for \( \pi^h\) cannot always ensure low risk severity, as execution-actions from \( \pi^l\) or dynamic environmental changes may lead to sudden risk increases. To address this, the safety evaluation module operating on a shorter timescale is introduced to assess risk severity: \( {\cal K}_t^l( {\cal G}_t^{[s,u]}( o_t^{[s,u]}, a_t^{h,[s,u]} ), s_t^l )\) . Since \( \pi^l\) outputs a deterministic action, an alternative action \( a_t^{l,r}\) is obtained from a priori conservative control model \( {{\cal F}_{p-m}}\) , to generate a safer action:

\[ \begin{equation} \begin{split} \begin{array}{l} a_t^{l,\left[ s \right]} = \left\{ {\begin{array}{*{20}{l}} a_t^l &\eta \cal{K}_t^l < \cal{K}_{th}\\ {\arg \max }_{{a^l} \in \left[ {a_t^l,a_t^{l,r}} \right]} Q^l\left( z_t^l,a^l;\omega^l \right) &{\eta {\cal K}_t^l \ge {{\cal K}_{th}}} \end{array}} \right. \end{array} \end{split} \end{equation} \]

(10)

where \( {{\cal F}_{p - m}}\) may vary depending on the specific implementation, with details and an example provided in Sec. 4.2. The \( a_t^{l,\left[ s \right]}\) is used to control the agent’s interaction with the environment and, together with \( z_t^l\) , \( r_t^l\) , and \( z_{t + {T^l}}^l\) , is stored in the low-level replay buffer \( {\cal D}^l\) for updating \( \pi^l\) .

When the risk severity of motion guidance is high, the high-level safety mechanism should be prioritized over a longer timescale to ensure long-term safety. Accordingly, the safety evaluations results of \( \pi^h\) and \( \pi^l\) are integrated, resulting in a safety-aware termination function \( \beta\) :

\[ \begin{equation} \begin{split} \begin{array}{l} \beta \left( {z_{t + i{T^l}}^l} \right) = {f_{v}}\left( {s_{t + i{T^l}}^l} \right) \vee \left[ {i = {n_{\max }}} \right] \vee {\cal C}\left( {z_{t + i{T^l}}^l} \right) \end{array} \end{split} \end{equation} \]

(11)

\[ \begin{equation} \begin{split} \begin{array}{l} {\cal C}\left( {z_{t + i{T^l}}^l} \right) = \left[ {\left( {\eta {\cal K}_{t + i{T^l}}^l \ge {{\cal K}_{th}}} \right) \wedge \left( \eta {\cal K}_t^h < {{\cal K}_{th}} \right)} \right] \end{array} \end{split} \end{equation} \]

(12)

Three conditions trigger \( \beta ( {z_{t + i{T^l}}^l} ) = 1\) : 1) the agent is in violation, i.e., \( {f_v} = 1\) ; 2) the cumulative number of low-level decisions reaches the limit \( {n_{\max }}\) , which can be fixed or task-dependent; and 3) \( {\cal C}( z_{t + i{T^l}}^l ) = 1\) , indicating that, during the operation of \( \pi^l\) , the risk severity exceeds the threshold while the safety correction of \( \pi^h\) is inactive.

3.4 Policy Optimization Training

The update of \( \pi^h\) involves two networks: \( {Q^h}(\cdot\,;{\omega^h})\) and \( {\mu^h}(\cdot\,;{\theta^h})\) . Corresponding target networks, \( {Q^h_*}(\cdot\,;{\omega^h})\) and \( {\mu^h_*}(\cdot\,;{\theta^h})\) , are introduced, being updated using a soft update parameter \( \tau\) . The gradient for updating \( {Q^h}(\cdot\,;{\omega^h})\) is computed from randomly sampled transitions \( < s_t^h, o_t^{[s]}, a_t^{h,[s]}, r_{t+T^h}^h, s_{t+T^h}^h >\) from \( {\cal D}^h\) , and is given by:

\[ \begin{equation} \begin{split} \begin{array}{l} {\nabla}\left( \right) = - \left[ {y\left( {{Q^h}} \right) - {Q^h}\left( {s_t^h,\!o_t^{\left[ s \right]},\!a_t^{h,\left[ s \right]};} \right)} \right]\\ y\left( {{Q^h}} \right) = r_{t + }^h + \gamma {\max }\limits_{o \in {\cal O}} Q_*^h\left( {s_{t + {T^h}}^h,\!o,\mu _*^h\left( {s_{t + }^h;\theta _*^h} \right);\omega _*^h} \right). \end{array} \end{split} \end{equation} \]

(13)

For \( {\mu ^h}( { \cdot {\rm{ }};{\theta ^h}} )\) , the update objective is to maximize the value function over all discrete actions, so its policy gradient is:

\[ \begin{equation} \begin{split} {\nabla}\left( {{\theta ^h}} \right) = \sum\limits_{o \in {\cal O}} {{\nabla _{{\theta ^h}}}{\mu ^h}\left( {s_t^h;} \right)\left( {s_t^h,\!o,;} \right)} \left| {_{{a^h} = {\mu ^h}}} \right. \end{split} \end{equation} \]

(14)

Similarly, updating the low-level \( {Q^l}( { \cdot {\rm{ }};{\omega ^l}} )\) and \( {\mu ^l}( { \cdot {\rm{ }};{\theta ^l}} )\) relies on target networks \( Q_*^l( { \cdot {\rm{ }};\omega _*^l} )\) and \( \mu _*^l( { \cdot {\rm{ }};\theta _*^l} )\) . The gradient of \( {Q^l}( { \cdot {\rm{ }};{\omega ^l}} )\) is computed based on \( < z_t^l,a_t^{l,\left[ s \right]},r_{t + {T^l}}^l,z_{t + {T^l}}^l > \) , which is randomly sampled from \( {\cal D}^l\) :

\[ \begin{equation} \begin{split} \begin{array}{l} \nabla {{\cal L}_t}\left( {{\omega ^l}} \right) = - \left[ {y\left( {{Q^l}} \right) - {Q^l}\left( {z_t^l,a_t^{l,\left[ s \right]};{\omega ^l}} \right)} \right]{\nabla _{{\omega ^l}}}{Q^l},\\ y\left( {{Q^l}} \right) = r_{t + {T^l}}^l + \gamma U\left( {z_{t + {T^l}}^l} \right)\left| {_{\pi _*^h,{s^h} = s_{t + {T^l}}^l}} \right., \end{array} \end{split} \end{equation} \]

(15)

where \( \pi _*^h\) denotes the high-level target policy. The gradient for updating \( {\mu ^l}( { \cdot {\rm{ }};{\theta ^l}} )\) is:

\[ \begin{equation} \begin{split} \begin{array}{l} \nabla {{\cal J}_t}\left( {{\theta ^l}} \right) = {\nabla _{{\theta ^{_l}}}}{\mu ^l}\left( {z_t^l;{\theta ^l}} \right){\nabla _{{a^l}}}{Q^l}\left( {z_t^l,{a^l};{\omega ^l}} \right)\left| {_{{a^l} = {\mu ^l}}} \right.. \end{array} \end{split} \end{equation} \]

(16)

The training procedure of our multi-timescale hierarchical RL, with safety mechanisms, is presented in Algorithm 1.

Algorithm 1 Training process of our method

1.Require:

2.Initialize: replay buffer \( \{{\cal D}^h, {\cal D}^l\}\) , high- and low-level networks \( \{Q^h, \mu^h, Q_{*}^h, \mu _{*}^h, Q^l, \mu^l, Q_{*}^l, \mu _{*}^l\}\) .

3.for \( t = 0\) to \( T\) do

4.Get \( {s_t^h}\) from environment.

5.Select \( a_t^h \sim {\mu ^h}\left( {{s^h_t};{\theta ^h}} \right)\) for \( {\forall o \in {\cal O}}\) .

6.Select \( \left( {{o_t},a_t^h} \right) \sim { {Q^h}\left( {s_{t}^h,o, a_t^h;{\omega ^h}} \right)} \left| {_{\forall o \in {\cal O}}} \right.\) .

7.Construct motion guidance \( {\cal G}_t\) according to \( \Psi \left( {{o_t},a_t^h} \right)\) .

8.Get safer \( \left( {o_t^{\left[ s \right]},a_t^{h,\left[ s \right]}} \right)\) and \( {{\cal G}_t^{\left[ s \right]}}\) according to Eq. 8.

9.while not \( \beta\) do

10.Select \( a_t^l \sim {\mu ^l}\left( {{z^l_t}{\rm{ }};{\theta ^l}} \right)\) .

11.Get safer \( a_t^{l,\left[ s \right]}\) according to Eq. 10.

12.Get \( s^l_{t + T^l}\) and \( r^l_{t + T^l}\) from environment.

13.Incremental update for \( {\cal G}_{t+T^l}^{\left[ {s,u} \right]}\left( {o_{t+T^l}^{\left[ {s,u} \right]},a_{t+T^l}^{h,\left[ {s,u} \right]}} \right)\) .

14.Get \( \beta \left( {z_{t+T^l}^l} \right)\) according to Eq. 11.

15.Update \( \omega ^l_{t + T^l}\) , \( \theta ^l_{t + T^l}\) , \( {\omega ^{l}_{*}}\) , \( {\theta ^{l}_{*}}\) .

16.\( z _{t}^l \leftarrow z _{t + T^l}^l\) , \( i \leftarrow i + 1\) , \( t \leftarrow t + T^l\) .

17.end while

18.\( n \leftarrow i\) , \( T^h = nT^l\) , \( s _{t + T^h}^h \leftarrow s _{t + T^l}^l\) .

19.Get \( {r_{t + {T^h}}^h}\) according to Eq. 2.

20.Update \( \omega ^h_{t + T^h}\) , \( \theta ^h_{t + T^h}\) , \( {\omega ^{h}_{*}}\) , \( {\theta ^{h}_{*}}\) .

21.\( s_t^h \leftarrow s_{t + T^h}^h\) , \( i \leftarrow 0\) , \( \beta \leftarrow 0\) .

22.end for

4 Implementation

The highway multi-lane scenario is a common yet challenging environment, requiring the ego vehicle (EV) to dynamically adjust its position and speed within defined lanes to ensure efficiency, action consistency, and safety. Therefore, we implement our approach in this setting.

4.1 Two Augmented MDPs Formulation

Both high-MDP and low-MDP involve three key elements: 1) Action space: The high-MDP adopts a hybrid action space for motion guidance under road constraints, focusing on long-term target position planning. In contrast, low-MDP uses a continuous action space to generate control commands, focusing on short-term speed adjustment. 2) State space: The two MDPs share a same original state space, i.e., \( {\cal S}^h = {\cal S}^l\) . 3) Reward function: The high-MDP’s reward is implicitly defined by that of the low-MDP and requires no separate design.

4.1.1 High-MDP Action Space

In lanes that are discrete laterally but continuous longitudinally, the hybrid action space \( {\cal H}^h\) of the high-level policy is defined as follows:

\[ \begin{equation} \begin{split} \begin{array}{l} \left\{ \begin{array}{l} {\cal O}: \left\{ { - {w_r},0,{w_r}} \right\}\\ {{\cal A}^h}: \left[ {\min \left( {\sqrt {4{R_0}{w_r} - w_r^2} ,\frac{{v_e^2}}{{2a_{\max }^ - }}} \right),{e^{\left| {{v_e}} \right| + {w_r}}}} \right] \end{array} \right. \end{array} \end{split} \end{equation} \]

(17)

where \( w_r\) is the lane width, \( {R_0}\) is the minimum turning radius, \( {2a_{\max }^-}\) is the maximum braking acceleration, and \( v_e\) is the EV’s speed. The \( {\cal O}\) allows \( o\) to represent lane selection, restricting the target to the current or adjacent lane. Given the EV’s state and kinematics, the \( {\cal A}^h\) allows \( a^h\) to represent the selection of a feasible target location within the chosen lane. Together, \( (o, a^h)\) specifies a target point without further processing, which can be considered as a coarse-grained motion guidance.

To provide more comprehensive and detailed guidance for low-level policy, a finer-grained motion guidance is designed using polynomial curve: \( {\cal G} = {\arg _{(x_j, y_j)}}[ y_j = \sum_{m=0}^5 c_m x_j^m ]\) , where \( j \in [1, …, g]\) . Here, \( \cal G\) is a set of \( g\) target points lying on a fifth-degree polynomial parameterized by coefficients \( c_m\) . In the Frenet frame, with the EV at the origin, \( x_j\) and \( y_j\) are the longitudinal and lateral coordinates of the \( j\) -th target point. The start point is set by the EV’s current state: \( (x_1, y_1) = (x_e, y_e)\) , with heading angle \( \varphi_1 = \varphi_e\) . The end point is given by \( \pi^h\) : \( (x_g, y_g) = (o, a^h)\) , with \( \varphi_g\) from lane information. Solving the resulting linear system yields the coefficients \( c_m\) [13]. The mapping \( (o, a^h) \equiv (x_g, y_g) \in {\cal G}\) defines an explicit representation function \( \Psi\) , which satisfies Eq. 4 given the current EV and road states.

In timestep \( T^h\) , the frenet frame moves with EV. Since the points in \( \cal G\) are defined relative to previous frenet frame, their coordinates must be updated. A simple coordinate transformation yields the updated set \( {{\cal G}^{[u]}}\) , corresponding to Eq. 5.

4.1.2 Low-MDP Action Space

The control commands required for EV driving are acceleration and steering angle. Thus, the execution action output by \( \pi^l\) consists of two items: \( a^l = (\delta_e, a_e)\) . According to general vehicle kinematics, the low-level action space is defined as:

\[ \begin{equation} \begin{split} {\cal A}^l = \left\{\left[-\pi/6 \text{ rad}, \pi/6 \text{ rad}\right], \left[-3\text{ m/s}^2,3\text{ m/s}^2\right]\right\}. \end{split} \end{equation} \]

(18)

4.1.3 State Space

Both \( \pi^h\) and \( \pi^l\) should account for the states of the EV and surrounding vehicles (SVs) in adjacent lanes. Thus, the state space is defined as:

\[ \begin{equation} \begin{split} {{\cal S}^h} \equiv {{\cal S}^l} = \left\{ {\begin{array}{*{20}{c}} {{{\left[ {I{D_{lane}},{x_e},{y_e},{\varphi _e},v_e^x,v_e^y} \right]}^{{\rm{EV}}}},}\\ {\left[ {{p_k},\Delta {x_k},\Delta {y_k},{\varphi _k},\Delta v_k^x,\Delta v_k^y} \right]_{k \in \left[ {1 \cdots 6} \right]}^{{\rm{SVs}}}} \end{array}} \right\}, \end{split} \end{equation} \]

(19)

The EV state includes lane ID, longitudinal and lateral positions, heading angle, and longitudinal and lateral speeds. SVs ahead of and behind the EV in the current and adjacent lanes are considered, with up to six SVs in total. Each SV state includes a presence flag, relative longitudinal and lateral positions, relative heading angle, and relative longitudinal and lateral speeds. The EV only considers SVs within the observation range \( \Delta x \in [-80\,\mathrm{m},\,160\,\mathrm{m}]\) .

4.1.4 Reward Function

The reward function is designed to consider driving safety, efficiency, and action consistency:

\[ \begin{equation} \begin{split} \begin{array}{l} {{\cal R}^l} = {{\cal R}_{s}} + {{\cal R}_{e}} + {{\cal R}_{c}}, {{\cal R}_s} = - 10{f_v} - 5\left( {{{\cal K}^h} + {{\cal K}^l}} \right),\\ {{\cal R}_e} = {{\left| {{v_e} - {v_*}} \right|} \left/ {\vphantom {{\left| {{v_e} - {v_*}} \right|} {{v_*}}}} \right. - {{v_*}}} - \max \left( {0,{{{v_p} - {v_e}} \left/ {\vphantom {{{v_p} - {v_e}} {{v_p}}}} \right. - {{v_p}}}} \right),\\ {{\cal R}_c} = - {{\left( {0.5\left| {{\delta _e}} \right| + 0.2\left| {\Delta {\delta _e}} \right|} \right)}} - \end{array} \end{split} \end{equation} \]

(20)

where the weights for each term reflect its relative importance. In safety reward \( {{\cal R}_s}\) , \( f_v = 1\) indicates an EV violation, such as road departure or collision with an SV. The results from the safety evaluation module are also included. In efficiency reward \( {{\cal R}_e}\) , the EV is encouraged to reach the target speed \( v_*\) , while speeds below the threshold \( v_p\) are penalized. In action consistency reward \( {{\cal R}_c}\) , fluctuations in steering and acceleration commands are penalized to promote smoothness.

4.2 Safety Mechanism Formulation

4.2.1 Risk Severity Evaluation

Artificial Potential Fields (APF) integrate discrete events into a unified field over high-dimensional observations [26], providing reliable and expressive measures of driving risk severity. Based on APF, we propose a risk severity evaluation model for \( \cal G\) :

\[ \begin{equation} \begin{split} {\cal K} = \frac{1}{g}\sum\limits_{j = 1}^g {\left[ {{{\cal I}_j} \cdot {\max }\limits_{k \in \left[ {1, \cdots ,6} \right]} \left\{ {\rho _j^k} \right\}} \right]}, \end{split} \end{equation} \]

(21)

where \( {\rho _j^k}\) is the risk potential field at the \( j\) -th point relative to the \( k\) -th SV. Since points closer to the EV are more critical, the importance at the \( i\) -th point is: \( {{\cal I}_j} = 1 - {e^{{K_r}\left( {j - g} \right)}}\) , where \( K_r\) is a decay rate coefficient. Additionally, \( {\rho _j^k}\) is defined as:

\[ \begin{equation} \begin{split} \begin{array}{l} \rho _j^k = {w_1}{e^{\left( { - \frac{1}{2}{{\rm P}_1}{{\rm B}^{ - 1}}{\rm P}_1^{\rm{T}}} \right)}} + {w_2}{e^{\left( { - \frac{1}{2}{{\rm P}_2}{{\rm B}^{ - 1}}{\rm P}_2^{\rm{T}}} \right)}}\\ {\rm B} = \left[ {\begin{array}{*{20}{c}} {X_s^2}&0\\ 0&{Y_s^2} \end{array}} \right],{{\rm{P}}_1} = \left[ {\begin{array}{*{20}{c}} {\Delta x_j^k}&{\Delta y_j^k} \end{array}} \right],\\ {{\rm{P}}_2} = \left[ {\begin{array}{*{20}{c}} {{_{\Delta a_x^k < 0}}\Delta x_j^k}&{{_{\Delta a_y^k < 0}}\Delta y_j^k} \end{array}} \right] \end{array} \end{split} \end{equation} \]

(22)

where \( {w_1} \in [0.5,1]\) and \( {w_1} + {w_2} = 1\) . In addition, \( {X_s}\) and \( {Y_s}\) are the minimum safe distances in the longitudinal and lateral directions, respectively. The \( \Delta x_j^k\) and \( \Delta y_j^k\) are the longitudinal and lateral distances of the \( j\) -th point relative to the \( k\) -th SV.

4.2.2 Prior Control Model Design

To obtain \( a^{l,r}\) and simplify the design, we combine the widely used Intelligent Driver Model (IDM) and Stanley path-tracking algorithm as a conservative prior control model. Specifically, IDM determines the acceleration based on environment, while Stanley algorithm computes the steering angle according to motion guidance.

5 Experiments

5.1 General Settings

5.1.1 Environment

The training scenario is built in Highway-Env with three defined lanes [33]. At the start of each episode, the EV and SVs are randomly placed in any lane with random initial speeds. SVs follow the IDM and MOBIL models, allowing lane changes to reach their target speeds, which may interfere with the EV. The vehicle capacity (V/C) ratio, representing traffic density.

With this setup, training is conducted for 2,000 episodes using five seeds. Testing is then performed over 100 episodes on both Highway-Env and the HighD dataset [34], with each episode limited to 100s. Notably, the traffic density in Highway-Env is set to 0.3, presenting a challenging scenario.

5.1.2 Comparison Methods

We select several popular RL-based AD methods as baselines, which have different policy structures. PPO directly outputs control commands. The General Hierarchical method (Gen-H-RL) uses a high-level RL policy to select discrete behaviors for a rule-based low-level controller. Classical Hierarchical RL (Class-HRL) combines a high-level value-based policy with a low-level actor-critic structure. RL-PTA have an implicit hierarchical policy structure with fixed timescale. MHRL-I and Skill-Critic are preliminary methods exploring multi-timescale RL policy training, with discrete and continues high-level outputs, respectively. In contrast, our method features a more advanced hierarchical policy structure: P-AC is used to generate Hybrid action-based motion guidance (path points), and a supporting Safety mechanism is also included. The details of all methods are shown in Table 1. For all MT-based methods, the high- and low-level outputs operate at timesteps of 1s and 0.1s, respectively, as this has proven effective [6].

Method	High-Level		Low-Level	Is MT
	Model	Output
PPO [1]	N/A	N/A	PPO	No
Gen-H-RL [2]	VB	Dis-Beh.	PID	Yes
Class-HRL [3]	VB	Dis-Beh.	AC	No
RL-PTA [4]	P-AC	Path Points	P-AC	No
MHRL-I [5]	VB	Dis-Beh.	AC	Yes
Skill-Critic [6]	AC	Path Points	AC	Yes
MTHRL-H (Ours)	P-AC	Path Points	AC	Yes
MTHRL-HS (Ours)	P-AC	Path Points	AC	Yes

5.1.3 Evaluation Metrics

To comprehensively evaluate the performance of each driving policy, we select key metrics across four aspects:

5.2 Performance Comparison

5.2.1 Training

The total reward curves for all methods during training are shown in Fig. 3. All methods converge after 1,800 episodes. Our proposed MTHRL-HS achieves the highest reward, indicating superior driving performance.

In Fig. 3(a), PPO converges more slowly and achieve significantly lower rewards than methods using hierarchical techniques, suggesting that direct control output hinders effective policy learning. Gen-H-RL, which applies RL only at the high level, converges fastest but yields relatively lower rewards. In contrast, Class-HRL uses RL at both levels, leading to marginally improved performance yet slightly slower convergence. The methods in Fig. 3(b) generally benefit from more advanced hierarchical structures, resulting in better policies. RL-PTA, while implicitly hierarchical with a fixed timescale, achieves strong performance with less fluctuation across different seeds. MHRL-I/Skill-Critic perform significantly worse than MTHRL-H, demonstrating that hybrid action-based motion guidance, which better aligns with road structure, leads to superior policies compared to purely continuous or discrete actions. Furthermore, MTHRL-HS, with its safety mechanism, further enhances policy performance.

Figure 3. The training process of our method with comparison methods.

5.2.2 Testing

Method	Overall Perf.	Efficiency		Action Consistency			Safety
	TR	DS \( [m/s]\)	TLC	AS \([ rad]\)	AA \( [m/s^2]\)	CDD \([ m]\)	CR	TTC-T \( [s]\)	TTC-C \( [s]\)
PPO	68.22(2.10)	8.87(2.20)	1.11(0.93)	0.086(0.184)	0.98(0.68)	0.22(0.35)	0.01%	9.14(1.85)	9.19(2.03)
Gen-H-RL	72.65(1.94)	10.43(3.41)	6.22(1.98)	0.022(0.090)	0.42(0.63)	0.09(0.24)	0.33%	7.67(1.93)	7.13(2.38)
Class-HRL	73.20(0.58)	9.99(3.16)	5.68(1.85)	0.044(0.133)	0.59(0.55)	0.12(0.29)	0.29%	7.42(2.18)	7.51(2.10)
RL-PTA	81.37(2.70)	11.21(3.02)	7.89(2.55)	0.024(0.052)	0.54(0.51)	0.13(0.25)	0.10%	9.10(1.19)	8.87(1.50)
MHRL-I	78.91(4.99)	11.03(3.60)	7.62(3.10)	0.039(0.069)	0.49(0.48)	0.12(0.19)	0.09%	9.05(1.26)	8.93(1.41)
Skill-Critic	79.26(4.01)	11.15(3.79)	7.15(2.54)	0.023(0.042)	0.50(0.44)	0.10(0.20)	0.11%	8.89(1.87)	8.73(1.95)
MTHRL-H(ours)	84.99(3.04)	12.68(2.87)	8.19(2.69)	0.012(0.040)	0.31(0.37)	0.07(0.14)	0.07%	9.14(1.36)	9.15(2.05)
MTHRL-HS(ours)	89.14(2.85)	12.61(3.81)	8.03(2.60)	0.013(0.034)	0.33(0.36)	0.07(0.16)	0.03%	9.36(1.26)	9.03(1.89)

Method	Overall Perf.	Efficiency		Action Consistency			Safety
	TR	DS \([ m/s]\)	TLC	AS \( [rad]\)	AA \( [m/s^2]\)	CDD\([ m]\)	CR	TTC-T \( [s]\)	TTC-C \( [s]\)
PPO	81.62(3.54)	10.22(1.92)	0.68(1.20)	0.081(0.188)	1.08(0.65)	0.18(0.23)	0.02%	9.23(1.88)	9.12(1.68)
Gen-H-RL	87.40(5.33)	13.08(2.40)	5.73(2.14)	0.030(0.073)	0.36(0.51)	0.07(0.18)	0.20%	8.06(1.85)	7.66(2.00)
Class-HRL	89.25(3.11)	12.82(3.15)	5.55(2.16)	0.041(0.121)	0.74(0.58)	0.11(0.20)	0.21%	7.84(1.99)	7.89(2.07)
RL-PTA	95.57(7.04)	15.01(2.98)	6.96(3.84)	0.022(0.050)	0.60(0.52)	0.12(0.23)	0.05%	9.21(1.15)	8.99(1.43)
MHRL-I	94.91(8.95)	14.13(3.63)	7.42(4.23)	0.028(0.062)	0.53(0.50)	0.12(0.18)	0.08%	9.10(1.21)	9.04(1.28)
Skill-Critic	95.30(8.01)	14.80(4.04)	7.08(3.80)	0.021(0.039)	0.59(0.39)	0.10(0.21)	0.05%	8.96(1.77)	8.97(1.81)
MTHRL-H(ours)	96.62(6.30)	16.18(3.00)	7.39(3.52)	0.013(0.041)	0.37(0.35)	0.06(0.15)	0.05%	9.15(1.76)	9.10(1.88)
MTHRL-HS(ours)	96.77(5.51)	16.09(3.44)	7.07(3.27)	0.013(0.037)	0.32(0.30)	0.07(0.17)	0.02%	9.30(1.33)	9.09(1.72)

Figure 4. The joint distribution of acceleration and speed in testing.

Figure 5. The joint distribution of steering angle and CDD in testing.

Figure 6. The distributions of TTC-C and TTC-T in testing.

The testing results further show that different methods produce distinct driving behaviors, as detailed in Table 2. Notably, our method achieves the highest TR, consistent with the training results. Compared to RL-PTA, which has the second-highest TR, MTHRL-H improves TR by 4.4%, and with the safety mechanism ‘S’, this improvement increases to 9.5%. After analyzing all the metrics, we provide additional details for some key metrics of the well-performing methods. The joint distribution of acceleration and speed is shown in Fig. 4, while the joint distribution of steering angle and CDD is shown in Fig. 5. The Fig. 6 further presents the distributions of TTC-C and TTC-T as boxplots.

For driving efficiency, MTHRL-H achieves the highest DS and TLC, with DS improved by 13.1% over the suboptimal method. This indicates that the multi-timescale hierarchical policy enables more flexible lane-changing behaviors to enhance driving efficiency, while the introduction of safety mechanism does not significantly compromise this performance. More specifically, under the same testing environment, Fig. 4 shows that the peak speed distributions for each method correspond to two potential cases: one near 20 m/s, representing opportunities for overtaking to reach the target speed; and another near 10 m/s, indicating the EV must follow SVs due to traffic congestion. Notably, MTHRL-HS achieves better lane-change timing in the former case, yielding the highest peak speed distribution. In the latter case, it effectively chooses to follow a faster SV, shifting the peak speed distribution upward. In contrast, other methods show lower driving efficiency, with worse DS, TLC, and speed distributions. In particular, PPO, lacking hierarchical structures, tend to adopt overly conservative following policies.

For driving action consistency, MTHRL-H shows the lowest means and standard deviations for AS, AA, and CDD, which are 50.0%, 26.2%, and 22.2% lower than those of the suboptimal method, respectively. The safety mechanism does not adversely affect these metrics. As shown in Fig. 4 and Fig. 5, our method produces acceleration, steering angle, and CDD values more tightly centered around 0. Even during lane changes, while CDD increases, steering angles remain smaller than with other methods. This suggests that high-level hybrid action-based motion guidance improves action consistency. It reduces fluctuations in control commands, resulting in smoother longitudinal and lateral driving behaviors. In contrast, PPO, which directly outputs control commands, lead to more fluctuating behaviors and make it harder to keep the EV on the centerline. Among hierarchical methods, generating finer-grained path points at the high level yields greater action consistency than discrete behavior generation.

For driving safety, excluding the over-conservative PPO, MTHRL-H reduces the CR by 22.2% compared to the suboptimal method. With the safety mechanism, the reduction in CR increases to 66.7%, clearly demonstrating its effectiveness in improving safety. Additionally, TTC-C and TTC-T reflect the policy’s ability in ensuring safer driving during interactions with SVs. As shown in Table 2 and Fig. 6, MTHRL-HS achieves the highest TTC-C and TTC-T, indicating a more cautious driving style. Meanwhile, the safety mechanism results in a significantly higher TTC-T than TTC-C, highlighting its ability to guide the EV into safer lanes.

5.2.3 Validation on HighD Dataset

The metrics in Table 3 show the driving performance of each method on the HighD dataset. Compared to the Highway-Env scenario, all methods perform better due to lower traffic density in HighD. MTHRL-H and MTHRL-HS remain the top performers overall, despite a smaller TR gap with other methods, demonstrating strong adaptability and robust policy performance in real traffic. Benefiting from the multi-timescale hierarchical architecture and hybrid action-based motion guidance, they lead in both efficiency and action consistency. Specifically, DS remains the highest, with only minor impact from the safety mechanism, while AS, AA, and CDD stay at the lowest levels. For safety, excluding the over-conservative PPO, MTHRL-HS further reduces the CR to 0.02% by leveraging the hierarchical safety mechanism. Therefore, above results confirm the superiority of our method across all driving metrics and its strong potential for real-world deployment.

6 Conclusion and Future Work

This paper proposes a Multi-Timescale Hierarchical RL approach for AD. The approach features a hierarchical policy structure: a high-level RL policy generates long-timescale motion guidance, while a low-level RL policy produces short-timescale vehicle control commands. Therein, a hybrid action-based explicit representation is designed for motion guidance to better adapt to structured roads and to facilitate addressing low-level state inconsistencies. In addition, supporting hierarchical safety mechanisms are introduced to enhance the safety of both high- and low-level outputs. We evaluate our approach against advanced baselines in both simulator-based and HighD data-based highway multi-lane scenarios, and conduct a comprehensive analysis of various driving behavior metrics. Results demonstrate that our approach effectively improves driving efficiency, action consistency, and safety.

Future work aims to apply our approach to more complex urban structured roads scenarios and develop advanced safety mechanisms to further enhance driving safety.

Acknowledgments

This work is supported in part by the National Science Fund for Distinguished Young Scholars of China under Grant No. 52325212.

References

[1] Dong Hu and Longfei Mo and Jingda Wu and Chao Huang “Feariosity”-Guided Reinforcement Learning for Safe and Efficient Autonomous End-to-end Navigation IEEE Robot. Autom. Lett. 2025

[2] Hojoon Lee and Takuma Seno and Jun Jet Tai and Kaushik Subramanian and Kenta Kawamoto and Peter Stone and Peter R Wurman A Champion-Level Vision-Based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7 IEEE Robot. Autom. Lett. 2025

[3] Xiaolin Tang and Bing Huang and Teng Liu and Xianke Lin Highway decision-making and motion planning for autonomous driving via soft actor-critic IEEE Trans. Veh. Technol. 2022 71 5 4706–4717

[4] Tanmay Agarwal and Hitesh Arora and Jeff Schneider Learning urban driving policies using deep reinforcement learning Proc. IEEE Intell. Transp. Syst. Conf. (ITSC) 2021 607–614

[5] Longquan Chen and Ying He and Qiang Wang and Weike Pan and Zhong Ming Joint optimization of sensing, decision-making and motion-controlling for autonomous vehicles: A deep reinforcement learning approach IEEE Trans. Veh. Technol. 2022 71 5 4642–4654

[6] Letian Wang and Jie Liu and Hao Shao and Wenshuo Wang and Ruobing Chen and Yu Liu and Steven L Waslander Efficient reinforcement learning for autonomous driving with parameterized skills and priors arXiv preprint arXiv:2305.04412 2023

[7] Yuyang Xia and Shuncheng Liu and Quanlin Yu and Liwei Deng and You Zhang and Han Su and Kai Zheng Parameterized Decision-Making with Multi-Modality Perception for Autonomous Driving IEEE Int. Conf. on Data Eng. (ICDE) 2024 4463–4476 IEEE

[8] Chao Lu and Hongliang Lu and Danni Chen and Haoyang Wang and Penghui Li and Jianwei Gong Human-like decision making for lane change based on the cognitive map and hierarchical reinforcement learning Transp. Res. Part C Emerg. Technol. 2023 156 104328

[9] Yigit Gurses and Kaan Buyukdemirci and Yildiray Yildiz Developing driving strategies efficiently: A skill-based hierarchical reinforcement learning approach IEEE Control Syst. Lett. 2024

[10] Yaping Liao and Guizhen Yu and Peng Chen and Bin Zhou and Han Li Integration of Decision-Making and Motion Planning for Autonomous Driving Based on Double-Layer Reinforcement Learning Framework IEEE Trans. Veh. Technol. 2023 73 3 3142–3158

[11] Ruibin Zhao and Zhanbo Sun and Ang Ji A deep reinforcement learning approach for automated on-ramp merging Proc. IEEE Intell. Transp. Syst. Conf. (ITSC) 2022 3800–3806 IEEE

[12] Zhihao Zhang and Ekim Yurtsever and Keith A Redmill Extensive Exploration in Complex Traffic Scenarios using Hierarchical Reinforcement Learning arXiv preprint arXiv:2501.14992 2025

[13] Guizhe Jin and Zhuoren Li and Bo Leng and Wei Han and Lu Xiong Stability Enhanced Hierarchical Reinforcement Learning for Autonomous Driving with Parameterized Trajectory Action Proc. IEEE Intell. Transp. Syst. Conf. (ITSC) 2024

[14] Jiankun Peng and Siyu Zhang and Yang Zhou and Zhibin Li An integrated model for autonomous speed and lane change decision-making based on deep reinforcement learning IEEE Trans. Intell.Transp. Syst. 2022 23 11 21848–21860

[15] Longquan Chen and Ying He and Weike Pan and F Richard Yu and Zhong Ming A Novel Generalized Meta Hierarchical Reinforcement Learning Method for Autonomous Vehicles IEEE Network 2023 37 4 230–236

[16] Sang-Hyun Lee and Yoonjae Jung and Seung-Woo Seo Imagination-Augmented Hierarchical Reinforcement Learning for Safe and Interactive Autonomous Driving in Urban Environments IEEE Trans on Intell. Transp. Syst. 2024

[17] Jiao Wang and Haoyi Sun and Can Zhu Vision-based autonomous driving: A hierarchical reinforcement learning approach IEEE Trans. Veh. Technol. 2023 72 9 11213–11226

[18] Guizhe Jin and Zhuoren Li and Bo Leng and Minhua Shao Deep reinforcement learning Lane-Change Decision-Making for autonomous vehicles based on motion primitives library in hierarchical action space Artif. Intell. Auton. Syst. 2024 1 2 1–2

[19] Hongliang Lu and Chao Lu and Yang Yu and Guangming Xiong and Jianwei Gong Autonomous overtaking for intelligent vehicles considering social preference based on hierarchical reinforcement learning Automotive Innovation 2022 5 2 195–208

[20] Tong Zhou and Letian Wang and Ruobing Chen and Wenshuo Wang and Yu Liu Accelerating reinforcement learning for autonomous driving using task-agnostic and ego-centric motion skills Proc. IEEE/RSJ Int. Conf. Intell. Rob. Syst (IROS) 2023 11289–11296

[21] Murtaza Dalal and Deepak Pathak and Russ R Salakhutdinov Accelerating robotic reinforcement learning via parameterized action primitives Adv. Neural Inf. Process. Syst. (NeurIPS) 2021 34 21847–21859

[22] Siyuan Chen and Li Yang and Zihao Mao and Mingyu Hou and Liu He and Wenjie Song Unified Planning Framework with Drivable Area Attention Extraction for Autonomous Driving in Urban Scenarios IEEE Robot. Autom. Lett. 2025

[23] Yuan Lin and Xiao Liu and Zishun Zheng Discretionary lane-change decision and control via parameterized soft actor–critic for hybrid action space Machines 2024 12 4 213

[24] Guizhe Jin and Zhuoren Li and Bo Leng and Wei Han and Lu Xiong and Chen Sun Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving arXiv preprint arXiv:2501.08096 2025

[25] Zhuoren Li and Guizhe Jin and Ran Yu and Bo Leng and Lu Xiong Interaction-Aware Deep Reinforcement Learning Approach Based on Hybrid Parameterized Action Space for Autonomous Driving Proc. SAE Intell. Connected Veh. Symposium (SAE ICVS) 2024

[26] Di Chen and Hao Li and Zhicheng Jin and Huizhao Tu and Meixin Zhu Risk-Anticipatory Autonomous Driving Strategies Considering Vehicles’ Weights Based on Hierarchical Deep Reinforcement Learning IEEE Trans on Intell. Transp. Syst. 2024

[27] Qiangqiang Guo and Ohay Angah and Zhijun Liu and Xuegang Jeff Ban Hybrid deep reinforcement learning based eco-driving for low-level connected and automated vehicles along signalized corridors Transp. Res. Part C Emerg. Technol. 2021 124 102980

[28] Zhiqi Mao and Yang Liu and Xiaobo Qu Integrating big data analytics in autonomous driving: An unsupervised hierarchical reinforcement learning approach Transp. Res. Part C Emerg. Technol. 2024 162 104606

[29] Kyowoon Lee and Seongun Kim and Jaesik Choi Adaptive and explainable deployment of navigation skills via hierarchical deep reinforcement learning Proc. IEEE Int. Conf. Robot. Autom. (ICRA) 2023 1673–1679 IEEE

[30] Ce Hao and Catherine Weaver and Chen Tang and Kenta Kawamoto and Masayoshi Tomizuka and Wei Zhan Skill-critic: Refining learned skills for hierarchical reinforcement learning IEEE Robot. Autom. Lett. 2024

[31] S. Zhang and S. Whiteson, “Dac: The double actor-critic architecture for learning options,'' Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 32, 2019.

[32] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized deep q-networks learning: Reinforcement learning with discrete-continuous hybrid action space,'' arXiv preprint arXiv:1810.06394, 2018.

[33] E. Leurent, “An environment for autonomous driving decision-making.'' https://github.com/eleurent/highway-env, 2018.

[34] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein, “The highd dataset: A drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems,'' in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), pp. 2118–2125, 2018.