Latent Policy Steering through One-Step Flow Policies

Preprint

1Yonsei University, 2Microsoft Research

TL;DR: Robust, Tuning-Free Offline RL via Steering Latent Actor through One-step Policy

Abstract

Offline reinforcement learning (RL) should be ideal for robotics, allowing learning from dataset without risky exploration. Yet, offline RL's performance often hinges on a brittle trade-off between (1) return maximization, which can push policies outside dataset support, and (2) behavioral constraints, which typically require sensitive hyperparameter tuning. Latent steering offers a structural way to stay within dataset support during RL but, in order to approximate action values, existing offline adaptations commonly rely on latent-space critics learned via indirect distillation, which can lose information and hinder convergence. We propose Latent Policy Steering (LPS), which enables high-fidelity latent policy improvement by backpropagating original-action-space Q-gradients through a differentiable one-step MeanFlow policy to update a latent-action-space actor. By eliminating proxy latent critics, LPS allows an original-action-space critic to guide end-to-end latent-space optimization, while the one-step MeanFlow policy serves as a behavior-constrained generative prior. This decoupling yields a robust method that works out-of-the-box with minimal tuning. Across OGBench and real-world robotic tasks, LPS achieves state-of-the-art performance and consistently outperforms behavioral cloning and strong latent steering baselines.


Limitations of Prior Works

Most state-of-the-art offline RL algorithms follow the TD3+BC paradigm, aiming to maximize return while constraining the learned policy to dataset support by adding a regularization term weighted by a hyperparameter α. This introduces a delicate trade-off: weak regularization leads to out-of-distribution actions and extrapolation error, while excessive regularization reduces offline RL to behavioral cloning. Meanwhile, latent steering methods like DSRL resolve this trade-off structurally but require learning a latent-space critic via indirect distillation, which can be lossy and produce inaccurate gradients.

Click to expand details

Behavioral constraint is too sensitive

Sensitivity to regularization weight alpha

The way behavior-regularized offline RL methods balance value maximization and behavioral adherence – using a weighting hyperparameter α – can be fragile even in simple settings. Large α yields overly conservative policies, while small α encourages out-of-support actions. The best α is highly sensitive to reward scale, dataset diversity, and model capacity, making extensive hyperparameter sweeps feasible in simulation but prohibitively expensive and risky with real-world robots.

Latent-space critic is not enough

Latent-space critic gradient mismatch

Latent steering methods (e.g., DSRL) optimize latents by relying on a value function defined in the latent space, typically obtained by distilling the action-space critic through a frozen decoder. However, matching values does not guarantee that the latent gradients used for improvement are accurate. Even when the distilled Q approximates values reasonably, its gradient direction ∇zQϕ(s,z) can deviate substantially from the gradient of the action-space critic, particularly near sharp boundaries of the data manifold. Such gradient mismatch can lead to suboptimal latent updates and degrade purely-offline performance.


Latent Policy Steering (LPS)

We propose Latent Policy Steering (LPS), which addresses both limitations above. First, LPS avoids the explicit behavior-regularization trade-off by separating reward maximization and distributional constraints: a fixed generative behavior policy defines the support, while a latent actor performs value-driven steering (resolving α-sensitivity). Second, LPS eliminates proxy latent critics by directly backpropagating action-space critic gradients through a differentiable generative base policy to update the latent actor (avoiding the inaccurate latent critic).

1. Differentiable Base Policy via MeanFlow

The base policy πβ defines the "safe manifold" of the dataset. LPS treats it as a differentiable mapping, enabling backpropagation of gradients from the action-space critic to the latent actor. We employ MeanFlow, which provides efficient one-step deterministic generation via a simple ODE: â = z − uβ(z, 0, 1). A noise-to-action reformulation directly predicts the denoised action rather than displacement, improving numerical stability.

2. Spherical Latent Geometry

To prevent the "norm explosion" problem where latent actors increase |z| to query atypical regions under the base policy prior, we leverage the concentration of measure property of high-dimensional Gaussians. Both the base policy latent and the latent actor output are constrained to the hypersphere Sd−1 with radius √d, keeping latent queries within the valid coverage of the base policy while maintaining well-conditioned gradients.

3. Direct Latent Policy Steering

The latent actor πϕ is optimized directly using the action-space critic Qθ:

LPS = −𝔼s∼D[Qθ(s, πβ(s, πϕ(s)))]

Gradients propagate through πβ via the chain rule, yielding low-variance latent updates without introducing proxy Q(s,z). No explicit behavior-regularization coefficient α is needed: behavioral constraints are enforced structurally by the fixed generative prior.

Key Insight: LPS structurally decouples behavioral constraints from reward maximization—the fixed MeanFlow base policy confines output to the data manifold, while the latent actor freely maximizes Q-value within that manifold. This eliminates the need for sensitive α tuning.

Experiments

We evaluate LPS across OGBench manipulation tasks in simulation and real-world robotic manipulation tasks on the DROID platform. All methods share the same Q-Chunking (QC) value-learning backbone and differ only in policy extraction strategy. We compare against: QC-FQL and QC-MFQL (action-space distillation baselines), DSRL (latent steering with latent-space critic), and CFGRL (inference-time classifier-free guidance).

Real-World (Franka)

Our real-world experiments use the DROID platform with a Franka Research 3 robot across four manipulation tasks: pick-and-place carrots, eggplant to bin, plug in bulb, and refill tape. We collected 50 human-teleoperated demonstrations per task. Across all tasks, LPS achieves the highest success rates and the best average performance (56.2%), outperforming both behavioral cloning baselines (Flow-BC: 31.2%, MF-BC: 28.7%) and prior latent-steering methods (DSRL: 35.0%). LPS is particularly effective on precision-critical tasks where DSRL struggles.

Simulation (OGBench)

We evaluate on five state-based manipulation tasks from OGBench (cube-single, cube-double, scene-sparse, puzzle-3x3-sparse, puzzle-4x4) and pixel-based visual task variants. LPS consistently outperforms the one-step distillation baselines (QC-FQL and QC-MFQL). DSRL exhibits higher variance and performs poorly on the challenging cube-double domain, highlighting the limitations of relying on a distilled latent-space critic. CFGRL underperforms explicit policy extraction methods, suggesting that inference-time guidance alone provides weaker improvement signals than direct critic-based optimization.


Robustness to α

Sensitivity to regularization weight

To evaluate robustness to behavior-regularization tuning, we sweep α on representative OGBench tasks. QC-MFQL exhibits a sharp performance peak at a specific α, with success rates dropping rapidly when α deviates from the task-specific optimum. In contrast, LPS remains stable across a wide range of α, consistent with its design goal of decoupling policy improvement from explicit behavior-regularization weights.

When BC Fails and How LPS Improves

BC failure modes and LPS corrections

Our dataset consists solely of successful trajectories, which are often suboptimal due to human teleoperation artifacts like hesitation, micro-corrections, and jittery motions. BC baselines frequently suffer from premature release, repetitive motion loops, and freezing during precision alignment. LPS effectively mitigates these issues by steering the latent policy toward high-value regions, enabling the agent to execute decisive actions where BC would otherwise stall or oscillate.


Ablation Study

We conduct comprehensive ablation studies to validate three key design choices in LPS:

  • Latent-space geometry: Replacing the spherical prior with a standard normal or truncated normal significantly reduces performance. Without a spherical constraint, latent optimization tends to increase |z| (norm growth), pushing the actor into atypical regions. The spherical constraint prevents this norm explosion and yields stable optimization.
  • One-step generative backbone: Comparing against flow-matching variants shows that standard FM vector fields incur large approximation errors under single-step integration, and multi-step backpropagation introduces additional instability. MeanFlow enables high-fidelity one-step generation and stable end-to-end gradients.
  • Noise-to-action reformulation: Training LPS with the original MeanFlow parameterization leads to unstable learning, whereas the reformulated objective consistently yields strong performance, highlighting the importance of grounding training in the action space.

Collectively, these findings provide strong evidence that every component of LPS is a meaningful contributor to its overall performance.


Contributions

  1. We identify two practical bottlenecks for real-world offline RL: the sensitivity induced by explicit behavior regularization and the approximation error induced by indirect latent distillation (e.g., noise aliasing).
  2. We propose Latent Policy Steering (LPS), which structurally decouples behavioral constraints from reward maximization by enabling direct latent policy improvement via backpropagation through a differentiable one-step generative model.
  3. We demonstrate that LPS achieves state-of-the-art performance on OGBench and exhibits superior practicality on real-world robotic manipulation, consistently outperforming behavioral cloning without task-specific tuning.

BibTeX

@article{im2025lps,
  title={Latent Policy Steering through One-Step Flow Policies},
  author={Hokyun Im and Andrey Kolobov and Jianlong Fu and Youngwoon Lee},
  year={2025},
}