1. Differentiable Base Policy via MeanFlow
The base policy πβ defines the "safe manifold" of the dataset. LPS treats it as a
differentiable mapping, enabling backpropagation of gradients from the action-space critic
to the latent actor. We employ MeanFlow, which provides efficient one-step deterministic
generation via a simple ODE: â = z − uβ(z, 0, 1).
A noise-to-action reformulation directly predicts the denoised action rather than displacement,
improving numerical stability.
2. Spherical Latent Geometry
To prevent the "norm explosion" problem where latent actors increase |z| to query atypical regions
under the base policy prior, we leverage the concentration of measure property of
high-dimensional Gaussians. Both the base policy latent and the latent actor output are constrained
to the hypersphere Sd−1 with radius √d, keeping latent queries within the valid
coverage of the base policy while maintaining well-conditioned gradients.
3. Direct Latent Policy Steering
The latent actor πϕ is optimized directly using the action-space critic Qθ:
ℒLPS = −𝔼s∼D[Qθ(s, πβ(s, πϕ(s)))]
Gradients propagate through πβ via the chain rule, yielding low-variance latent updates
without introducing proxy Q(s,z). No explicit behavior-regularization coefficient α is needed:
behavioral constraints are enforced structurally by the fixed generative prior.
Key Insight: LPS structurally decouples behavioral constraints from reward
maximization—the fixed MeanFlow base policy confines output to the data manifold, while the latent
actor freely maximizes Q-value within that manifold. This eliminates the need for sensitive α tuning.