r/MachineLearning · May 20, 2026 · 1 min read

NOML-NOML: hierarchical TD3 + anchor policy for flight control [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

I built a custom RL algorithm for continuous flight control and open-sourced it. Sharing here in case the structural ideas are useful for anyone doing continuous control where one action axis dominates.

I've been training continuous control on a 6-DoF flight sim (pitch/roll/yaw/throttle/brake/fire) and kept hitting the same wall: vanilla TD3 would peak, then collapse into pitch oscillation and never recover. I tried reward shaping for a while before concluding the problem was structural, not in the reward. NOML is what came out of that.

Three structural changes on top of a standard TD3 skeleton:

Anchor policy — the action is anchor + delta·gate, where the anchor is a fixed safe action (wings level, MIL throttle). The policy literally cannot fully forget how to fly straight; the worst a collapsed policy can do is fall back to the anchor.
Hierarchical actor — three MLPs with independent optimizers (pitch → roll → rest), so a roll-side gradient update can't corrupt the pitch head. This is what actually killed the oscillation for me.
Mirror learning — left-right symmetry means every transition can be mirrored into a free second sample. 2× data when env steps are the bottleneck.

One thing that surprised me and goes against the usual advice: my best results came with exploration noise effectively off. On this task adding Gaussian action noise mostly just shook the stick and hurt. The anchor+gate structure seems to provide enough of the "fall back to safe behavior" role that noise usually plays.

Code (Apache 2.0), full writeup, and a test video are here: https://github.com/9138noms/NOML

https://www.youtube.com/watch?v=ZNn6wo_PX8Y

submitted by /u/9138NOMS
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning