NOML-NOML: hierarchical TD3 + anchor policy for flight control [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
I built a custom RL algorithm for continuous flight control and open-sourced it. Sharing here in case the structural ideas are useful for anyone doing continuous control where one action axis dominates.
I've been training continuous control on a 6-DoF flight sim (pitch/roll/yaw/throttle/brake/fire) and kept hitting the same wall: vanilla TD3 would peak, then collapse into pitch oscillation and never recover. I tried reward shaping for a while before concluding the problem was structural, not in the reward. NOML is what came out of that.
Three structural changes on top of a standard TD3 skeleton:
- Anchor policy — the action is
anchor + delta·gate, where the anchor is a fixed safe action (wings level, MIL throttle). The policy literally cannot fully forget how to fly straight; the worst a collapsed policy can do is fall back to the anchor. - Hierarchical actor — three MLPs with independent optimizers (pitch → roll → rest), so a roll-side gradient update can't corrupt the pitch head. This is what actually killed the oscillation for me.
- Mirror learning — left-right symmetry means every transition can be mirrored into a free second sample. 2× data when env steps are the bottleneck.
One thing that surprised me and goes against the usual advice: my best results came with exploration noise effectively off. On this task adding Gaussian action noise mostly just shook the stick and hurt. The anchor+gate structure seems to provide enough of the "fall back to safe behavior" role that noise usually plays.
Code (Apache 2.0), full writeup, and a test video are here: https://github.com/9138noms/NOML
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.