r/LocalLLaMA · · 1 min read

Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

To train Qwen 3.5 4B or 9B for a custom multi-tool agent workflow and would appreciate guidance from people who have done this successfully.

A few questions:

  1. SFT → RL or RL-only?

    - Is it still recommended to first do supervised fine-tuning (tool-calling traces, reasoning trajectories, etc.) and then apply RL?

    - Or are people seeing good results with RL-based training directly for tool-use tasks?

  2. Reward design

    - How do you design reward functions for tool-use agents?

  3. Parallel tool execution

    - One complication in my workflow:

- Tool A returns N items

- The agent must call Tool B N times, potentially in parallel

- Then aggregate the results

How would you represent and train this behavior?

For those who have trained production-quality tool-use models, what training recipe worked best?

submitted by /u/siri_1110
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA