Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hello,
I have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call a tool).
I am wondering what the best training approach would be and why.
My current dataset is stored in a chat format similar to this:
```text system user assistant_think assistant_tool assistant_answer
user assistant_think assistant_tool assistant_answer ... ```
My current idea is to split each conversation into multiple training samples. For example, if a conversation contains two user turns, I would create two samples:
Sample 1
text system user assistant_think assistant_tool assistant_answer
Sample 2
```text system user assistant_think assistant_tool assistant_answer
user assistant_think assistant_tool assistant_answer ```
In other words, each sample contains all previous conversation history up to the assistant response being trained.
For training, the loss would be computed only on the assistant-generated tokens:
text assistant_think assistant_tool assistant_answer
while the system and user messages would be masked out from the loss.
Is this approach correct, or is there a better way to structure the training data for reasoning and tool-calling behavior?
My second question is about reinforcement learning.
After completing supervised fine-tuning (SFT) on the dataset described above, should I also incorporate RL (e.g., PPO, GRPO, DPO, or another approach) to further train the model on when a tool should or should not be called?
If so:
- What advantages would RL provide over SFT alone for tool use and reasoning?
- How would you design the reward function?
- Under what circumstances is RL actually necessary, and when is SFT sufficient?
I would appreciate any practical advice, papers, blog posts, or open-source examples related to training reasoning and tool-calling models. ```
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.