r/MachineLearning · June 1, 2026 · 1 min read

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Hello,

I have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call a tool).

I am wondering what the best training approach would be and why.

My current dataset is stored in a chat format similar to this:

```text system user assistant_think assistant_tool assistant_answer

user assistant_think assistant_tool assistant_answer ... ```

My current idea is to split each conversation into multiple training samples. For example, if a conversation contains two user turns, I would create two samples:

Sample 1

text system user assistant_think assistant_tool assistant_answer

Sample 2

```text system user assistant_think assistant_tool assistant_answer

user assistant_think assistant_tool assistant_answer ```

In other words, each sample contains all previous conversation history up to the assistant response being trained.

For training, the loss would be computed only on the assistant-generated tokens:

text assistant_think assistant_tool assistant_answer

while the system and user messages would be masked out from the loss.

Is this approach correct, or is there a better way to structure the training data for reasoning and tool-calling behavior?

My second question is about reinforcement learning.

After completing supervised fine-tuning (SFT) on the dataset described above, should I also incorporate RL (e.g., PPO, GRPO, DPO, or another approach) to further train the model on when a tool should or should not be called?

If so:

What advantages would RL provide over SFT alone for tool use and reasoning?
How would you design the reward function?
Under what circumstances is RL actually necessary, and when is SFT sufficient?

I would appreciate any practical advice, papers, blog posts, or open-source examples related to training reasoning and tool-calling models. ```

submitted by /u/zdeneklapes
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Sample 1

Sample 2

Discussion (0)

More from r/MachineLearning