How to fine-tune an LLM for open-ended problems? [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
I want to develop an LLM that can solve open-ended math problems (such as proof-only problems). This means that RLVR where we use the final answer alone as reward signal is not enough. Since SFT is useless here and GRPO/PPO methods will not have an appropriate reward function, what kind of fine-tuning can I do? For data, I will use the MathNet dataset.
[link] [comments]
More from r/MachineLearning
-
Workshop submission for main conference paper under review [D]
May 30
-
Before we spend months processing open-source robotics datasets, tell us why this is a bad idea [D]
May 30
-
Query about non-archival workshop at CVPR-2026 [R]
May 30
-
Why do the output layer weights become word vectors in Word2Vec? [D]
May 30
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.