r/LocalLLaMA · · 1 min read

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever.

The method (DV-DPO):

  • Run a 3-voice council on each question, produce a synthesis
  • Cross-examine: losing voices challenge the synthesis
  • If synthesis gets revised → DPO pair (chosen=post-revision, rejected=pre-revision)
  • If synthesis holds → no pair (good reasoning produces nothing to learn from)

Only genuine revisions under adversarial pressure become training signal. Not format preference, not sampling variance.

Results:

  • 1,040 pairs total (~$3 at Haiku rates)
  • Head-to-head vs Claude Haiku: Format 100%, Commits 100%, Context 89%, Composite 96%
  • Latency: 11s vs 3s (T4 GPU, 4-bit quantized)
  • Adversarial failure rate: 2% on 96 targeted questions

Autonomous loop now running:
failure_detector → auto_red_team → DPO pairs → retrain → redeploy → eval. v5 pairs accumulating.

GGUF ready for Ollama. Happy to share the pipeline if there's interest.

submitted by /u/Lower-Economics6910
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA