r/LocalLLaMA · June 10, 2026 · 1 min read

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Built a decision-reasoning engine (Orlog) and wanted to fine-tune a local model for it instead of paying per-call forever.

The method (DV-DPO):

Run a 3-voice council on each question, produce a synthesis
Cross-examine: losing voices challenge the synthesis
If synthesis gets revised → DPO pair (chosen=post-revision, rejected=pre-revision)
If synthesis holds → no pair (good reasoning produces nothing to learn from)

Only genuine revisions under adversarial pressure become training signal. Not format preference, not sampling variance.

Results:

1,040 pairs total (~$3 at Haiku rates)
Head-to-head vs Claude Haiku: Format 100%, Commits 100%, Context 89%, Composite 96%
Latency: 11s vs 3s (T4 GPU, 4-bit quantized)
Adversarial failure rate: 2% on 96 targeted questions

Autonomous loop now running:
failure_detector → auto_red_team → DPO pairs → retrain → redeploy → eval. v5 pairs accumulating.

GGUF ready for Ollama. Happy to share the pipeline if there's interest.

Discussion (0)

No comments yet. Sign in and be the first to say something.