Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm wondering if anyone else has come across this.
I've tested the same model on llama.cpp and vLLM with similar settings and quantizations. The performance and concurrency in vLLM are much noticeably better, but sometimes the model feels less reliable.
Some things I've noticed:
* More mistakes with formatting and tool calls
* Forgetting context suddenly
* Sometimes acting like messages didn't exist
* Lower quality code even with similar parameters
I'm not trying to start a comparison. I just want to know if others have seen differences in quality between inference backends... Is it usually because of quantization, chat templates, parser problems or configuration errors.
What has your experience been, like?
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.