r/LocalLLaMA · · 1 min read

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm wondering if anyone else has come across this.

I've tested the same model on llama.cpp and vLLM with similar settings and quantizations. The performance and concurrency in vLLM are much noticeably better, but sometimes the model feels less reliable.

Some things I've noticed:

* More mistakes with formatting and tool calls

* Forgetting context suddenly

* Sometimes acting like messages didn't exist

* Lower quality code even with similar parameters

I'm not trying to start a comparison. I just want to know if others have seen differences in quality between inference backends... Is it usually because of quantization, chat templates, parser problems or configuration errors.

What has your experience been, like?

submitted by /u/recro69
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA