r/LocalLLaMA · · 1 min read

Is using vLLM actually worth it if you aren't serving the model to other people?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The thing is, I've never actually used vLLM directly, but I've heard good things about how it performs compared to llama.cpp, with vLLM apparently outperforming it pretty much across the board.

Buuuuut, I only serve my model to myself - no hosting for others to worry about, and another thing I've heard is that vLLM is engineered more for scenarios where you're serving many requests at once. But the apparent speedup still piques my interest.

Has anybody here actually done this? Is it worth all the hassle, or is it basically unnoticeable and not something to bother with? It would be great to hear some of the experiences from people who aren't just using it in enterprise-type settings.

Appreciate any help, ty!

submitted by /u/ayylmaonade
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA