I'm running an agentic system with kobold.cpp as my backend. Am I losing performance?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.
I plan to move to a larger MoE model once I'm satisfied with how everything is working, but I'm just wondering if I'm sacrificing performance by not using llama.cpp standalone and relying on a program that's more focused on ease of use.
To my knowledge it's just a simple wrapper, but I'm curious if anyone has any experience swapping between Kobold and other local endpoints. Thanks!
[link] [comments]
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.