llama.cpp constantly reprocessing huge prompts with opencode/pi.dev
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.
Example behavior:
- context grows to +50k tokens
- LCP similarity often shows 0.99+
- but sometimes
n_pastsuddenly falls back to ~4-5k - then llama.cpp reprocesses 40k+ tokens again
- TTFT jumps to multiple minutes
Example logs:
sim_best = 0.996 restored context checkpoint ... n_tokens = 4750 prompt eval time = 222411 ms / 44016 tokens Normal reuse looks fine:
prompt eval time = 473 ms / 19 tokens Current config:
llama-server --ctx-size 150000 --parallel 1 --ctx-checkpoints 32 --cache-ram 2500 --cache-reuse 256 -no-kvu --no-context-shift Also seeing:
cache state: 1 prompts, 4676 MiB (limits: 2500 MiB) I suspect either:
- cache invalidation
- bad KV reuse
- or opencode changing early prompt tokens too often.
Would love to hear from others running long-context coding agents with llama.cpp and what settings helped reduce huge prompt reprocessing.
[link] [comments]
More from r/LocalLLaMA
-
I have (even faster) DeepSeek V4 Pro at home
May 15
-
Came home to find Pi with Qwen3.627B had run rm -rf .....
May 15
-
Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)
May 15
-
China modded GPU (eg. 4090 48gb) --> I'm gonna figure it out. IS THERE NO ONE ELSE CURIOUS??
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.