r/LocalLLaMA · May 15, 2026 · 3 min read

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Evaluated a RAG chatbot and the most expensive model was the worst performer. Notes on what actually moved the needle.

We had a customer support RAG bot. Standard setup: ChromaDB, system prompt, an LLM doing generation. Nobody had actually measured the response quality.

In the name of evaluation, I only had a keyword matching script producing numbers that looked like scores and meant nothing.

I went in to fix this properly. Sharing what I found because most of it was not where I expected.

1. Retrieval problems disguise themselves as LLM problems.

User asks "hey what do you guys do?" Bot says "I don't have access to specific information about our company's services." Everyone's first instinct is to tweak the prompt or swap the model. Wrong. The similarity threshold in ChromaDB was set to 0.7 (cosine distance, lower = more similar, so this is actually strict). Casual openers don't produce embeddings close enough to any chunk to pass that filter. Zero docs retrieved. The model was honestly reporting it had nothing.

Lesson: always log what context the LLM actually received before blaming generation. If retrieval returns nothing, no amount of prompt engineering fixes it.

2. Heuristic evaluators are worse than no evaluator.

Counting keywords and source references gives you a number. That number has no correlation with whether users are being helped. Worse, it gives you false confidence that you are measuring something. Bit the bullet and used an LLM judge (Claude Haiku 4.5 via OpenRouter) scoring relevance, accuracy, helpfulness, and overall on 0-10. Costs a few cents per full run. Cheap insurance.

3. Deduplicate chunks before sending to the model.

Two of our turns had three near-identical FAQ chunks in the context window. Added a check for >80% token overlap from the same source file. Cleaner context, fewer tokens, and the agent stopped hallucinating product names on one turn (probably because the noise was gone).

4. Stricter grounding trades helpfulness for accuracy.

Added a rule that the agent only states facts present in retrieved docs. Accuracy went up. Helpfulness went down on knowledge-gap turns because the bot started saying "the docs don't specify this, contact support" instead of guessing. This is the right call for a factual support bot but you need to make it consciously. Otherwise users complain the bot got worse even though your scores say it got better.

5. Run a model sweep. The defaults are usually wrong.

I was running Gemini 3.1 Flash Lite Preview. Swept 5 models against the same eval harness. Gemma 4 26B scored higher (7.88 vs 7.33) and cost 75% less per session. Mistral Small 3.2 close second. Nova Micro cheapest but terse responses got penalized for not being actionable.

The point is not that Gemma is the best model. The point is your production model is probably not on the Pareto frontier and you only find that out by measuring.

End to end: quality 6.62 to 7.88 (+19%), cost $0.002420 to $0.000509 per session (−79%). Both directions, same run.

This entire evaluation was done using Neo AI Engineer. It built the eval harness, handled checkpointed runs, dealt with timeout and context limit issues, and consolidated results. I reviewed everything manually and made the calls on what to ship.

Full walkthrough write up in the comments if anyone wants to replicate it on their own system. 👇

submitted by /u/gvij
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA