r/MachineLearning · · 2 min read

Notes from evaluating a customer support chat agent system: heuristic evaluators give false signal, retrieval bugs masquerade as LLM failures, and the cost/quality Pareto frontier is rarely where you think [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Posting some practical findings from a structured audit of a production customer support RAG system. Methodology and caveats up front.

Methodology:

  • 6 representative turns from a real production session as the eval set (small, acknowledged limitation)
  • LLM-as-judge using Claude Haiku 4.5, scoring relevance/accuracy/helpfulness/overall on 0-10, returning per-turn reasoning strings for verification
  • Same judge across all conditions, same questions, same retrieval state where possible
  • Production model held constant while isolating retrieval changes, then swept across 5 LLMs once retrieval was fixed
  • Live pricing from OpenRouter /models API rather than estimates

Findings:

  1. Heuristic evaluation produces zero signal. The existing evaluator counted keywords and source references. Output was numerical but uncorrelated with response quality. LLM judges with explicit rubrics caught hallucinations, identified zero-retrieval turns, and produced reasoning that could be spot-checked. The cost is real but small (cents per run) compared to shipping undetected regressions.
  2. Retrieval failures present as generation failures. A turn where the agent said "I don't have information about our company" looked like a model knowledge problem. Trace showed zero documents retrieved. Root cause was a similarity threshold (cosine distance 0.7 in Chroma) too strict for casual openers. Always inspect what entered the context window before tuning the generation step.
  3. The production model was not on the Pareto frontier. Sweep across Gemini Flash Lite Preview (incumbent), Gemma 4 26B, Mistral Small 3.2, Nova Micro, and one more. Gemma 4 26B dominated the incumbent on both axes: higher quality scores (7.88 vs 7.33) at 75% lower cost. The incumbent was neither cheapest nor best.
  4. Grounding constraints have measurable helpfulness cost. Adding "only state facts present in retrieved documents" to the system prompt improved accuracy scores and reduced helpfulness scores on turns where docs didn't fully answer the question. The judge consistently flagged "the documents don't specify this, contact support" responses as accurate but less actionable. Real tradeoff worth surfacing rather than discovering post-deployment.

Limitations I want to be honest about:

  • n=6 is small. Treat the deltas as directional, not as confidence intervals.
  • LLM-as-judge has known biases (length, verbosity, self-preference). Using a different family than the production models reduces but doesn't eliminate this. Sanity checked by reading the reasoning strings.
  • "Quality" here is judge-defined, not user-defined. A proper next step would be correlating judge scores with user satisfaction signals.

End-to-end delta: +19% quality, −79% cost. The cost win is robust because pricing is mechanical. The quality win I'd want to see replicated on a larger eval set before claiming it generalizes.

I've also written a detailed write up if anyone wants to go in depth on the evaluation process details. Mentioned below in comments 👇

submitted by /u/gvij
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning