r/MachineLearning · May 23, 2026 · 1 min read

Tested chunking + embeddings data from 3 production websites. [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density:

Workspace	Sources	Chunks	HIGH	MEDIUM	LOW	REJECTED
Intercom	188	941	96	200	541	104
HubSpot	251	1705	40	508	1153	4
KPMG	53	209	3	14	127	65

(HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers)

87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.

Retrieval probes on KPMG (the worst-case corpus):

"Family business succession" → /private-enterprise.html (cosine 0.721)
"ESG and climate risk" → /our-insights/esg.html (cosine 0.794)
"Cybersecurity for energy sector" → /energy-natural-resources-chemicals.html (cosine 0.656)

So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH × 1.20) shifts the top-k composition meaningfully — on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59).

Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing.

Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.

submitted by /u/Otherwise_Economy576
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning