Tested chunking + embeddings data from 3 production websites. [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density:
| Workspace | Sources | Chunks | HIGH | MEDIUM | LOW | REJECTED |
|---|---|---|---|---|---|---|
| Intercom | 188 | 941 | 96 | 200 | 541 | 104 |
| HubSpot | 251 | 1705 | 40 | 508 | 1153 | 4 |
| KPMG | 53 | 209 | 3 | 14 | 127 | 65 |
(HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers)
87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.
Retrieval probes on KPMG (the worst-case corpus):
- "Family business succession" → /private-enterprise.html (cosine 0.721)
- "ESG and climate risk" → /our-insights/esg.html (cosine 0.794)
- "Cybersecurity for energy sector" → /energy-natural-resources-chemicals.html (cosine 0.656)
So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH × 1.20) shifts the top-k composition meaningfully — on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59).
Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing.
Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.
[link] [comments]
More from r/MachineLearning
-
Spice: We built an open-sourced decision layer that sits above your AI agents (controls agent actions before execution) [P]
May 23
-
I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]
May 23
-
LLMs are just giant probability machines pretending to think [P]
May 23
-
LQS v3.1 — an open methodology for rating AI training data (multi-oracle consensus + signed certificates) [P]
May 23
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.