r/MachineLearning · · 1 min read

Tested chunking + embeddings data from 3 production websites. [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density:

Workspace Sources Chunks HIGH MEDIUM LOW REJECTED
Intercom 188 941 96 200 541 104
HubSpot 251 1705 40 508 1153 4
KPMG 53 209 3 14 127 65

(HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers)

87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.

Retrieval probes on KPMG (the worst-case corpus):

  • "Family business succession" → /private-enterprise.html (cosine 0.721)
  • "ESG and climate risk" → /our-insights/esg.html (cosine 0.794)
  • "Cybersecurity for energy sector" → /energy-natural-resources-chemicals.html (cosine 0.656)

So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH × 1.20) shifts the top-k composition meaningfully — on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59).

Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing.

Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.

submitted by /u/Otherwise_Economy576
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning