Hugging Face Daily Papers · · 4 min read

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.</p>\n","updatedAt":"2026-05-21T23:16:12.016Z","author":{"_id":"651f30ce9f372ea08ddc5b1c","avatarUrl":"/avatars/db6eb1e1f50477740a653529c4657039.svg","fullname":"Fangzhou Wu","name":"wark123","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8774011135101318},"editors":["wark123"],"editorAvatarUrls":["/avatars/db6eb1e1f50477740a653529c4657039.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17110","authors":[{"_id":"6a0cc4df65eb30f20d962a02","user":{"_id":"651f30ce9f372ea08ddc5b1c","avatarUrl":"/avatars/db6eb1e1f50477740a653529c4657039.svg","isPro":false,"fullname":"Fangzhou Wu","user":"wark123","type":"user","name":"wark123"},"name":"Fangzhou Wu","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:24:18.747Z","hidden":false},{"_id":"6a0cc4df65eb30f20d962a03","name":"Sandeep Silwal","hidden":false},{"_id":"6a0cc4df65eb30f20d962a04","name":"Qiuyi Zhang","hidden":false}],"publishedAt":"2026-05-16T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Capturing LLM Capabilities via Evidence-Calibrated Query Clustering","submittedOnDailyBy":{"_id":"651f30ce9f372ea08ddc5b1c","avatarUrl":"/avatars/db6eb1e1f50477740a653529c4657039.svg","isPro":false,"fullname":"Fangzhou Wu","user":"wark123","type":"user","name":"wark123"},"summary":"Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.","upvotes":1,"discussionId":"6a0cc4e065eb30f20d962a05","ai_summary":"Query clustering algorithm ECC improves LLM capability evaluation by aligning semantic embeddings with latent capability demands through posterior model comparisons and Bradley-Terry modeling.","ai_keywords":["query clustering","latent capability demands","semantic embeddings","Bradley-Terry model","posterior model comparisons","capability-aware clustering","trainable mixture weights","LLM capability ranking","query routing"],"organization":{"_id":"6279cd50c20d41b28913755e","name":"Uwmadison","fullname":"University of Wisconsin-Madison","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/9Jg5WNG52u9RTFzOgQsII.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"651f30ce9f372ea08ddc5b1c","avatarUrl":"/avatars/db6eb1e1f50477740a653529c4657039.svg","isPro":false,"fullname":"Fangzhou Wu","user":"wark123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6279cd50c20d41b28913755e","name":"Uwmadison","fullname":"University of Wisconsin-Madison","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/9Jg5WNG52u9RTFzOgQsII.png"}}">
Papers
arxiv:2605.17110

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

Published on May 16
· Submitted by
Fangzhou Wu
on May 21
Authors:
,

Abstract

Query clustering algorithm ECC improves LLM capability evaluation by aligning semantic embeddings with latent capability demands through posterior model comparisons and Bradley-Terry modeling.

AI-generated summary

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

Community

Paper author Paper submitter about 3 hours ago

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.17110 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17110 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17110 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers