τ-Rec: A verifiable benchmark for agentic recommender systems. Dataset: <a href=\"https://huggingface.co/datasets/nbharaths/tau-rec\">https://huggingface.co/datasets/nbharaths/tau-rec</a></p>\n","updatedAt":"2026-06-11T12:37:45.980Z","author":{"_id":"642dd9eb2f6dbab7757ea329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nDglHznmW74nEjKhZ1sim.png","fullname":"Bharath Sivaram Narasimhan","name":"nbharaths","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5786004662513733},"editors":["nbharaths"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nDglHznmW74nEjKhZ1sim.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.10156","authors":[{"_id":"6a295de0887fb79cbf65d672","user":{"_id":"642dd9eb2f6dbab7757ea329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nDglHznmW74nEjKhZ1sim.png","isPro":false,"fullname":"Bharath Sivaram Narasimhan","user":"nbharaths","type":"user","name":"nbharaths"},"name":"Bharath Sivaram Narasimhan","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:39:58.941Z","hidden":false},{"_id":"6a295de0887fb79cbf65d673","name":"Karthik R Narasimhan","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems","submittedOnDailyBy":{"_id":"642dd9eb2f6dbab7757ea329","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nDglHznmW74nEjKhZ1sim.png","isPro":false,"fullname":"Bharath Sivaram Narasimhan","user":"nbharaths","type":"user","name":"nbharaths"},"summary":"As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on \"LLM-as-a-judge\" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.","upvotes":1,"discussionId":"6a295de0887fb79cbf65d674","githubRepo":"https://github.com/nbharaths/tau-rec","githubRepoAddedBy":"user","ai_summary":"A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models.","ai_keywords":["agentic recommender systems","LLM-as-a-judge","reward-based evaluation","reveal-tagged elicitation","pass^k reliability metric","conversational interfaces","structured catalog predicates"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a2ae6c2e36bc84d91b6e7cc","avatarUrl":"/avatars/abf4b4c0020f9332b6827952cc53163e.svg","isPro":false,"fullname":"mmgood","user":"mmgood","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.10156.md"}">
τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems
Abstract
A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models.
As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present τ-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, τ-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.10156 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.10156 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.