Hugging Face Daily Papers · June 18, 2026 · 4 min read

Learning User Simulators with Turing Rewards

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We propose Turing-RL: a Turing-Test-based reinforcement learning approach for training user simulator models. Across two different domains—conversational chat and Reddit forum discussion—we find that Turing-RL consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.</p>\n","updatedAt":"2026-06-18T02:47:09.509Z","author":{"_id":"6475b3b904c82116f9babbda","avatarUrl":"/avatars/ca736b0f15ced84d0f218d8738770d17.svg","fullname":"Ced Zhang","name":"cedzhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"en","probability":0.9082593321800232},"editors":["cedzhang"],"editorAvatarUrls":["/avatars/ca736b0f15ced84d0f218d8738770d17.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19336","authors":[{"_id":"6a33573c59127a45e2c1c5b7","name":"Yingshan Susan Wang","hidden":false},{"_id":"6a33573c59127a45e2c1c5b8","name":"Cedegao E. Zhang","hidden":false},{"_id":"6a33573c59127a45e2c1c5b9","name":"Linlu Qiu","hidden":false},{"_id":"6a33573c59127a45e2c1c5ba","name":"Zexue He","hidden":false},{"_id":"6a33573c59127a45e2c1c5bb","name":"Pengyuan Li","hidden":false},{"_id":"6a33573c59127a45e2c1c5bc","name":"Alex Pentland","hidden":false},{"_id":"6a33573c59127a45e2c1c5bd","name":"Roger P. Levy","hidden":false},{"_id":"6a33573c59127a45e2c1c5be","name":"Yoon Kim","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Learning User Simulators with Turing Rewards","submittedOnDailyBy":{"_id":"6475b3b904c82116f9babbda","avatarUrl":"/avatars/ca736b0f15ced84d0f218d8738770d17.svg","isPro":false,"fullname":"Ced Zhang","user":"cedzhang","type":"user","name":"cedzhang"},"summary":"Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.","upvotes":1,"discussionId":"6a33573c59127a45e2c1c5bf","githubRepo":"https://github.com/SusanWYS/turing-rl","githubRepoAddedBy":"user","ai_summary":"A reinforcement learning approach using Turing test-based rewards trains language models to generate responses indistinguishable from human users in conversational and forum discussion settings.","ai_keywords":["large language model","reinforcement learning","Turing test","discriminative Turing reward","user simulator","LLM judge","conversational chat","Reddit forum discussion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"63728bde14d543d507ae970d","name":"MIT","fullname":"Massachusetts Institute of Technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/S90qoeEJeEYaYf-c7Zs8g.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643e1ee21e5be78c66463f7d","avatarUrl":"/avatars/84b1b1dfc854c480a9b3a447c3753444.svg","isPro":false,"fullname":"Susan Wang","user":"susanw03","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63728bde14d543d507ae970d","name":"MIT","fullname":"Massachusetts Institute of Technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/S90qoeEJeEYaYf-c7Zs8g.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.19336.md","query":{}}">

Papers

arxiv:2606.19336

Learning User Simulators with Turing Rewards

Published on Jun 17

· Submitted by

Ced Zhang on Jun 18

Massachusetts Institute of Technology

Upvote

Authors:

Abstract

A reinforcement learning approach using Turing test-based rewards trains language models to generate responses indistinguishable from human users in conversational and forum discussion settings.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

View arXiv page View PDF GitHub 1 Add to collection

Community

cedzhang

Paper submitter about 6 hours ago

•

edited about 6 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.19336

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.19336 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19336 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19336 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Learning User Simulators with Turing Rewards

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers