Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.</p>\n","updatedAt":"2026-06-04T03:28:52.276Z","author":{"_id":"67c7fa8679b553252bc0c9dc","avatarUrl":"/avatars/62701a702a90dce8a3e44678818041f8.svg","fullname":"Yuxin Meng","name":"yuxin-meng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8048980236053467},"editors":["yuxin-meng"],"editorAvatarUrls":["/avatars/62701a702a90dce8a3e44678818041f8.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03220","authors":[{"_id":"6a1f99bbe292c1c78ecb1341","user":{"_id":"67c7fa8679b553252bc0c9dc","avatarUrl":"/avatars/62701a702a90dce8a3e44678818041f8.svg","isPro":false,"fullname":"Yuxin Meng","user":"yuxin-meng","type":"user","name":"yuxin-meng"},"name":"Yuxin Meng","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:19:13.209Z","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1342","name":"Yuhan Suo","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1343","name":"Junjie Wang","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1344","name":"Yuhan Sun","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1345","name":"Yiyao Yu","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1346","name":"Ruixu Zhang","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1347","name":"Ruining Hu","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1348","name":"Yubin Wang","hidden":false},{"_id":"6a1f99bbe292c1c78ecb1349","name":"Shouwei Ruan","hidden":false},{"_id":"6a1f99bbe292c1c78ecb134a","name":"Bin Wang","hidden":false},{"_id":"6a1f99bbe292c1c78ecb134b","name":"Yuxiang Zhang","hidden":false},{"_id":"6a1f99bbe292c1c78ecb134c","name":"Yujiu Yang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/67c7fa8679b553252bc0c9dc/ZdOhM2kZrqftznFlgdYE7.png"],"publishedAt":"2026-06-02T06:29:40.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts","submittedOnDailyBy":{"_id":"67c7fa8679b553252bc0c9dc","avatarUrl":"/avatars/62701a702a90dce8a3e44678818041f8.svg","isPro":false,"fullname":"Yuxin Meng","user":"yuxin-meng","type":"user","name":"yuxin-meng"},"summary":"Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.","upvotes":8,"discussionId":"6a1f99bbe292c1c78ecb134d","projectPage":"https://iigroup.github.io/WebRISE","githubRepo":"https://github.com/IIGROUP/WebRISE","githubRepoAddedBy":"user","ai_summary":"WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection compared to traditional methods.","ai_keywords":["Interaction Contract Graphs","task requirements","user-intent transitions","DOM assertions","visual assertions","MLLMs","implicit constraints","state errors","checkpoint-style evaluation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"66f55d53853f0506904d1922","name":"IIGroup","fullname":"Tsinghua IIGroup","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62579c55b98dcaa7e0de285d/A1SKeBEvaODFnkAZusICK.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67c7fa8679b553252bc0c9dc","avatarUrl":"/avatars/62701a702a90dce8a3e44678818041f8.svg","isPro":false,"fullname":"Yuxin Meng","user":"yuxin-meng","type":"user"},{"_id":"68cd76a63fdd2afb061045c0","avatarUrl":"/avatars/3f86cf9c2df5b0db0aebc24c548b1eb0.svg","isPro":false,"fullname":"cc","user":"CuSO6","type":"user"},{"_id":"6825b12477079dc69bbb674b","avatarUrl":"/avatars/e4c9c97f119b61d9878a28bcaf5fb6ed.svg","isPro":false,"fullname":"AndyHu","user":"AndyHu918","type":"user"},{"_id":"6239386535384c2bcccd2a4f","avatarUrl":"/avatars/c62cefc6235210ca12b46beb62751d8b.svg","isPro":false,"fullname":"Yuxiang Zhang","user":"Joelzhang","type":"user"},{"_id":"67da58d5904a6c06ea5dde0c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/MYhkWTyt0jIv7Zg1FVLaK.png","isPro":false,"fullname":"Zephyra","user":"Zephyra111","type":"user"},{"_id":"699fc71e0b66a240da0ebcfd","avatarUrl":"/avatars/75b5c9d0b5db115c9ae4e8544a27f9d5.svg","isPro":false,"fullname":"Ruixu Zhang","user":"muyu111","type":"user"},{"_id":"6a1687d82953a567dfa0c26d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/9UU7jsjXVkTelP-7ycLQm.png","isPro":false,"fullname":"Новиков Тимофей","user":"charleslopez","type":"user"},{"_id":"650bde36534285d49a60ce2f","avatarUrl":"/avatars/dae255b5486a30fe7418ccfe760b27af.svg","isPro":false,"fullname":"Yubin Wang","user":"yubinwang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66f55d53853f0506904d1922","name":"IIGroup","fullname":"Tsinghua IIGroup","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62579c55b98dcaa7e0de285d/A1SKeBEvaODFnkAZusICK.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03220.md"}">
WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
WebRISE evaluates MLLM-generated web artifacts by analyzing interaction contracts that capture user intent transitions and requirement checks across multiple input modalities, revealing significant gaps in model performance and demonstrating superior error detection compared to traditional methods.
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.
Community
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.03220 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.03220 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.