Hugging Face Daily Papers · May 14, 2026 · 4 min read

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

ShapeCodeBench is a synthetic multimodal benchmark for perception-to-program reconstruction: models infer executable drawing programs from rendered shape scenes.</p>\n<p>The v1 release is intentionally small but challenging: the best multimodal exact-match rate is only 2.7%, even though foreground IoU reaches 0.87. The benchmark is renewable from fresh seeds to reduce exact-instance contamination, and includes deterministic scoring, a frozen eval_v1 split, baselines, code, data, paper source, and archived reproducibility artifacts.</p>\n","updatedAt":"2026-05-14T02:55:12.076Z","author":{"_id":"67b2da517c5844c6bc3c00be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b2da517c5844c6bc3c00be/k1rGEjpdN3z9Wj7fs-IUp.jpeg","fullname":"Shivam Kumar","name":"shivamk3r","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8783073425292969},"editors":["shivamk3r"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/67b2da517c5844c6bc3c00be/k1rGEjpdN3z9Wj7fs-IUp.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.11680","authors":[{"_id":"6a03dc0a86b054ce2fa40d3f","user":{"_id":"67b2da517c5844c6bc3c00be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b2da517c5844c6bc3c00be/k1rGEjpdN3z9Wj7fs-IUp.jpeg","isPro":false,"fullname":"Shivam Kumar","user":"shivamk3r","type":"user","name":"shivamk3r"},"name":"Shivam Kumar","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:51:03.118Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/67b2da517c5844c6bc3c00be/t5xGvV_Ns0jtEIE2FTYuz.png"],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes","submittedOnDailyBy":{"_id":"67b2da517c5844c6bc3c00be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b2da517c5844c6bc3c00be/k1rGEjpdN3z9Wj7fs-IUp.jpeg","isPro":false,"fullname":"Shivam Kumar","user":"shivamk3r","type":"user","name":"shivamk3r"},"summary":"We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.","upvotes":1,"discussionId":"6a03dc0a86b054ce2fa40d40","projectPage":"https://arxiv.org/abs/2605.11680","githubRepo":"https://github.com/shivamk3r/shape-code-bench","githubRepoAddedBy":"user","ai_summary":"ShapeCodeBench presents a synthetic benchmark for perception-to-program reconstruction where models generate executable drawing programs from raster images, evaluated on multiple metrics including exact match and pixel accuracy.","ai_keywords":["perception-to-program reconstruction","synthetic benchmark","raster image","executable drawing program","deterministic evaluator","DSL","primitives","canvas","seeded RNG","held-out sets","exact match","pixel accuracy","foreground IoU","parse success","execution success","multimodal configuration","parameter errors"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67b2da517c5844c6bc3c00be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b2da517c5844c6bc3c00be/k1rGEjpdN3z9Wj7fs-IUp.jpeg","isPro":false,"fullname":"Shivam Kumar","user":"shivamk3r","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.11680.md"}">

Papers

arxiv:2605.11680

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Published on May 12

· Submitted by

Shivam Kumar on May 14

Upvote

Authors:

Shivam Kumar

Abstract

ShapeCodeBench presents a synthetic benchmark for perception-to-program reconstruction where models generate executable drawing programs from raster images, evaluated on multiple metrics including exact match and pixel accuracy.

AI-generated summary

We introduce ShapeCodeBench, a synthetic benchmark for perception-to-program reconstruction: given a rendered raster image, a model must emit an executable drawing program that a deterministic evaluator re-renders and compares with the target. The v1 DSL has four primitives on a 512 x 512 black-on-white canvas, but every instance is generated from a seeded RNG, so fresh held-out sets can be created to reduce exact-instance contamination. We release a frozen eval_v1 split with 150 samples across easy, medium, and hard tiers, scored by exact match, pixel accuracy, foreground IoU, parse success, and execution success. We evaluate an empty-program floor, a classical computer-vision heuristic, Claude Opus 4.7 at high and max effort, and GPT-5.5 at medium and extra_high reasoning effort. The heuristic is competitive on easy scenes but collapses when overlaps fuse components; the strongest multimodal configuration preserves much of the foreground structure but still misses exact match because of small parameter errors. Best overall exact match remains low, so ShapeCodeBench is far from saturated. The benchmark code, frozen dataset, run artifacts, and paper sources are released to support independent replication and extension.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

shivamk3r

Paper author Paper submitter about 23 hours ago

ShapeCodeBench is a synthetic multimodal benchmark for perception-to-program reconstruction: models infer executable drawing programs from rendered shape scenes.

The v1 release is intentionally small but challenging: the best multimodal exact-match rate is only 2.7%, even though foreground IoU reaches 0.87. The benchmark is renewable from fresh seeds to reduce exact-instance contamination, and includes deterministic scoring, a frozen eval_v1 split, baselines, code, data, paper source, and archived reproducibility artifacts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.11680

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.11680 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.11680 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.11680 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers