Hugging Face Daily Papers · · 3 min read

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

SEIG is an agentic framework that reconstructs 3D scenes from single images by progressively generating executable Blender code, enabling novel-view synthesis, scene editing, and relighting.</p>\n","updatedAt":"2026-06-02T04:12:05.847Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8850113749504089},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02580","authors":[{"_id":"6a1e580b808ddbc3c7d43dd0","name":"Guangzhao He","hidden":false},{"_id":"6a1e580b808ddbc3c7d43dd1","name":"Rundong Luo","hidden":false},{"_id":"6a1e580b808ddbc3c7d43dd2","name":"Wei-Chiu Ma","hidden":false},{"_id":"6a1e580b808ddbc3c7d43dd3","name":"Hadar Averbuch-Elor","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.","upvotes":1,"discussionId":"6a1e580b808ddbc3c7d43dd4","ai_summary":"Pretrained vision-language models can reconstruct 3D scenes from single images as editable Blender programs through progressive refinement, demonstrating improved fidelity through staged reconstruction approaches.","ai_keywords":["inverse graphics","vision-language models","Blender program","staged reconstruction","scene factors","geometry","materials","composition","lighting","executable inverse graphics","task decomposition"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69bcb9c204a5b0d01a574eae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/7n7d6O19_oDvefbrMrmYC.png","isPro":false,"fullname":"David Sanchez","user":"AmeliaTaylor202","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02580.md"}">
Papers
arxiv:2606.02580

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Published on Jun 1
· Submitted by
taesiri
on Jun 2
Authors:
,
,
,

Abstract

Pretrained vision-language models can reconstruct 3D scenes from single images as editable Blender programs through progressive refinement, demonstrating improved fidelity through staged reconstruction approaches.

AI-generated summary

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

Community

Paper submitter about 6 hours ago

SEIG is an agentic framework that reconstructs 3D scenes from single images by progressively generating executable Blender code, enabling novel-view synthesis, scene editing, and relighting.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.02580
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02580 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02580 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02580 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers