Hugging Face Daily Papers · June 2, 2026 · 3 min read

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

SEIG is an agentic framework that reconstructs 3D scenes from single images by progressively generating executable Blender code, enabling novel-view synthesis, scene editing, and relighting.</p>\n","updatedAt":"2026-06-02T04:12:05.847Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8850113749504089},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02580","authors":[{"_id":"6a1e580b808ddbc3c7d43dd0","name":"Guangzhao He","hidden":false},{"_id":"6a1e580b808ddbc3c7d43dd1","name":"Rundong Luo","hidden":false},{"_id":"6a1e580b808ddbc3c7d43dd2","name":"Wei-Chiu Ma","hidden":false},{"_id":"6a1e580b808ddbc3c7d43dd3","name":"Hadar Averbuch-Elor","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.","upvotes":1,"discussionId":"6a1e580b808ddbc3c7d43dd4","ai_summary":"Pretrained vision-language models can reconstruct 3D scenes from single images as editable Blender programs through progressive refinement, demonstrating improved fidelity through staged reconstruction approaches.","ai_keywords":["inverse graphics","vision-language models","Blender program","staged reconstruction","scene factors","geometry","materials","composition","lighting","executable inverse graphics","task decomposition"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69bcb9c204a5b0d01a574eae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/7n7d6O19_oDvefbrMrmYC.png","isPro":false,"fullname":"David Sanchez","user":"AmeliaTaylor202","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02580.md"}">

Papers

arxiv:2606.02580

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Published on Jun 1

· Submitted by

taesiri on Jun 2

Upvote

Authors:

Abstract

Pretrained vision-language models can reconstruct 3D scenes from single images as editable Blender programs through progressive refinement, demonstrating improved fidelity through staged reconstruction approaches.

AI-generated summary

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

View arXiv page View PDF Add to collection

Community

taesiri

Paper submitter about 6 hours ago

SEIG is an agentic framework that reconstructs 3D scenes from single images by progressively generating executable Blender code, enabling novel-view synthesis, scene editing, and relighting.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02580

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02580 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02580 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02580 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers