Hugging Face Daily Papers · · 4 min read

Brain-IT-VQA: From Brain Signals to Answers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

A step toward connecting brain signals and large language models.</p>\n","updatedAt":"2026-06-02T01:53:58.746Z","author":{"_id":"6698688dd723c35c0ea9f16d","avatarUrl":"/avatars/0748274382d0814bb52b9491bb0a329c.svg","fullname":"Roman Beliy","name":"RomanBeliy","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9326741099357605},"editors":["RomanBeliy"],"editorAvatarUrls":["/avatars/0748274382d0814bb52b9491bb0a329c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29588","authors":[{"_id":"6a1d8419808ddbc3c7d43898","name":"Roman Beliy","hidden":false},{"_id":"6a1d8419808ddbc3c7d43899","name":"Matias Cosarinsky","hidden":false},{"_id":"6a1d8419808ddbc3c7d4389a","name":"Oliver Heinimann","hidden":false},{"_id":"6a1d8419808ddbc3c7d4389b","name":"Navve Wasserman","hidden":false},{"_id":"6a1d8419808ddbc3c7d4389c","name":"Michal Irani","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6698688dd723c35c0ea9f16d/pFVDqk07FUBwJ4pAilPvc.png"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Brain-IT-VQA: From Brain Signals to Answers","submittedOnDailyBy":{"_id":"6698688dd723c35c0ea9f16d","avatarUrl":"/avatars/0748274382d0814bb52b9491bb0a329c.svg","isPro":false,"fullname":"Roman Beliy","user":"RomanBeliy","type":"user","name":"RomanBeliy"},"summary":"Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.","upvotes":9,"discussionId":"6a1d8419808ddbc3c7d4389d","projectPage":"https://mcosarinsky.github.io/brain-it-vqa/","ai_summary":"Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation.","ai_keywords":["visual question answering","fMRI","Brain-IT","language tokens","language model","brain activity","visual representations","NSD-VQA","transformer","brain regions"],"organization":{"_id":"62e28e39555a866437a78225","name":"weizmannscience","fullname":"Weizmann Institute of Science","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659014711791-624bebf604abc7ebb01789af.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6698688dd723c35c0ea9f16d","avatarUrl":"/avatars/0748274382d0814bb52b9491bb0a329c.svg","isPro":false,"fullname":"Roman Beliy","user":"RomanBeliy","type":"user"},{"_id":"6492c419702103104f9450c4","avatarUrl":"/avatars/d6700e4bb1b9f172096ea31ba15a83b2.svg","isPro":false,"fullname":"navve wasserman","user":"navvew","type":"user"},{"_id":"687f44e793eb81b0684b4eee","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/uDef9M8sQxTweYvwcVs07.png","isPro":false,"fullname":"Yuval Golbari","user":"yuvalgolbari","type":"user"},{"_id":"6735088b22d14a01ae17501f","avatarUrl":"/avatars/23d2eb2bb833dcf7a05434b499fedd5e.svg","isPro":false,"fullname":"Matias Cosarinsky","user":"mcosarinsky","type":"user"},{"_id":"6745d69eef06fb82fc7d7f33","avatarUrl":"/avatars/c1b8d041cbebc50f71a8a2460c3c3c00.svg","isPro":false,"fullname":"Spiegel","user":"orspiegel","type":"user"},{"_id":"678bfa754dd9bb4f2b8c8c0b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/aUXxZHGHhTgLYKocSE4Jg.png","isPro":false,"fullname":"Itamar Fruchter","user":"PitzF","type":"user"},{"_id":"667bb2777c4fbc951c64c567","avatarUrl":"/avatars/edad8738da17cebb1429da955f764a58.svg","isPro":false,"fullname":"Tal Zimbalist","user":"talzimb","type":"user"},{"_id":"699d9ae023c56fbd7d51384a","avatarUrl":"/avatars/8db1cfb65ccfe3837af3735471e86d2f.svg","isPro":false,"fullname":"Mila ALLEN","user":"owenf2023","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62e28e39555a866437a78225","name":"weizmannscience","fullname":"Weizmann Institute of Science","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1659014711791-624bebf604abc7ebb01789af.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29588.md"}">
Papers
arxiv:2605.29588

Brain-IT-VQA: From Brain Signals to Answers

Published on May 28
· Submitted by
Roman Beliy
on Jun 2
Authors:
,
,
,
,

Abstract

Brain-IT-VQA framework decodes visual content from fMRI signals using transformer-based architecture and introduces NSD-VQA dataset for improved visual question answering evaluation.

AI-generated summary

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

Community

Paper submitter about 8 hours ago

A step toward connecting brain signals and large language models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.29588
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29588 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29588 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29588 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers