We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with \"next-channel prediction\". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.</p>\n","updatedAt":"2026-05-26T03:21:32.825Z","author":{"_id":"665eccf5ffd59344a22533a8","avatarUrl":"/avatars/2ae2710753ce34a04937384bc6dddf70.svg","fullname":"Wei Song (SII)","name":"Songweii","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8652878999710083},"editors":["Songweii"],"editorAvatarUrls":["/avatars/2ae2710753ce34a04937384bc6dddf70.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26089","authors":[{"_id":"6a1510dbb57a1823d5708ad4","user":{"_id":"665eccf5ffd59344a22533a8","avatarUrl":"/avatars/2ae2710753ce34a04937384bc6dddf70.svg","isPro":false,"fullname":"Wei Song (SII)","user":"Songweii","type":"user","name":"Songweii"},"name":"Wei Song","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:09:13.056Z","hidden":false},{"_id":"6a1510dbb57a1823d5708ad5","name":"Tianhang Wang","hidden":false},{"_id":"6a1510dbb57a1823d5708ad6","user":{"_id":"64651db3611ae99d14d392ea","avatarUrl":"/avatars/b818dc0dddc999758ab5737d5053e8c3.svg","isPro":false,"fullname":"YitongChen (SII)","user":"Row11n","type":"user","name":"Row11n"},"name":"Yitong Chen","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:09:10.714Z","hidden":false},{"_id":"6a1510dbb57a1823d5708ad7","name":"Tong Zhang","hidden":false},{"_id":"6a1510dbb57a1823d5708ad8","name":"Zuxuan Wu","hidden":false},{"_id":"6a1510dbb57a1823d5708ad9","name":"Ming Li","hidden":false},{"_id":"6a1510dbb57a1823d5708ada","name":"Jiaqi Wang","hidden":false},{"_id":"6a1510dbb57a1823d5708adb","name":"Kaicheng Yu","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Channel-wise Vector Quantization","submittedOnDailyBy":{"_id":"665eccf5ffd59344a22533a8","avatarUrl":"/avatars/2ae2710753ce34a04937384bc6dddf70.svg","isPro":false,"fullname":"Wei Song (SII)","user":"Songweii","type":"user","name":"Songweii"},"summary":"We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with \"next-channel prediction\". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.","upvotes":5,"discussionId":"6a1510dbb57a1823d5708adc","projectPage":"https://github.com/songweii/CVQ","githubRepo":"https://github.com/songweii/CVQ","githubRepoAddedBy":"user","ai_summary":"Channel-wise Vector Quantization replaces patch-wise tokens with channel-wise tokens in image tokenization, enabling a next-channel prediction framework that generates images by sequentially refining visual details.","ai_keywords":["Channel-wise Vector Quantization","image tokenization","vector quantization","patch-wise tokens","channel-wise tokens","visual autoregressive framework","next-channel prediction","Channel-wise Autoregressive","codebook utilization","reconstruction quality","DPG score","GenEval score","text-to-image generation"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"665eccf5ffd59344a22533a8","avatarUrl":"/avatars/2ae2710753ce34a04937384bc6dddf70.svg","isPro":false,"fullname":"Wei Song (SII)","user":"Songweii","type":"user"},{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user"},{"_id":"624862b4a460a8870c9d6a48","avatarUrl":"/avatars/479bc415ee624528e910f22bdb344b23.svg","isPro":false,"fullname":"Tianhang Wang (SII)","user":"tianhang-wang","type":"user"},{"_id":"64651db3611ae99d14d392ea","avatarUrl":"/avatars/b818dc0dddc999758ab5737d5053e8c3.svg","isPro":false,"fullname":"YitongChen (SII)","user":"Row11n","type":"user"},{"_id":"634ec067aae4bde2c8dfc86f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ec067aae4bde2c8dfc86f/OQBLKcspofUqAzmEpvH0-.png","isPro":false,"fullname":"Yamata Zen","user":"yamatazen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26089.md"}">
Channel-wise Vector Quantization
Abstract
Channel-wise Vector Quantization replaces patch-wise tokens with channel-wise tokens in image tokenization, enabling a next-channel prediction framework that generates images by sequentially refining visual details.
AI-generated summary
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.
Community
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.26089 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.26089 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.26089 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.