r/LocalLLaMA · June 21, 2026 · 2 min read

[NEW MODEL] SupraLabs started the Any2Any model family!

#multimodal

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

[NEW MODEL] SupraLabs started the Any2Any model family!

SupraLabs Supra-A2A-Nano-Exp - ~30M Any-to-Any Multimodal Transformer

Status: Experimental / Educational Prototype

🚀 Overview

Supra-A2A-Nano-Exp is a ~30M parameter autoregressive Transformer that unifies text, image, and video into a single token stream.

There are: - No separate vision encoder - No diffusion model - No cross-attention modules between modalities

Instead, everything is treated as tokens in one shared sequence.

🧠 Core Idea

The model predicts the next token in a unified stream where tokens can represent:

Text tokens
Image patches (VQ-VAE codes)
Video frames (sequences of visual tokens)

👉 Multimodality = language modeling over a shared vocabulary.

🔤 Unified Token Stream Format

<TEXT>some text</TEXT> <IMAGE><FRAME>[64 visual tokens]</IMAGE> <VIDEO><FRAME>[frames of visual tokens]</VIDEO>

📚 Tokenization

Text side

GPT-2 BPE tokenizer: 50,257 tokens
Special tokens (7):
<TEXT>, </TEXT>
<IMAGE>, </IMAGE>
<VIDEO>, </VIDEO>
<FRAME>

Total text vocab: 50,264 tokens

Vision side

VQ-VAE encoder/decoder
3-layer convolutional encoder (/8 downsampling)
Codebook: 256 entries × 64 dimensions
Image 64×64 → 8×8 grid → 64 tokens

Combined vocabulary

50,264 (text) + 256 (visual) = 50,520 tokens

🏗️ Architecture

Component	Specification
Backbone	GPT-style Transformer
Layers	4
Embedding size	256
Context length	384 tokens
Attention heads	4 (assumed)
MLP	4× expansion
Total parameters	~29.9M
Precision	FP32

📁 Repository Files

File	Description
`model.safetensors`	GPT backbone weights
`vqvae.safetensors`	VQ-VAE weights
`tokenizer.json`	BPE tokenizer
`tokenizer_config.json`	Tokenizer metadata
`run_supra_a2a.py`	Full inference pipeline(Code on Readme.md)

⚙️ Installation

bash pip install torch transformers huggingface_hub safetensors Pillow numpy

🧪 Usage Modes

Text generation

bash python run_supra_a2a.py --mode text --prompt "<TEXT>Once upon a time"

Chat mode

bash python run_supra_a2a.py --mode chat

Image reconstruction

bash python run_supra_a2a.py --mode reconstruct --image input.png --out output.png

Text-to-image

bash python run_supra_a2a.py --mode text2image --prompt "<TEXT>a red square</TEXT><IMAGE>" --out output.png

🧩 Key Insight

This model does not switch between modalities.

It simply:

Predicts the next token.

That token might be: - a word - a visual code - a frame element

Everything is treated equally.

⚠️ Important Caveats

Attention heads (inferred)

Default assumption: 4 heads
May be incorrect depending on checkpoint
Incorrect value can silently degrade performance

VQ-VAE output activation

Default assumption: - sigmoid (0–1 range)

Alternative: - tanh (-1 to 1 range)

📉 Limitations

~30M parameters (small scale)
384 token context window
Low-resolution, abstract image generation
No RLHF or instruction tuning
Experimental research prototype

💡 Interpretation

This architecture explores a radical simplification:

Instead of separate systems for vision and language:

👉 everything becomes tokens

👉 everything is modeled by one Transformer

👉 modality boundaries disappear

🧠 Final Take

This is not a production-grade model.

But it is a clean conceptual experiment showing that:

images can be token sequences
video can be token sequences
multimodal learning can be pure language modeling

Feedback welcome!

submitted by /u/Dangerous_Try3619
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.