r/LocalLLaMA · · 2 min read

[NEW MODEL] SupraLabs started the Any2Any model family!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

[NEW MODEL] SupraLabs started the Any2Any model family!

SupraLabs Supra-A2A-Nano-Exp - ~30M Any-to-Any Multimodal Transformer

Status: Experimental / Educational Prototype


🚀 Overview

Supra-A2A-Nano-Exp is a ~30M parameter autoregressive Transformer that unifies text, image, and video into a single token stream.

There are: - No separate vision encoder - No diffusion model - No cross-attention modules between modalities

Instead, everything is treated as tokens in one shared sequence.


🧠 Core Idea

The model predicts the next token in a unified stream where tokens can represent:

  • Text tokens
  • Image patches (VQ-VAE codes)
  • Video frames (sequences of visual tokens)

👉 Multimodality = language modeling over a shared vocabulary.


🔤 Unified Token Stream Format

<TEXT>some text</TEXT> <IMAGE><FRAME>[64 visual tokens]</IMAGE> <VIDEO><FRAME>[frames of visual tokens]</VIDEO>


📚 Tokenization

Text side

  • GPT-2 BPE tokenizer: 50,257 tokens
  • Special tokens (7):
  • <TEXT>, </TEXT>
  • <IMAGE>, </IMAGE>
  • <VIDEO>, </VIDEO>
  • <FRAME>

Total text vocab: 50,264 tokens


Vision side

  • VQ-VAE encoder/decoder
  • 3-layer convolutional encoder (/8 downsampling)
  • Codebook: 256 entries × 64 dimensions
  • Image 64×64 → 8×8 grid → 64 tokens

Combined vocabulary

50,264 (text) + 256 (visual) = 50,520 tokens


🏗️ Architecture

Component Specification
Backbone GPT-style Transformer
Layers 4
Embedding size 256
Context length 384 tokens
Attention heads 4 (assumed)
MLP 4× expansion
Total parameters ~29.9M
Precision FP32

📁 Repository Files

File Description
model.safetensors GPT backbone weights
vqvae.safetensors VQ-VAE weights
tokenizer.json BPE tokenizer
tokenizer_config.json Tokenizer metadata
run_supra_a2a.py Full inference pipeline(Code on Readme.md)

⚙️ Installation

bash pip install torch transformers huggingface_hub safetensors Pillow numpy


🧪 Usage Modes

Text generation

bash python run_supra_a2a.py --mode text --prompt "<TEXT>Once upon a time"

Chat mode

bash python run_supra_a2a.py --mode chat

Image reconstruction

bash python run_supra_a2a.py --mode reconstruct --image input.png --out output.png

Text-to-image

bash python run_supra_a2a.py --mode text2image --prompt "<TEXT>a red square</TEXT><IMAGE>" --out output.png


🧩 Key Insight

This model does not switch between modalities.

It simply:

Predicts the next token.

That token might be: - a word - a visual code - a frame element

Everything is treated equally.


⚠️ Important Caveats

Attention heads (inferred)

  • Default assumption: 4 heads
  • May be incorrect depending on checkpoint
  • Incorrect value can silently degrade performance

VQ-VAE output activation

Default assumption: - sigmoid (0–1 range)

Alternative: - tanh (-1 to 1 range)


📉 Limitations

  • ~30M parameters (small scale)
  • 384 token context window
  • Low-resolution, abstract image generation
  • No RLHF or instruction tuning
  • Experimental research prototype

💡 Interpretation

This architecture explores a radical simplification:

Instead of separate systems for vision and language:

👉 everything becomes tokens

👉 everything is modeled by one Transformer

👉 modality boundaries disappear

🧠 Final Take

This is not a production-grade model.

But it is a clean conceptual experiment showing that:

  • images can be token sequences
  • video can be token sequences
  • multimodal learning can be pure language modeling

Feedback welcome!

submitted by /u/Dangerous_Try3619
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA