[NEW MODEL] SupraLabs just released SupraVL-Nano-900k, a Vision-Language Model built entirely from scratch!
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hey r/LocalLLaMA! We just released SupraVL-Nano-900k, our first VLM. It has ~900k parameters, was trained from scratch on Flickr8k, and the entire architecture fits in a single Jupyter notebook. This is not a production model, it's a fully transparent, readable blueprint for anyone who wants to understand how image-to-text models actually work under the hood.
What is this?
Most VLMs are black boxes. CLIP encoders, billion-parameter LLMs, fusion layers you can't easily read. SupraVL-Nano builds the whole thing from scratch: a CNN visual encoder, a GPT-2-style transformer decoder, a BPE tokenizer trained on the dataset itself, and a prefix concatenation fusion strategy. Every component is written from scratch and documented.
The goal is simple: if you want to understand how a VLM works, you should be able to read the code.
Architecture
| Component | Details |
|---|---|
| Visual encoder | 3× Conv-BN-ReLU + AdaptiveAvgPool(4×4) → 16 spatial tokens |
| Visual channels | 64-d → projected to 128-d |
| Decoder | GPT-2 style, 3 layers, d=128, 4 heads, FF=256 |
| Vocabulary | 2048 BPE tokens trained on Flickr8k captions |
| Context | 16 visual tokens + 48 text tokens = 64 total positions |
| Parameters | ~900k |
| Fusion | Prefix concatenation (visual tokens prepended to text sequence) |
| Weight tying | tok_emb ↔ lm_head (GPT-2 style) |
The 4×4 spatial grid is a deliberate choice over a single global token — the decoder can attend to different image regions when generating different words, which is closer to how real VLMs work.
Training
| Setting | Value |
|---|---|
| Dataset | Flickr8k (30k train / 5k val pairs) |
| Epochs | 15 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
| Learning rate | 3e-4 → cosine decay → 3e-5 |
| Batch size | 64 |
| Precision | Mixed (AMP) |
| Hardware | Kaggle 2× T4 / Google Colab T4 |
Quick start
Install:
pip install torch torchvision pillow huggingface_hub safetensors tokenizers Run:
import json, torch, torch.nn as nn, torch.nn.functional as F import torchvision.transforms as T from PIL import Image from huggingface_hub import hf_hub_download from safetensors.torch import load_file from tokenizers import Tokenizer REPO = "SupraLabs/SupraVL-Nano-900k" ckpt_path = hf_hub_download(REPO, "model.safetensors") tok_path = hf_hub_download(REPO, "tokenizer.json") cfg_path = hf_hub_download(REPO, "config.json") with open(cfg_path) as f: cfg = json.load(f) tokenizer = Tokenizer.from_file(tok_path) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Config keys: D_MODEL, N_HEADS, N_LAYERS, D_FF, VIS_CH, N_VIS, VOCAB_SIZE, MAX_SEQ, IMG_SIZE N_EMBD = cfg["D_MODEL"] # 128 N_HEAD = cfg["N_HEADS"] # 4 N_LAYER = cfg["N_LAYERS"] # 3 D_FF = cfg["D_FF"] # 256 VIS_CH = cfg["VIS_CH"] # 64 VIS_TOKENS = cfg["N_VIS"] # 16 VOCAB_SIZE = cfg["VOCAB_SIZE"] # 2048 MAX_SEQ = cfg["MAX_SEQ"] # 48 IMG_SIZE = cfg["IMG_SIZE"] # 112 TOTAL_POS = VIS_TOKENS + MAX_SEQ # 64 BOS_ID = cfg.get("bos_token_id", 1) EOS_ID = cfg.get("eos_token_id", 2) # --- Model definition --- class CausalSelfAttention(nn.Module): def __init__(self): super().__init__() self.qkv = nn.Linear(N_EMBD, 3 * N_EMBD, bias=False) self.proj = nn.Linear(N_EMBD, N_EMBD, bias=False) self.n_head = N_HEAD self.register_buffer( "mask", torch.tril(torch.ones(TOTAL_POS, TOTAL_POS)).view(1, 1, TOTAL_POS, TOTAL_POS) ) def forward(self, x): B, T, C = x.shape nh, hs = self.n_head, C // self.n_head q, k, v = self.qkv(x).split(C, dim=-1) q = q.view(B,T,nh,hs).transpose(1,2) k = k.view(B,T,nh,hs).transpose(1,2) v = v.view(B,T,nh,hs).transpose(1,2) att = (q @ k.transpose(-2,-1)) * (hs**-0.5) att = att.masked_fill(self.mask[:,:,:T,:T]==0, float("-inf")) att = F.softmax(att, dim=-1) return self.proj((att @ v).transpose(1,2).contiguous().view(B,T,C)) class MLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(N_EMBD, D_FF) self.fc2 = nn.Linear(D_FF, N_EMBD) def forward(self, x): return self.fc2(F.gelu(self.fc1(x))) class Block(nn.Module): def __init__(self): super().__init__() self.ln1 = nn.LayerNorm(N_EMBD) self.attn = CausalSelfAttention() self.ln2 = nn.LayerNorm(N_EMBD) self.mlp = MLP() def forward(self, x): x = x + self.attn(self.ln1(x)) x = x + self.mlp(self.ln2(x)) return x class VisualEncoder(nn.Module): def __init__(self): super().__init__() c1,c2,c3 = VIS_CH//4, VIS_CH//2, VIS_CH self.conv1 = nn.Sequential(nn.Conv2d(3,c1,3,2,1), nn.BatchNorm2d(c1), nn.ReLU(True)) self.conv2 = nn.Sequential(nn.Conv2d(c1,c2,3,2,1), nn.BatchNorm2d(c2), nn.ReLU(True)) self.conv3 = nn.Sequential(nn.Conv2d(c2,c3,3,2,1), nn.BatchNorm2d(c3), nn.ReLU(True)) grid = int(VIS_TOKENS**0.5) self.pool = nn.AdaptiveAvgPool2d((grid, grid)) self.proj = nn.Linear(c3, N_EMBD) def forward(self, x): x = self.conv3(self.conv2(self.conv1(x))) B,C,H,W = self.pool(x).shape x = self.pool(x).view(B,C,H*W).transpose(1,2) return self.proj(x) class MiniVLM(nn.Module): def __init__(self): super().__init__() self.vis_enc = VisualEncoder() self.tok_emb = nn.Embedding(VOCAB_SIZE, N_EMBD) self.pos_emb = nn.Embedding(TOTAL_POS, N_EMBD) self.blocks = nn.ModuleList([Block() for _ in range(N_LAYER)]) self.ln_f = nn.LayerNorm(N_EMBD) self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE, bias=False) def forward(self, img_tokens, tok_ids): B, T = tok_ids.shape seq = torch.cat([img_tokens, self.tok_emb(tok_ids)], dim=1) pos = self.pos_emb(torch.arange(VIS_TOKENS+T, device=tok_ids.device)) x = seq + pos.unsqueeze(0) for block in self.blocks: x = block(x) return self.lm_head(self.ln_f(x)) u/torch.no_grad() def generate_beam(self, img, beam_width=3, max_new=48): self.eval() img_tokens = self.vis_enc(img) beams = [(0.0, [BOS_ID])] for _ in range(max_new): candidates = [] for score, seq in beams: if seq[-1] == EOS_ID: candidates.append((score, seq)); continue ids = torch.tensor([seq], dtype=torch.long, device=img.device) logits = self.forward(img_tokens, ids) lprobs = F.log_softmax(logits[0, VIS_TOKENS+len(seq)-1], dim=-1) topk = torch.topk(lprobs, beam_width) for lp, tok in zip(topk.values.tolist(), topk.indices.tolist()): candidates.append((score+lp, seq+[tok])) beams = sorted(candidates, key=lambda x: x[0], reverse=True)[:beam_width] if all(s[-1]==EOS_ID for _,s in beams): break best = [t for t in beams[0][1] if t not in (BOS_ID, EOS_ID)] return tokenizer.decode(best) # --- Load weights --- model = MiniVLM() model.load_state_dict(load_file(ckpt_path, device=str(device)), strict=False) model.lm_head.weight = model.tok_emb.weight model.to(device).eval() # --- Run inference --- transform = T.Compose([ T.Resize((IMG_SIZE, IMG_SIZE)), T.ToTensor(), T.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]), ]) img = Image.open("your_image.jpg").convert("RGB") img_t = transform(img).unsqueeze(0).to(device) print("Caption:", model.generate_beam(img_t, beam_width=3, max_new=48)) Generation strategies
| Method | Notes |
|---|---|
| Greedy | model.generate_greedy(img) — fast, deterministic |
| Top-k sampling | model.generate_topk(img, temperature=0.8, top_k=50) — more varied |
| Beam search | model.generate_beam(img, beam_width=3) — most fluent, recommended |
Limitations (be honest with yourselves)
This is trained on Flickr8k in under an hour on a T4. Expect short generic captions, repetition on out-of-distribution images, nonsense outputs some times and no instruction following whatsoever. It is not competing with LLaVA. It is competing with nothing, it's an educational artifact.
Roadmap
- Replace CNN with a tiny ViT patch encoder
- Cross-attention layers instead of prefix concatenation (Flamingo-style)
- Pretrained frozen CLIP backbone
- Scale decoder to 6-12 layers, d=512+
- Train on CC3M / LAION-400M
- Scale up
Apache 2.0. Go read the code. That's the whole point.
[link] [comments]
More from r/LocalLLaMA
-
Researchers trained a Deep Research agent with 32 H100s and open-sourced everything
Jun 19
-
GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.
Jun 19
-
SETI @ Home aka distributed LLM inference engine. Does this exist and if not, should we make one?
Jun 19
-
GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval
Jun 19
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.