r/LocalLLaMA · · 5 min read

[NEW MODEL] SupraLabs just released SupraVL-Nano-900k, a Vision-Language Model built entirely from scratch!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hey r/LocalLLaMA! We just released SupraVL-Nano-900k, our first VLM. It has ~900k parameters, was trained from scratch on Flickr8k, and the entire architecture fits in a single Jupyter notebook. This is not a production model, it's a fully transparent, readable blueprint for anyone who wants to understand how image-to-text models actually work under the hood.

🤗 SupraVL-Nano-900k

What is this?

Most VLMs are black boxes. CLIP encoders, billion-parameter LLMs, fusion layers you can't easily read. SupraVL-Nano builds the whole thing from scratch: a CNN visual encoder, a GPT-2-style transformer decoder, a BPE tokenizer trained on the dataset itself, and a prefix concatenation fusion strategy. Every component is written from scratch and documented.

The goal is simple: if you want to understand how a VLM works, you should be able to read the code.

Architecture

Component Details
Visual encoder 3× Conv-BN-ReLU + AdaptiveAvgPool(4×4) → 16 spatial tokens
Visual channels 64-d → projected to 128-d
Decoder GPT-2 style, 3 layers, d=128, 4 heads, FF=256
Vocabulary 2048 BPE tokens trained on Flickr8k captions
Context 16 visual tokens + 48 text tokens = 64 total positions
Parameters ~900k
Fusion Prefix concatenation (visual tokens prepended to text sequence)
Weight tying tok_emb ↔ lm_head (GPT-2 style)

The 4×4 spatial grid is a deliberate choice over a single global token — the decoder can attend to different image regions when generating different words, which is closer to how real VLMs work.

Training

Setting Value
Dataset Flickr8k (30k train / 5k val pairs)
Epochs 15
Optimizer AdamW (β₁=0.9, β₂=0.95, wd=0.01)
Learning rate 3e-4 → cosine decay → 3e-5
Batch size 64
Precision Mixed (AMP)
Hardware Kaggle 2× T4 / Google Colab T4

Quick start

Install:

pip install torch torchvision pillow huggingface_hub safetensors tokenizers 

Run:

import json, torch, torch.nn as nn, torch.nn.functional as F import torchvision.transforms as T from PIL import Image from huggingface_hub import hf_hub_download from safetensors.torch import load_file from tokenizers import Tokenizer REPO = "SupraLabs/SupraVL-Nano-900k" ckpt_path = hf_hub_download(REPO, "model.safetensors") tok_path = hf_hub_download(REPO, "tokenizer.json") cfg_path = hf_hub_download(REPO, "config.json") with open(cfg_path) as f: cfg = json.load(f) tokenizer = Tokenizer.from_file(tok_path) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Config keys: D_MODEL, N_HEADS, N_LAYERS, D_FF, VIS_CH, N_VIS, VOCAB_SIZE, MAX_SEQ, IMG_SIZE N_EMBD = cfg["D_MODEL"] # 128 N_HEAD = cfg["N_HEADS"] # 4 N_LAYER = cfg["N_LAYERS"] # 3 D_FF = cfg["D_FF"] # 256 VIS_CH = cfg["VIS_CH"] # 64 VIS_TOKENS = cfg["N_VIS"] # 16 VOCAB_SIZE = cfg["VOCAB_SIZE"] # 2048 MAX_SEQ = cfg["MAX_SEQ"] # 48 IMG_SIZE = cfg["IMG_SIZE"] # 112 TOTAL_POS = VIS_TOKENS + MAX_SEQ # 64 BOS_ID = cfg.get("bos_token_id", 1) EOS_ID = cfg.get("eos_token_id", 2) # --- Model definition --- class CausalSelfAttention(nn.Module): def __init__(self): super().__init__() self.qkv = nn.Linear(N_EMBD, 3 * N_EMBD, bias=False) self.proj = nn.Linear(N_EMBD, N_EMBD, bias=False) self.n_head = N_HEAD self.register_buffer( "mask", torch.tril(torch.ones(TOTAL_POS, TOTAL_POS)).view(1, 1, TOTAL_POS, TOTAL_POS) ) def forward(self, x): B, T, C = x.shape nh, hs = self.n_head, C // self.n_head q, k, v = self.qkv(x).split(C, dim=-1) q = q.view(B,T,nh,hs).transpose(1,2) k = k.view(B,T,nh,hs).transpose(1,2) v = v.view(B,T,nh,hs).transpose(1,2) att = (q @ k.transpose(-2,-1)) * (hs**-0.5) att = att.masked_fill(self.mask[:,:,:T,:T]==0, float("-inf")) att = F.softmax(att, dim=-1) return self.proj((att @ v).transpose(1,2).contiguous().view(B,T,C)) class MLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(N_EMBD, D_FF) self.fc2 = nn.Linear(D_FF, N_EMBD) def forward(self, x): return self.fc2(F.gelu(self.fc1(x))) class Block(nn.Module): def __init__(self): super().__init__() self.ln1 = nn.LayerNorm(N_EMBD) self.attn = CausalSelfAttention() self.ln2 = nn.LayerNorm(N_EMBD) self.mlp = MLP() def forward(self, x): x = x + self.attn(self.ln1(x)) x = x + self.mlp(self.ln2(x)) return x class VisualEncoder(nn.Module): def __init__(self): super().__init__() c1,c2,c3 = VIS_CH//4, VIS_CH//2, VIS_CH self.conv1 = nn.Sequential(nn.Conv2d(3,c1,3,2,1), nn.BatchNorm2d(c1), nn.ReLU(True)) self.conv2 = nn.Sequential(nn.Conv2d(c1,c2,3,2,1), nn.BatchNorm2d(c2), nn.ReLU(True)) self.conv3 = nn.Sequential(nn.Conv2d(c2,c3,3,2,1), nn.BatchNorm2d(c3), nn.ReLU(True)) grid = int(VIS_TOKENS**0.5) self.pool = nn.AdaptiveAvgPool2d((grid, grid)) self.proj = nn.Linear(c3, N_EMBD) def forward(self, x): x = self.conv3(self.conv2(self.conv1(x))) B,C,H,W = self.pool(x).shape x = self.pool(x).view(B,C,H*W).transpose(1,2) return self.proj(x) class MiniVLM(nn.Module): def __init__(self): super().__init__() self.vis_enc = VisualEncoder() self.tok_emb = nn.Embedding(VOCAB_SIZE, N_EMBD) self.pos_emb = nn.Embedding(TOTAL_POS, N_EMBD) self.blocks = nn.ModuleList([Block() for _ in range(N_LAYER)]) self.ln_f = nn.LayerNorm(N_EMBD) self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE, bias=False) def forward(self, img_tokens, tok_ids): B, T = tok_ids.shape seq = torch.cat([img_tokens, self.tok_emb(tok_ids)], dim=1) pos = self.pos_emb(torch.arange(VIS_TOKENS+T, device=tok_ids.device)) x = seq + pos.unsqueeze(0) for block in self.blocks: x = block(x) return self.lm_head(self.ln_f(x)) u/torch.no_grad() def generate_beam(self, img, beam_width=3, max_new=48): self.eval() img_tokens = self.vis_enc(img) beams = [(0.0, [BOS_ID])] for _ in range(max_new): candidates = [] for score, seq in beams: if seq[-1] == EOS_ID: candidates.append((score, seq)); continue ids = torch.tensor([seq], dtype=torch.long, device=img.device) logits = self.forward(img_tokens, ids) lprobs = F.log_softmax(logits[0, VIS_TOKENS+len(seq)-1], dim=-1) topk = torch.topk(lprobs, beam_width) for lp, tok in zip(topk.values.tolist(), topk.indices.tolist()): candidates.append((score+lp, seq+[tok])) beams = sorted(candidates, key=lambda x: x[0], reverse=True)[:beam_width] if all(s[-1]==EOS_ID for _,s in beams): break best = [t for t in beams[0][1] if t not in (BOS_ID, EOS_ID)] return tokenizer.decode(best) # --- Load weights --- model = MiniVLM() model.load_state_dict(load_file(ckpt_path, device=str(device)), strict=False) model.lm_head.weight = model.tok_emb.weight model.to(device).eval() # --- Run inference --- transform = T.Compose([ T.Resize((IMG_SIZE, IMG_SIZE)), T.ToTensor(), T.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]), ]) img = Image.open("your_image.jpg").convert("RGB") img_t = transform(img).unsqueeze(0).to(device) print("Caption:", model.generate_beam(img_t, beam_width=3, max_new=48)) 

Generation strategies

Method Notes
Greedy model.generate_greedy(img) — fast, deterministic
Top-k sampling model.generate_topk(img, temperature=0.8, top_k=50) — more varied
Beam search model.generate_beam(img, beam_width=3) — most fluent, recommended

Limitations (be honest with yourselves)

This is trained on Flickr8k in under an hour on a T4. Expect short generic captions, repetition on out-of-distribution images, nonsense outputs some times and no instruction following whatsoever. It is not competing with LLaVA. It is competing with nothing, it's an educational artifact.

Roadmap

  • Replace CNN with a tiny ViT patch encoder
  • Cross-attention layers instead of prefix concatenation (Flamingo-style)
  • Pretrained frozen CLIP backbone
  • Scale decoder to 6-12 layers, d=512+
  • Train on CC3M / LAION-400M
  • Scale up

Apache 2.0. Go read the code. That's the whole point.

submitted by /u/Dangerous_Try3619
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA