r/LocalLLaMA · June 13, 2026 · 3 min read

[NEW FAMILY OF MODELS] Supra1.5 family just released!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon)

Hey r/LocalLLaMA! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it, and a GGUF quantized version ready to run anywhere.

🤗 Supra-1.5-50M-Base-exp | 🤗 Supra-1.5-50M-Instruct-exp | 🤗 GGUF | Supra1.5 50M Instruct Demo

These are experimental releases. Part of Project Chimera.

This model uses Alpaca chat format!

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

[INSTRUCTION]

### Response:

With additional input:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

[INSTRUCTION]

### Input:

[CONTEXT]

### Response:

----

What changed from Supra-50M?

The biggest upgrade is context. Supra-1.5 expands from 1,024 to 5,120 tokens using RoPE scaling, with continued pretraining on a 3B token mix of tool calling data, ChatML conversations, factual text, and math. Same architecture, same tokenizer, just a much better base for SFT and future RL work.

Spec	Supra-50M	Supra-1.5-50M
Context length	1,024 tokens	5,120 tokens
Training data (CPT)	20B tokens (pretraining)	3T tokens (continued) (experimental 1T)
Data mix	Fineweb-Edu only	Tool calling, ChatML, factual, math
Instruct format	Alpaca	ChatML

Benchmarks (Instruct)

BLiMP sits at a consistent 67.4 across evaluations. The model also showed an interesting raw vs. normalized accuracy split: science and factual tasks perform better under raw inference, while math and logic tasks benefit from normalized inference. Make of that what you will for a 50M model.

The model is already listed on the Open SLM Leaderboard by AxiomicLabs.

Quick start

Base model:

from transformers import pipeline import torch print("[*] Loading Supra-1.5-50M Base...") pipe = pipeline( "text-generation", model="SupraLabs/Supra-1.5-50M-Base-exp", device_map="auto", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32 ) def generate_text(prompt, max_new_tokens=150): result = pipe( prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.5, top_k=25, top_p=0.9, repetition_penalty=1.2, pad_token_id=pipe.tokenizer.pad_token_id, eos_token_id=pipe.tokenizer.eos_token_id ) return result[0]['generated_text'] print(generate_text("The importance of education is"))

Instruct model:

import os, warnings os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" warnings.filterwarnings("ignore", category=UserWarning, module="transformers") import torch from transformers import pipeline, AutoTokenizer, logging logging.set_verbosity_error() MODEL_ID = "SupraLabs/Supra-1.5-50M-Instruct-exp" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, clean_up_tokenization_spaces=False) pipe = pipeline( "text-generation", model=MODEL_ID, tokenizer=tokenizer, device_map="auto", torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32 ) def build_prompt(instruction, input_text=""): if input_text.strip(): return ( "Below is an instruction that describes a task, paired with an input " "that provides further context. Write a response that appropriately " "completes the request.\n\n" f"### Instruction:\n{instruction}\n\n" f"### Input:\n{input_text}\n\n### Response:\n" ) return ( "Below is an instruction that describes a task. Write a response that " "appropriately completes the request.\n\n" f"### Instruction:\n{instruction}\n\n### Response:\n" ) def generate(instruction, input_text=""): result = pipe( build_prompt(instruction, input_text), max_new_tokens=512, do_sample=True, temperature=0.7, top_k=50, top_p=0.9, repetition_penalty=1.15, pad_token_id=pipe.tokenizer.pad_token_id, eos_token_id=pipe.tokenizer.eos_token_id, return_full_text=False ) return result[0]['generated_text'].strip() while True: print("\nEnter an instruction (or 'exit' to quit):") user_input = input().strip() if user_input.lower() == "exit": break print("\nEnter additional context (optional, press Enter to skip):") context_input = input().strip() print(f"\nResponse:\n{generate(user_input, context_input)}\n")

GGUF quantizations:

Bits	Quant	Size
1-bit	Q1_D	19.6 MB
1-bit	TQ1_0	25.1 MB
2-bit	Q2_K	28.8 MB
2-bit	TQ2_0	26.4 MB
3-bit	IQ3_S	31 MB
3-bit	Q3_K_S	31 MB
3-bit	IQ3_M	31.7 MB
3-bit	Q3_K_M	32.7 MB
3-bit	Q3_K_L	33.8 MB
4-bit	IQ4_XS	33.8 MB
4-bit	Q4_K_S	35.7 MB
4-bit	IQ4_NL	34.7 MB
4-bit	Q4_0	34.5 MB
4-bit	Q4_1	36.8 MB
4-bit	Q4_K_M	37.4 MB
5-bit	Q5_K_S	39.5 MB
5-bit	Q5_0	39 MB
5-bit	Q5_1	41.2 MB
5-bit	Q5_K_M	41 MB
6-bit	Q6_K	45.8 MB
8-bit	Q8_0	56.2 MB
16-bit	BF16	105 MB
16-bit	F16 recommended	105 MB
32-bit	F32 recommended	208 MB

GGUF with llama.cpp:

# Run directly (replace Q4_K_M with your preferred quant) llama-cli -hf SupraLabs/Supra-1.5-50M-instruct-exp-gguf:Q4_K_M \ --chat-template alpaca \ -p "Write a short poem about open source AI." \ -n 256 # Or run as a local OpenAI-compatible server llama-server -hf SupraLabs/Supra-1.5-50M-instruct-exp-gguf:Q4_K_M \ --chat-template alpaca \ -c 5120

What's next?

Supra-124M - Base, Chat, Reasoning (legacy family, in production)

Supra-350M - Base, Chat, Reasoning, Coding (legacy family, in production)

All weights Apache 2.0. Feedback welcome!

submitted by /u/Dangerous_Try3619
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA