r/LocalLLaMA · · 3 min read

[NEW FAMILY OF MODELS] Supra1.5 family just released!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon)

Hey r/LocalLLaMA! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it, and a GGUF quantized version ready to run anywhere.

🤗 Supra-1.5-50M-Base-exp | 🤗 Supra-1.5-50M-Instruct-exp | 🤗 GGUF | Supra1.5 50M Instruct Demo

These are experimental releases. Part of Project Chimera.

This model uses Alpaca chat format!

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

[INSTRUCTION]

### Response:

With additional input:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

[INSTRUCTION]

### Input:

[CONTEXT]

### Response:

----

What changed from Supra-50M?

The biggest upgrade is context. Supra-1.5 expands from 1,024 to 5,120 tokens using RoPE scaling, with continued pretraining on a 3B token mix of tool calling data, ChatML conversations, factual text, and math. Same architecture, same tokenizer, just a much better base for SFT and future RL work.

Spec Supra-50M Supra-1.5-50M
Context length 1,024 tokens 5,120 tokens
Training data (CPT) 20B tokens (pretraining) 3T tokens (continued) (experimental 1T)
Data mix Fineweb-Edu only Tool calling, ChatML, factual, math
Instruct format Alpaca ChatML

Benchmarks (Instruct)

BLiMP sits at a consistent 67.4 across evaluations. The model also showed an interesting raw vs. normalized accuracy split: science and factual tasks perform better under raw inference, while math and logic tasks benefit from normalized inference. Make of that what you will for a 50M model.

The model is already listed on the Open SLM Leaderboard by AxiomicLabs.

Quick start

Base model:

from transformers import pipeline import torch print("[*] Loading Supra-1.5-50M Base...") pipe = pipeline( "text-generation", model="SupraLabs/Supra-1.5-50M-Base-exp", device_map="auto", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32 ) def generate_text(prompt, max_new_tokens=150): result = pipe( prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.5, top_k=25, top_p=0.9, repetition_penalty=1.2, pad_token_id=pipe.tokenizer.pad_token_id, eos_token_id=pipe.tokenizer.eos_token_id ) return result[0]['generated_text'] print(generate_text("The importance of education is")) 

Instruct model:

import os, warnings os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" warnings.filterwarnings("ignore", category=UserWarning, module="transformers") import torch from transformers import pipeline, AutoTokenizer, logging logging.set_verbosity_error() MODEL_ID = "SupraLabs/Supra-1.5-50M-Instruct-exp" tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, clean_up_tokenization_spaces=False) pipe = pipeline( "text-generation", model=MODEL_ID, tokenizer=tokenizer, device_map="auto", torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32 ) def build_prompt(instruction, input_text=""): if input_text.strip(): return ( "Below is an instruction that describes a task, paired with an input " "that provides further context. Write a response that appropriately " "completes the request.\n\n" f"### Instruction:\n{instruction}\n\n" f"### Input:\n{input_text}\n\n### Response:\n" ) return ( "Below is an instruction that describes a task. Write a response that " "appropriately completes the request.\n\n" f"### Instruction:\n{instruction}\n\n### Response:\n" ) def generate(instruction, input_text=""): result = pipe( build_prompt(instruction, input_text), max_new_tokens=512, do_sample=True, temperature=0.7, top_k=50, top_p=0.9, repetition_penalty=1.15, pad_token_id=pipe.tokenizer.pad_token_id, eos_token_id=pipe.tokenizer.eos_token_id, return_full_text=False ) return result[0]['generated_text'].strip() while True: print("\nEnter an instruction (or 'exit' to quit):") user_input = input().strip() if user_input.lower() == "exit": break print("\nEnter additional context (optional, press Enter to skip):") context_input = input().strip() print(f"\nResponse:\n{generate(user_input, context_input)}\n") 

GGUF quantizations:

Bits Quant Size
1-bit Q1_D 19.6 MB
1-bit TQ1_0 25.1 MB
2-bit Q2_K 28.8 MB
2-bit TQ2_0 26.4 MB
3-bit IQ3_S 31 MB
3-bit Q3_K_S 31 MB
3-bit IQ3_M 31.7 MB
3-bit Q3_K_M 32.7 MB
3-bit Q3_K_L 33.8 MB
4-bit IQ4_XS 33.8 MB
4-bit Q4_K_S 35.7 MB
4-bit IQ4_NL 34.7 MB
4-bit Q4_0 34.5 MB
4-bit Q4_1 36.8 MB
4-bit Q4_K_M 37.4 MB
5-bit Q5_K_S 39.5 MB
5-bit Q5_0 39 MB
5-bit Q5_1 41.2 MB
5-bit Q5_K_M 41 MB
6-bit Q6_K 45.8 MB
8-bit Q8_0 56.2 MB
16-bit BF16 105 MB
16-bit F16 recommended 105 MB
32-bit F32 recommended 208 MB

GGUF with llama.cpp:

# Run directly (replace Q4_K_M with your preferred quant) llama-cli -hf SupraLabs/Supra-1.5-50M-instruct-exp-gguf:Q4_K_M \ --chat-template alpaca \ -p "Write a short poem about open source AI." \ -n 256 # Or run as a local OpenAI-compatible server llama-server -hf SupraLabs/Supra-1.5-50M-instruct-exp-gguf:Q4_K_M \ --chat-template alpaca \ -c 5120 

What's next?

Supra-124M - Base, Chat, Reasoning (legacy family, in production)

Supra-350M - Base, Chat, Reasoning, Coding (legacy family, in production)

All weights Apache 2.0. Feedback welcome!

submitted by /u/Dangerous_Try3619
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA