Tag

Edge

197 articles archived under #edge · RSS

r/MachineLearning community 24d ago

Are We Underestimating Small Edge AI Models?[D]

A lot of recent discussion around Edge AI focuses on running increasingly larger local LLMs. Meanwhile modern smartphones already have enough compute for many practical computer vision tasks that don't require massive models at all. I recently built and released an Android…

7
r/LocalLLaMA community 25d ago

Run (your largest) local models from your iPhone

  submitted by   /u/BustyMeow [link]   [comments]

18
llama.cpp releases dev-tools 25d ago

b9518

server : disable on-device spec checkpoints ( #24108 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64 (Vulkan) Ubuntu arm64…

15
r/LocalLLaMA community 25d ago

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards. My setup (Node #04) Gigabyte X399 Designare EX Threadripper 1950X 128GB DDR4 4x RTX 3090 10GbE TP-Link/Aquantia NIC llama.cpp NCCL build vLLM for safetensors models I…

15
The Information — AI news-outlet 26d ago

Apple to Launch New Siri in September With Help of Google, Nvidia

Apple is currently on track to launch its overhauled Siri in September, to run in part on Google’s cloud computing servers using Nvidia chips, according to people familiar with the matter. While Apple will try to run as much as possible of the new Siri on devices such as…

31
r/LocalLLaMA community 26d ago

Big Model Value Wars - DeepSeek V4 Pro vs MiMo-V2.5-Pro vs MiniMax M3

For those who sometimes boost their local model use with openrouter options, or the madlads who have the infrastructure to actually run those locally, it feels like those three model have the edge in best bang for your buck. How then do you decide which one to use? Do you have a…

19
r/LocalLLaMA community 26d ago

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

Hi everyone, I want to set up a 100% offline RAG system using LM Studio and the entire Italian Wikipedia (text-only, no images). My goal is to index the database once so my local LLMs can query it for up-to-date factual knowledge without internet access. Here are my PC specs:…

14
r/LocalLLaMA community 27d ago

Microsoft Aion 1.0 Instruct and Aion 1.0 Plan models!

Microsoft announced 2 new on-device models at Microsoft Build 2026. Aion 1.0 Instruct: efficiency at scale. Aion 1.0 Instruct is our next-generation small language model, smaller, faster and more efficient than our current Windows OS SLM. Designed from the ground up for…

14
r/LocalLLaMA community 27d ago

I Put a Datacenter GPU in My Gaming PC for £200

Hey there! I wrote a blogpost about my experience running local models on a V100 from a newbie perspective and got loads of views outside of reddit, so I thought I'd share it here too!   submitted by   /u/tymscar [link]   [comments]

33
r/LocalLLaMA community 27d ago

What are you using to preprocess pdfs before feeding them to a local model?

I have been running a local setup for document QA and the output quality varies a lot depending on what the pdf looks like when it hits the LLM. clean prose docs are fine but anything with tables or multi column layouts comes out garbled and the model just works with whatever…

37
r/LocalLLaMA community 27d ago

Ignoring benchmarks, how do the newest local models (gemma 4 31B, 26BA4B, Qwen 3.6) “feel” to you? What do you think they compare to?

I use local ai mainly for creative writing, and benchmarks are a bit iffy on that I feel like. I’d like to compare Gemma mainly to Gemini as I like their writing the best, I do know that qwen 3.6 is amazing but mostly for coding and agentic work. I’d like to ask everyone how the…

30
r/LocalLLaMA community 27d ago

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

For two weeks I ran my multi-agent orchestrator entirely on Qwen3.6-27B via Ollama, on a single 3090. The goal: see if a local model could replace Claude as the reasoning layer for the lead/manager/sub-agent loop. Here's where it worked and where it broke. Setup: - RTX 3090,…

13
r/LocalLLaMA community 28d ago

Man trains local model to detect and kill mosquitos with a laser

Now this is local AI innovation we can all get behind. https://x.com/stevencheng/status/2059836738449854898   submitted by   /u/No_Information9314 [link]   [comments]

37
r/LocalLLaMA community 28d ago

Stop asking what model to run. There are literally only two.

Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this entire planet: Qwen 3.6 35b a3b Qwen 3.6 27b That is the entire list. Your specs don’t…

30
arXiv — NLP / Computation & Language research 29d ago

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

arXiv:2605.30628v1 Announce Type: new Abstract: Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no…

38
arXiv — NLP / Computation & Language research 29d ago

Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows

arXiv:2605.31452v1 Announce Type: new Abstract: Building on our previous work, this paper develops practical, low-barrier methods for freelance translators and smaller language service providers to evaluate translation technologies using rigorous yet accessible analytic methods.…

23
Hugging Face Daily Papers research 29d ago

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Abstract Multi-step trojan attacks in local LLM agents can bypass existing defenses by embedding malicious prompts across multiple operations, requiring new detection methods like DASGuard for effective protection. AI-generated summary LLM agents are evolving from conversational…

20
r/LocalLLaMA community 29d ago

Don’t bite me for that question please…

And question is… How you earning money on your local llm setups? (Except coding ofc) I see people spending SO MUCH MONEY on the compute power to run llms locally and many of them saying that their setups already payed themselves or they earning much more (I guess they not mean…

29
r/MachineLearning community 29d ago

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

Hey! I'm a CS student and I got tired of not being able to compare MLX inference engines properly — every benchmark out there is either made by the engine's own developers, runs on an M3 Ultra nobody has, or just shows tok/s with zero context. So I built mlx-Chronos — a small…

11
r/LocalLLaMA community 1mo ago

what do you use your local llm?

what do you use your local llm for? for me, i run everything on linux and it ends up generating api tokens i can plug into other stuff. on my laptop (and for personal projects), i mostly use it for coding help—then i’ve got an ai agent (not openclaw ) that monitors stock prices…

34
r/LocalLLaMA community 1mo ago

How do I try to run Gemma 4 31B at Q8 quantization? Only seeing Q4_K_M on Ollama

Just got my new PC up and running and want to test some local models. I'm a complete noob but I've managed to install ollama. Im on Fedora Linux.   submitted by   /u/JayoTree [link]   [comments]

9
r/LocalLLaMA community 1mo ago

Cost Analysis of my $6.4k Local LLM Server

I haven't seen any of these done, so I just wanted to share my experience in case it is useful for anyone. The purpose of this post is to show total cost of ownership of my local llm server versus API equivalent. Before you look at the final numbers, note that most people do not…

17
r/LocalLLaMA community 1mo ago

Why does Thinking Output More Tokens Than a Response?

I was too lazy to use a vector DB + Embedding + Clustering for this list of 1000 items I wanted to categorize. I was hoping to use a local LLM to do it, but it would only respond with a list of about 100 items or so and their categories. It confused me because when I saw the…

22
r/LocalLLaMA community 1mo ago

Shoutout to Gemma4 as a conversational assistant / agent

I'm seriously impressed by Gemma4 26B A4B. On my M5 Pro (so not much memory bandwidth by GPU standards), it's blazingly fast and it's a very good generalist / everyday local LLM. It has a little bit of personality to its responses, and seems to perform decently for everything:…

37
r/LocalLLaMA community 1mo ago

LiquidAI/LFM2.5-8B-A1B · Hugging Face

looks like you can run it on any potato (A1B)! https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF from LiquidAI: LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.…

22
r/LocalLLaMA community 1mo ago

Qwen3.6 35B - TXT vs Markdown vs HTML vs HTML+CSS

Theres been talk of late about using HTML rather than markdown in Claude Code. I was curious how this worked with a local model so loaded up Qwen3.6 35B A3B at Q8 and F16 KV cache. Then I gave it the same prompt write a detailed explanation of the Blazor render cycle first…

31
The Information — AI news-outlet 1mo ago

Apple to Renew Push for AI That Runs on Devices, Instead of the Cloud

At Apple’s annual developer conference next month, the star of the show will be a series of long-delayed artificial intelligence upgrades to the iPhone. But the company is also expected to emphasize what could be an underrated asset in its efforts to catch up in AI: Its ability…

26
r/LocalLLaMA community 1mo ago

Heterogeneous GPU Weighting & Layer Splitting

This is what I worked on today. With local LLM of course. So if I didn't write the code, did I really work on it? Who cares. It was my idea and I simply asked it to implement it. I basically downloaded /main/ branch, which is totally broken for Windows by the way (i had to…

21
arXiv — Machine Learning research 1mo ago

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

arXiv:2605.27599v1 Announce Type: new Abstract: Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all…

37
r/LocalLLaMA community 1mo ago

Local LLMs on Refurb M4 Max vs new M5 Max

Hoping the community can guide me on this one. I'm on the fence about the following purchase: Refurbished 16-inch MacBook Pro Apple M4 Max Chip with 16‑Core CPU and 40‑Core GPU, 64gb ram, 1Tb Drv for $3,479.00 vs The new 16-inch MacBook Pro Apple M5 Max Chip with 18‑core CPU,…

30
r/LocalLLaMA community 1mo ago

CrankGPT by Squeez Labs - hand-cranked edge AI - talk about local AI!!!

I met Katrin from Squeez Labs at an event hosted by Pathway AI (the team behind Baby Dragon Hatchling) where she told me about CrankGPT, a literally hand-cranked device for running local LLMs. It's apparently real. It's appearently launched. It's apparently glorious. Check it…

15
r/LocalLLaMA community 1mo ago

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent

So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap. First thing I stopped using Ollama and now I only use llama.cpp built in server that works really great. The quality improvement from Q4 to…

34
arXiv — Machine Learning research 1mo ago

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

arXiv:2605.26128v1 Announce Type: new Abstract: Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex-constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments,…

24
arXiv — Machine Learning research 1mo ago

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

arXiv:2605.26496v1 Announce Type: new Abstract: The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense…

32
Hugging Face Daily Papers research 1mo ago

MobileMoE: Scaling On-Device Mixture of Experts

Abstract MobileMoE introduces efficient on-device Mixture-of-Experts language models with sub-billion parameters that achieve better performance and efficiency compared to dense baselines and existing MoE models. AI-generated summary Mixture-of-Experts (MoE) has become the de…

17
arXiv — Machine Learning research 1mo ago

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

arXiv:2605.24058v1 Announce Type: new Abstract: On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a…

28
r/LocalLLaMA community 1mo ago

New local model reaching near frontier on PII removal at 9 ms CPU inference

Hi all, I've been working on this model to strip sensitive information from computer use data and would love some feedback!   submitted by   /u/louis3195 [link]   [comments]

34
r/LocalLLaMA community 1mo ago

Using Local LLMs for Generating Custom Interactive Recursive Textbooks on the Fly

  submitted by   /u/Ryoiki-Tokuiten [link]   [comments]

28
llama.cpp releases dev-tools 1mo ago

b9315

llama : document that only one on-device state can be saved per sequence ( #23520 ) macOS/iOS: macOS Apple Silicon (arm64) macOS Apple Silicon (arm64, KleidiAI enabled) macOS Intel (x64) iOS XCFramework Linux: Ubuntu x64 (CPU) Ubuntu arm64 (CPU) Ubuntu s390x (CPU) Ubuntu x64…

13
r/MachineLearning community 1mo ago

Is AI inference platform really that saturated now? [D]

I’m thinking of expanding an on-device inference SDk into a full blown AI inference platform and seeing more and more inference platform popping out. Been talking with a VC from Seattle/NY. Is this space really that saturated?   submitted by   /u/kampak212 [link]  …

35
r/LocalLLaMA community 1mo ago

RAG for developer docs so local llm can code using latest library?

I was wondering if it would make local llm better at coding if it has access to the latest documentation available through a RAG. I'm specifically interested in python. But then this might lead ingesting and embedding a very large number of documents. Or I could just focus on…

28
r/LocalLLaMA community 1mo ago

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

Imagine you are using a local model for agentic coding. You discuss the idea (50k tokens), then say “implement it”. The agent reads files, writes files, runs commands, produces another 20k tokens and the code is ready. Then your next prompt is just “thank you”, and... nothing…

6
r/LocalLLaMA community 1mo ago

llama.cpp has a clever trick for speeding up KV cache decode

So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under developer options. This is the…

23
r/LocalLLaMA community 1mo ago

Is NVIDIA still the default best choice for local LLMs in 2026?

  submitted by   /u/pmv143 [link]   [comments]

9
r/LocalLLaMA community 1mo ago

Local model doing accounting tasks

So I've been using qwen 3.6 27b for monthly closes, bank recs, payable and receivables. Built a simple sql lite database it manages. Anyhow, wanted to post I integrated Claude skills and the https://github.com/anthropics/financial-services repo. It works well. Just wanted to…

22
r/LocalLLaMA community 1mo ago

club-rdna16: practical 16GB AMD/Radeon local LLM testing repo

Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/ The first test machine is an RX 6900 XT…

24
r/LocalLLaMA community 1mo ago

Gmail tie-ins

hey folks. I’m looking to setup a way to give a local LLM access to google cloud SDK for Gmail functions. The goal is to be able to have an LLM once daily check a spreadsheet, and based on criteria send an email that will be structured exactly the same way each time, simply as a…

14
arXiv — Machine Learning research 1mo ago

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

arXiv:2605.20295v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization…

36
arXiv — NLP / Computation & Language research 1mo ago

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

arXiv:2605.20815v1 Announce Type: new Abstract: Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments…

32
r/LocalLLaMA community 1mo ago

24GB M4 Mac - is Qwen 9B only option while system is running?

I have mac at work that I want to use local model for prototyping and basic prompts that needs to stay on device. What sort of model I can run that I can fit at least 64k context ? Any setups share or guides welcome. I need to have firefox open with one tab at minium. Problem I…

6

Are We Underestimating Small Edge AI Models?[D]

Run (your largest) local models from your iPhone

b9518

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Apple to Launch New Siri in September With Help of Google, Nvidia

Big Model Value Wars - DeepSeek V4 Pro vs MiMo-V2.5-Pro vs MiniMax M3

Best way to index full Italian Wikipedia for 100% offline RAG in LM Studio?

Microsoft Aion 1.0 Instruct and Aion 1.0 Plan models!

I Put a Datacenter GPU in My Gaming PC for £200

What are you using to preprocess pdfs before feeding them to a local model?

Ignoring benchmarks, how do the newest local models (gemma 4 31B, 26BA4B, Qwen 3.6) “feel” to you? What do you think they compare to?

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Man trains local model to detect and kill mosquitos with a laser

Stop asking what model to run. There are literally only two.

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

Translation Analytics for Freelancers II: Benchmarking Local LLMs for Confidential Translation Workflows

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Don’t bite me for that question please…

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama) [P]

what do you use your local llm?

How do I try to run Gemma 4 31B at Q8 quantization? Only seeing Q4_K_M on Ollama

Cost Analysis of my $6.4k Local LLM Server

Why does Thinking Output More Tokens Than a Response?

Shoutout to Gemma4 as a conversational assistant / agent

LiquidAI/LFM2.5-8B-A1B · Hugging Face

Qwen3.6 35B - TXT vs Markdown vs HTML vs HTML+CSS

Apple to Renew Push for AI That Runs on Devices, Instead of the Cloud

Heterogeneous GPU Weighting & Layer Splitting

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

Local LLMs on Refurb M4 Max vs new M5 Max

CrankGPT by Squeez Labs - hand-cranked edge AI - talk about local AI!!!

Qwen3.6 huge quality gain from Q4 to Q6 for coding agent

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

MobileMoE: Scaling On-Device Mixture of Experts

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

New local model reaching near frontier on PII removal at 9 ms CPU inference

Using Local LLMs for Generating Custom Interactive Recursive Textbooks on the Fly

b9315

Is AI inference platform really that saturated now? [D]

RAG for developer docs so local llm can code using latest library?

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp

llama.cpp has a clever trick for speeding up KV cache decode

Is NVIDIA still the default best choice for local LLMs in 2026?

Local model doing accounting tasks

club-rdna16: practical 16GB AMD/Radeon local LLM testing repo

Gmail tie-ins

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

24GB M4 Mac - is Qwen 9B only option while system is running?