r/LocalLLaMA · · 3 min read

Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Ran a head-to-head on two open-weight models for tool-calling on a 4-core CPU, no GPU, no cherry-picking. Wanted to see if the small specialist (Needle, 26M, distilled from Gemini 3.1 for function calls) actually holds up against a small generalist (Qwen3-0.6B) that also does tools.

Setup: 50 queries across 5 tiers (simple, paraphrased, implicit, ambiguous, edge cases including foreign language and a "don't call any tool" trap). 5 mock tools. Three metrics per run: parse_success, tool_match, args_match. Same queries, same eval rubric, same hardware.

Headline numbers:

 Needle (26M) Qwen3 (0.6B) tool_match overall 72.0% 56.0% parse_success 84.0% 54.0% args_match | match 97.2% 100.0% mean latency 10.9s 47.9s 

The interesting part is not the overall win, it's the failure shapes. They diverge completely:

  • Needle fails by picking the wrong tool. When it does pick a tool, args are right 97% of the time. Its sin is selection, mostly routing system commands to search_web instead of run_command.
  • Qwen3 fails by not calling a tool at all. Every single one of its 22 misses is a parse failure where it answered in prose instead of emitting <tool_call> tags. When it does emit a call, args are perfect 100% of the time.

Tier breakdown is where it gets sharp. T1 and T2 (literal and paraphrased) are tied at ~95% each. T3 (implicit, like "should I bring an umbrella in Amsterdam?" where the tool name never appears) is where Qwen3 falls off a cliff: 80% to 10%. Needle just maps the intent. Qwen3 tries to be helpful in prose and apologizes for not having real-time data.

T5 (edge) is the only tier Qwen3 wins, by 10 pts. Hindi queries broke Needle's tokenizer (Devanagari fragments badly, one query timed out at 73s with garbled output). Qwen3 handled both Hindi and French cleanly.

One thing that almost killed the Needle run: first pass it scored 8% because I was feeding it OpenAI JSON Schema. Needle was trained on a flat schema ({location: {type, description, required}}) and was literally echoing the word "properties" back as an argument value. Wrote a converter, accuracy jumped from 8% to 72% with no other changes. Worth knowing if anyone else picks up the Needle weights.

Qwen3 had its own issue, it never emitted EOS on the hand-rolled prompt template and burned the full 256-token budget on every query (~230s each). Switching to tokenizer.apply_chat_template(tools=...) with enable_thinking=False dropped it to ~37s and the <tool_call> tags started appearing naturally.

My read: these are not the same product category even though they sound like they are. Needle is a dispatcher. Qwen3 is a tiny chatbot that can also call tools. If you want on-device single-shot tool routing with a fixed palette, Needle is genuinely good for 13MB. If you want any conversational ability, Needle has zero of it and Qwen3 wins by default.

Limitations: n=50 is small. Single CPU hardware. Mock tools, not real ones. Would love anyone who reproduces it on different hardware or with a paraphrase-stress-test to share results.

Repo with full code, raw_log.jsonl, summary.json, and the 5 charts are in comments below 👇

This evaluation was done using NEO, an AI engineering agent. It built the eval harness, handled the checkpointed runs, debugged the schema mismatch and the EOS issue, and consolidated results. I reviewed everything manually and made the calls on what to ship.

submitted by /u/gvij
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA