r/LocalLLaMA · · 2 min read

Releasing Apodex-1.0 Smol Models (0.8B, 2B, 4B Open-Weights) optimized for Agentic Verification + AgentHarness Evals

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Releasing Apodex-1.0 Smol Models (0.8B, 2B, 4B Open-Weights) optimized for Agentic Verification + AgentHarness Evals

Hey r/LocalLLaMA,

We just released Apodex 1.0, and alongside our flagship API, we are releasing the weights for our Smol models (0.8B, 2B, and 4B).

Our core research focuses on independent verification in long-horizon tasks. Instead of just scaling up parameter sizes for raw generation, we’ve been experimenting with small, highly specialized local models that handle specific sub-tasks in an agentic loop (like source cross-examination, hypothesis testing, and tool-grounded synthesis).

We wanted to share the open weights and our evaluation harness with the community to get your thoughts on local agent workflows.

🧠 The Setup: What are these Smol models for?

When running long-horizon agents locally, using a massive 70B+ model for every single step (like checking if a URL is broken or verifying a regex) is incredibly inefficient.

We specialized these 0.8B, 2B, and 4B models to act as sub-agents within our AgentOS runtime. They are trained to:

  1. Fact-check/Cross-examine: Treat external text outputs as "claims" rather than ground truth.
  2. Execute & Verify: Formulate precise tool calls and verify structural outputs before passing them back to the main controller.

📊 Flagship Model Benchmarks (For Context)

To give you an idea of what the full architecture is capable of when these verification loops are running at scale, our flagship model (Apodex-1.0-H) achieved the following scores:

  • DeepSearchQA: 94.4 | BrowseComp: 90.3
  • HLE-Text: 60.8
  • SuperChem: 74.2
  • FrontierScience Research: 46.7 ( Frontier science reasoning is still a brutal bottleneck for all of us)

🛠️ Open-Source Components & Local Evals

We’ve open-sourced AgentHarness, which is the framework we use to test and evaluate these agentic workflows locally without drifting over 50+ steps.

The open-weight models are hosted on Hugging Face, and the evaluation code is on GitHub.

(Note: To keep this post strictly compliant with the sub's rules, I’ve put all the Hugging Face links, GitHub repos, and the free early-access web platform in the stickied comment below).

For those into local agent orchestration:

  • Have you tried routing smaller tasks to <4B models in your local agent workflows? How do you mitigate the formatting/JSON adherence drift?
  • What are your thoughts on optimizing small models specifically for verification rather than conversational fluency?

Would love to hear your feedback, and let me know if you want us to cook up some GGUF/EXL2 quants for these!

submitted by /u/wuqiao
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA