Tag

Agents + tool use

500 articles archived under #agents · RSS

Hugging Face Daily Papers research 7d ago

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Abstract PlanBench-XL evaluates large language model agents' ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents increasingly operate in large tool ecosystems, where…

10
Hugging Face Daily Papers research 7d ago

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

Abstract EvoEmbedding is a dynamic embedding model that generates adaptive representations by maintaining a continuously updated latent memory, enabling improved retrieval performance in long-context scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing embedding…

32
Hugging Face Daily Papers research 7d ago

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

Abstract Search agents face challenges in real-world evaluation due to limited benchmarks and coarse metrics, necessitating more nuanced assessment approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search Agents (SAs) typically leverage large language models (LLMs) to…

14
Hugging Face Daily Papers research 7d ago

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

Abstract Calibrated verifier telemetry enhances LLM agents in knowledge-intensive question answering by providing confidence scores and grounding verification, reducing both over-retrieval and unsupported answers. Generated by Qwen/Qwen2.5-Coder-32B-Instruct LLM agents in…

7
Hugging Face Daily Papers research 7d ago

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Abstract PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms. Generated by…

5
r/LocalLLaMA community 7d ago

Why is NO one talking about Microsoft's open source Fast Context!!!

https://huggingface.co/microsoft/FastContext-1.0-4B-SFT https://github.com/microsoft/fastcontext FastContext-1.0 is a lightweight repository-exploration subagent for LLM coding agents. Instead of letting a single model both explore the repository and solve the task, FastContext…

38
TechCrunch — AI news-outlet 7d ago

The AI world is getting ‘loopy’

The loop takes agentic AI a step further, by authorizing a swarm of agents to work continuously in the background, endlessly.

28
r/LocalLLaMA community 7d ago

TMax: A Simple Recipe for Terminal Agents

TMax is the strongest open RL recipe for terminal agents to date, bringing open data recipes closer to the frontier. We release two things. The first is TMax-15k , a dataset of 14,600 RL environments built from a compositional pipeline with explicit control over difficulty and…

22
Interconnects (Nathan Lambert) research 7d ago

GLM-5.2 is the step change for open agents

A capability threshold I've been carefully monitoring.

12
r/LocalLLaMA community 7d ago

Same model, same prompt, 4 different agents

Setup: one self-hosted Qwen3.6-27B (Q4) on llama.cpp, identical prompt, identical hardware. The only variable is the agent scaffolding. Agents tested: pi, opencode, hermes, qwen code . Task: a single-file 2D canvas solar system with scripted orbits and gravity that acts only on…

14
Vercel — AI dev-tools 7d ago

Chat SDK adds Novu support

Chat SDK now supports Novu with the new vendor-official adapter . One handler set puts your agent on Slack, Microsoft Teams, WhatsApp, Telegram, and email. Novu handles credentials, identity, and delivery, keeping OAuth and tokens outside your app and mapping each channel to one…

32
r/LocalLLaMA community 7d ago

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

arXiv : https://arxiv.org/abs/2606.15079 Full Paper : https://arxiv.org/pdf/2606.15079 HuggingFace : https://huggingface.co/inclusionAI/models?sort=created (This month they released base models for both Ling-2.6-1T & Ling-2.6-flash ) -------------------------- Wish they released…

11
r/LocalLLaMA community 8d ago

I want to love hermes agent, but it looks so ugly, and ux is not nice

I am rechecking on hermes agent currently, also because many report great experiences, but oh my, does it look ugly. The web-UI uses such ugly fonts and background graphics, and for some reasons, UX feel slow and tedious (even in the tui). Pi mono agent feels quick and fast…

20
Hugging Face Daily Papers research 8d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Abstract Current memory agents lack reliable shared institutional deployment due to challenges in balancing utility, access control, and forgetting across multiple principals with diverse authorization contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory benchmarks for…

5
Hugging Face Daily Papers research 8d ago

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Abstract WorldLines benchmark evaluates long-term memory in embodied agents through household scenarios, while ObsMem framework addresses challenges in partial observability and memory translation for decision-making. Generated by Qwen/Qwen2.5-Coder-32B-Instruct To assist humans…

19
Vercel — AI dev-tools 8d ago

Sakana Fugu Ultra now available on AI Gateway

Sakana Fugu Ultra from Sakana AI is now available on AI Gateway . Fugu Ultra is built on a pool of publicly accessible frontier models, rather than running as a single model. It coordinates several models, routing work to 1-3 agents depending on the problem and combining their…

31
Simon Willison community 8d ago

Temporary Cloudflare Accounts for AI agents

Temporary Cloudflare Accounts for AI agents The announcement says this is "for AI agents" but (as is pretty common these days) the AI hook isn't really necessary, this is an interesting feature for everyone else as well. Short version: you can now create a Cloudflare Workers…

16
r/LocalLLaMA community 8d ago

I pretrained and post trained a 500M parameter LLM and 330M parameter Image generator from scratch

Hey folks Hope you are doing well I started HobbyLM as an side project last month Initially I wrote an Agent harness using Claude SDK which takes notes on various LLM architecture does ablation studies to find optimised or well fit architecture for this model training then I…

16
r/LocalLLaMA community 8d ago

Sandboxing code execution for AI agents

For those giving their agents the ability to execute code, how are you sandboxing it? The spectrum seems to be: Docker containers: familiar, decent isolation, but heavyweight for per-request sandboxing microVMs: great isolation, fast boot, but operational complexity WASM:…

5
r/LocalLLaMA community 8d ago

8-16 MI50s Minimax M3 @19 tps TG (peak)

TL;DR Speeds are not too ugly for this old 2018 hardware but imo, not very usable for agentic coding (if you compare with qwen3.6 27B on 8 MI50 @ 50 tps TG 800 tps PP). More concerning is that the reasoning output is very very long and still didn’t check about the quality of…

27
r/LocalLLaMA community 8d ago

I mapped every agent config file (AGENTS.md, CLAUDE.md, llms.txt, .cursorrules, SKILL.md...) and tagged how widely each is actually used

Every tool ships its own magic file now and after a while the names all blur together. I put together a guide to the ones agents actually read and write, with a tag on each for real adoption instead of hype. https://github.com/ItamarZand88/awesome-agent-conventions 21…

22
r/LocalLLaMA community 9d ago

Board where every tile is an agent

I've been hacking a project which I find extremely useful and wanted to share. Imagine a board where every tile is an agent those job is to maintain the tile. I tried to illustrate the idea with a video here. The project is open source on GitHub and you can also try it out here…

36
Hacker News — AI on Front Page community 9d ago

Temporary Cloudflare accounts for AI agents

Article URL: https://blog.cloudflare.com/temporary-accounts/ Comments URL: https://news.ycombinator.com/item?id=48608394 Points: 203 # Comments: 106

15
r/LocalLLaMA community 10d ago

Local AI for local office files

Which AI agent do you think is the best for working with local files (Excel, PDF, Word, txt, json, etc.)? What have you used for this? What workflows have you implemented?   submitted by   /u/Holiday-Display509 [link]   [comments]

29
r/LocalLLaMA community 10d ago

Giving a local agent web access without paid search/scrape APIs: SearXNG + Scrapling

I wanted web access for a local-first agent without reaching for Tavily, Serper, Firecrawl, etc. For this agent path, I wanted no paid API keys, a search service I control, and page extraction I can run myself. What I ended up with is two tools: web_search and web_extract .…

6
r/LocalLLaMA community 10d ago

Local agent on 4090 - looking for LM Studio settings

I have moved on from Ollama to just dink around and instead want to start running a local agent from time to time. With the 24GB of a 4090 (Gigabyte OC edition) that should be quite possible. But no matter what settings I use for context and batching, token generation is slow as…

36
Simon Willison community 10d ago

Quoting Sean Lynch

The real valuable capability MCP offers over skills/CLI is isolating the auth flow outside of the agent’s context window, and potentially out of the harness completely. [...] Maybe the idealized form of MCP is just an auth gateway for the API and nothing else. That’d still be a…

8
r/LocalLLaMA community 10d ago

Best Local Agents - Jun 2026

A megathread that is overdue! Let's discuss and debate on what the best local agents available today are Prologue First a note on terminology: While most regular users are going to have a general sense of what these are, I think its worth a brief pause to preempt turbulence in…

6
Hugging Face Daily Papers research 10d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Abstract LEDGERAGENT is a method for customer service agents that maintains task states in a separate ledger to improve policy adherence and state management during tool calling. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Policy-adherent tool-calling agents in customer-service…

36
GitHub Blog — AI & ML official-blog 10d ago

How we built an internal data analytics agent

Qubot, our internal Copilot-powered analytics agent, allows any GitHub employee to ask questions about our data in plain language. Here's what we learned as we built it. The post How we built an internal data analytics agent appeared first on The GitHub Blog .

18
Hugging Face Daily Papers research 10d ago

Context-Aware RL for Agentic and Multimodal LLMs

Abstract ContextRL enhances long-horizon reasoning and multimodal performance through reinforcement learning that rewards context selection for supporting query-answer pairs, achieving improvements over standard methods on diverse benchmarks. Generated by…

21
r/LocalLLaMA community 10d ago

Improving local models with an API based "consultant"?

I'm sure that someone else has come up with this before, but i just wanted to ask: Has it occurred to anyone to improve their local AI workflow by adding a more powerful API based "consultant" agent (GLM 5.2 now springs to mind) to call upon for refining plans, learnings and…

35
Hugging Face Daily Papers research 10d ago

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Abstract ACIE, an agentic RAG system deployed in a clinical setting, demonstrates high accuracy in extracting medical information from complex patient contexts, achieving 96.5% acceptance rate by nuclear-medicine physicians across 7,326 judgments. Generated by…

5
r/LocalLLaMA community 10d ago

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

I wanted to find the exact floor for running an intelligent, local voice assistant agent on consumer hardware. I kept the environment, tools, and prompts identical, I stepped the model sizes down through Qwen 3.5 9B, 4B, 2B, and 0.8B to see how agentic reasoning degrades. The…

12
r/LocalLLaMA community 10d ago

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

You can read about it here: https://artificialanalysis.ai/articles/aa-briefcase This is a solid benchmark from Artificial Analysis. It basically tests an LLMs ability to plan and execute tasks. And more importantly, it is a new benchmark that is not saturated, so no one can…

32
r/LocalLLaMA community 10d ago

Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

Ohio State University's NLP team released QUEST-35B, an open-source Deep Research agent trained using ~32 H100s and ~8K synthetic samples. The team open-sourced the training recipe, code, weights and datasets. Benchmark results show competitive performance against several…

13
Hugging Face Daily Papers research 10d ago

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Abstract ENPIRE framework enables autonomous robotics research through a closed-loop system that automates policy improvement via environment feedback, policy refinement, and evolutionary code optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Achieving dexterous robotic…

27
Hugging Face Daily Papers research 11d ago

Playful Agentic Robot Learning

Abstract Embodied robots learn reusable skills through self-directed play and exploration, then apply these skills to improve performance on downstream tasks without additional training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Current agentic robot systems can write…

4
arXiv — Machine Learning research 11d ago

MortarBench: Evaluating Mortgage Loan Origination Agents

arXiv:2606.19416v1 Announce Type: new Abstract: Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an…

15
arXiv — Machine Learning research 11d ago

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

arXiv:2606.19595v1 Announce Type: new Abstract: Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for…

35
arXiv — Machine Learning research 11d ago

OnDeFog: Online Decision Transformer under Frame Dropping

arXiv:2606.19721v1 Announce Type: new Abstract: In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the…

20
arXiv — NLP / Computation & Language research 11d ago

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

arXiv:2606.20002v1 Announce Type: cross Abstract: This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it…

13
arXiv — NLP / Computation & Language research 11d ago

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

arXiv:2606.19659v1 Announce Type: new Abstract: On-policy distillation (OPD) improves student models by training them on trajectories induced by their own policy, making it a promising approach for mitigating exposure bias in agent training. However, most OPD studies focus on…

17
arXiv — NLP / Computation & Language research 11d ago

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

arXiv:2606.19847v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented…

32
arXiv — NLP / Computation & Language research 11d ago

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

arXiv:2606.19852v1 Announce Type: new Abstract: Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional…

26
arXiv — NLP / Computation & Language research 11d ago

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

arXiv:2606.20113v1 Announce Type: new Abstract: Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the…

21
arXiv — NLP / Computation & Language research 11d ago

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

arXiv:2606.20487v1 Announce Type: new Abstract: Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition…

16
arXiv — NLP / Computation & Language research 11d ago

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

arXiv:2606.19388v1 Announce Type: cross Abstract: Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct…

31
arXiv — NLP / Computation & Language research 11d ago

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

arXiv:2606.19501v1 Announce Type: cross Abstract: Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing…

14
arXiv — NLP / Computation & Language research 11d ago

Uncertainty Decomposition for Clarification Seeking in LLM Agents

arXiv:2606.19559v1 Announce Type: cross Abstract: Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable…

9

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

CalVerT: Augmenting Agents with Calibrated Verifier Telemetry Improves Action and Learning in Knowledge-Intensive Tasks

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Why is NO one talking about Microsoft's open source Fast Context!!!

The AI world is getting &#8216;loopy&#8217;

TMax: A Simple Recipe for Terminal Agents

GLM-5.2 is the step change for open agents

Same model, same prompt, 4 different agents

Chat SDK adds Novu support

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

I want to love hermes agent, but it looks so ugly, and ux is not nice

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

Sakana Fugu Ultra now available on AI Gateway

Temporary Cloudflare Accounts for AI agents

I pretrained and post trained a 500M parameter LLM and 330M parameter Image generator from scratch

Sandboxing code execution for AI agents

8-16 MI50s Minimax M3 @19 tps TG (peak)

I mapped every agent config file (AGENTS.md, CLAUDE.md, llms.txt, .cursorrules, SKILL.md...) and tagged how widely each is actually used

Board where every tile is an agent

Temporary Cloudflare accounts for AI agents

Local AI for local office files

Giving a local agent web access without paid search/scrape APIs: SearXNG + Scrapling

Local agent on 4090 - looking for LM Studio settings

Quoting Sean Lynch

Best Local Agents - Jun 2026

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

How we built an internal data analytics agent

Context-Aware RL for Agentic and Multimodal LLMs

Improving local models with an API based "consultant"?

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Watching a local AI voice assistant get dumber (A 9B to 0.8B agent experiment on my RTX 5060 Ti)

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Playful Agentic Robot Learning

MortarBench: Evaluating Mortgage Loan Origination Agents

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

OnDeFog: Online Decision Transformer under Frame Dropping

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

Uncertainty Decomposition for Clarification Seeking in LLM Agents

The AI world is getting ‘loopy’