Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — Machine Learning research 15d ago

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications…

5
arXiv — Machine Learning research 15d ago

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

arXiv:2606.13710v1 Announce Type: cross Abstract: Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended…

6
arXiv — NLP / Computation & Language research 15d ago

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce…

25
arXiv — NLP / Computation & Language research 15d ago

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

arXiv:2606.13835v1 Announce Type: new Abstract: LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a…

10
arXiv — NLP / Computation & Language research 15d ago

SANA: What Matters for QA Agents over Massive Data Lakes?

arXiv:2606.13904v1 Announce Type: new Abstract: Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish…

38
arXiv — NLP / Computation & Language research 15d ago

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this…

10
arXiv — NLP / Computation & Language research 15d ago

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

arXiv:2606.14179v1 Announce Type: new Abstract: We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach…

7
arXiv — NLP / Computation & Language research 15d ago

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

arXiv:2606.14302v1 Announce Type: new Abstract: LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online…

15
arXiv — NLP / Computation & Language research 15d ago

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

arXiv:2606.14574v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type…

8
arXiv — NLP / Computation & Language research 15d ago

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

arXiv:2606.14600v1 Announce Type: new Abstract: Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce…

32
arXiv — NLP / Computation & Language research 15d ago

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often…

13
arXiv — NLP / Computation & Language research 15d ago

Orchestra-o1: Omnimodal Agent Orchestration

arXiv:2606.13707v1 Announce Type: cross Abstract: The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition…

38
arXiv — NLP / Computation & Language research 15d ago

WorkBench Revisited: Workplace Agents Two Years On

arXiv:2606.13715v1 Announce Type: cross Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best…

34
arXiv — NLP / Computation & Language research 15d ago

Same-Origin Policy for Agentic Browsers

arXiv:2606.14027v1 Announce Type: cross Abstract: Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that…

16
arXiv — NLP / Computation & Language research 15d ago

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software…

37
arXiv — NLP / Computation & Language research 15d ago

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

arXiv:2606.14672v1 Announce Type: cross Abstract: Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which…

28
arXiv — NLP / Computation & Language research 15d ago

Large Language Model Agents Are Not Always Faithful Self-Evolvers

arXiv:2601.22436v3 Announce Type: replace Abstract: Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the…

16
Vercel — AI dev-tools 15d ago

Auth0 joins the Vercel Marketplace

You can now add Auth0 , a production-ready authentication to your Vercel app in just a few clicks. Built for modern frameworks like Next.js, Auth0 is an identity and access management platform for securing your apps and agentic workflows. This integration enables: Automatic…

26
r/MachineLearning community 16d ago

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,…

24
r/LocalLLaMA community 16d ago

32 bit crossplatform coding agent running on pentium m with less than a second startup time

supports subagents, goals, MacOS, Unix, linux, BSD, windows 7 +, minimum needed CPU is 386, tried it on an 800mhz pentium 3 and still got sub second startup time and less than 1% cpu usage during use without --slow-cpu flag, prism supports plugins and is small enough to fit on a…

7
NVIDIA Developer Blog official-blog 17d ago

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how...

8
Hugging Face Daily Papers research 17d ago

See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

Abstract Heterogeneous multi-agent systems can effectively transfer knowledge through aligned KV-cache communication, achieving better performance than text-based methods with reduced computational costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-agent systems…

21
Hugging Face Daily Papers research 17d ago

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Abstract TRACE is a skill-layer pipeline that mines user corrections to create runtime checks, significantly reducing preference violations in interactive LLM agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Interactive LLM agents are becoming part of daily work, but they do…

30
Hacker News — AI on Front Page community 17d ago

How to setup a local coding agent on macOS

Article URL: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent-on-macos Comments URL: https://news.ycombinator.com/item?id=48507020 Points: 261 # Comments: 71

36
Hugging Face Daily Papers research 17d ago

WebChallenger: A Reliable and Efficient Generalist Web Agent

Abstract WebChallenger presents a web agent framework that improves autonomous navigation through structured page representation and cognitive-inspired mechanisms, achieving high performance with open-weight models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Autonomous web…

15
Hugging Face Daily Papers research 17d ago

The Cold-Start Safety Gap in LLM Agents

Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe…

37
r/LocalLLaMA community 17d ago

MiniMax Sparse Attention (MSA)

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax…

14
NVIDIA Developer Blog official-blog 17d ago

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...

25
r/MachineLearning community 17d ago

Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D]

Hi all, A real story from my current experience: I'm associated with an internship where the primary work revolves around autonomous UAVs. What has shocked me the most is that almost everyone is so heavily focused on coding agents and AI tools that they're building things…

17
r/LocalLLaMA community 18d ago

moonshotai/Kimi-K2.7-Code · Hugging Face

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing…

10
OpenAI official-blog 18d ago

New OpenAI Academy courses for the next era of work

OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyday work.

27
Hugging Face Daily Papers research 18d ago

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Abstract ArogyaBodha dataset and ArogyaSutra framework enhance multilingual medical reasoning in low-resource settings through diverse data integration and actor-critic multi-agent reasoning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models (MLLMs)…

30
r/LocalLLaMA community 18d ago

[browser-use-wasm] I made a browser-use agent that runs in WASM at zero cost

The only cost is electricity! I built this in a few weeks since I couldn't find anything else like it. Demo: https://pdufour.github.io/browser-use-wasm/ Source Code: https://github.com/pdufour/browser-use-wasm One thing I've wanted to do for a while was add a widget to my page…

12
Smol AI News news-outlet 18d ago

not much happened today

**Anthropic** suspended access to **Claude Fable 5** and **Mythos 5** due to **US export controls**, sparking a debate on **model sovereignty** and geopolitical risks for frontier AI vendors. **Artificial Analysis** updated its coding agent benchmark, replacing **SWE-Bench Pro**…

17
Hacker News — AI on Front Page community 18d ago

AI agent bankrupted their operator while trying to scan DN42

Article URL: https://lantian.pub/en/article/fun/ai-agent-bankrupted-their-operator-scan-dn42lantian.lantian/ Comments URL: https://news.ycombinator.com/item?id=48500012 Points: 347 # Comments: 99

34
arXiv — NLP / Computation & Language research 18d ago

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global…

19
arXiv — NLP / Computation & Language research 18d ago

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

arXiv:2606.12908v1 Announce Type: new Abstract: Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an…

18
arXiv — NLP / Computation & Language research 18d ago

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive…

15
arXiv — NLP / Computation & Language research 18d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to…

26
arXiv — NLP / Computation & Language research 18d ago

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

arXiv:2606.13177v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate,…

4
arXiv — NLP / Computation & Language research 18d ago

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

arXiv:2606.13317v1 Announce Type: new Abstract: Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them,…

11
arXiv — NLP / Computation & Language research 18d ago

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

arXiv:2606.13349v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the…

29
arXiv — NLP / Computation & Language research 18d ago

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

arXiv:2606.13572v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource…

20
arXiv — NLP / Computation & Language research 18d ago

Recursive Agent Harnesses

arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in…

35
arXiv — NLP / Computation & Language research 18d ago

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

arXiv:2606.13663v1 Announce Type: new Abstract: Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally…

23
arXiv — NLP / Computation & Language research 18d ago

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to…

30
arXiv — NLP / Computation & Language research 18d ago

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

arXiv:2606.12616v1 Announce Type: cross Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single…

16
arXiv — NLP / Computation & Language research 18d ago

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

arXiv:2606.12634v1 Announce Type: cross Abstract: Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by…

26
arXiv — NLP / Computation & Language research 18d ago

Agentic MPC for Semantic Control System Resynthesis

arXiv:2606.12774v1 Announce Type: cross Abstract: While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language…

21
arXiv — NLP / Computation & Language research 18d ago

ProPlay: Procedural World Models for Self-Evolving LLM Agents

arXiv:2606.12780v1 Announce Type: cross Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and…

33

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

SANA: What Matters for QA Agents over Massive Data Lakes?

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Retrospective Progress-Aware Self-Refinement for LLM Agent Training

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

Orchestra-o1: Omnimodal Agent Orchestration

WorkBench Revisited: Workplace Agents Two Years On

Same-Origin Policy for Agentic Browsers

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

Large Language Model Agents Are Not Always Faithful Self-Evolvers

Auth0 joins the Vercel Marketplace

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

32 bit crossplatform coding agent running on pentium m with less than a second startup time

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

How to setup a local coding agent on macOS

WebChallenger: A Reliable and Efficient Generalist Web Agent

The Cold-Start Safety Gap in LLM Agents

MiniMax Sparse Attention (MSA)

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D]

moonshotai/Kimi-K2.7-Code · Hugging Face

New OpenAI Academy courses for the next era of work

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

[browser-use-wasm] I made a browser-use agent that runs in WASM at zero cost

not much happened today

AI agent bankrupted their operator while trying to scan DN42

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Recursive Agent Harnesses

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

Agentic MPC for Semantic Control System Resynthesis

ProPlay: Procedural World Models for Self-Evolving LLM Agents