News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — Machine Learning research 15d ago Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications… 5 arXiv — Machine Learning research 15d ago Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher arXiv:2606.13710v1 Announce Type: cross Abstract: Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended… 6 arXiv — NLP / Computation & Language research 15d ago Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces arXiv:2606.13686v1 Announce Type: new Abstract: As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce… 25 arXiv — NLP / Computation & Language research 15d ago When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation arXiv:2606.13835v1 Announce Type: new Abstract: LLM-based generative agents are increasingly used in urban simulators, yet it remains unclear whether they reproduce empirically realistic human mobility patterns or merely generate plausible mobility narratives. We introduce a… 10 arXiv — NLP / Computation & Language research 15d ago SANA: What Matters for QA Agents over Massive Data Lakes? arXiv:2606.13904v1 Announce Type: new Abstract: Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish… 38 arXiv — NLP / Computation & Language research 15d ago Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents arXiv:2606.13995v1 Announce Type: new Abstract: AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this… 10 arXiv — NLP / Computation & Language research 15d ago CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward arXiv:2606.14179v1 Announce Type: new Abstract: We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach… 7 arXiv — NLP / Computation & Language research 15d ago Retrospective Progress-Aware Self-Refinement for LLM Agent Training arXiv:2606.14302v1 Announce Type: new Abstract: LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online… 15 arXiv — NLP / Computation & Language research 15d ago SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model arXiv:2606.14574v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type… 8 arXiv — NLP / Computation & Language research 15d ago LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations arXiv:2606.14600v1 Announce Type: new Abstract: Online group chats are social spaces with local conversational norms that are rarely stated explicitly. The ability and willingness of LLM-based agents to recognize and adapt to these norms remains mostly unexplored. We introduce… 32 arXiv — NLP / Computation & Language research 15d ago AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often… 13 arXiv — NLP / Computation & Language research 15d ago Orchestra-o1: Omnimodal Agent Orchestration arXiv:2606.13707v1 Announce Type: cross Abstract: The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition… 38 arXiv — NLP / Computation & Language research 15d ago WorkBench Revisited: Workplace Agents Two Years On arXiv:2606.13715v1 Announce Type: cross Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best… 34 arXiv — NLP / Computation & Language research 15d ago Same-Origin Policy for Agentic Browsers arXiv:2606.14027v1 Announce Type: cross Abstract: Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that… 16 arXiv — NLP / Computation & Language research 15d ago GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software… 37 arXiv — NLP / Computation & Language research 15d ago Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows arXiv:2606.14672v1 Announce Type: cross Abstract: Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which… 28 arXiv — NLP / Computation & Language research 15d ago Large Language Model Agents Are Not Always Faithful Self-Evolvers arXiv:2601.22436v3 Announce Type: replace Abstract: Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the… 16 Vercel — AI dev-tools 15d ago Auth0 joins the Vercel Marketplace You can now add Auth0 , a production-ready authentication to your Vercel app in just a few clicks. Built for modern frameworks like Next.js, Auth0 is an identity and access management platform for securing your apps and agentic workflows. This integration enables: Automatic… 26 r/MachineLearning community 16d ago The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R] We recently presented a paper at ACM CAIS 2026 on safety evaluation for tool-using LLM agents. The core issue is that task completion alone can be misleading: an agent may complete a task while violating a safety or policy constraint. We separate outcomes into safe success ,… 24 r/LocalLLaMA community 16d ago 32 bit crossplatform coding agent running on pentium m with less than a second startup time supports subagents, goals, MacOS, Unix, linux, BSD, windows 7 +, minimum needed CPU is 386, tried it on an 800mhz pentium 3 and still got sub second startup time and less than 1% cpu usage during use without --slow-cpu flag, prism supports plugins and is small enough to fit on a… 7 NVIDIA Developer Blog official-blog 17d ago NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how... 8 Hugging Face Daily Papers research 17d ago See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents Abstract Heterogeneous multi-agent systems can effectively transfer knowledge through aligned KV-cache communication, achieving better performance than text-based methods with reduced computational costs. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-agent systems… 21 Hugging Face Daily Papers research 17d ago Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents Abstract TRACE is a skill-layer pipeline that mines user corrections to create runtime checks, significantly reducing preference violations in interactive LLM agents. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Interactive LLM agents are becoming part of daily work, but they do… 30 Hacker News — AI on Front Page community 17d ago How to setup a local coding agent on macOS Article URL: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent-on-macos Comments URL: https://news.ycombinator.com/item?id=48507020 Points: 261 # Comments: 71 36 Hugging Face Daily Papers research 17d ago WebChallenger: A Reliable and Efficient Generalist Web Agent Abstract WebChallenger presents a web agent framework that improves autonomous navigation through structured page representation and cognitive-inspired mechanisms, achieving high performance with open-weight models. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Autonomous web… 15 Hugging Face Daily Papers research 17d ago The Cold-Start Safety Gap in LLM Agents Abstract Tool-calling language model agents exhibit improved safety after initial interactions, with a systematic benchmark demonstrating enhanced security through prior task completion. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Are tool-calling LLM agents equally safe… 37 r/LocalLLaMA community 17d ago MiniMax Sparse Attention (MSA) Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax… 14 NVIDIA Developer Blog official-blog 17d ago Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and... 25 r/MachineLearning community 17d ago Just thinking, what about conducting a 1 day virtual session on fundamentals of computer vision ??? [D] Hi all, A real story from my current experience: I'm associated with an internship where the primary work revolves around autonomous UAVs. What has shocked me the most is that almost everyone is so heavily focused on coding agents and AI tools that they're building things… 17 r/LocalLLaMA community 18d ago moonshotai/Kimi-K2.7-Code · Hugging Face Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing… 10 OpenAI official-blog 18d ago New OpenAI Academy courses for the next era of work OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyday work. 27 Hugging Face Daily Papers research 18d ago ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages Abstract ArogyaBodha dataset and ArogyaSutra framework enhance multilingual medical reasoning in low-resource settings through diverse data integration and actor-critic multi-agent reasoning. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal Large Language Models (MLLMs)… 30 r/LocalLLaMA community 18d ago [browser-use-wasm] I made a browser-use agent that runs in WASM at zero cost The only cost is electricity! I built this in a few weeks since I couldn't find anything else like it. Demo: https://pdufour.github.io/browser-use-wasm/ Source Code: https://github.com/pdufour/browser-use-wasm One thing I've wanted to do for a while was add a widget to my page… 12 Smol AI News news-outlet 18d ago not much happened today **Anthropic** suspended access to **Claude Fable 5** and **Mythos 5** due to **US export controls**, sparking a debate on **model sovereignty** and geopolitical risks for frontier AI vendors. **Artificial Analysis** updated its coding agent benchmark, replacing **SWE-Bench Pro**… 17 Hacker News — AI on Front Page community 18d ago AI agent bankrupted their operator while trying to scan DN42 Article URL: https://lantian.pub/en/article/fun/ai-agent-bankrupted-their-operator-scan-dn42lantian.lantian/ Comments URL: https://news.ycombinator.com/item?id=48500012 Points: 347 # Comments: 99 34 arXiv — NLP / Computation & Language research 18d ago LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling arXiv:2606.12837v1 Announce Type: new Abstract: Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global… 19 arXiv — NLP / Computation & Language research 18d ago SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents arXiv:2606.12908v1 Announce Type: new Abstract: Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an… 18 arXiv — NLP / Computation & Language research 18d ago G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive… 15 arXiv — NLP / Computation & Language research 18d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge arXiv:2606.13120v1 Announce Type: new Abstract: Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to… 26 arXiv — NLP / Computation & Language research 18d ago MemRefine: LLM-Guided Compression for Long-Term Agent Memory arXiv:2606.13177v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate,… 4 arXiv — NLP / Computation & Language research 18d ago SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents arXiv:2606.13317v1 Announce Type: new Abstract: Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them,… 11 arXiv — NLP / Computation & Language research 18d ago From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent arXiv:2606.13349v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the… 29 arXiv — NLP / Computation & Language research 18d ago ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages arXiv:2606.13572v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource… 20 arXiv — NLP / Computation & Language research 18d ago Recursive Agent Harnesses arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in… 35 arXiv — NLP / Computation & Language research 18d ago HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents arXiv:2606.13663v1 Announce Type: new Abstract: Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally… 23 arXiv — NLP / Computation & Language research 18d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments arXiv:2606.13681v1 Announce Type: new Abstract: Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to… 30 arXiv — NLP / Computation & Language research 18d ago PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation arXiv:2606.12616v1 Announce Type: cross Abstract: Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single… 16 arXiv — NLP / Computation & Language research 18d ago Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents arXiv:2606.12634v1 Announce Type: cross Abstract: Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by… 26 arXiv — NLP / Computation & Language research 18d ago Agentic MPC for Semantic Control System Resynthesis arXiv:2606.12774v1 Announce Type: cross Abstract: While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language… 21 arXiv — NLP / Computation & Language research 18d ago ProPlay: Procedural World Models for Self-Evolving LLM Agents arXiv:2606.12780v1 Announce Type: cross Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and… 33 Page 9 of 10 · 500 articles ← Newer Older →