News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow r/LocalLLaMA community 12d ago Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C Some background so this is honest. Over the past few months I ran a lot of oneshot experiments with single file three.js games. Minecraft clones, that kind of thing. I picked those on purpose because they sit deep in the training data and are trivial to debug by eye. The goal… 37 Hugging Face Daily Papers research 12d ago Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion Abstract DR-DCI framework combines retrieval with direct corpus interaction by dynamically pulling relevant documents into a local workspace, enabling scalable and efficient agentic search across large corpora. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search over… 27 Hugging Face Daily Papers research 12d ago Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning Abstract Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) have demonstrated… 25 llama.cpp releases dev-tools 12d ago b9674 SYCL: fix use-after-free bug with async memcpy in MoE prefill ( #24676 ) SYCL: fix a bug with async memcpy make mmid_row_mapping_host persistent comment on stream->wait Apply suggestion from @sanmai Apply suggestion from @sanmai Apply suggestion from @sanmai macOS/iOS: macOS… 34 Hugging Face official-blog 12d ago From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot Back to Articles a]:hidden"> From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot Enterprise Article Published June 17, 2026 Upvote 4 Sundar Raghavan rsundaraws amazon Cagatay Cali cagataydev amazon A walkthrough of the LeRobot integration in Strands… 28 arXiv — Machine Learning research 13d ago ProCUA-SFT Technical Report arXiv:2606.17321v1 Announce Type: new Abstract: Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest… 9 arXiv — Machine Learning research 13d ago Offline Preference-Based Trajectory Evaluation arXiv:2606.17541v1 Announce Type: new Abstract: Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective… 20 arXiv — NLP / Computation & Language research 13d ago EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning arXiv:2606.17680v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards.… 15 arXiv — NLP / Computation & Language research 13d ago MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision arXiv:2606.17162v1 Announce Type: new Abstract: Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn… 25 arXiv — NLP / Computation & Language research 13d ago PromptMN: Pseudo Prompting Language arXiv:2606.17164v1 Announce Type: new Abstract: Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic… 13 arXiv — NLP / Computation & Language research 13d ago Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery arXiv:2606.17519v1 Announce Type: new Abstract: Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed… 14 arXiv — NLP / Computation & Language research 13d ago OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation arXiv:2606.17628v1 Announce Type: new Abstract: Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate… 37 arXiv — NLP / Computation & Language research 13d ago Environment-Grounded Automated Prompt Optimization for LLM Game Agents arXiv:2606.17838v1 Announce Type: new Abstract: LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes… 20 arXiv — NLP / Computation & Language research 13d ago GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? arXiv:2606.17861v1 Announce Type: new Abstract: Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a… 28 arXiv — NLP / Computation & Language research 13d ago Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose arXiv:2606.18051v1 Announce Type: new Abstract: LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem:… 8 arXiv — NLP / Computation & Language research 13d ago RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills arXiv:2606.18203v1 Announce Type: new Abstract: The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an… 28 arXiv — NLP / Computation & Language research 13d ago ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale… 36 arXiv — NLP / Computation & Language research 13d ago Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization arXiv:2606.17092v1 Announce Type: cross Abstract: Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a… 8 arXiv — NLP / Computation & Language research 13d ago Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models arXiv:2606.17389v1 Announce Type: cross Abstract: Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that… 24 arXiv — NLP / Computation & Language research 13d ago PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with… 17 arXiv — NLP / Computation & Language research 13d ago Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns arXiv:2606.17645v1 Announce Type: cross Abstract: Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow… 32 arXiv — NLP / Computation & Language research 13d ago EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.… 24 arXiv — NLP / Computation & Language research 13d ago Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically… 33 arXiv — NLP / Computation & Language research 13d ago A Framework for Evaluating Agentic Skills at Scale arXiv:2606.17819v1 Announce Type: cross Abstract: Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain… 10 arXiv — NLP / Computation & Language research 13d ago ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents arXiv:2606.18037v1 Announce Type: cross Abstract: Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually… 27 arXiv — NLP / Computation & Language research 13d ago PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience arXiv:2606.18060v1 Announce Type: cross Abstract: As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that… 13 arXiv — NLP / Computation & Language research 13d ago Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,… 21 arXiv — NLP / Computation & Language research 13d ago Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning arXiv:2601.03872v2 Announce Type: replace Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool… 27 arXiv — NLP / Computation & Language research 13d ago LVLMs and Humans Ground Differently in Referential Communication arXiv:2601.19792v4 Announce Type: replace Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common… 9 Vercel — AI dev-tools 13d ago Introducing Vercel Connect Giving your agents access to your tools, data, and services is what makes them useful. As agents perform deeper work across systems, authenticating and authorizing that access becomes central to your application architecture. Today, agent access is usually granted through… 21 Vercel — AI dev-tools 13d ago Introducing eve Today, we are proud to introduce eve , an open-source agent framework for building, running, and scaling agents. eve is designed around the idea that building an agent should mean defining what it does without assembling all of the pieces that it needs to run in production.… 15 Hugging Face Daily Papers research 13d ago MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision Abstract MemSlides presents a hierarchical memory framework for personalized presentation agents that separates long-term user profiles, working memory for session constraints, and tool memory for reusable execution experiences to enable stable personalization and reliable local… 21 Hugging Face Daily Papers research 13d ago ProCUA-SFT Technical Report Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents… 4 Hugging Face Daily Papers research 13d ago OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation Abstract OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory has become a standard… 28 Hugging Face Daily Papers research 13d ago Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus Abstract Research agents face significant challenges when evidence is in a different language than the query, with performance degrading even when gold evidence is provided directly. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research agents are increasingly evaluated on… 28 Hugging Face Daily Papers research 13d ago GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive… 31 Hugging Face Daily Papers research 13d ago LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching Abstract LectūraAgents is a multi-agent framework that enables personalized learning through adaptive embodied teaching by mimicking professor-student interactions and generating coordinated teaching actions aligned with learner profiles. Generated by… 9 Vercel — AI dev-tools 13d ago Introducing eve, an open-source agent framework eve is now available in public preview. eve is an open-source framework for building, running, and scaling agents. An agent is just a directory of files, and production comes built in: Durable execution Sandboxed compute Human-in-the-loop approvals Subagents Evals The smallest… 31 Hugging Face official-blog 13d ago Agentic Resource Discovery: Let agents search Back to Articles a]:hidden"> Agentic Resource Discovery: Let agents search for tools, skills, and other agents. Published June 17, 2026 Update on GitHub Upvote - ben burtenshaw burtenshaw shaun smith evalstate If you build with agents today, you probably know three protocols.… 15 Vercel — AI dev-tools 13d ago CLI deployment limits removed We've removed CLI-specific deployment limits, making it easier to deploy from local machine and external CI/CD pipelines with instant feedback. Teams and AI agents can now deploy at the pace their workflows demand. Learn more about limits in the Documentation . Read more 5 Vercel — AI dev-tools 13d ago Vercel for Enterprise Apps and Agents Today we are introducing Vercel for Enterprise Apps and Agents , a platform that gives your entire company the ability to ship with AI safely, behind your access and security boundaries. Over the past year, employees across Vercel shipped hundreds of agents and internal apps.… 34 NVIDIA Developer Blog official-blog 13d ago Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI Developers building for AR glasses and wearable devices face an infrastructure gap. The hardware is ready, but creating AI experiences requires integrating live... 33 Ars Technica — AI news-outlet 13d ago Anthropic "pauses" token-based billing for its Claude Agent SDK Move originally planned for Monday would have heavily increased power users' costs. 21 NVIDIA Developer Blog official-blog 13d ago Build On-Device AI Companions with the NVIDIA ACE Game Agent SDK and Unreal Engine 5 Plugins NVIDIA RTX technologies are deeply integrated into Unreal Engine 5 through the NVIDIA RTX Branch of Unreal Engine and the NVIDIA DLSS Unreal Engine plugin. This... 23 Google DeepMind official-blog 13d ago Securing the future of AI agents Securing internal systems with an AI Control Roadmap, combining traditional safeguards and real-time monitoring. 27 Hugging Face Daily Papers research 13d ago Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale Abstract Ling-2.6 and Ring-2.6 models are presented as scalable solutions for agentic intelligence, featuring architectural upgrades and specialized training methods to balance fast response times with advanced reasoning capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 34 TechCrunch — AI news-outlet 14d ago Malaysia’s AI agent-powered messaging app Respond.io raises $62.5M, eyes acquisitions Respond.io, one of Malaysia startups to watch, uses AI agents to handle high volumes of customer inquiries and charges per convo, not per seat. 9 Smol AI News news-outlet 14d ago GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs **Z.ai released GLM-5.2**, an MIT-licensed open-weight frontier model targeting **coding and long-horizon agentic tasks** with a **1M-token context window** and **two reasoning-effort modes**. It features a **744B-parameter mixture-of-experts architecture** with **40B active… 14 Hugging Face Daily Papers research 14d ago Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long… 28 Hugging Face Daily Papers research 14d ago PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions Abstract PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.… 13 Page 7 of 10 · 500 articles ← Newer Older →