Tag

Agents + tool use

500 articles archived under #agents · RSS

r/LocalLLaMA community 12d ago

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

Some background so this is honest. Over the past few months I ran a lot of oneshot experiments with single file three.js games. Minecraft clones, that kind of thing. I picked those on purpose because they sit deep in the training data and are trivial to debug by eye. The goal…

37
Hugging Face Daily Papers research 12d ago

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Abstract DR-DCI framework combines retrieval with direct corpus interaction by dynamically pulling relevant documents into a local workspace, enabling scalable and efficient agentic search across large corpora. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic search over…

27
Hugging Face Daily Papers research 12d ago

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Abstract Visual-Seeker enables visual-native multimodal deep search through active visual reasoning, outperforming proprietary models on real-world web environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multimodal large language models (MLLMs) have demonstrated…

25
llama.cpp releases dev-tools 12d ago

b9674

SYCL: fix use-after-free bug with async memcpy in MoE prefill ( #24676 ) SYCL: fix a bug with async memcpy make mmid_row_mapping_host persistent comment on stream->wait Apply suggestion from @sanmai Apply suggestion from @sanmai Apply suggestion from @sanmai macOS/iOS: macOS…

34
Hugging Face official-blog 12d ago

From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot

Back to Articles a]:hidden"> From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot Enterprise Article Published June 17, 2026 Upvote 4 Sundar Raghavan rsundaraws amazon Cagatay Cali cagataydev amazon A walkthrough of the LeRobot integration in Strands…

28
arXiv — Machine Learning research 13d ago

ProCUA-SFT Technical Report

arXiv:2606.17321v1 Announce Type: new Abstract: Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest…

9
arXiv — Machine Learning research 13d ago

Offline Preference-Based Trajectory Evaluation

arXiv:2606.17541v1 Announce Type: new Abstract: Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective…

20
arXiv — NLP / Computation & Language research 13d ago

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

arXiv:2606.17680v1 Announce Type: cross Abstract: Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards.…

15
arXiv — NLP / Computation & Language research 13d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

arXiv:2606.17162v1 Announce Type: new Abstract: Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn…

25
arXiv — NLP / Computation & Language research 13d ago

PromptMN: Pseudo Prompting Language

arXiv:2606.17164v1 Announce Type: new Abstract: Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic…

13
arXiv — NLP / Computation & Language research 13d ago

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

arXiv:2606.17519v1 Announce Type: new Abstract: Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed…

14
arXiv — NLP / Computation & Language research 13d ago

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

arXiv:2606.17628v1 Announce Type: new Abstract: Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate…

37
arXiv — NLP / Computation & Language research 13d ago

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

arXiv:2606.17838v1 Announce Type: new Abstract: LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes…

20
arXiv — NLP / Computation & Language research 13d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

arXiv:2606.17861v1 Announce Type: new Abstract: Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a…

28
arXiv — NLP / Computation & Language research 13d ago

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

arXiv:2606.18051v1 Announce Type: new Abstract: LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem:…

8
arXiv — NLP / Computation & Language research 13d ago

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

arXiv:2606.18203v1 Announce Type: new Abstract: The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an…

28
arXiv — NLP / Computation & Language research 13d ago

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

arXiv:2606.18237v1 Announce Type: new Abstract: Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale…

36
arXiv — NLP / Computation & Language research 13d ago

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

arXiv:2606.17092v1 Announce Type: cross Abstract: Agentic systems are increasingly integrated with geographic information systems (GIS), where multi-agent coordination enables complex conversational and spatial analysis but introduces security risks. This work presents a…

8
arXiv — NLP / Computation & Language research 13d ago

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

arXiv:2606.17389v1 Announce Type: cross Abstract: Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that…

24
arXiv — NLP / Computation & Language research 13d ago

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

arXiv:2606.17467v1 Announce Type: cross Abstract: Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with…

17
arXiv — NLP / Computation & Language research 13d ago

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

arXiv:2606.17645v1 Announce Type: cross Abstract: Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow…

32
arXiv — NLP / Computation & Language research 13d ago

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

arXiv:2606.17698v1 Announce Type: cross Abstract: As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked.…

24
arXiv — NLP / Computation & Language research 13d ago

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

arXiv:2606.17799v1 Announce Type: cross Abstract: Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically…

33
arXiv — NLP / Computation & Language research 13d ago

A Framework for Evaluating Agentic Skills at Scale

arXiv:2606.17819v1 Announce Type: cross Abstract: Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain…

10
arXiv — NLP / Computation & Language research 13d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

arXiv:2606.18037v1 Announce Type: cross Abstract: Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually…

27
arXiv — NLP / Computation & Language research 13d ago

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

arXiv:2606.18060v1 Announce Type: cross Abstract: As Large Language Model based agents enter autonomous scientific research, their ability to resist pseudoscience becomes increasingly important. Otherwise, such systems may rapidly generate plausible yet misleading studies that…

13
arXiv — NLP / Computation & Language research 13d ago

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

arXiv:2606.18142v1 Announce Type: cross Abstract: AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts,…

21
arXiv — NLP / Computation & Language research 13d ago

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

arXiv:2601.03872v2 Announce Type: replace Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool…

27
arXiv — NLP / Computation & Language research 13d ago

LVLMs and Humans Ground Differently in Referential Communication

arXiv:2601.19792v4 Announce Type: replace Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common…

9
Vercel — AI dev-tools 13d ago

Introducing Vercel Connect

Giving your agents access to your tools, data, and services is what makes them useful. As agents perform deeper work across systems, authenticating and authorizing that access becomes central to your application architecture. Today, agent access is usually granted through…

21
Vercel — AI dev-tools 13d ago

Introducing eve

Today, we are proud to introduce eve , an open-source agent framework for building, running, and scaling agents. eve is designed around the idea that building an agent should mean defining what it does without assembling all of the pieces that it needs to run in production.…

15
Hugging Face Daily Papers research 13d ago

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

Abstract MemSlides presents a hierarchical memory framework for personalized presentation agents that separates long-term user profiles, working memory for session constraints, and tool memory for reusable execution experiences to enable stable personalization and reliable local…

21
Hugging Face Daily Papers research 13d ago

ProCUA-SFT Technical Report

Abstract Training computer-use agents using a large-scale synthetic dataset with automated task generation and verification achieves significantly improved performance on desktop interaction benchmarks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training computer-use agents…

4
Hugging Face Daily Papers research 13d ago

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Abstract OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Memory has become a standard…

28
Hugging Face Daily Papers research 13d ago

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

Abstract Research agents face significant challenges when evidence is in a different language than the query, with performance degrading even when gold evidence is provided directly. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Deep research agents are increasingly evaluated on…

28
Hugging Face Daily Papers research 13d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Abstract End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive…

31
Hugging Face Daily Papers research 13d ago

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Abstract LectūraAgents is a multi-agent framework that enables personalized learning through adaptive embodied teaching by mimicking professor-student interactions and generating coordinated teaching actions aligned with learner profiles. Generated by…

9
Vercel — AI dev-tools 13d ago

Introducing eve, an open-source agent framework

eve is now available in public preview. eve is an open-source framework for building, running, and scaling agents. An agent is just a directory of files, and production comes built in: Durable execution Sandboxed compute Human-in-the-loop approvals Subagents Evals The smallest…

31
Hugging Face official-blog 13d ago

Agentic Resource Discovery: Let agents search

Back to Articles a]:hidden"> Agentic Resource Discovery: Let agents search for tools, skills, and other agents. Published June 17, 2026 Update on GitHub Upvote - ben burtenshaw burtenshaw shaun smith evalstate If you build with agents today, you probably know three protocols.…

15
Vercel — AI dev-tools 13d ago

CLI deployment limits removed

We've removed CLI-specific deployment limits, making it easier to deploy from local machine and external CI/CD pipelines with instant feedback. Teams and AI agents can now deploy at the pace their workflows demand. Learn more about limits in the Documentation . Read more

5
Vercel — AI dev-tools 13d ago

Vercel for Enterprise Apps and Agents

Today we are introducing Vercel for Enterprise Apps and Agents , a platform that gives your entire company the ability to ship with AI safely, behind your access and security boundaries. Over the past year, employees across Vercel shipped hundreds of agents and internal apps.…

34
NVIDIA Developer Blog official-blog 13d ago

Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

Developers building for AR glasses and wearable devices face an infrastructure gap. The hardware is ready, but creating AI experiences requires integrating live...

33
Ars Technica — AI news-outlet 13d ago

Anthropic "pauses" token-based billing for its Claude Agent SDK

Move originally planned for Monday would have heavily increased power users' costs.

21
NVIDIA Developer Blog official-blog 13d ago

Build On-Device AI Companions with the NVIDIA ACE Game Agent SDK and Unreal Engine 5 Plugins

NVIDIA RTX technologies are deeply integrated into Unreal Engine 5 through the NVIDIA RTX Branch of Unreal Engine and the NVIDIA DLSS Unreal Engine plugin. This...

23
Google DeepMind official-blog 13d ago

Securing the future of AI agents

Securing internal systems with an AI Control Roadmap, combining traditional safeguards and real-time monitoring.

27
Hugging Face Daily Papers research 13d ago

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Abstract Ling-2.6 and Ring-2.6 models are presented as scalable solutions for agentic intelligence, featuring architectural upgrades and specialized training methods to balance fast response times with advanced reasoning capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

34
TechCrunch — AI news-outlet 14d ago

Malaysia’s AI agent-powered messaging app Respond.io raises $62.5M, eyes acquisitions

Respond.io, one of Malaysia startups to watch, uses AI agents to handle high volumes of customer inquiries and charges per convo, not per seat.

9
Smol AI News news-outlet 14d ago

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

**Z.ai released GLM-5.2**, an MIT-licensed open-weight frontier model targeting **coding and long-horizon agentic tasks** with a **1M-token context window** and **two reasoning-effort modes**. It features a **744B-parameter mixture-of-experts architecture** with **40B active…

14
Hugging Face Daily Papers research 14d ago

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

Abstract WebStep benchmark enables process-level analysis of web agents through semantic MDP tracking, revealing detailed performance differences and error localization that terminal success metrics miss. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Web agents act through long…

28
Hugging Face Daily Papers research 14d ago

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Abstract PhoneHarness presents a mixed-action benchmark and execution framework for evaluating phone-use agents on verifiable mobile workflows, demonstrating superior performance over existing approaches through deterministic action routing and auditable execution traces.…

13

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

b9674

From the Hugging Face Hub to robot hardware with Strands Agents and LeRobot

ProCUA-SFT Technical Report

Offline Preference-Based Trajectory Evaluation

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

PromptMN: Pseudo Prompting Language

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and Compose

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

A Framework for Evaluating Agentic Skills at Scale

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

LVLMs and Humans Ground Differently in Referential Communication

Introducing Vercel Connect

Introducing eve

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

ProCUA-SFT Technical Report

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Introducing eve, an open-source agent framework

Agentic Resource Discovery: Let agents search

CLI deployment limits removed

Vercel for Enterprise Apps and Agents

Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

Anthropic "pauses" token-based billing for its Claude Agent SDK

Build On-Device AI Companions with the NVIDIA ACE Game Agent SDK and Unreal Engine 5 Plugins

Securing the future of AI agents

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Malaysia&#8217;s AI agent-powered messaging app Respond.io raises $62.5M, eyes acquisitions

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

Malaysia’s AI agent-powered messaging app Respond.io raises $62.5M, eyes acquisitions