Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — Machine Learning research 30m ago

On the Necessity of a Liquid Substrate for Mesh Intelligence

arXiv:2606.28413v1 Announce Type: new Abstract: A mesh of sovereign agents has no center: no shared clock, no shared model, and no coordinator to gather data or retrain. Its competence rests on each agent folding the projections its peers emit into a single internal state,…

8
arXiv — Machine Learning research 30m ago

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

arXiv:2606.28467v1 Announce Type: new Abstract: Appliance-level energy monitoring in office buildings produces noisy alerts that non-expert facility managers struggle to use. This paper proposes an end-to-end agentic pipeline that combines deep time-series forecasting,…

11
arXiv — Machine Learning research 30m ago

Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization

arXiv:2606.28764v1 Announce Type: new Abstract: Hierarchical decision-making frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical…

20
arXiv — Machine Learning research 30m ago

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification…

38
arXiv — Machine Learning research 30m ago

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived…

16
arXiv — Machine Learning research 30m ago

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

arXiv:2606.28955v1 Announce Type: new Abstract: Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking. Existing practical defenses typically constrain…

10
arXiv — Machine Learning research 30m ago

A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming

arXiv:2606.29221v1 Announce Type: new Abstract: We address the problem of online multi-human multi-robot teaming through the lens of a linear matching bandit framework, where a learner assigns robots with unknown features from a fixed pool to distinct sets of human agents over…

15
arXiv — Machine Learning research 30m ago

Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

arXiv:2606.29280v1 Announce Type: new Abstract: We identify intervention bias as a previously unquantified failure mode of zero-shot large-language-model (LLM) educational advisory agents: without task-specific training, they recommend action when a hindsight-optimal oracle…

31
arXiv — Machine Learning research 30m ago

Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

arXiv:2606.29459v1 Announce Type: new Abstract: Inverse design of metal-organic frameworks (MOFs) requires searching a combinatorially vast space where property labels are expensive and most machine-learning models reveal little about why a structure succeeds. We introduce…

8
arXiv — Machine Learning research 30m ago

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

arXiv:2606.29476v1 Announce Type: new Abstract: Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a…

24
arXiv — NLP / Computation & Language research 30m ago

Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

arXiv:2606.28524v1 Announce Type: new Abstract: Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measured by the false belief task (FBT), yet persistent concerns of construct validity remain. We adopt a…

25
arXiv — NLP / Computation & Language research 30m ago

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce…

28
arXiv — NLP / Computation & Language research 30m ago

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

arXiv:2606.29265v1 Announce Type: new Abstract: Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents,…

17
arXiv — NLP / Computation & Language research 30m ago

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to…

33
arXiv — NLP / Computation & Language research 30m ago

SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution

arXiv:2606.29713v1 Announce Type: new Abstract: Hallucination is the reliability bottleneck for LLM-based agents, and fact attribution verifiers are the last line of defense -- yet today's verifiers emit only opaque binary labels, leaving agents unable to self-correct and…

24
arXiv — NLP / Computation & Language research 30m ago

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

arXiv:2606.29824v1 Announce Type: new Abstract: While Large Language Models (LLMs) excel as static solvers, transforming them into autonomous agents remains challenging. This transition requires continuous environmental interaction, yet current agents lack the necessary…

17
arXiv — NLP / Computation & Language research 30m ago

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search

arXiv:2606.29863v1 Announce Type: new Abstract: Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration -- deciding when to trust parametric…

38
arXiv — NLP / Computation & Language research 30m ago

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it…

4
arXiv — NLP / Computation & Language research 30m ago

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is…

17
arXiv — NLP / Computation & Language research 30m ago

LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

arXiv:2606.30005v1 Announce Type: new Abstract: Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that…

34
MIT Technology Review — AI news-outlet 10h ago

AI agents are not your “coworkers”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Imagine coming in to work to learn that a new underling will report to you. The worker is not a person but an AI tool—one that your company…

6
TechCrunch — AI news-outlet 11h ago

Cursor now has a mobile app for guiding your coding agent on the go

Cursor has launched a new mobile app for remote oversight over coding agents.

29
Simon Willison community 12h ago

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding This is an interesting new open weights (MIT licensed) model, the first model release from DeepReinforce. [...] with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. Built on top of pretrained Gemma 4 and Qwen…

5
NVIDIA Developer Blog official-blog 12h ago

How to Govern Autonomous Agents in Enterprise AI Factories

AI agents are quickly moving beyond chat. They inspect code, run tests, read documents, search knowledge bases, query internal systems, and operate for hours on...

28
MIT Technology Review — AI news-outlet 13h ago

Agent confidence on the technical frontier

Enterprise investment in AI is booming. Gartner is calling 2026 an “inflection year” for organizations to align their AI projects with strategic business objectives. As the pressure to prove ROI mounts, executives and technology leaders are looking to agentic AI to drive the…

19
r/LocalLLaMA community 14h ago

Anyone else end up building a web access layer for local AI agents?

I've been running local models for most of my experiments, and I kept running into the same issue. The model lives locally, but everything it needs to interact with doesn't. Every new agent ended up with another GitHub client, another Reddit integration, another documentation…

10
r/MachineLearning community 18h ago

Google's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R]

Google deployed an agentic AI peer-reviewer at two top CS conferences — reviewing ~10,000 papers with 30-minute turnaround — and the new formal research paper shows it catches 34% more mathematical errors than zero-shot prompting; the precedent for AI-automated scientific review…

23
Vercel — AI dev-tools 21h ago

Build realtime voice agents on AI Gateway

AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each…

26
arXiv — Machine Learning research 1d ago

Training Observable Control Policies to Expose Agent State Through Actions

arXiv:2606.27609v1 Announce Type: new Abstract: Physical or operational constraints often impose communications limitations on autonomous agents. Such limitations complicate monitoring or multiagent coordination. Even when strong communications are absent, some information may…

13
arXiv — Machine Learning research 1d ago

COOPA: A Modular LLM Agent Architecture for Operations Research Problems

arXiv:2606.27611v1 Announce Type: new Abstract: Operations Research (OR) provides a rigorous framework for high-stakes decision-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise. Recent LLM-based systems…

18
arXiv — Machine Learning research 1d ago

CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association

arXiv:2606.28179v1 Announce Type: new Abstract: Identifying robust associations between cardiac imaging phenotypes and clinical diseases is fundamental to population-scale cardiovascular research and reliable risk stratification. However, current phenome-wide association studies…

13
arXiv — Machine Learning research 1d ago

LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior

arXiv:2606.28182v1 Announce Type: new Abstract: Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)-based agents often exhibit behaviors that are…

28
arXiv — Machine Learning research 1d ago

Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives

arXiv:2606.28217v1 Announce Type: new Abstract: We propose a framework for reward allocation in fully delegated AI cooperatives where humans are represented by agents that contribute data and participate in model updates under heterogeneous value constraints. The key idea is to…

32
arXiv — NLP / Computation & Language research 1d ago

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

arXiv:2606.27409v1 Announce Type: cross Abstract: Multi-agent large language model (LLM) systems often rely on verifier and critic agents to suppress hallucinations, but verification is delayed. During this delay, false claims can propagate through the agent network. We model…

25
arXiv — NLP / Computation & Language research 1d ago

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

arXiv:2606.27472v1 Announce Type: new Abstract: Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding…

16
arXiv — NLP / Computation & Language research 1d ago

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,…

32
arXiv — NLP / Computation & Language research 1d ago

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume…

27
arXiv — NLP / Computation & Language research 1d ago

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write…

11
arXiv — NLP / Computation & Language research 1d ago

AI Persuasive Framing in Collective Dilemmas

arXiv:2606.27951v1 Announce Type: cross Abstract: AI agents are promising tools that can act as flexible behavioral nudges to enhance human cooperation in addressing large-scale societal problems. However, evidence on whether AI agents can effectively boost cooperation remains…

32
arXiv — NLP / Computation & Language research 1d ago

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn…

20
arXiv — NLP / Computation & Language research 1d ago

Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

arXiv:2511.03217v2 Announce Type: replace Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet…

4
arXiv — NLP / Computation & Language research 1d ago

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be…

28
arXiv — NLP / Computation & Language research 1d ago

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from…

34
arXiv — NLP / Computation & Language research 1d ago

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using…

4
r/LocalLLaMA community 1d ago

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

This is something I've been working on, I like playing around with smaller local models but found most agent harness's not well suited for them. The failure modes across different model family's tend to be the same: Failed tool calls Poor varication of environment variables Poor…

12
Vercel — AI dev-tools 1d ago

Realtime voice, speech, and transcription now supported on AI Gateway

AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with…

17
Simon Willison community 1d ago

Quoting Jon Udell

Human Agent in the loop I dislike the phrase “human in the loop” because it cedes authority to the machines. Let’s flip the narrative. It’s our loop, we work the same way we always have, now we recruit agents to join the team. An agent-assisted process need not be a black box…

4
r/LocalLLaMA community 2d ago

2x RX 9060xt 16gb, is it worth it?

I'm planning to buy 2x RX 9060xt with 16gb each to run Qwen 3.6 27B and alike. Would it be a good investment? How much tk/s should i expect in generation and prefill? I'm planning to use this as a coding agent in a large codebase. Currently I'm running this on my i7 64gb laptop…

35
r/LocalLLaMA community 2d ago

Full document redaction with Qwen 3.6 27B with a Pi agent harness

Link to full blog post with all method details, results, and links to all relevant code/skills/prompts at the bottom of this post. Apologies for not having more links throughout, it seems this subreddit restricts too many links in posts. Document redaction tasks are complex…

10
llama.cpp releases dev-tools 2d ago

b9827

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy ( #25057 ) [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous…

14

On the Necessity of a Liquid Substrate for Mesh Intelligence

An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations

Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization

The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables

Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation

Modification-Considering Value Learning for Reward Hacking Mitigation in RL

A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming

Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning

Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models

SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages

MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling

Hybrid Retriever Evolution for Multimodal Document Reasoning Agents

SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution

Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering

KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

AI agents are not your &#8220;coworkers&#8221;

Cursor now has a mobile app for guiding your coding agent on the go

Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding

How to Govern Autonomous Agents in Enterprise AI Factories

Agent confidence on the technical frontier

Anyone else end up building a web access layer for local AI agents?

Google's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R]

Build realtime voice agents on AI Gateway

Training Observable Control Policies to Expose Agent State Through Actions

COOPA: A Modular LLM Agent Architecture for Operations Research Problems

CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association

LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior

Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives

Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement

Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

AI Persuasive Framing in Collective Dilemmas

Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers.

Realtime voice, speech, and transcription now supported on AI Gateway

Quoting Jon Udell

2x RX 9060xt 16gb, is it worth it?

Full document redaction with Qwen 3.6 27B with a Pi agent harness

b9827

AI agents are not your “coworkers”