News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — Machine Learning research 30m ago On the Necessity of a Liquid Substrate for Mesh Intelligence arXiv:2606.28413v1 Announce Type: new Abstract: A mesh of sovereign agents has no center: no shared clock, no shared model, and no coordinator to gather data or retrain. Its competence rests on each agent folding the projections its peers emit into a single internal state,… 8 arXiv — Machine Learning research 30m ago An Agentic AI Pipeline for Appliance-Level Energy Anomaly Detection and LLM-Driven Recommendations arXiv:2606.28467v1 Announce Type: new Abstract: Appliance-level energy monitoring in office buildings produces noisy alerts that non-expert facility managers struggle to use. This paper proposes an end-to-end agentic pipeline that combines deep time-series forecasting,… 11 arXiv — Machine Learning research 30m ago Hierarchical Decision Making with Structured Policies: A Principled Design via Inverse Optimization arXiv:2606.28764v1 Announce Type: new Abstract: Hierarchical decision-making frameworks are pivotal for addressing complex control tasks, enabling agents to decompose intricate problems into manageable subgoals. Despite their promise, existing hierarchical policies face critical… 20 arXiv — Machine Learning research 30m ago The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables arXiv:2606.28839v1 Announce Type: new Abstract: We introduce the Contagion Tensor, a measurement framework for quantifying how large language model (LLM) output distributions couple across modalities, agents, and time steps. From the tensor we derive the Coupling Amplification… 38 arXiv — Machine Learning research 30m ago Multi-Agent Routing as Set-Valued Prediction: A WildChat Benchmark and Cost-Aware Evaluation arXiv:2606.28925v1 Announce Type: new Abstract: Tool and agent routing from natural-language prompts is naturally a set-valued prediction problem: a single query may require multiple agents, while over-selection increases execution cost. The benchmark introduced here is derived… 16 arXiv — Machine Learning research 30m ago Modification-Considering Value Learning for Reward Hacking Mitigation in RL arXiv:2606.28955v1 Announce Type: new Abstract: Reinforcement learning agents can exploit misspecified reward signals to achieve high apparent returns while failing on the intended objective, a failure mode known as reward hacking. Existing practical defenses typically constrain… 10 arXiv — Machine Learning research 30m ago A Linear Matching Bandit Approach to Online Multi-Human Multi-Robot Teaming arXiv:2606.29221v1 Announce Type: new Abstract: We address the problem of online multi-human multi-robot teaming through the lens of a linear matching bandit framework, where a learner assigns robots with unknown features from a fixed pool to distinct sets of human agents over… 15 arXiv — Machine Learning research 30m ago Deterministic Decisions for High-Stakes AI. A Zero-Egress Pipeline with the Deployability of RAG and the Accuracy of Machine Learning arXiv:2606.29280v1 Announce Type: new Abstract: We identify intervention bias as a previously unquantified failure mode of zero-shot large-language-model (LLM) educational advisory agents: without task-specific training, they recommend action when a hindsight-optimal oracle… 31 arXiv — Machine Learning research 30m ago Interpretable Inverse Design of Metal-Organic Frameworks with Large Language Model Agents arXiv:2606.29459v1 Announce Type: new Abstract: Inverse design of metal-organic frameworks (MOFs) requires searching a combinatorially vast space where property labels are expensive and most machine-learning models reveal little about why a structure succeeds. We introduce… 8 arXiv — Machine Learning research 30m ago CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning arXiv:2606.29476v1 Announce Type: new Abstract: Self-distilled agentic reinforcement learning augments trajectory-level reward with a token-level distillation loss, using as its teacher the same policy conditioned on privileged context. The prevailing recipe gates this loss by a… 24 arXiv — NLP / Computation & Language research 30m ago Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models arXiv:2606.28524v1 Announce Type: new Abstract: Recent work suggests that Large Language Models (LLMs) are sensitive to the belief states of agents described by text, as measured by the false belief task (FBT), yet persistent concerns of construct validity remain. We adopt a… 25 arXiv — NLP / Computation & Language research 30m ago SEATauBench: Adapting Tool-Agent-User Evaluation Into Low-Resource Southeast Asian Languages arXiv:2606.28715v1 Announce Type: new Abstract: While AI development and evaluation for Southeast Asia (SEA) has grown rapidly, agent capabilities in regional languages are still poorly understood despite its importance to sovereign AI. To fill this gap, we introduce… 28 arXiv — NLP / Computation & Language research 30m ago MIThinker: A Plug-and-Play Policy-Optimized Thinker For Motivational Interviewing Counseling arXiv:2606.29265v1 Announce Type: new Abstract: Reasoning large language models (LLMs) have recently made much progress in complex problem-solving, leveraging internal reasoning (or thought) to guide their solution generation. However, existing LLM-based counseling agents,… 17 arXiv — NLP / Computation & Language research 30m ago Hybrid Retriever Evolution for Multimodal Document Reasoning Agents arXiv:2606.29648v1 Announce Type: new Abstract: Different retrievers, including lexical, semantic, and multimodal approaches, provide highly complementary strengths for multimodal document understanding, yet most systems combine them through fixed pipelines that cannot adapt to… 33 arXiv — NLP / Computation & Language research 30m ago SEVA: Self-Evolving Verification Agent with Process Reward for Fact Attribution arXiv:2606.29713v1 Announce Type: new Abstract: Hallucination is the reliability bottleneck for LLM-based agents, and fact attribution verifiers are the last line of defense -- yet today's verifiers emit only opaque binary labels, leaving agents unable to self-correct and… 24 arXiv — NLP / Computation & Language research 30m ago Neural Procedural Memory: Empowering LLM Agents with Implicit Activation Steering arXiv:2606.29824v1 Announce Type: new Abstract: While Large Language Models (LLMs) excel as static solvers, transforming them into autonomous agents remains challenging. This transition requires continuous environmental interaction, yet current agents lack the necessary… 17 arXiv — NLP / Computation & Language research 30m ago KbSD: Knowledge Boundary aware Self-Distillation for Behavioral Calibration in Agentic Search arXiv:2606.29863v1 Announce Type: new Abstract: Agentic search equips large language models with dynamic retrieval abilities, but existing reinforcement learning methods remain limited by reward sparsity in knowledge boundary calibration -- deciding when to trust parametric… 38 arXiv — NLP / Computation & Language research 30m ago MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation arXiv:2606.29914v1 Announce Type: new Abstract: Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it… 4 arXiv — NLP / Computation & Language research 30m ago Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios? arXiv:2606.29920v1 Announce Type: new Abstract: Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is… 17 arXiv — NLP / Computation & Language research 30m ago LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard arXiv:2606.30005v1 Announce Type: new Abstract: Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that… 34 MIT Technology Review — AI news-outlet 10h ago AI agents are not your “coworkers” This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. Imagine coming in to work to learn that a new underling will report to you. The worker is not a person but an AI tool—one that your company… 6 TechCrunch — AI news-outlet 11h ago Cursor now has a mobile app for guiding your coding agent on the go Cursor has launched a new mobile app for remote oversight over coding agents. 29 Simon Willison community 12h ago Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding Ornith-1.0: Self-Scaffolding LLMs for Agentic Coding This is an interesting new open weights (MIT licensed) model, the first model release from DeepReinforce. [...] with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. Built on top of pretrained Gemma 4 and Qwen… 5 NVIDIA Developer Blog official-blog 12h ago How to Govern Autonomous Agents in Enterprise AI Factories AI agents are quickly moving beyond chat. They inspect code, run tests, read documents, search knowledge bases, query internal systems, and operate for hours on... 28 MIT Technology Review — AI news-outlet 13h ago Agent confidence on the technical frontier Enterprise investment in AI is booming. Gartner is calling 2026 an “inflection year” for organizations to align their AI projects with strategic business objectives. As the pressure to prove ROI mounts, executives and technology leaders are looking to agentic AI to drive the… 19 r/LocalLLaMA community 14h ago Anyone else end up building a web access layer for local AI agents? I've been running local models for most of my experiments, and I kept running into the same issue. The model lives locally, but everything it needs to interact with doesn't. Every new agent ended up with another GitHub client, another Reddit integration, another documentation… 10 r/MachineLearning community 18h ago Google's Agentic Peer-Reviewer Handled ~10K Papers at ICML/STOC — Formal Research Paper Now Out [R] Google deployed an agentic AI peer-reviewer at two top CS conferences — reviewing ~10,000 papers with 30-minute turnaround — and the new formal research paper shows it catches 34% more mathematical errors than zero-shot prompting; the precedent for AI-automated scientific review… 23 Vercel — AI dev-tools 21h ago Build realtime voice agents on AI Gateway AI Gateway now supports audio/voice. You can add realtime voice, text to speech, and speech to text with the same calls you already use for text, image, and video, routed through AI Gateway alongside every other modality. Audio launches with models from OpenAI and xAI . Each… 26 arXiv — Machine Learning research 1d ago Training Observable Control Policies to Expose Agent State Through Actions arXiv:2606.27609v1 Announce Type: new Abstract: Physical or operational constraints often impose communications limitations on autonomous agents. Such limitations complicate monitoring or multiagent coordination. Even when strong communications are absent, some information may… 13 arXiv — Machine Learning research 1d ago COOPA: A Modular LLM Agent Architecture for Operations Research Problems arXiv:2606.27611v1 Announce Type: new Abstract: Operations Research (OR) provides a rigorous framework for high-stakes decision-making, but effective OR modeling requires substantial domain knowledge, mathematical abstraction, and solver expertise. Recent LLM-based systems… 18 arXiv — Machine Learning research 1d ago CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association arXiv:2606.28179v1 Announce Type: new Abstract: Identifying robust associations between cardiac imaging phenotypes and clinical diseases is fundamental to population-scale cardiovascular research and reliable risk stratification. However, current phenome-wide association studies… 13 arXiv — Machine Learning research 1d ago LLawCo: Learning Laws of Cooperation for Modeling Embodied Multi-Agent Behavior arXiv:2606.28182v1 Announce Type: new Abstract: Embodied agents operating in decentralized and partially observable environments have attracted growing attention in recent years. However, existing large language model (LLM)-based agents often exhibit behaviors that are… 28 arXiv — Machine Learning research 1d ago Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives arXiv:2606.28217v1 Announce Type: new Abstract: We propose a framework for reward allocation in fully delegated AI cooperatives where humans are represented by agents that contribute data and participate in model updates under heterogeneous value constraints. The key idea is to… 32 arXiv — NLP / Computation & Language research 1d ago Delayed Verification Destabilizes Multi-Agent LLM Belief: Instability Thresholds and Optimal Corrector Placement arXiv:2606.27409v1 Announce Type: cross Abstract: Multi-agent large language model (LLM) systems often rely on verifier and critic agents to suppress hallucinations, but verification is delayed. During this delay, false claims can propagate through the agent network. We model… 25 arXiv — NLP / Computation & Language research 1d ago Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents arXiv:2606.27472v1 Announce Type: new Abstract: Large language model (LLM) agents operate over long, multi-session interactions in which facts change: a user moves, a price updates, a plan is revised. Acting correctly requires using the current value of a fact and discarding… 16 arXiv — NLP / Computation & Language research 1d ago Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents arXiv:2606.27595v1 Announce Type: new Abstract: Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated,… 32 arXiv — NLP / Computation & Language research 1d ago When Search Agents Should Ask: DiscoBench for Clarification-Aware Deep Search arXiv:2606.27669v1 Announce Type: new Abstract: Search agents powered by large language models (LLMs) are increasingly used to solve complex information-seeking tasks, requiring multi-step retrieval and reasoning to fulfill user goals. However, existing benchmarks often assume… 27 arXiv — NLP / Computation & Language research 1d ago DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection arXiv:2606.27499v1 Announce Type: cross Abstract: Research on agent memory has matured rapidly, but almost entirely on the text side: few existing benchmarks ask, in an interactive environment, when an agent genuinely needs to remember what it saw rather than what it could write… 11 arXiv — NLP / Computation & Language research 1d ago AI Persuasive Framing in Collective Dilemmas arXiv:2606.27951v1 Announce Type: cross Abstract: AI agents are promising tools that can act as flexible behavioral nudges to enhance human cooperation in addressing large-scale societal problems. However, evidence on whether AI agents can effectively boost cooperation remains… 32 arXiv — NLP / Computation & Language research 1d ago Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety arXiv:2510.16492v4 Announce Type: replace Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn… 20 arXiv — NLP / Computation & Language research 1d ago Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification arXiv:2511.03217v2 Announce Type: replace Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet… 4 arXiv — NLP / Computation & Language research 1d ago LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks arXiv:2604.13072v2 Announce Type: replace Abstract: OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments. Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be… 28 arXiv — NLP / Computation & Language research 1d ago EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning arXiv:2603.09731v3 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from… 34 arXiv — NLP / Computation & Language research 1d ago Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents arXiv:2606.16682v3 Announce Type: replace-cross Abstract: When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using… 4 r/LocalLLaMA community 1d ago I built an agent Harness for Small Models. I got Qwen 3.5 4b managing servers. This is something I've been working on, I like playing around with smaller local models but found most agent harness's not well suited for them. The failure modes across different model family's tend to be the same: Failed tool calls Poor varication of environment variables Poor… 12 Vercel — AI dev-tools 1d ago Realtime voice, speech, and transcription now supported on AI Gateway AI Gateway now supports voice and audio models. You can build realtime voice agents, generate speech from text, and transcribe audio to text. This provides the same observability, spend controls, and bring-your-own-key support as text, image, and video models in AI Gateway, with… 17 Simon Willison community 1d ago Quoting Jon Udell Human Agent in the loop I dislike the phrase “human in the loop” because it cedes authority to the machines. Let’s flip the narrative. It’s our loop, we work the same way we always have, now we recruit agents to join the team. An agent-assisted process need not be a black box… 4 r/LocalLLaMA community 2d ago 2x RX 9060xt 16gb, is it worth it? I'm planning to buy 2x RX 9060xt with 16gb each to run Qwen 3.6 27B and alike. Would it be a good investment? How much tk/s should i expect in generation and prefill? I'm planning to use this as a coding agent in a large codebase. Currently I'm running this on my i7 64gb laptop… 35 r/LocalLLaMA community 2d ago Full document redaction with Qwen 3.6 27B with a Pi agent harness Link to full blog post with all method details, results, and links to all relevant code/skills/prompts at the bottom of this post. Apologies for not having more links throughout, it seems this subreddit restricts too many links in posts. Document redaction tasks are complex… 10 llama.cpp releases dev-tools 2d ago b9827 [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy ( #25057 ) [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous… 14 Page 1 of 10 · 500 articles Older →