Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — NLP / Computation & Language research 6d ago

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

arXiv:2606.23937v1 Announce Type: new Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B…

11
arXiv — Machine Learning research 6d ago

Critique of Agent Model

arXiv:2606.23991v1 Announce Type: cross Abstract: What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same…

31
arXiv — Machine Learning research 6d ago

Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines

arXiv:2606.24598v1 Announce Type: cross Abstract: While expert-validated "LLM + script" workflows deliver significant value, they remain static: they encode hard-won domain knowledge yet fail to adapt execution based on feedback. Existing agent research predominantly targets…

22
arXiv — Machine Learning research 6d ago

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target…

29
arXiv — NLP / Computation & Language research 6d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

arXiv:2606.24151v1 Announce Type: new Abstract: Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as…

38
arXiv — NLP / Computation & Language research 6d ago

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

arXiv:2606.24428v1 Announce Type: new Abstract: Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent…

17
arXiv — NLP / Computation & Language research 6d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of…

36
arXiv — NLP / Computation & Language research 6d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

arXiv:2606.24530v1 Announce Type: new Abstract: We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real…

21
arXiv — NLP / Computation & Language research 6d ago

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

arXiv:2606.24595v1 Announce Type: new Abstract: Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through…

32
arXiv — NLP / Computation & Language research 6d ago

Qwen-AgentWorld: Language World Models for General Agents

arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can…

8
arXiv — NLP / Computation & Language research 6d ago

Are We Ready For An Agent-Native Memory System?

arXiv:2606.24775v1 Announce Type: new Abstract: Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic…

8
arXiv — NLP / Computation & Language research 6d ago

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

arXiv:2606.24783v1 Announce Type: new Abstract: Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2)…

23
arXiv — NLP / Computation & Language research 6d ago

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

arXiv:2606.24820v1 Announce Type: new Abstract: LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval…

20
arXiv — NLP / Computation & Language research 6d ago

Bayesian control for coding agents

arXiv:2606.24453v1 Announce Type: cross Abstract: Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty.…

21
arXiv — NLP / Computation & Language research 6d ago

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially,…

20
Hugging Face Daily Papers research 6d ago

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

Abstract EDV is a three-stage framework that uses multiple heterogeneous agents to collaboratively construct reliable experiences for LLM agents, preventing self-confirmatory errors through execute-distill-verify processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

29
Hugging Face Daily Papers research 6d ago

Qwen-AgentWorld: Language World Models for General Agents

Abstract Language-based world models enable agentic environment simulation across multiple domains and enhance general agent performance through scalable simulation and improved downstream task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A world model predicts…

16
Hugging Face Daily Papers research 6d ago

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation…

21
Hugging Face Daily Papers research 6d ago

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

Abstract A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs. Generated by…

29
Hugging Face Daily Papers research 6d ago

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Abstract MemGUI-Agent addresses long-horizon mobile GUI task limitations through proactive context management using Context-as-Action (ConAct) to maintain critical information across extended sequences. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents…

32
Hugging Face Daily Papers research 6d ago

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

Abstract MobileForge enables efficient adaptation of mobile GUI agents through annotation-free learning by combining real app interaction grounding with hierarchical feedback-guided policy optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents…

18
TechCrunch — AI news-outlet 6d ago

India’s MoEngage bets that the future of marketing is millions of AI agents

The all-cash deal gives MoEngage access to technology that assigns AI agents to individual customers.

17
r/LocalLLaMA community 6d ago

Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000)

For agentic work fast high context is king, OpenCode fills the window quickly and most models that feel snappy at 8k context turn into dial-up ADSL brrr by the time you're at 150k context deep. So I've been testing lots of models and runners trying to get "local Sonnet" on 2x…

14
r/LocalLLaMA community 6d ago

MiniMax2.7 @47tg 1200pp

MiniMax 2.7 REAP Q4 on 96GB VRAM and 192 GB DDR5 udimm ram on a b840 MSI board and 9900X cpu. 1250W PSU and all cards are power limited. Linux Ubuntu. Agent class model. Excellent instruction following and tool calling. I run this model in a round robin loop with 3 sequencing…

19
r/LocalLLaMA community 6d ago

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Hey everyone, wanted to share some work on making the new Tmax-27B terminal agent actually runnable on consumer hardware. What is Tmax-27B? Ai2 just released Tmax, a family of terminal-agent LLMs trained with DPPO (RL) on top of Qwen3.6. The 27B model hits ~43% on Terminal Bench…

32
Hugging Face Daily Papers research 6d ago

Self-Compacting Language Model Agents

Abstract SelfCompact is a scaffolding approach that enables models to autonomously determine optimal compaction timing and methods for managing long agent traces, achieving better performance with reduced token costs compared to fixed-interval methods. Generated by…

13
Hugging Face Daily Papers research 6d ago

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Abstract Pre premature commitment in long-horizon LLM agents leads to silent failures where agents defend early interpretations without considering alternatives, and hidden-state convergence serves as an early diagnostic for trajectory consistency. Generated by…

24
Hugging Face Daily Papers research 6d ago

Libretto: Giving LLM Agents a Sense of Musical Structure

Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from…

18
Hugging Face Daily Papers research 6d ago

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Abstract Computer-use agents frequently expose inappropriate information across applications, prompting the creation of AgentCIBench to evaluate and mitigate privacy risks in cross-application contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents (CUAs) now…

7
NVIDIA Developer Blog official-blog 6d ago

Build an AI Scientist for Life Science Discovery with NVIDIA BioNeMo Agent Toolkit

AI scientists are emerging as a new interface for scientific computing. These agents can read papers, write code, generate hypotheses, call APIs, inspect files,...

12
TechCrunch — AI news-outlet 6d ago

Fika Jobs raises $4M to build a video-first hiring platform where AI agents interview candidates

Stockholm-based startup Fika Jobs is building a video-first hiring platform that combines AI interview agents with short-form video profiles, creating something that feels like a cross between LinkedIn and TikTok.

5
Hugging Face official-blog 6d ago

Build real agentic apps using CUGA: two dozen working examples on a lightweight harness

Back to Articles a]:hidden"> Build real agentic apps using CUGA: two dozen working examples on a lightweight harness Enterprise Article Published June 23, 2026 Upvote - Anupama Murthi anupamamurthi ibm-research Hamid Adebayo harmedox ibm-research Sami Marreed samimarreed…

30
Hugging Face Daily Papers research 6d ago

Training Open Models for Agentic Phone Use

Abstract PhoneBuddy combines real and mock app environments to improve training of open models for phone use, demonstrating enhanced task success rates through mixed reinforcement learning approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Phones are becoming an important…

11
r/LocalLLaMA community 6d ago

Pooled round robin hardware with friends?

I have a rig Friend1 has rig Friend2 has a rig Each rig idle 90% of the time With agentic, how could we round robin as a group? So when I load up, it checks if friends rigs are idle (vpn etc) and if idle farms out tasks. If I understand right, agents work in parallel, so this…

29
Hugging Face Daily Papers research 6d ago

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Abstract A large-scale dataset of human-metaevaluations of LLM critiques for agentic tasks is introduced to improve the calibration and reliability of automated evaluation methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agentic systems tackle increasingly complex…

22
r/LocalLLaMA community 6d ago

My local server idling 99% of the time!

Guys what you running to make agents busy? Like some crazy 24/7 tasks, or maybe some useful ideas on how to utilize local llm with some purpose/use? I personally running Qwen3.6-27B with owu and with pi for coding (little-coder) but as in title - it’s idling all the time…  …

33
Hugging Face Daily Papers research 6d ago

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

Abstract Notes2Skills framework converts laboratory notes into verifiable skills for AI agents while maintaining author uncertainty levels, addressing gaps in scientific AI development. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Scientific discovery workflows usually contain…

27
Hugging Face Daily Papers research 6d ago

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)…

24
Hugging Face Daily Papers research 7d ago

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

Abstract AOHP presents an Android-based operating system framework that treats AI agents as first-class entities, enhancing task completion rates and reducing execution costs through specialized agent-oriented mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI agents…

16
NVIDIA Developer Blog official-blog 7d ago

How Telcos Build Autonomous Networks with Agentic AI

Telecom operators are adopting AI across network operations, customer care, and back-office workflows, but most are still early in the journey to autonomy. In...

37
r/LocalLLaMA community 7d ago

Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL?

To train Qwen 3.5 4B or 9B for a custom multi-tool agent workflow and would appreciate guidance from people who have done this successfully. A few questions: SFT → RL or RL-only? - Is it still recommended to first do supervised fine-tuning (tool-calling traces, reasoning…

15
Hugging Face Daily Papers research 7d ago

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured…

19
Smol AI News news-outlet 7d ago

not much happened today

**Prime Intellect's `prime-rl` v0.6.0** advances agentic reinforcement learning infrastructure supporting **1 trillion parameter MoE models** with sub-5-minute step times and a **131k context GLM-5 agentic setup**. The release includes optimizations in inference, training, and…

37
r/LocalLLaMA community 7d ago

Is there any reason for a lack of love for Gemma 4 26b?

The answer to most questions on here is Qwen3.6 27b or 35b and then Gemma4 31b (but lesser so as it doesn’t fit well on a solo 3090). Is there any reason why Gemma 4 26b moe isn’t mentioned more? I plan on using Qwen for my coding agents. But I’ve been building a Jarvis for…

20
Hugging Face Daily Papers research 7d ago

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Abstract A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training. Generated by…

5
Hugging Face Daily Papers research 7d ago

Causal Discovery in the Era of Agents

Abstract Language models should assist causal discovery workflows by providing contextual support and explanations rather than generating causal conclusions, as demonstrated through a platform that integrates data analysis and expert knowledge. Generated by…

31
Hugging Face Daily Papers research 7d ago

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents…

30
Hugging Face Daily Papers research 7d ago

Tmax: A simple recipe for terminal agents

Abstract A novel RL training approach for terminal agents achieves superior performance using a simplified recipe and expanded dataset, enabling effective training with fewer parameters than previous methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Terminal-using agents…

36
Hugging Face Daily Papers research 7d ago

OpenRath: Session-Centered Runtime State for Agent Systems

Abstract OpenRath introduces a PyTorch-like programming model for multi-agent systems using Session as a central runtime abstraction that enables explicit fork, merge, and replay operations while recording comprehensive execution state. Generated by…

21
Hugging Face Daily Papers research 7d ago

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Abstract Large language models can be trained through reinforcement learning to develop a meta-capability enabling continuous learning and adaptation across long sequences of tasks in dynamic environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This work presents a general…

31

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

Critique of Agent Model

Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Qwen-AgentWorld: Language World Models for General Agents

Are We Ready For An Agent-Native Memory System?

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

Bayesian control for coding agents

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

Qwen-AgentWorld: Language World Models for General Agents

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

India&#8217;s MoEngage bets that the future of marketing is millions of AI agents

Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000)

MiniMax2.7 @47tg 1200pp

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Self-Compacting Language Model Agents

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

Libretto: Giving LLM Agents a Sense of Musical Structure

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?

Build an AI Scientist for Life Science Discovery with NVIDIA BioNeMo Agent Toolkit

Fika Jobs raises $4M to build a video-first hiring platform where AI agents interview candidates

Build real agentic apps using CUGA: two dozen working examples on a lightweight harness

Training Open Models for Agentic Phone Use

Pooled round robin hardware with friends?

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

My local server idling 99% of the time!

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

SkillHarness: Harnessing Safe Skills for Computer-Use Agents

AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction

How Telcos Build Autonomous Networks with Agentic AI

Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL?

DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams

not much happened today

Is there any reason for a lack of love for Gemma 4 26b?

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Causal Discovery in the Era of Agents

EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions

Tmax: A simple recipe for terminal agents

OpenRath: Session-Centered Runtime State for Agent Systems

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

India’s MoEngage bets that the future of marketing is millions of AI agents