News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — NLP / Computation & Language research 6d ago When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents arXiv:2606.23937v1 Announce Type: new Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B… 11 arXiv — Machine Learning research 6d ago Critique of Agent Model arXiv:2606.23991v1 Announce Type: cross Abstract: What is an agent? What constitutes agency? With the rise of Large Language Model (LLM) systems marketed as ``coding agents'', ``AI co-scientists'', and other ``agentic" tools that promise to drive up productivity, and at the same… 31 arXiv — Machine Learning research 6d ago Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines arXiv:2606.24598v1 Announce Type: cross Abstract: While expert-validated "LLM + script" workflows deliver significant value, they remain static: they encode hard-won domain knowledge yet fail to adapt execution based on feedback. Existing agent research predominantly targets… 22 arXiv — Machine Learning research 6d ago ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning arXiv:2606.24601v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) addresses the problem of training multiple agents that pursue collaborative, competitive, or mixed objectives. Prior work has investigated transfer learning between source and target… 29 arXiv — NLP / Computation & Language research 6d ago Metis: Bridging Text and Code Memory for Self-Evolving Agents arXiv:2606.24151v1 Announce Type: new Abstract: Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as… 38 arXiv — NLP / Computation & Language research 6d ago Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning arXiv:2606.24428v1 Announce Type: new Abstract: Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent… 17 arXiv — NLP / Computation & Language research 6d ago AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning arXiv:2606.24526v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy collection of… 36 arXiv — NLP / Computation & Language research 6d ago NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? arXiv:2606.24530v1 Announce Type: new Abstract: We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real… 21 arXiv — NLP / Computation & Language research 6d ago MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery arXiv:2606.24595v1 Announce Type: new Abstract: Long-term memory promises LLM agents that grow more capable across sessions, maintaining an accurate, evolving understanding of the user that interaction forms. In practice, however, this memory is evaluated mostly through… 32 arXiv — NLP / Computation & Language research 6d ago Qwen-AgentWorld: Language World Models for General Agents arXiv:2606.24597v1 Announce Type: new Abstract: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can… 8 arXiv — NLP / Computation & Language research 6d ago Are We Ready For An Agent-Native Memory System? arXiv:2606.24775v1 Announce Type: new Abstract: Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic… 8 arXiv — NLP / Computation & Language research 6d ago Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce arXiv:2606.24783v1 Announce Type: new Abstract: Commercial NLP treats the shopping chatbot as a recommender or a conversion tool: its job is to match a user to a catalogue entry and close a sale. We argue that the arrival of agent-native micro-payment rails (e.g., x402, AP2)… 23 arXiv — NLP / Computation & Language research 6d ago SHERLOC: Structured Diagnostic Localization for Code Repair Agents arXiv:2606.24820v1 Announce Type: new Abstract: LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval… 20 arXiv — NLP / Computation & Language research 6d ago Bayesian control for coding agents arXiv:2606.24453v1 Announce Type: cross Abstract: Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty.… 21 arXiv — NLP / Computation & Language research 6d ago CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark arXiv:2409.11363v2 Announce Type: replace Abstract: AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially,… 20 Hugging Face Daily Papers research 6d ago Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning Abstract EDV is a three-stage framework that uses multiple heterogeneous agents to collaboratively construct reliable experiences for LLM agents, preventing self-confirmatory errors through execute-distill-verify processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 29 Hugging Face Daily Papers research 6d ago Qwen-AgentWorld: Language World Models for General Agents Abstract Language-based world models enable agentic environment simulation across multiple domains and enhance general agent performance through scalable simulation and improved downstream task performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A world model predicts… 16 Hugging Face Daily Papers research 6d ago NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? Abstract NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation… 21 Hugging Face Daily Papers research 6d ago ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection Abstract A comprehensive multimodal misinformation detection framework is introduced that handles complex, multilingual content with multiple images and diverse verification approaches, achieving superior performance while reducing computational costs. Generated by… 29 Hugging Face Daily Papers research 6d ago MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management Abstract MemGUI-Agent addresses long-horizon mobile GUI task limitations through proactive context management using Context-as-Action (ConAct) to maintain critical information across extended sequences. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents… 32 Hugging Face Daily Papers research 6d ago MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization Abstract MobileForge enables efficient adaptation of mobile GUI agents through annotation-free learning by combining real app interaction grounding with hierarchical feedback-guided policy optimization. Generated by Qwen/Qwen2.5-Coder-32B-Instruct MLLM-based mobile GUI agents… 18 TechCrunch — AI news-outlet 6d ago India’s MoEngage bets that the future of marketing is millions of AI agents The all-cash deal gives MoEngage access to technology that assigns AI agents to individual customers. 17 r/LocalLLaMA community 6d ago Mimo 2.5 is _fast_ at large context (dual RTX Pro 6000) For agentic work fast high context is king, OpenCode fills the window quickly and most models that feel snappy at 8k context turn into dial-up ADSL brrr by the time you're at 150k context deep. So I've been testing lots of models and runners trying to get "local Sonnet" on 2x… 14 r/LocalLLaMA community 6d ago MiniMax2.7 @47tg 1200pp MiniMax 2.7 REAP Q4 on 96GB VRAM and 192 GB DDR5 udimm ram on a b840 MSI board and 9900X cpu. 1250W PSU and all cards are power limited. Linux Ubuntu. Agent class model. Excellent instruction following and tool calling. I run this model in a round robin loop with 3 sequencing… 19 r/LocalLLaMA community 6d ago Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL) Hey everyone, wanted to share some work on making the new Tmax-27B terminal agent actually runnable on consumer hardware. What is Tmax-27B? Ai2 just released Tmax, a family of terminal-agent LLMs trained with DPPO (RL) on top of Qwen3.6. The 27B model hits ~43% on Terminal Bench… 32 Hugging Face Daily Papers research 6d ago Self-Compacting Language Model Agents Abstract SelfCompact is a scaffolding approach that enables models to autonomously determine optimal compaction timing and methods for managing long agent traces, achieving better performance with reduced token costs compared to fixed-interval methods. Generated by… 13 Hugging Face Daily Papers research 6d ago When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents Abstract Pre premature commitment in long-horizon LLM agents leads to silent failures where agents defend early interpretations without considering alternatives, and hidden-state convergence serves as an early diagnostic for trajectory consistency. Generated by… 24 Hugging Face Daily Papers research 6d ago Libretto: Giving LLM Agents a Sense of Musical Structure Abstract Libretto provides a structured framework for symbolic music generation and revision using LLM-native grammar and statistical evaluation across musical dimensions. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Generative music systems can now produce impressive audio from… 18 Hugging Face Daily Papers research 6d ago Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? Abstract Computer-use agents frequently expose inappropriate information across applications, prompting the creation of AgentCIBench to evaluate and mitigate privacy risks in cross-application contexts. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use agents (CUAs) now… 7 NVIDIA Developer Blog official-blog 6d ago Build an AI Scientist for Life Science Discovery with NVIDIA BioNeMo Agent Toolkit AI scientists are emerging as a new interface for scientific computing. These agents can read papers, write code, generate hypotheses, call APIs, inspect files,... 12 TechCrunch — AI news-outlet 6d ago Fika Jobs raises $4M to build a video-first hiring platform where AI agents interview candidates Stockholm-based startup Fika Jobs is building a video-first hiring platform that combines AI interview agents with short-form video profiles, creating something that feels like a cross between LinkedIn and TikTok. 5 Hugging Face official-blog 6d ago Build real agentic apps using CUGA: two dozen working examples on a lightweight harness Back to Articles a]:hidden"> Build real agentic apps using CUGA: two dozen working examples on a lightweight harness Enterprise Article Published June 23, 2026 Upvote - Anupama Murthi anupamamurthi ibm-research Hamid Adebayo harmedox ibm-research Sami Marreed samimarreed… 30 Hugging Face Daily Papers research 6d ago Training Open Models for Agentic Phone Use Abstract PhoneBuddy combines real and mock app environments to improve training of open models for phone use, demonstrating enhanced task success rates through mixed reinforcement learning approaches. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Phones are becoming an important… 11 r/LocalLLaMA community 6d ago Pooled round robin hardware with friends? I have a rig Friend1 has rig Friend2 has a rig Each rig idle 90% of the time With agentic, how could we round robin as a group? So when I load up, it checks if friends rigs are idle (vpn etc) and if idle farms out tasks. If I understand right, agents work in parallel, so this… 29 Hugging Face Daily Papers research 6d ago Counsel: A Meta-Evaluation Dataset for Agentic Tasks Abstract A large-scale dataset of human-metaevaluations of LLM critiques for agentic tasks is introduced to improve the calibration and reliability of automated evaluation methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct As agentic systems tackle increasingly complex… 22 r/LocalLLaMA community 6d ago My local server idling 99% of the time! Guys what you running to make agents busy? Like some crazy 24/7 tasks, or maybe some useful ideas on how to utilize local llm with some purpose/use? I personally running Qwen3.6-27B with owu and with pi for coding (little-coder) but as in title - it’s idling all the time…  … 33 Hugging Face Daily Papers research 6d ago Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills Abstract Notes2Skills framework converts laboratory notes into verifiable skills for AI agents while maintaining author uncertainty levels, addressing gaps in scientific AI development. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Scientific discovery workflows usually contain… 27 Hugging Face Daily Papers research 6d ago SkillHarness: Harnessing Safe Skills for Computer-Use Agents Abstract SkillHarness is a framework that enables computer-use agents to safely learn and execute skills in dynamic environments by incorporating safety constraints and adaptive skill selection mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-Use Agents (CUAs)… 24 Hugging Face Daily Papers research 7d ago AOHP: An Open-Source OS-Level Agent Harness for Personalized, Efficient and Secure Interaction Abstract AOHP presents an Android-based operating system framework that treats AI agents as first-class entities, enhancing task completion rates and reducing execution costs through specialized agent-oriented mechanisms. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI agents… 16 NVIDIA Developer Blog official-blog 7d ago How Telcos Build Autonomous Networks with Agentic AI Telecom operators are adopting AI across network operations, customer care, and back-office workflows, but most are still early in the journey to autonomy. In... 37 r/LocalLLaMA community 7d ago Training a Qwen 3.5 4B/9B agent for multi-tool use: SFT first or go directly to RL? To train Qwen 3.5 4B or 9B for a custom multi-tool agent workflow and would appreciate guidance from people who have done this successfully. A few questions: SFT → RL or RL-only? - Is it still recommended to first do supervised fine-tuning (tool-calling traces, reasoning… 15 Hugging Face Daily Papers research 7d ago DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams Abstract Agentic Data Tailoring paradigm uses learnable data processing to structure high-entropy multimodal streams, with DataClaw_0-9B model achieving robust alignment through SFT and GRPO on a novel benchmark. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Massive unstructured… 19 Smol AI News news-outlet 7d ago not much happened today **Prime Intellect's `prime-rl` v0.6.0** advances agentic reinforcement learning infrastructure supporting **1 trillion parameter MoE models** with sub-5-minute step times and a **131k context GLM-5 agentic setup**. The release includes optimizations in inference, training, and… 37 r/LocalLLaMA community 7d ago Is there any reason for a lack of love for Gemma 4 26b? The answer to most questions on here is Qwen3.6 27b or 35b and then Gemma4 31b (but lesser so as it doesn’t fit well on a solo 3090). Is there any reason why Gemma 4 26b moe isn’t mentioned more? I plan on using Qwen for my coding agents. But I’ve been building a Jarvis for… 20 Hugging Face Daily Papers research 7d ago CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents Abstract A principled synthesis engine generates high-quality terminal-agent tasks through multi-dimensional capability taxonomy and evidence-guided research, creating a distilled dataset that enables significant performance gains in LLM training. Generated by… 5 Hugging Face Daily Papers research 7d ago Causal Discovery in the Era of Agents Abstract Language models should assist causal discovery workflows by providing contextual support and explanations rather than generating causal conclusions, as demonstrated through a platform that integrates data analysis and expert knowledge. Generated by… 31 Hugging Face Daily Papers research 7d ago EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions Abstract EnterpriseClawBench presents a benchmark for enterprise agents based on real-world sessions with 852 reproducible tasks, emphasizing comprehensive evaluation metrics beyond single performance scores. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Enterprise agents… 30 Hugging Face Daily Papers research 7d ago Tmax: A simple recipe for terminal agents Abstract A novel RL training approach for terminal agents achieves superior performance using a simplified recipe and expanded dataset, enabling effective training with fewer parameters than previous methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Terminal-using agents… 36 Hugging Face Daily Papers research 7d ago OpenRath: Session-Centered Runtime State for Agent Systems Abstract OpenRath introduces a PyTorch-like programming model for multi-agent systems using Session as a central runtime abstraction that enables explicit fork, merge, and replay operations while recording comprehensive execution state. Generated by… 21 Hugging Face Daily Papers research 7d ago Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning Abstract Large language models can be trained through reinforcement learning to develop a meta-capability enabling continuous learning and adaptation across long sequences of tasks in dynamic environments. Generated by Qwen/Qwen2.5-Coder-32B-Instruct This work presents a general… 31 Page 4 of 10 · 500 articles ← Newer Older →