News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — NLP / Computation & Language research 18d ago MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback arXiv:2606.12916v1 Announce Type: cross Abstract: Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert… 8 arXiv — NLP / Computation & Language research 18d ago MiniPIC: Flexible Position-Independent Caching in <100LOC arXiv:2606.13126v1 Announce Type: cross Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV… 12 arXiv — NLP / Computation & Language research 18d ago Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents arXiv:2606.13174v1 Announce Type: cross Abstract: Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference… 23 arXiv — NLP / Computation & Language research 18d ago ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm arXiv:2606.13239v1 Announce Type: cross Abstract: Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with… 34 arXiv — NLP / Computation & Language research 18d ago Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models arXiv:2606.13441v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility… 32 arXiv — NLP / Computation & Language research 18d ago Adaptive Turn-Taking for Real-time Multi-Party Voice Agents arXiv:2606.13544v1 Announce Type: cross Abstract: Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice… 35 Hugging Face Daily Papers research 18d ago Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning Abstract A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Visual reasoning requires integrating… 6 Hugging Face Daily Papers research 18d ago EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search… 26 Hugging Face Daily Papers research 18d ago EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments Abstract EvoArena benchmark and EvoMem memory paradigm address the challenge of dynamic environments in LLM agents by modeling progressive updates and structured memory evolution, showing improved performance on evolving tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large… 5 Hugging Face Daily Papers research 18d ago Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents Abstract Evoflux enables compact language models to execute tool workflows more reliably by using evolutionary search to repair failed plans during inference, significantly improving execution feasibility compared to traditional fine-tuning methods. Generated by… 20 Hugging Face Daily Papers research 18d ago WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces Abstract WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods. Generated by… 38 Hugging Face Daily Papers research 18d ago EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery Abstract Environment engineering enhances autonomous scientific discovery by designing structured agent environments that optimize behaviors like exploration and collaboration while mitigating issues such as reward hacking and human oversight friction, as demonstrated by the… 35 Hugging Face Daily Papers research 18d ago InterleaveThinker: Reinforcing Agentic Interleaved Generation Abstract InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks. Generated by… 36 Hugging Face Daily Papers research 18d ago FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents Abstract A framework for creating shortcut-resistant training data for deep search agents by identifying and mitigating four shortcut risks in data synthesis processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training deep search agents requires verifiable questions whose… 11 Hugging Face Daily Papers research 18d ago SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning Abstract SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks. Generated by… 36 r/LocalLLaMA community 18d ago Has anyone used agents to decompile binary executables? Was wondering if there was a way to set it up so that you just drop in the binary file and then it goes to work reversing the file?   submitted by   /u/qzrz [link]   [comments] 25 Hugging Face Daily Papers research 18d ago HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness Abstract Learnable harness controller called HarnessBridge is introduced to parameterize agent-environment interfaces through bidirectional projections, achieving performance comparable to specialized harnesses with reduced computational overhead. Generated by… 21 Vercel — AI dev-tools 18d ago Program Claude Code, Codex, Pi and other agent harnesses with AI SDK AI SDK 7 introduces HarnessAgent , a single API for running established agent harnesses, including Claude Code, Codex, and Pi. AI SDK has always let you switch models without rewriting your agent. Now you can switch the harness the same way. Write the agent once. Use the best… 7 Hugging Face Daily Papers research 18d ago Can Generalist Agents Automate Data Curation? Abstract Automated data curation using generalist coding agents shows promise but requires structured scaffolding to achieve superior performance compared to traditional methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Curating training data is among the most consequential… 33 Hugging Face Daily Papers research 18d ago Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs Abstract ModSleuth is an agentic system that recursively reconstructs large-scale dependency graphs for LLM development by analyzing public artifacts and resolving inconsistencies in documentation and artifact identities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern LLM… 6 Hugging Face Daily Papers research 18d ago ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use… 10 The Information — AI news-outlet 18d ago Snowflake Mounts Full-Court Press to Get Employees Using AI When Snowflake CEO Sridhar Ramaswamy and CFO Brian Robins face Wall Street analysts on quarterly earnings calls nowadays, they’re armed with material from an internally developed AI agent. The agent tells the two executives the questions they’re expecting the analysts to ask and… 38 r/MachineLearning community 18d ago What should context compression keep? I looked at how six agents handle it[D] I use Claude Code, Codex CLI, OpenCode, Cline, Cursor, and Amp enough to notice a pattern in how they handle long context. They are all converging on layered progressive compression, but they disagree on what to protect. Most protect recent user messages as a first-class asset.… 20 r/LocalLLaMA community 18d ago As we know Minimax M3 is just going to be open sourced in few days and because of that I was surfing on internet searching for its scores and I found out pretty interesting results. Is Minimax M3 really that good in agentic stuff and in coding? Is it better than older gpt models? Has anyone personally compared the Minimax M3 model against other proprietary models to determine its relative performance tier? I am trying to understand where it currently ranks in the broader Al landscape. Can we say Minimax M3 is better than GPT 5.2 in coding and agentic… 26 Hugging Face Daily Papers research 18d ago τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems Abstract A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models. Generated by… 6 r/LocalLLaMA community 18d ago Cognitor: open-source semantic search engine. Automatically chunks, embeds and indexes the content of a target folder, making it searchable semantically. https://github.com/tanaos/cognitor Cognitor is an open-source semantic search engine and vector database which automatically chunks, embeds and indexes the entire content of a target folder (and its subfolders), making it easily searchable by both AI agents and humans.… 15 MIT Technology Review — AI news-outlet 19d ago Google DeepMind is worried about what happens when millions of agents start to interact Google DeepMind is funding research into the potential dangers of situations where millions of different AI agents interact with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can… 35 Hugging Face Daily Papers research 19d ago TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning Abstract TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 Hugging Face Daily Papers research 19d ago POISE: Position-Aware Undetectable Skill Injection on LLM Agents Abstract POISE is a stealthy skill-poisoning attack that embeds malicious triggers within benign-looking instructions, achieving high attack success rates while avoiding detection by LLM scanners that are overly sensitive to privileged tool operations. Generated by… 16 Hugging Face Daily Papers research 19d ago EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning Abstract EvoTrainer autonomously evolves both language model policies and training harnesses through empirical feedback, demonstrating superior performance in complex reasoning and coding tasks compared to traditional handcrafted approaches. Generated by… 6 Hugging Face Daily Papers research 19d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 16 arXiv — NLP / Computation & Language research 19d ago FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse arXiv:2606.11290v1 Announce Type: cross Abstract: Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy… 30 arXiv — Machine Learning research 19d ago Signed Compression Progress on a Sealed Audit is Goodhart-Resistant arXiv:2606.11417v1 Announce Type: new Abstract: Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is… 16 arXiv — Machine Learning research 19d ago Counterexample Guided Learning in the Large using Reasoning Agents arXiv:2606.11521v1 Announce Type: new Abstract: LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to… 16 arXiv — Machine Learning research 19d ago TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models arXiv:2606.11625v1 Announce Type: new Abstract: Time-series foundation models (TSFMs) are increasingly explored as predictive experts within emerging agentic time-series systems. However, TSFMs exhibit heterogeneous inductive biases, and no single model consistently dominates… 10 arXiv — Machine Learning research 19d ago IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents arXiv:2606.11652v1 Announce Type: new Abstract: This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic… 29 arXiv — Machine Learning research 19d ago Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents arXiv:2606.11998v1 Announce Type: new Abstract: Trusted monitoring is a cornerstone of AI control. However, as frontier models grow more capable, the increasing capabilities gap between trusted and untrusted models may render trusted models unreliable monitors. We introduce… 30 arXiv — Machine Learning research 19d ago Fourier Features Let Agents Learn High Precision Policies with Imitation Learning arXiv:2606.12334v1 Announce Type: new Abstract: High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information… 14 arXiv — Machine Learning research 19d ago Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks arXiv:2606.12344v1 Announce Type: new Abstract: General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch,… 27 arXiv — NLP / Computation & Language research 19d ago Beyond Compaction: Structured Context Eviction for Long-Horizon Agents arXiv:2606.11213v1 Announce Type: new Abstract: We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through… 6 arXiv — NLP / Computation & Language research 19d ago Agent Skill Evaluation and Evolution: Frameworks and Benchmarks arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in… 20 arXiv — NLP / Computation & Language research 19d ago AI Coding Agents Can Reproduce Social Science Findings arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks… 8 arXiv — NLP / Computation & Language research 19d ago AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable arXiv:2606.11456v1 Announce Type: new Abstract: The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated… 34 arXiv — NLP / Computation & Language research 19d ago ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories arXiv:2606.11520v1 Announce Type: new Abstract: Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate ->… 32 arXiv — NLP / Computation & Language research 19d ago Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness arXiv:2606.11686v1 Announce Type: new Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed… 14 arXiv — NLP / Computation & Language research 19d ago Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents arXiv:2606.11688v1 Announce Type: new Abstract: Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric… 13 arXiv — NLP / Computation & Language research 19d ago WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning arXiv:2606.11816v1 Announce Type: new Abstract: Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model… 6 arXiv — NLP / Computation & Language research 19d ago Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills arXiv:2606.11897v1 Announce Type: new Abstract: Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving… 29 arXiv — NLP / Computation & Language research 19d ago FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents arXiv:2606.12087v1 Announce Type: new Abstract: Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph… 21 arXiv — NLP / Computation & Language research 19d ago Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application arXiv:2606.12191v1 Announce Type: new Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work… 18 Page 10 of 10 · 500 articles ← Newer