Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — NLP / Computation & Language research 18d ago

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

arXiv:2606.12916v1 Announce Type: cross Abstract: Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert…

8
arXiv — NLP / Computation & Language research 18d ago

MiniPIC: Flexible Position-Independent Caching in <100LOC

arXiv:2606.13126v1 Announce Type: cross Abstract: Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV…

12
arXiv — NLP / Computation & Language research 18d ago

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

arXiv:2606.13174v1 Announce Type: cross Abstract: Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference…

23
arXiv — NLP / Computation & Language research 18d ago

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

arXiv:2606.13239v1 Announce Type: cross Abstract: Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with…

34
arXiv — NLP / Computation & Language research 18d ago

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

arXiv:2606.13441v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility…

32
arXiv — NLP / Computation & Language research 18d ago

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

arXiv:2606.13544v1 Announce Type: cross Abstract: Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice…

35
Hugging Face Daily Papers research 18d ago

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Abstract A multi-agent framework with shared MLLM policy and role-specific training methods improves visual reasoning by reducing hallucinations and enabling efficient parallel processing. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Visual reasoning requires integrating…

6
Hugging Face Daily Papers research 18d ago

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

Abstract EvoBrowseComp is an evolving benchmark with 800 contamination-free questions synthesized through a three-agent framework that ensures temporal freshness and prevents parametric memorization in search agent evaluation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Search…

26
Hugging Face Daily Papers research 18d ago

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Abstract EvoArena benchmark and EvoMem memory paradigm address the challenge of dynamic environments in LLM agents by modeling progressive updates and structured memory evolution, showing improved performance on evolving tasks. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large…

5
Hugging Face Daily Papers research 18d ago

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Abstract Evoflux enables compact language models to execute tool workflows more reliably by using evolutionary search to repair failed plans during inference, significantly improving execution feasibility compared to traditional fine-tuning methods. Generated by…

20
Hugging Face Daily Papers research 18d ago

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Abstract WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods. Generated by…

38
Hugging Face Daily Papers research 18d ago

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

Abstract Environment engineering enhances autonomous scientific discovery by designing structured agent environments that optimize behaviors like exploration and collaboration while mitigating issues such as reward hacking and human oversight friction, as demonstrated by the…

35
Hugging Face Daily Papers research 18d ago

InterleaveThinker: Reinforcing Agentic Interleaved Generation

Abstract InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks. Generated by…

36
Hugging Face Daily Papers research 18d ago

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Abstract A framework for creating shortcut-resistant training data for deep search agents by identifying and mitigating four shortcut risks in data synthesis processes. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Training deep search agents requires verifiable questions whose…

11
Hugging Face Daily Papers research 18d ago

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Abstract SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks. Generated by…

36
r/LocalLLaMA community 18d ago

Has anyone used agents to decompile binary executables?

Was wondering if there was a way to set it up so that you just drop in the binary file and then it goes to work reversing the file?   submitted by   /u/qzrz [link]   [comments]

25
Hugging Face Daily Papers research 18d ago

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Abstract Learnable harness controller called HarnessBridge is introduced to parameterize agent-environment interfaces through bidirectional projections, achieving performance comparable to specialized harnesses with reduced computational overhead. Generated by…

21
Vercel — AI dev-tools 18d ago

Program Claude Code, Codex, Pi and other agent harnesses with AI SDK

AI SDK 7 introduces HarnessAgent , a single API for running established agent harnesses, including Claude Code, Codex, and Pi. AI SDK has always let you switch models without rewriting your agent. Now you can switch the harness the same way. Write the agent once. Use the best…

7
Hugging Face Daily Papers research 18d ago

Can Generalist Agents Automate Data Curation?

Abstract Automated data curation using generalist coding agents shows promise but requires structured scaffolding to achieve superior performance compared to traditional methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Curating training data is among the most consequential…

33
Hugging Face Daily Papers research 18d ago

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

Abstract ModSleuth is an agentic system that recursively reconstructs large-scale dependency graphs for LLM development by analyzing public artifacts and resolving inconsistencies in documentation and artifact identities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Modern LLM…

6
Hugging Face Daily Papers research 18d ago

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Abstract ReVision improves computer-use agent efficiency by removing redundant visual patches from consecutive screenshots while preserving spatial structure, reducing token usage by 46% and improving success rates. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Computer-use…

10
The Information — AI news-outlet 18d ago

Snowflake Mounts Full-Court Press to Get Employees Using AI

When Snowflake CEO Sridhar Ramaswamy and CFO Brian Robins face Wall Street analysts on quarterly earnings calls nowadays, they’re armed with material from an internally developed AI agent. The agent tells the two executives the questions they’re expecting the analysts to ask and…

38
r/MachineLearning community 18d ago

What should context compression keep? I looked at how six agents handle it[D]

I use Claude Code, Codex CLI, OpenCode, Cline, Cursor, and Amp enough to notice a pattern in how they handle long context. They are all converging on layered progressive compression, but they disagree on what to protect. Most protect recent user messages as a first-class asset.…

20
r/LocalLLaMA community 18d ago

As we know Minimax M3 is just going to be open sourced in few days and because of that I was surfing on internet searching for its scores and I found out pretty interesting results. Is Minimax M3 really that good in agentic stuff and in coding? Is it better than older gpt models?

Has anyone personally compared the Minimax M3 model against other proprietary models to determine its relative performance tier? I am trying to understand where it currently ranks in the broader Al landscape. Can we say Minimax M3 is better than GPT 5.2 in coding and agentic…

26
Hugging Face Daily Papers research 18d ago

τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Abstract A benchmark for agentic recommender systems is introduced that uses verifiable rewards and controlled dialogue constraints to evaluate conversational agent reliability, revealing significant performance gaps among leading models. Generated by…

6
r/LocalLLaMA community 18d ago

Cognitor: open-source semantic search engine. Automatically chunks, embeds and indexes the content of a target folder, making it searchable semantically.

https://github.com/tanaos/cognitor Cognitor is an open-source semantic search engine and vector database which automatically chunks, embeds and indexes the entire content of a target folder (and its subfolders), making it easily searchable by both AI agents and humans.…

15
MIT Technology Review — AI news-outlet 19d ago

Google DeepMind is worried about what happens when millions of agents start to interact

Google DeepMind is funding research into the potential dangers of situations where millions of different AI agents interact with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of agents that can…

35
Hugging Face Daily Papers research 19d ago

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Abstract TRACE is a rollout allocation framework that improves reward contrast in multi-turn agentic reinforcement learning by dynamically distributing resources across tree-structured rollouts based on prefix-level informativeness. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
Hugging Face Daily Papers research 19d ago

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Abstract POISE is a stealthy skill-poisoning attack that embeds malicious triggers within benign-looking instructions, achieving high attack success rates while avoiding detection by LLM scanners that are overly sensitive to privileged tool operations. Generated by…

16
Hugging Face Daily Papers research 19d ago

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Abstract EvoTrainer autonomously evolves both language model policies and training harnesses through empirical feedback, demonstrating superior performance in complex reasoning and coding tasks compared to traditional handcrafted approaches. Generated by…

6
Hugging Face Daily Papers research 19d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Abstract A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

16
arXiv — NLP / Computation & Language research 19d ago

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

arXiv:2606.11290v1 Announce Type: cross Abstract: Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy…

30
arXiv — Machine Learning research 19d ago

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

arXiv:2606.11417v1 Announce Type: new Abstract: Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is…

16
arXiv — Machine Learning research 19d ago

Counterexample Guided Learning in the Large using Reasoning Agents

arXiv:2606.11521v1 Announce Type: new Abstract: LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to…

16
arXiv — Machine Learning research 19d ago

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

arXiv:2606.11625v1 Announce Type: new Abstract: Time-series foundation models (TSFMs) are increasingly explored as predictive experts within emerging agentic time-series systems. However, TSFMs exhibit heterogeneous inductive biases, and no single model consistently dominates…

10
arXiv — Machine Learning research 19d ago

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

arXiv:2606.11652v1 Announce Type: new Abstract: This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic…

29
arXiv — Machine Learning research 19d ago

Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

arXiv:2606.11998v1 Announce Type: new Abstract: Trusted monitoring is a cornerstone of AI control. However, as frontier models grow more capable, the increasing capabilities gap between trusted and untrusted models may render trusted models unreliable monitors. We introduce…

30
arXiv — Machine Learning research 19d ago

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

arXiv:2606.12334v1 Announce Type: new Abstract: High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information…

14
arXiv — Machine Learning research 19d ago

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

arXiv:2606.12344v1 Announce Type: new Abstract: General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch,…

27
arXiv — NLP / Computation & Language research 19d ago

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

arXiv:2606.11213v1 Announce Type: new Abstract: We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through…

6
arXiv — NLP / Computation & Language research 19d ago

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in…

20
arXiv — NLP / Computation & Language research 19d ago

AI Coding Agents Can Reproduce Social Science Findings

arXiv:2606.11447v1 Announce Type: new Abstract: Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks…

8
arXiv — NLP / Computation & Language research 19d ago

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

arXiv:2606.11456v1 Announce Type: new Abstract: The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated…

34
arXiv — NLP / Computation & Language research 19d ago

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

arXiv:2606.11520v1 Announce Type: new Abstract: Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate ->…

32
arXiv — NLP / Computation & Language research 19d ago

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

arXiv:2606.11686v1 Announce Type: new Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed…

14
arXiv — NLP / Computation & Language research 19d ago

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

arXiv:2606.11688v1 Announce Type: new Abstract: Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric…

13
arXiv — NLP / Computation & Language research 19d ago

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

arXiv:2606.11816v1 Announce Type: new Abstract: Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model…

6
arXiv — NLP / Computation & Language research 19d ago

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

arXiv:2606.11897v1 Announce Type: new Abstract: Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving…

29
arXiv — NLP / Computation & Language research 19d ago

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

arXiv:2606.12087v1 Announce Type: new Abstract: Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph…

21
arXiv — NLP / Computation & Language research 19d ago

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

arXiv:2606.12191v1 Announce Type: new Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work…

18

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MiniPIC: Flexible Position-Independent Caching in <100LOC

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

InterleaveThinker: Reinforcing Agentic Interleaved Generation

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Has anyone used agents to decompile binary executables?

HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness

Program Claude Code, Codex, Pi and other agent harnesses with AI SDK

Can Generalist Agents Automate Data Curation?

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

Snowflake Mounts Full-Court Press to Get Employees Using AI

What should context compression keep? I looked at how six agents handle it[D]

As we know Minimax M3 is just going to be open sourced in few days and because of that I was surfing on internet searching for its scores and I found out pretty interesting results. Is Minimax M3 really that good in agentic stuff and in coding? Is it better than older gpt models?

τ-Rec: A Verifiable Benchmark for Agentic Recommender Systems

Cognitor: open-source semantic search engine. Automatically chunks, embeds and indexes the content of a target folder, making it searchable semantically.

Google DeepMind is worried about what happens when millions of agents start to interact

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

Counterexample Guided Learning in the Large using Reasoning Agents

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

AI Coding Agents Can Reproduce Social Science Findings

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

Notes2Skills: From Lab Notebooks to Certainty-Aware Scientific Agent Skills

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application