Tag

Agents + tool use

500 articles archived under #agents · RSS

Hugging Face Daily Papers research 4d ago

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

Abstract Researchers develop a human-centered approach to align AI agents with privacy norms by creating a comprehensive dataset of privacy judgments and using annotation-conditioned reward modeling to improve agent behavior. Generated by Qwen/Qwen2.5-Coder-32B-Instruct AI…

7
r/MachineLearning community 4d ago

Tool selection at scale is a retrieval problem, and document-style defaults are the wrong starting point [D]

A pattern I keep running into building agents. Posting as a discussion because I think the standard intuition is backwards for this specific case. Setup is an agent with a big set of callable tools (mine are MCP-exposed, but the shape generalises to any function-calling loop).…

21
Vercel — AI dev-tools 4d ago

AI SDK 7

AI SDK, with over 16 million weekly downloads, is the TypeScript SDK for building AI applications, features, frameworks, and agents across any model provider. It's the same layer eve , Vercel's open-source agent framework, is built on. AI SDK 7 adds production depth for agent…

15
Hugging Face Daily Papers research 4d ago

Autodata: An agentic data scientist to create high quality synthetic data

Abstract Autodata enables AI agents to function as data scientists who create high-quality training data through meta-optimization, demonstrating improved performance across multiple task domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct We introduce Autodata, a general…

30
Vercel — AI dev-tools 4d ago

AI SDK 7 is now available

AI SDK 7 is a major release for building production agents in TypeScript. The SDK has grown from model calls and chat primitives into a broader agent platform for developing, running, integrating, and observing agents across text, audio, realtime, image, and video. Every major…

8
Vercel — AI dev-tools 4d ago

Teaching agents product design at Vercel

Coding agents can produce working UI fast, but what's harder is a different shape. They can copy your product's style, match its patterns, and try to follow its conventions. What they cannot do is understand why those patterns exist. Code shows agents what shipped, not why one…

17
Smol AI News news-outlet 5d ago

not much happened today

**Z.ai's GLM-5.2** leads in coding and agent benchmarks with top scores like **1595** on Code Arena: Frontend and **34.29%** reasoning accuracy with zero failures. Databricks improved GLM-5.2 speed to **392 tok/s** using hardware and optimizations. **Ornith-1.0**, a new…

13
Hugging Face Daily Papers research 5d ago

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

Abstract The book provides a comprehensive guide to building autonomous AI systems, covering foundational elements like transformer architecture and training methods, along with advanced topics such as reinforcement learning, agent architectures, and production deployment.…

5
Hugging Face Daily Papers research 5d ago

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

Abstract RL-Index introduces an agentic indexing framework that shifts reasoning from query time to indexing stage by using LLM-generated rationales and reinforcement learning to improve retrieval effectiveness and reduce latency. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

25
Hugging Face Daily Papers research 5d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

Abstract LLM agents frequently select higher-privilege tools unnecessarily, and while safety alignment doesn't ensure least-privilege choices, a post-training defense can reduce excessive privilege use without sacrificing performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

26
arXiv — Machine Learning research 5d ago

GCT-MARL: Graph-Based Contrastive Transfer for Sample-Efficient Cooperative Multi-Agent Reinforcement Learning

arXiv:2606.25073v1 Announce Type: new Abstract: In cooperative multi-agent reinforcement learning (MARL), from a deployment perspective, it is challenging and expensive to train agents from scratch for each new environment or task. In this work, we propose GCT-MARL, a transfer…

30
arXiv — Machine Learning research 5d ago

Forget to Improve: On-Device LLM-Agent Continual Learning via Budget-Curated Memory

arXiv:2606.25115v1 Announce Type: new Abstract: On-device language-model agents improve by accumulating experience in retrieved memory rather than by updating weights. This memory is hard-bounded and exposed: it consumes RAM and energy, reaches peers through a thin uplink, and…

24
arXiv — Machine Learning research 5d ago

Reward-Conditioned Attention: How Reward Design Shapes What Autonomous Driving Agents See

arXiv:2606.25127v1 Announce Type: new Abstract: We investigate how reward design shapes the internal attention patterns of reinforcement learning agents trained for autonomous driving. Using three Perceiver-based agents that share identical architectures and training data but…

33
arXiv — NLP / Computation & Language research 5d ago

The Interplay of Harness Design and Post-Training in LLM Agents

arXiv:2606.25447v1 Announce Type: cross Abstract: Tool-integrated LLM agents are often wrapped within a harness: the scaffolding that determines which tools are exposed, how they are described, and what auxiliary information accompanies each per-step observation. While agents…

15
arXiv — Machine Learning research 5d ago

Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning

arXiv:2606.25526v1 Announce Type: new Abstract: Cooperative multi-agent reinforcement learning assumes each agent shares the same reward function and can be trained effectively using the Trust Region framework of single-agent. Instead of relying on other agents' actions, the…

28
arXiv — Machine Learning research 5d ago

Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

arXiv:2606.25527v1 Announce Type: new Abstract: Online reinforcement learning (RL) agents increasingly depend on knowledge acquired offline to achieve practical efficiency. Originally studied in offline-to-online RL, this paradigm now spans foundation model post-training and…

27
arXiv — NLP / Computation & Language research 5d ago

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

arXiv:2606.24893v1 Announce Type: new Abstract: For agents to learn continuously from interaction with the world at test time, they must be able to explore effectively, acquire new world knowledge and skills, retain relevant episodic experiences, and plan over long horizons. To…

17
arXiv — NLP / Computation & Language research 5d ago

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

arXiv:2606.25361v1 Announce Type: new Abstract: Prior research on memory mechanism in RAG-based conversational system has emphasized how memory is stored and retrieved. However, far less is known about how memories with different functional roles influence response quality.…

27
arXiv — NLP / Computation & Language research 5d ago

Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

arXiv:2606.25421v1 Announce Type: new Abstract: Recent studies on world modeling for Large Language Model (LLM) agents typically formulate the learning objective as next-observation prediction. However, this objective ties supervision to what a transition happens to reveal,…

32
arXiv — NLP / Computation & Language research 5d ago

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

arXiv:2606.25556v1 Announce Type: new Abstract: Stepwise group-based RL is an attractive way to train long-horizon LLM agents without a learned critic: it reuses multiple sampled rollouts to estimate local advantages. Its weakness is less visible but more fundamental: every…

11
arXiv — NLP / Computation & Language research 5d ago

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

arXiv:2606.25605v1 Announce Type: new Abstract: Tool Calling and Structured Output are two core capabilities of modern Agent systems, yet their interaction under joint deployment conditions remains insufficiently understood. This paper reports a reproducible phenomenon observed…

10
arXiv — NLP / Computation & Language research 5d ago

Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

arXiv:2606.25632v1 Announce Type: new Abstract: Recent LLM role-playing systems build character agents from novels by extracting characters, scenes, and relations. Yet long-narrative role-playing suffers from two failures: Factual Overreach, where shared retrieval or parametric…

30
arXiv — NLP / Computation & Language research 5d ago

Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization

arXiv:2606.25656v1 Announce Type: new Abstract: As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge…

21
arXiv — NLP / Computation & Language research 5d ago

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arXiv:2606.25819v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume…

26
arXiv — NLP / Computation & Language research 5d ago

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

arXiv:2606.26027v1 Announce Type: new Abstract: Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited…

17
arXiv — NLP / Computation & Language research 5d ago

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

arXiv:2606.24937v1 Announce Type: cross Abstract: The Hitchhiker's Guide to Agentic AI is a comprehensive practitioner's reference for building autonomous AI systems. The book covers the full stack from first principles to production deployment, organized around a central…

25
arXiv — NLP / Computation & Language research 5d ago

Diagnosing and Mitigating Compounding Failures in Agentic Persuasion via Taxonomic Strategy Retrieval

arXiv:2606.24976v1 Announce Type: cross Abstract: Foundation-model agents in multi-step, open-ended environments frequently suffer from compounding errors, where early mistakes contaminate long-horizon trajectories. While Multi-Agent Debate (MAD) succeeds in deterministic…

10
arXiv — NLP / Computation & Language research 5d ago

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

arXiv:2606.25760v1 Announce Type: cross Abstract: Computer-use agents turn vision-language model (VLM) predictions into executable GUI clicks, so reliable uncertainty estimates are essential for rejection, calibration, miss-severity ranking, and spatial safety regions. Yet…

14
arXiv — NLP / Computation & Language research 5d ago

Autodata: An agentic data scientist to create high quality synthetic data

arXiv:2606.25996v1 Announce Type: cross Abstract: We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to…

30
arXiv — NLP / Computation & Language research 5d ago

Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

arXiv:2601.03785v3 Announce Type: replace Abstract: Long-term human-agent dialogues are organized by topic continuity: adjacent turns often develop the same goal, plan, problem, or event, while related activities may recur across distant sessions. Yet many LLM agent memory…

25
MIT News — AI research 5d ago

Improving the speed and energy-efficiency of AI agents

A new system, known as Murakkab, optimizes the design and deployment of multistep workflows that power AI applications.

26
Hugging Face Daily Papers research 5d ago

Are We Ready For An Agent-Native Memory System?

Abstract Large language model agents' memory systems have evolved into complex data management frameworks requiring systematic evaluation across multiple modules and workloads to understand their performance characteristics and trade-offs. Generated by…

7
OpenAI official-blog 5d ago

How agents are transforming work

A new OpenAI research paper shows how AI agents are transforming work, enabling longer, more complex tasks and expanding productivity across roles.

28
Hugging Face Daily Papers research 5d ago

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Abstract Long-term memory in LLM agents should be evaluated as an auditable post-interaction artifact by reconstructing structured user state from the agent's memory, as demonstrated by MEMPROBE, a benchmark testing memory recovery against synthetic ground truth across 50…

21
Vercel — AI dev-tools 5d ago

Deep Agents and OpenCode are now available in the AI SDK Harness

The AI SDK Harness lets you run established coding-agent runtimes through one unified interface, so you can switch runtimes without changing your application code. Today we're adding two new adapters, Deep Agents and OpenCode, both running inside a Vercel Sandbox. Deep Agents…

27
Simon Willison community 5d ago

simonw/browser-compat-db

simonw/browser-compat-db Inspired by Mozilla's new MDN MCP service - source code here - I decided to try converting their comprehensive mdn/browser-compat-data repository full of browser compatibility data into a SQLite database. This new GitHub Repo includes a Claude Code for…

11
Hugging Face Daily Papers research 5d ago

Critique of Agent Model

Abstract True artificial agency requires internalized structures for goals, identity, decision-making, self-regulation, and learning, distinguishing autonomous systems from task-specific ones. Generated by Qwen/Qwen2.5-Coder-32B-Instruct What is an agent? What constitutes…

24
Latent.Space news-outlet 5d ago

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

In a rare double-interview, the Databricks technical leaders riff on what it will take for every company to build Agent Clouds

19
Google DeepMind official-blog 5d ago

Introducing computer use in Gemini 3.5 Flash

Introducing computer use in Gemini 3.5 Flash Jun 24, 2026 · Share x.com Facebook LinkedIn Mail Computer use is now a built-in tool in Gemini 3.5 Flash to build agents that can interact across platforms. Mateo Quiros Product Manager, Google DeepMind Share x.com Facebook LinkedIn…

9
r/MachineLearning community 5d ago

I made a superhuman Generals.io agent with self-play RL [P]

Hi everyone, I trained a self-play RL agent for Generals.io that reached superhuman-level and ranked #1 on the human 1v1 leaderboard. It began as my master's thesis where the goal was to beat a prior algorithm based agent. We succeeded using behavior cloning, RL fine-tuning and…

6
r/LocalLLaMA community 5d ago

Nex-N2-Mini-Ultra-Uncensored-Heretic Is Out Now, an Agentic Model With Agentic Thinking Now Uncensored With 5/100 Refusals and 0.0020 KLD, Available in Safetensors and GGUF Formats!

Safetensors: https://huggingface.co/llmfan46/Nex-N2-mini-ultra-uncensored-heretic GGUFs: https://huggingface.co/llmfan46/Nex-N2-mini-ultra-uncensored-heretic-GGUF Find all my models here: HuggingFace-LLMFan46 If you like my work and find my models useful, then I would really…

36
r/LocalLLaMA community 5d ago

Qwen-AgentWorld-35B-A3B for Coding?

Benchmark from its model card. Removed online models & Qwen-AgentWorld-397B-A17B from the table. Just Open models. Model MCP Search Term. SWE Android Web OS Overall DeepSeek-V4-Pro 63.27 27.61 51.26 59.44 55.17 50.32 63.70 52.97 GLM-5.1 67.60 22.46 47.32 52.07 59.10 51.50 59.13…

11
Hugging Face Daily Papers research 5d ago

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Abstract Large language models face challenges in archive-grounded reasoning tasks involving evidence retrieval and synthesis across diverse document collections, with performance varying significantly across domains. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Large language…

26
Latent.Space news-outlet 5d ago

[AINews] Claude Tag: Multiplayer, Proactive, Persistent Agents in Slack

Claude finally gets a Slackbot upgrade

8
Hugging Face Daily Papers research 5d ago

OpenThoughts-Agent: Data Recipes for Agentic Models

Abstract An open-source data curation pipeline for training agentic language models is presented, demonstrating superior performance through systematic experimentation and scalable training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Agentic language models dramatically…

34
Hugging Face Daily Papers research 5d ago

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Abstract A large-scale multi-agent benchmark for evaluating LLMs in Chinese psychiatric diagnosis is introduced, highlighting challenges in dynamic consultation and the gap between consultation quality and diagnostic accuracy. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Mental…

36
r/LocalLLaMA community 6d ago

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments

Qwen just released Qwen-AgentWorld-35B-A3B — a 35B-parameter MoE with only ~3B active parameters per token. The interesting part: this is not positioned as a standard chat/instruction model or a full autonomous agent. It is a language world model trained to predict what an…

6
r/LocalLLaMA community 6d ago

GitHub - QwenLM/Qwen-AgentWorld: Qwen-AgentWorld: Language World Models for General Agents

  submitted by   /u/dan945 [link]   [comments]

5
arXiv — Machine Learning research 6d ago

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

arXiv:2606.23961v1 Announce Type: new Abstract: Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same…

20
arXiv — NLP / Computation & Language research 6d ago

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

arXiv:2606.23937v1 Announce Type: new Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B…

11

PrivacyAlign: Contextual Privacy Alignment for LLM Agents

Tool selection at scale is a retrieval problem, and document-style defaults are the wrong starting point [D]

AI SDK 7

Autodata: An agentic data scientist to create high quality synthetic data

AI SDK 7 is now available

Teaching agents product design at Vercel

not much happened today

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

GCT-MARL: Graph-Based Contrastive Transfer for Sample-Efficient Cooperative Multi-Agent Reinforcement Learning

Forget to Improve: On-Device LLM-Agent Continual Learning via Budget-Curated Memory

Reward-Conditioned Attention: How Reward Design Shapes What Autonomous Driving Agents See

The Interplay of Harness Design and Post-Training in LLM Agents

Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning

Beyond One-Size-Fits-All: Diagnosis-Driven Online Reinforcement Learning with Offline Priors

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

Constraint Tax in Open-Weight LLMs: An Empirical Study of Tool Calling Suppression Under Structured Output Constraints

Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

Diagnosing and Mitigating Compounding Failures in Agentic Persuasion via Taxonomic Strategy Retrieval

Uncertainty Quantification for Computer-Use Agents: A Benchmark across Vision-Language Models and GUI Grounding Datasets

Autodata: An agentic data scientist to create high quality synthetic data

Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Improving the speed and energy-efficiency of AI agents

Are We Ready For An Agent-Native Memory System?

How agents are transforming work

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

Deep Agents and OpenCode are now available in the AI SDK Harness

simonw/browser-compat-db

Critique of Agent Model

Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks

Introducing computer use in Gemini 3.5 Flash

I made a superhuman Generals.io agent with self-play RL [P]

Nex-N2-Mini-Ultra-Uncensored-Heretic Is Out Now, an Agentic Model With Agentic Thinking Now Uncensored With 5/100 Refusals and 0.0020 KLD, Available in Safetensors and GGUF Formats!

Qwen-AgentWorld-35B-A3B for Coding?

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

[AINews] Claude Tag: Multiplayer, Proactive, Persistent Agents in Slack

OpenThoughts-Agent: Data Recipes for Agentic Models

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments

GitHub - QwenLM/Qwen-AgentWorld: Qwen-AgentWorld: Language World Models for General Agents

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents