Tag

Agents + tool use

500 articles archived under #agents · RSS

arXiv — NLP / Computation & Language research 11d ago

Benchmarking Agentic Review Systems

arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems…

15
arXiv — NLP / Computation & Language research 11d ago

Multi-Agent Transactive Memory

arXiv:2606.19911v1 Announce Type: cross Abstract: The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated…

24
arXiv — NLP / Computation & Language research 11d ago

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving…

17
arXiv — NLP / Computation & Language research 11d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

arXiv:2606.20529v1 Announce Type: cross Abstract: Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and…

27
arXiv — NLP / Computation & Language research 11d ago

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and…

22
Hugging Face Daily Papers research 11d ago

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

27
r/LocalLLaMA community 11d ago

GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval

  submitted by   /u/analysis_scaled [link]   [comments]

7
Hacker News — AI on Front Page community 11d ago

Zero-Touch OAuth for MCP

Article URL: https://blog.modelcontextprotocol.io/posts/enterprise-managed-auth/ Comments URL: https://news.ycombinator.com/item?id=48592163 Points: 202 # Comments: 66

17
Hugging Face official-blog 11d ago

MosaicLeaks: Can your research agent keep a secret?

Back to Articles a]:hidden"> MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung ServiceNow Rafael Pardinas rafapi-snow ServiceNow TL;DR Deep research agents increasingly combine private local documents…

24
r/LocalLLaMA community 11d ago

poolside/Laguna-M.1 · Hugging Face - 225B-A23B

Laguna M.1 Laguna M.1 is a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token designed for agentic coding and long-horizon work. Highlights Large sparse MoE for agentic coding : Laguna M.1 is a 70-layer MoE transformer with 225B total…

26
TechCrunch — AI news-outlet 11d ago

General Intuition in talks to raise $300M at around $2B valuation

General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning.

14
Hugging Face Daily Papers research 11d ago

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be…

6
Hugging Face Daily Papers research 11d ago

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and…

29
r/LocalLLaMA community 11d ago

gave my local llm agent mcp tools for local image + video gen, so it just generates when i ask (fully offline+free)

free and open source, runs fully offline. the local llm agent does the image and video gen itself via mcp tools. details and github in the comments.   submitted by   /u/GroundbreakingMall54 [link]   [comments]

33
Hugging Face Daily Papers research 11d ago

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems…

28
r/LocalLLaMA community 11d ago

Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face

Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. Highlights Outstanding Video Understanding and…

29
Hugging Face Daily Papers research 12d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

Abstract RODS addresses sample depletion in multi-turn tool-use reinforcement learning by dynamically synthesizing new data based on reward variance to maintain informative training samples. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-turn tool-use RL is bottlenecked by…

21
r/LocalLLaMA community 12d ago

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it?

Yes I know this is a simple question I could just ask Claude or something but I want to see what the community suggests For context it’s a 16in MacBook Pro and i use Hermes agent as a harness connected to LM studio as obviously it’s preferable to be running MLX models especially…

4
Vercel — AI dev-tools 12d ago

The Agent Stack

Agents are designed to do almost any kind of work, from answering support tickets to writing code. No matter how complex the workload, how long it runs, or how many turns it takes to complete, every agent needs three core capabilities to operate: Agents need to connect to models…

16
Hugging Face Daily Papers research 12d ago

Native Active Perception as Reasoning for Omni-Modal Understanding

Abstract OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing. Generated by…

24
arXiv — NLP / Computation & Language research 12d ago

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

arXiv:2606.18284v1 Announce Type: cross Abstract: The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve,…

21
arXiv — NLP / Computation & Language research 12d ago

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

arXiv:2606.18388v1 Announce Type: cross Abstract: RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to…

34
arXiv — Machine Learning research 12d ago

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,…

18
arXiv — Machine Learning research 12d ago

Stealthy World Model Manipulation via Data Poisoning

arXiv:2606.18697v1 Announce Type: new Abstract: Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack…

18
arXiv — Machine Learning research 12d ago

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

arXiv:2606.18820v1 Announce Type: new Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational…

19
arXiv — NLP / Computation & Language research 12d ago

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory…

22
arXiv — Machine Learning research 12d ago

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

arXiv:2606.18963v1 Announce Type: new Abstract: We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact,…

21
arXiv — Machine Learning research 12d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

arXiv:2606.18967v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive…

25
arXiv — NLP / Computation & Language research 12d ago

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

arXiv:2606.18406v1 Announce Type: new Abstract: Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces…

11
arXiv — NLP / Computation & Language research 12d ago

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the…

19
arXiv — NLP / Computation & Language research 12d ago

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators…

37
arXiv — NLP / Computation & Language research 12d ago

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv:2606.18831v1 Announce Type: new Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a…

36
arXiv — NLP / Computation & Language research 12d ago

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

arXiv:2606.19308v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm…

13
arXiv — NLP / Computation & Language research 12d ago

Learning User Simulators with Turing Rewards

arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by…

37
arXiv — NLP / Computation & Language research 12d ago

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors…

8
arXiv — NLP / Computation & Language research 12d ago

CEO-Bench: Can Agents Play the Long Game?

arXiv:2606.18543v1 Announce Type: cross Abstract: Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain…

29
arXiv — NLP / Computation & Language research 12d ago

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

arXiv:2606.18947v1 Announce Type: cross Abstract: Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider…

20
arXiv — NLP / Computation & Language research 12d ago

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

arXiv:2603.00026v2 Announce Type: replace Abstract: Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may…

15
Hugging Face Daily Papers research 12d ago

CEO-Bench: Can Agents Play the Long Game?

Abstract CEO-Bench evaluates language model agents' ability to manage a simulated startup over 500 days, testing their proficiency in long-term planning, noise handling, adaptability, and multi-task coordination through a Python interface. Generated by…

5
Hugging Face Daily Papers research 12d ago

Guava: An Effective and Universal Harness for Embodied Manipulation

Abstract A harness framework for embodied tool use combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models trained on large-scale…

15
Hugging Face official-blog 12d ago

Is it agentic enough? Benchmarking open models on your own tooling

Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a…

26
llama.cpp releases dev-tools 12d ago

b9691

ggml-cpu: Conditionally enable power11 backend based on compiler support ( #24687 ) ggml: Conditionally enable power11 backend based on compiler support Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang…

14
r/LocalLLaMA community 12d ago

Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools

v10.8 is out, so here's a project update on what landed. This was a 20-contributor release in just 7 days! Smarter memory and context management Dynamic VRAM management now auto-unloads idle models and downsizes their KV-cache to reclaim GPU memory on the fly, plus model pinning…

27
Ars Technica — AI news-outlet 12d ago

AI coding agents taught robots how to install GPUs and cut zip-ties

NVIDIA’s self-improvement program for robots enlists teams of AI coding agents.

13
TechCrunch — AI news-outlet 12d ago

NEA’s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning

Tokenmaxxing was the hottest trend in Silicon Valley earlier this year, with CEOs encouraging employees to push AI usage as far as it would go. Then the bill came due. Uber reportedly blew through its annual AI budget in a few months, some companies…

23
r/LocalLLaMA community 12d ago

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

arXiv : https://arxiv.org/abs/2606.17861 Full Paper : https://arxiv.org/pdf/2606.17861 HuggingFace : https://huggingface.co/papers/2606.17861 GitHub : https://github.com/tongxuluo/gamecraft-bench Project : https://tongxuluo.github.io/gamecraft-bench-website/ I see big/large…

20
llama.cpp releases dev-tools 12d ago

b9685

[SYCL] add dev2dev memcpy by SYCL API ( #24476 ) add dev2dev memcpy by SYCL API mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table update the detect method for p2p comm fix the erro created during fix confilct Co-authored-by: Neo Zhang macOS/iOS: macOS Apple Silicon (arm64) macOS…

33
Vercel — AI dev-tools 12d ago

Vercel Ship 2026 recap

For a decade, Vercel has shaped how the web gets built. Now, we’re doing the same for agents. The companies that win the next decade will build on infrastructure designed for agents from the start, and over 2,500 people gathered in London this week to do just that at Vercel Ship…

20
r/LocalLLaMA community 12d ago

GLM-5.2 is a win for local AI

I know GLM 5.2's massive 753B footprint means none of us are running it at home without an enterprise cluster, but having a true frontier-level, MIT-licensed coding agent out in the wild makes me optimistic. The distillation potential here is massive. Once the community starts…

38
r/LocalLLaMA community 12d ago

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

Some background so this is honest. Over the past few months I ran a lot of oneshot experiments with single file three.js games. Minecraft clones, that kind of thing. I picked those on purpose because they sit deep in the training data and are trivial to debug by eye. The goal…

37

Benchmarking Agentic Review Systems

Multi-Agent Transactive Memory

When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval

Zero-Touch OAuth for MCP

MosaicLeaks: Can your research agent keep a secret?

poolside/Laguna-M.1 · Hugging Face - 225B-A23B

General Intuition in talks to raise $300M at around $2B valuation

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

gave my local llm agent mcp tools for local image + video gen, so it just generates when i ask (fully offline+free)

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it?

The Agent Stack

Native Active Perception as Reasoning for Omni-Modal Understanding

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

Stealthy World Model Manipulation via Data Poisoning

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

VISUALSKILL: Multimodal Skills for Computer-Use Agents

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Learning User Simulators with Turing Rewards

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

CEO-Bench: Can Agents Play the Long Game?

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

CEO-Bench: Can Agents Play the Long Game?

Guava: An Effective and Universal Harness for Embodied Manipulation

Is it agentic enough? Benchmarking open models on your own tooling

b9691

Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools

AI coding agents taught robots how to install GPUs and cut zip-ties

NEA&#8217;s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

b9685

Vercel Ship 2026 recap

GLM-5.2 is a win for local AI

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

NEA’s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning