News / #agents Tag Agents + tool use 500 articles archived under #agents · RSS Sign in to follow arXiv — NLP / Computation & Language research 11d ago Benchmarking Agentic Review Systems arXiv:2606.19749v1 Announce Type: cross Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems… 15 arXiv — NLP / Computation & Language research 11d ago Multi-Agent Transactive Memory arXiv:2606.19911v1 Announce Type: cross Abstract: The decentralized deployment of LLM agents with diverse capabilities across diverse tasks motivates infrastructure for knowledge sharing across heterogeneous agent populations. Just as search engines index human-generated… 24 arXiv — NLP / Computation & Language research 11d ago When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM Agents arXiv:2606.20023v1 Announce Type: cross Abstract: As LLM agents increasingly select tools autonomously, their choices among tools with different privileges become safety-relevant. However, prior tool-selection studies focus on safety-agnostic metadata preferences, leaving… 17 arXiv — NLP / Computation & Language research 11d ago LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents arXiv:2606.20529v1 Announce Type: cross Abstract: Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and… 27 arXiv — NLP / Computation & Language research 11d ago ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents arXiv:2508.04266v4 Announce Type: replace Abstract: Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and… 22 Hugging Face Daily Papers research 11d ago Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents Abstract Aggregate-score leaderboards in agent benchmarks fail to capture deployment-relevant dimensions and show rank instability, necessitating new evaluation frameworks based on predictive validity and out-of-distribution criteria. Generated by Qwen/Qwen2.5-Coder-32B-Instruct… 27 r/LocalLLaMA community 11d ago GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval   submitted by   /u/analysis_scaled [link]   [comments] 7 Hacker News — AI on Front Page community 11d ago Zero-Touch OAuth for MCP Article URL: https://blog.modelcontextprotocol.io/posts/enterprise-managed-auth/ Comments URL: https://news.ycombinator.com/item?id=48592163 Points: 202 # Comments: 66 17 Hugging Face official-blog 11d ago MosaicLeaks: Can your research agent keep a secret? Back to Articles a]:hidden"> MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung ServiceNow Rafael Pardinas rafapi-snow ServiceNow TL;DR Deep research agents increasingly combine private local documents… 24 r/LocalLLaMA community 11d ago poolside/Laguna-M.1 · Hugging Face - 225B-A23B Laguna M.1 Laguna M.1 is a 225B total parameter Mixture-of-Experts model with 23B activated parameters per token designed for agentic coding and long-horizon work. Highlights Large sparse MoE for agentic coding : Laguna M.1 is a 70-layer MoE transformer with 225B total… 26 TechCrunch — AI news-outlet 11d ago General Intuition in talks to raise $300M at around $2B valuation General Intuition is in talks to raise around $300 million at a roughly $2 billion valuation from backers including Jeff Bezos. The startup trains AI agents on spatial-temporal reasoning. 14 Hugging Face Daily Papers research 11d ago iOSWorld: A Benchmark for Personally Intelligent Phone Agents Abstract IOSWorld is introduced as the first interactive native iOS simulator benchmark featuring persistent user identity across multiple apps to evaluate personalized mobile agent capabilities. Generated by Qwen/Qwen2.5-Coder-32B-Instruct A useful phone agent needs to be… 6 Hugging Face Daily Papers research 11d ago MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents Abstract MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and… 29 r/LocalLLaMA community 11d ago gave my local llm agent mcp tools for local image + video gen, so it just generates when i ask (fully offline+free) free and open source, runs fully offline. the local llm agent does the image and video gen itself via mcp tools. details and github in the comments.   submitted by   /u/GroundbreakingMall54 [link]   [comments] 33 Hugging Face Daily Papers research 11d ago Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems Abstract Multicultural multi-agent systems exhibit limited value diversity despite cultural alignment, with social interaction reducing diversity and compromising collective decision-making breadth. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multicultural multi-agent systems… 28 r/LocalLLaMA community 11d ago Kwai-Keye/Keye-VL-2.0-30B-A3B-GGUF · Hugging Face Meet Keye-VL-2.0-30B-A3B — the latest 30B-class flagship base model in the Keye series, purpose-built to push the frontier of long-video understanding and to unlock the first generation of Agent capabilities in the Keye family. Highlights Outstanding Video Understanding and… 29 Hugging Face Daily Papers research 12d ago RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents Abstract RODS addresses sample depletion in multi-turn tool-use reinforcement learning by dynamically synthesizing new data based on reward variance to maintain informative training samples. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Multi-turn tool-use RL is bottlenecked by… 21 r/LocalLLaMA community 12d ago I have a M5 Max MacBook Pro with 128gb of ram, what models should I run on it? Yes I know this is a simple question I could just ask Claude or something but I want to see what the community suggests For context it’s a 16in MacBook Pro and i use Hermes agent as a harness connected to LM studio as obviously it’s preferable to be running MLX models especially… 4 Vercel — AI dev-tools 12d ago The Agent Stack Agents are designed to do almost any kind of work, from answering support tickets to writing code. No matter how complex the workload, how long it runs, or how many turns it takes to complete, every agent needs three core capabilities to operate: Agents need to connect to models… 16 Hugging Face Daily Papers research 12d ago Native Active Perception as Reasoning for Omni-Modal Understanding Abstract OmniAgent is a novel omni-modal agent that addresses long video understanding by using an iterative observation-thought-action cycle with active perception, achieving superior performance compared to larger models through efficient selective processing. Generated by… 24 arXiv — NLP / Computation & Language research 12d ago Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier arXiv:2606.18284v1 Announce Type: cross Abstract: The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve,… 21 arXiv — NLP / Computation & Language research 12d ago LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents arXiv:2606.18388v1 Announce Type: cross Abstract: RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to… 34 arXiv — Machine Learning research 12d ago Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals,… 18 arXiv — Machine Learning research 12d ago Stealthy World Model Manipulation via Data Poisoning arXiv:2606.18697v1 Announce Type: new Abstract: Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack… 18 arXiv — Machine Learning research 12d ago Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets arXiv:2606.18820v1 Announce Type: new Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational… 19 arXiv — NLP / Computation & Language research 12d ago GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents arXiv:2606.18829v1 Announce Type: cross Abstract: Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory… 22 arXiv — Machine Learning research 12d ago Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards arXiv:2606.18963v1 Announce Type: new Abstract: We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact,… 21 arXiv — Machine Learning research 12d ago EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts arXiv:2606.18967v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive… 25 arXiv — NLP / Computation & Language research 12d ago CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents arXiv:2606.18406v1 Announce Type: new Abstract: Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces… 11 arXiv — NLP / Computation & Language research 12d ago VISUALSKILL: Multimodal Skills for Computer-Use Agents arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the… 19 arXiv — NLP / Computation & Language research 12d ago LegalWorld: A Life-Cycle Interactive Environment for Legal Agents arXiv:2606.18728v1 Announce Type: new Abstract: Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators… 37 arXiv — NLP / Computation & Language research 12d ago Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning arXiv:2606.18831v1 Announce Type: new Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a… 36 arXiv — NLP / Computation & Language research 12d ago Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play arXiv:2606.19308v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm… 13 arXiv — NLP / Computation & Language research 12d ago Learning User Simulators with Turing Rewards arXiv:2606.19336v1 Announce Type: new Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by… 37 arXiv — NLP / Computation & Language research 12d ago Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies arXiv:2606.18264v1 Announce Type: cross Abstract: Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors… 8 arXiv — NLP / Computation & Language research 12d ago CEO-Bench: Can Agents Play the Long Game? arXiv:2606.18543v1 Announce Type: cross Abstract: Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain… 29 arXiv — NLP / Computation & Language research 12d ago Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents arXiv:2606.18947v1 Announce Type: cross Abstract: Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider… 20 arXiv — NLP / Computation & Language research 12d ago ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents arXiv:2603.00026v2 Announce Type: replace Abstract: Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may… 15 Hugging Face Daily Papers research 12d ago CEO-Bench: Can Agents Play the Long Game? Abstract CEO-Bench evaluates language model agents' ability to manage a simulated startup over 500 days, testing their proficiency in long-term planning, noise handling, adaptability, and multi-task coordination through a Python interface. Generated by… 5 Hugging Face Daily Papers research 12d ago Guava: An Effective and Universal Harness for Embodied Manipulation Abstract A harness framework for embodied tool use combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Language models trained on large-scale… 15 Hugging Face official-blog 12d ago Is it agentic enough? Benchmarking open models on your own tooling Back to Articles a]:hidden"> Is it agentic enough? Benchmarking open models on your own tooling Published June 18, 2026 Update on GitHub Upvote 2 Lysandre lysandre Nathan Habib SaylorTwift Pedro Cuenca pcuenq Benchmarking transformers revisions across different metrics This is a… 26 llama.cpp releases dev-tools 12d ago b9691 ggml-cpu: Conditionally enable power11 backend based on compiler support ( #24687 ) ggml: Conditionally enable power11 backend based on compiler support Guard POWER11 backend creation behind a compiler flag check for -mcpu=power11. This avoids build failures on current GCC/Clang… 14 r/LocalLLaMA community 12d ago Lemonade v10.8: auto memory management, cloud offload, Omni improvements, and call your local models as MCP tools v10.8 is out, so here's a project update on what landed. This was a 20-contributor release in just 7 days! Smarter memory and context management Dynamic VRAM management now auto-unloads idle models and downsizes their KV-cache to reclaim GPU memory on the fly, plus model pinning… 27 Ars Technica — AI news-outlet 12d ago AI coding agents taught robots how to install GPUs and cut zip-ties NVIDIA’s self-improvement program for robots enlists teams of AI coding agents. 13 TechCrunch — AI news-outlet 12d ago NEA’s Tiffany Luck on AI IPOs, personal agents, and the ROI reckoning Tokenmaxxing was the hottest trend in Silicon Valley earlier this year, with CEOs encouraging employees to push AI usage as far as it would go. Then the bill came due. Uber reportedly blew through its annual AI budget in a few months, some companies… 23 r/LocalLLaMA community 12d ago GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? arXiv : https://arxiv.org/abs/2606.17861 Full Paper : https://arxiv.org/pdf/2606.17861 HuggingFace : https://huggingface.co/papers/2606.17861 GitHub : https://github.com/tongxuluo/gamecraft-bench Project : https://tongxuluo.github.io/gamecraft-bench-website/ I see big/large… 20 llama.cpp releases dev-tools 12d ago b9685 [SYCL] add dev2dev memcpy by SYCL API ( #24476 ) add dev2dev memcpy by SYCL API mv GGML_SYCL_DEV2DEV_MEMCPY to runntime table update the detect method for p2p comm fix the erro created during fix confilct Co-authored-by: Neo Zhang macOS/iOS: macOS Apple Silicon (arm64) macOS… 33 Vercel — AI dev-tools 12d ago Vercel Ship 2026 recap For a decade, Vercel has shaped how the web gets built. Now, we’re doing the same for agents. The companies that win the next decade will build on infrastructure designed for agents from the start, and over 2,500 people gathered in London this week to do just that at Vercel Ship… 20 r/LocalLLaMA community 12d ago GLM-5.2 is a win for local AI I know GLM 5.2's massive 753B footprint means none of us are running it at home without an enterprise cluster, but having a true frontier-level, MIT-licensed coding agent out in the wild makes me optimistic. The distillation potential here is massive. Once the community starts… 38 r/LocalLLaMA community 12d ago Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C Some background so this is honest. Over the past few months I ran a lot of oneshot experiments with single file three.js games. Minecraft clones, that kind of thing. I picked those on purpose because they sit deep in the training data and are trivial to debug by eye. The goal… 37 Page 6 of 10 · 500 articles ← Newer Older →