Tag

Long Context

217 articles archived under #long-context · RSS

r/LocalLLaMA community 16d ago

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X.

The model now supports a 1M context window and two thinking modes: max and high. z.ai recommends using max for coding. Vote on X What should we prioritize most? Longer context window MIT-licensed open weights No price increase Other links: GLM 5.2 announcement LLM Benchmark…

32
r/LocalLLaMA community 17d ago

MiniMax Sparse Attention (MSA)

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax…

14
NVIDIA Developer Blog official-blog 17d ago

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

As enterprise AI adoption scales, developers are increasingly forced to stitch together fragmented pipelines—separate models for text, vision, and...

25
r/LocalLLaMA community 17d ago

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million tokens costs 122 GB. No consumer card…

33
Hugging Face Daily Papers research 18d ago

MiniMax Sparse Attention

Abstract MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance. Generated by Qwen/Qwen2.5-Coder-32B-Instruct…

20
arXiv — NLP / Computation & Language research 18d ago

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most…

23
arXiv — NLP / Computation & Language research 18d ago

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

arXiv:2606.13115v1 Announce Type: new Abstract: While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive…

15
arXiv — NLP / Computation & Language research 18d ago

Recursive Agent Harnesses

arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in…

35
Vercel — AI dev-tools 18d ago

GLM 5.2 now available on AI Gateway

GLM 5.2 is now available on AI Gateway . Built for long-horizon tasks, GLM 5.2 carries project-level engineering context across a single task, runs long-running tasks more reliably, and follows engineering standards more consistently. The context window for this model has been…

16
Hugging Face Daily Papers research 18d ago

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Abstract SparDA is a decoupled sparse attention architecture that improves long-context LLM inference by reducing KV cache bottlenecks and attention complexity through aForecast projection for lookahead selection. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Sparse attention…

23
r/MachineLearning community 18d ago

What should context compression keep? I looked at how six agents handle it[D]

I use Claude Code, Codex CLI, OpenCode, Cline, Cursor, and Amp enough to notice a pattern in how they handle long context. They are all converging on layered progressive compression, but they disagree on what to protect. Most protect recent user messages as a first-class asset.…

20
arXiv — NLP / Computation & Language research 19d ago

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

arXiv:2606.11213v1 Announce Type: new Abstract: We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through…

6
r/LocalLLaMA community 19d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the…

25
Hugging Face Daily Papers research 20d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Abstract Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key…

8
arXiv — Machine Learning research 20d ago

Blurry Window Attention

arXiv:2606.09862v1 Announce Type: new Abstract: The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome…

33
arXiv — NLP / Computation & Language research 20d ago

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

arXiv:2606.10537v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose…

37
arXiv — NLP / Computation & Language research 20d ago

Dynamic Linear Attention

arXiv:2606.10650v1 Announce Type: new Abstract: The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To…

34
arXiv — NLP / Computation & Language research 20d ago

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

arXiv:2606.10694v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management…

22
arXiv — NLP / Computation & Language research 20d ago

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

arXiv:2606.11052v1 Announce Type: new Abstract: Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including…

26
arXiv — NLP / Computation & Language research 20d ago

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

arXiv:2606.10435v1 Announce Type: cross Abstract: Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this…

21
Hugging Face Daily Papers research 20d ago

Dynamic Linear Attention

Abstract DLA addresses limitations in long-context LLMs by introducing adaptive state merging and capacity-bounded memory modeling for improved multi-state linear attention. Generated by Qwen/Qwen2.5-Coder-32B-Instruct The scalability of Large Language Models (LLMs) to long…

25
Smol AI News news-outlet 21d ago

Anthropic Claude Fable 5

**Anthropic** released two major models: **Claude Fable 5** for general availability and **Claude Mythos 5** for restricted access, with fallback to **Claude Opus 4.8** for sensitive queries. **Fable 5** features a **1M-token context window** and pricing at **$10/million input…

24
arXiv — Machine Learning research 21d ago

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

arXiv:2606.07703v1 Announce Type: new Abstract: Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve…

4
Hugging Face Daily Papers research 21d ago

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Abstract Lookahead Sparse Attention with Neural Memory Indexer reduces GPU memory usage for long-context LLM inference while maintaining accuracy through proactive KV cache management and decoupled training. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Conventional LLMs keep the…

19
Hugging Face Daily Papers research 21d ago

End-to-End Context Compression at Scale

Abstract Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.…

25
r/LocalLLaMA community 21d ago

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling , and u/complexminded pointed out the tool-eval-bench utility by…

9
Hugging Face Daily Papers research 21d ago

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Abstract RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Efficient inference is critical for long-context language models, where…

28
arXiv — NLP / Computation & Language research 22d ago

A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

arXiv:2606.06758v1 Announce Type: new Abstract: Final-answer accuracy, retrieval recall, and citation overlap do not by themselves identify whether a long-context or retrieval-augmented language model used the evidence it was given. A model can answer from parametric memory,…

21
arXiv — NLP / Computation & Language research 22d ago

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

arXiv:2606.06906v1 Announce Type: new Abstract: Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate…

6
r/LocalLLaMA community 22d ago

How are you all managing multiple MCP servers on startup?

Hello! I'm using openCode and loading a bunch of different MCP servers at startup. This starts becoming a mess, it eats up tokens and pollutes the context window before I even type a single prompt. How are you all handling this locally? Are you using a proxy/hub to route…

14
r/LocalLLaMA community 22d ago

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ

Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks . BeeLlama.cpp (my llama.cpp fork) was used as inference engine due to support of additional types:…

31
r/LocalLLaMA community 23d ago

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher. A number of people in the comments under my previous post asked a fair question: what if we…

21
arXiv — Machine Learning research 25d ago

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

arXiv:2606.06034v1 Announce Type: new Abstract: Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We…

23
arXiv — NLP / Computation & Language research 25d ago

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

arXiv:2606.05182v1 Announce Type: new Abstract: Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer…

14
arXiv — NLP / Computation & Language research 25d ago

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

arXiv:2606.06203v1 Announce Type: new Abstract: Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information --…

4
r/LocalLLaMA community 25d ago

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Model Summary Total Parameters 550B (55B active) Architecture LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP) Context Length Up to 1M tokens Minimum GPU Requirement 8x GB200/B200/GB300/B300, 16x H100, 8x H200 Supported Languages English, French,…

21
Vercel — AI dev-tools 25d ago

Nemotron 3 Ultra now available on AI Gateway

Nemotron 3 Ultra from Nvidia is now available on Vercel AI Gateway . Nemotron 3 Ultra is an open Mixture-of-Experts reasoning model built for orchestrating long-running agent workflows, with a 1M token context window. The model targets multi-turn agent workflows: planning, tool…

37
Smol AI News news-outlet 26d ago

not much happened today

**NVIDIA** released **Nemotron 3 Ultra**, a fully open **550B MoE** model with **55B active parameters** and **1M context**, optimized for long-running agent tasks with up to **5x speedup** and **30% cost reduction**. It features hybrid Mamba/attention, LatentMoE, native MTP,…

7
arXiv — NLP / Computation & Language research 26d ago

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

arXiv:2606.04120v1 Announce Type: new Abstract: Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents…

10
arXiv — NLP / Computation & Language research 26d ago

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

arXiv:2606.04302v1 Announce Type: new Abstract: Key-value (KV) caching accelerates inference of large language models (LLMs) by reusing past computations for generated tokens. Its importance becomes even greater in long-context applications such as retrieval-augmented generation…

10
arXiv — NLP / Computation & Language research 26d ago

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv:2606.04511v1 Announce Type: new Abstract: Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe…

5
arXiv — NLP / Computation & Language research 26d ago

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

arXiv:2606.04557v1 Announce Type: new Abstract: Large Language Models can reason over long contexts, yet prefilling millions of tokens is wasteful as much of the content remains static across queries. Cartridges address this by distilling document collections into reusable…

14
arXiv — NLP / Computation & Language research 27d ago

Memory Retrieval for Changing Preferences

arXiv:2606.02976v1 Announce Type: new Abstract: Long-context dialogue systems must decide both when to access memory and which parts of the interaction history are relevant. Existing approaches typically rely on heuristic retrieval signals or always-on memory usage, failing to…

19
arXiv — NLP / Computation & Language research 27d ago

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

arXiv:2606.03363v1 Announce Type: new Abstract: Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases,…

15
arXiv — NLP / Computation & Language research 27d ago

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

arXiv:2606.02812v1 Announce Type: cross Abstract: Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but…

38
r/MachineLearning community 27d ago

MiniMax dropped a new attention architecture. [N]

It contains something interesting about context windows. They’re natively scaling to 1M tokens with MiniMax Sparse Attention (MSA) , bypassing standard quadratic complexity by completely restructuring the memory access patterns at the operator level. Instead of relying on…

26
r/LocalLLaMA community 27d ago

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Llama benchmark results model size params backend ngl threads type_k type_v fa test t/s qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02 qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 tg128 70.54 ± 0.12 I've…

22
arXiv — NLP / Computation & Language research 28d ago

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

arXiv:2606.00024v1 Announce Type: new Abstract: Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before…

19
arXiv — NLP / Computation & Language research 28d ago

MemPro: Agentic Memory Systems as Evolvable Programs

arXiv:2606.00619v1 Announce Type: new Abstract: Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory…

5
arXiv — NLP / Computation & Language research 28d ago

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

arXiv:2606.00724v1 Announce Type: new Abstract: Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in…

28

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X.

MiniMax Sparse Attention (MSA)

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

MiniMax Sparse Attention

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

Recursive Agent Harnesses

GLM 5.2 now available on AI Gateway

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

What should context compression keep? I looked at how six agents handle it[D]

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Blurry Window Attention

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Dynamic Linear Attention

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

Dynamic Linear Attention

Anthropic Claude Fable 5

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

End-to-End Context Compression at Scale

Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

How are you all managing multiple MCP servers on startup?

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 · Hugging Face

Nemotron 3 Ultra now available on AI Gateway

not much happened today

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

Memory Retrieval for Changing Preferences

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

MiniMax dropped a new attention architecture. [N]

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

MemPro: Agentic Memory Systems as Evolvable Programs

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering