Tag

Long Context

217 articles archived under #long-context · RSS

Hugging Face Daily Papers research 1mo ago

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Abstract Rotary Positional Embeddings in Transformer models lose locality bias and token relevance consistency as context length increases, leading to unpredictable attention patterns that cannot be mitigated by multi-head, multi-layer architectures. AI-generated summary We…

9
r/LocalLLaMA community 1mo ago

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP,…

38
Hugging Face Daily Papers research 1mo ago

Context Memorization for Efficient Long Context Generation

Abstract Attention-state memory enables efficient long-prefix inference by storing precomputed attention states in lightweight memory, improving accuracy and reducing latency compared to traditional methods. AI-generated summary Modern large language model (LLM) applications…

13
Hugging Face Daily Papers research 1mo ago

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Abstract PEEK enables large language model agents to efficiently reuse orientation knowledge about recurring external contexts through a persistent context map that reduces computational costs and improves performance. AI-generated summary Large language model (LLM) agents…

4
arXiv — Machine Learning research 1mo ago

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

arXiv:2605.18854v1 Announce Type: new Abstract: Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force practitioners to choose between truncation and task failure. While numerous memory condensation strategies have been proposed,…

12
arXiv — Machine Learning research 1mo ago

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

arXiv:2605.18856v1 Announce Type: new Abstract: Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods…

22
arXiv — Machine Learning research 1mo ago

KVBuffer: IO-aware Serving for Linear Attention

arXiv:2605.19049v1 Announce Type: new Abstract: Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by…

28
arXiv — NLP / Computation & Language research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

arXiv:2605.19577v1 Announce Type: new Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter…

17
arXiv — NLP / Computation & Language research 1mo ago

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

arXiv:2605.19660v1 Announce Type: cross Abstract: The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel…

30
arXiv — NLP / Computation & Language research 1mo ago

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

arXiv:2605.19932v1 Announce Type: cross Abstract: Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory,…

22
Hugging Face Daily Papers research 1mo ago

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

Abstract GoLongRL presents an open-source approach for long-context reinforcement learning with diverse reward optimization through capability-oriented data construction and TMN-Reweight methodology. AI-generated summary We present GoLongRL, a fully open-source,…

37
arXiv — Machine Learning research 1mo ago

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

arXiv:2605.16360v1 Announce Type: new Abstract: Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision…

13
arXiv — Machine Learning research 1mo ago

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

arXiv:2605.16883v1 Announce Type: new Abstract: Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work…

26
arXiv — NLP / Computation & Language research 1mo ago

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

arXiv:2605.16839v1 Announce Type: new Abstract: Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed…

31
arXiv — NLP / Computation & Language research 1mo ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

arXiv:2605.16928v1 Announce Type: new Abstract: Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an…

13
arXiv — NLP / Computation & Language research 1mo ago

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

arXiv:2605.18071v1 Announce Type: new Abstract: Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding,…

28
arXiv — NLP / Computation & Language research 1mo ago

Context Memorization for Efficient Long Context Generation

arXiv:2605.18226v1 Announce Type: new Abstract: Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the…

25
r/LocalLLaMA community 1mo ago

Is there any <3B model with usable 200k+ context window?

I need a small model for processing conversation transcripts from larger models, so need usable context window out to at least 200k tokens. I know some models claim to support this, but I don’t know which are actually good at this in practice. Also desirable: low hallucination…

15
r/LocalLLaMA community 1mo ago

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

Has anyone here tested different KV cache quantizations and compared their performance? I’m currently using the model in Q5_K_M with Q4 KV cache on a 12 GB VRAM GPU. With this setup, I’m offloading about 27 MoE layers to the CPU and getting around 40 tok/s with a 128k total…

38
arXiv — Machine Learning research 1mo ago

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv:2605.15422v1 Announce Type: new Abstract: Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both…

36
arXiv — NLP / Computation & Language research 1mo ago

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

arXiv:2605.15514v1 Announce Type: new Abstract: We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its…

29
arXiv — NLP / Computation & Language research 1mo ago

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

arXiv:2605.15913v1 Announce Type: new Abstract: Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG).…

29
arXiv — NLP / Computation & Language research 1mo ago

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

arXiv:2605.16045v1 Announce Type: new Abstract: Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process…

31
r/LocalLLaMA community 1mo ago

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96GB, one PRO 5000 48GB, two 5090 32GB, and…

4
r/LocalLLaMA community 1mo ago

Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache. Model is at q8_0 to mitigate some weird behavior I was seeing at lower quants. Speed is very slow at around 50tps pp, 10tps tg, but usable for coding agent workflows. Anybody else running MoE models in…

22
r/LocalLLaMA community 1mo ago

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territory. I don't really need a full study,…

25
r/LocalLLaMA community 1mo ago

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Saw some posts around PP being slower, so they were cautious on trying it. Here's a real-world datapoint. Settings: Headless RTX 3090 24G OpenCode Model unsloth's Qwen3.6-27B-MTP-Q4_K_M.gguf 128k context q8_0 kv cache --spec-draft-n-max: 3 --draft-p-min: 0 Use Cases: Research…

8
r/LocalLLaMA community 1mo ago

Deepseek V4's 1M context window: the breaking point

Just ran to verify deepseek v4's context claim of 1M and ran it across three production codebases like 45k (microservice), 180k (monorepo backend) and 520k(full stack app). For the observation, tasks included dependency tracing, cross file refractors and bug isolation to see…

13
Ahead of AI (Sebastian Raschka) research 1mo ago

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs

22
r/LocalLLaMA community 1mo ago

RAG on Snapdragon X2 Laptop, 200K documents.

Qualcomm recently released the new 𝐒𝐧𝐚𝐩𝐝𝐫𝐚𝐠𝐨𝐧 𝐗2 𝐥𝐚𝐩𝐭𝐨𝐩 𝐜𝐡𝐢𝐩𝐬𝐞𝐭. I immediately ordered one: ASUS Zenbook A16 16" 3K OLED Touchscreen Laptop — Snapdragon X2 Elite Extreme (2026) A few things I really like about this machine: 𝐄𝐱𝐭𝐫𝐞𝐦𝐞𝐥𝐲 𝐥𝐢𝐠𝐡𝐭.…

26
r/LocalLLaMA community 1mo ago

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with 4 concurrent batches 😄 At 8k context…

12
Hugging Face Daily Papers research 1mo ago

Long Context Pre-Training with Lighthouse Attention

Abstract Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance. AI-generated summary Training causal transformers at…

33
r/LocalLLaMA community 1mo ago

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests. The project was a test - building a full iterative step-by-step pygame; a small mystery dungeon-style game. At first I set 100-200k context…

28
arXiv — NLP / Computation & Language research 1mo ago

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

arXiv:2605.14589v1 Announce Type: new Abstract: Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to…

14
Hugging Face Daily Papers research 1mo ago

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Abstract A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches. AI-generated summary Memory is essential for large vision-language models (LVLMs) to…

25
Hugging Face Daily Papers research 1mo ago

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Abstract RealICU benchmark evaluates large language models for ICU decision support using hindsight-annotated patient trajectories, revealing limitations in clinical recommendation accuracy and early interpretation bias. AI-generated summary Intensive care units (ICU) generate…

32
Hugging Face Daily Papers research 1mo ago

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

Abstract MemReread addresses long-context reasoning challenges by avoiding intermediate retrieval and employing question decomposition with rereading to recover discarded information, maintaining linear time complexity. AI-generated summary To tackle long-context reasoning tasks…

38
Hugging Face Daily Papers research 1mo ago

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Abstract Long-context continued pre-training enhances vision-language models' ability to handle extended documents while maintaining performance across diverse contexts through strategic data mixture design. AI-generated summary Long-context modeling is becoming a core…

24
r/LocalLLaMA community 1mo ago

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M models, 128k context): Model tok/s Key…

19
r/LocalLLaMA community 1mo ago

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens?

Xiaomi open-sourced MiMo-V2.5-Pro. 1.02 trillion parameters, 42B active (MoE), 1M context, MIT license. On paper, this is exciting. In practice, I'm stuck on the math. What I've been doing with it I've been running V2.5-Pro via the API through Claude Code for autonomous coding…

13
Hugging Face Daily Papers research 1mo ago

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Abstract Training framework FocuSFT improves long-context language model performance by addressing attention allocation issues through bilevel optimization with parametric memory that focuses attention on semantically relevant content. AI-generated summary Large language models…

25
arXiv — Machine Learning research 1mo ago

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv:2605.11196v1 Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention to $\mathcal{O}(T)$, but its memory state grows as $\mathcal{O}(T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce…

13
arXiv — Machine Learning research 1mo ago

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

arXiv:2605.11414v1 Announce Type: new Abstract: While traditional time-series classifiers assume full sequences at inference, practical constraints (latency and cost) often limit inputs to partial prefixes. The absence of class-discriminative patterns in partial data can…

29
arXiv — NLP / Computation & Language research 1mo ago

Training-Inference Consistent Segmented Execution for Long-Context LLMs

arXiv:2605.11744v1 Announce Type: new Abstract: Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many…

7
arXiv — NLP / Computation & Language research 1mo ago

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

arXiv:2605.12227v1 Announce Type: new Abstract: Adapting large language models (LLMs) to long-context tasks requires post-training methods that remain accurate and coherent over thousands of tokens. Existing approaches are limited in several ways: 1) off-policy methods such as…

12
arXiv — NLP / Computation & Language research 1mo ago

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

arXiv:2605.12260v1 Announce Type: new Abstract: Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the…

8
r/LocalLLaMA community 1mo ago

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter…

6
Smol AI News news-outlet 1mo ago

GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

**OpenAI** released **GPT-Realtime-2**, a voice model with **GPT-5-class reasoning**, tool use, interruption handling, and extended context windows up to **128K tokens**, achieving top scores on **Big Bench Audio** and **Conversational Dynamics** benchmarks. They also launched a…

22
Vercel — AI dev-tools 2mo ago

Grok 4.3 on AI Gateway

Grok 4.3 is now available on Vercel AI Gateway . The model has a 1M token context window and improvements in accuracy, tool calling, and instruction following. To use Grok 4.3, set model to xai/grok-4.3 in the AI SDK . AI Gateway provides a unified API for calling models,…

7
Vercel — AI dev-tools 2mo ago

Deepseek V4 on AI Gateway

DeepSeek V4 is now available on Vercel AI Gateway . There are 2 model variants: DeepSeek V4 Pro and DeepSeek V4 Flash. A 1M token context window is the default across both models. DeepSeek V4 Pro focuses on agentic coding, formal mathematical reasoning, and long-horizon…

27

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

Context Memorization for Efficient Long Context Generation

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

KVBuffer: IO-aware Serving for Linear Attention

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

Context Memorization for Efficient Long Context Generation

Is there any <3B model with usable 200k+ context window?

Configuration Qwen3.6-35b-a3b (12Gb VRAM)

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster

Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4

Developers who use local AI - Q4_0 vs Q8_0 KV quant?

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

Deepseek V4's 1M context window: the breaking point

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

RAG on Snapdragon X2 Laptop, 200K documents.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Long Context Pre-Training with Lighthouse Attention

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens?

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

Training-Inference Consistent Segmented Execution for Long-Context LLMs

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

Attention Drift: What Autoregressive Speculative Decoding Models Learn

GPT-Realtime-2, -Translate, and -Whisper: new SOTA realtime voice APIs

Grok 4.3 on AI Gateway

Deepseek V4 on AI Gateway