Tag

Long Context

217 articles archived under #long-context · RSS

arXiv — Machine Learning research 1h ago

HARD-KV: Head-Adaptive Regularization for Decoding-time KV Compression

arXiv:2606.28831v1 Announce Type: new Abstract: Long-context LLM inference faces a fundamental conflict: head-adaptive compression algorithms (e.g., Top-$p$ nucleus sampling) offer superior accuracy by dynamically fluctuating memory budgets, yet modern inference engines (e.g.,…

29
arXiv — NLP / Computation & Language research 1h ago

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

arXiv:2606.28548v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have become a useful tool for extracting interpretable features in language models. However, standard SAE architectures operate on individual token activations, meaning that the number of active features…

25
arXiv — NLP / Computation & Language research 1h ago

Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

arXiv:2606.28876v1 Announce Type: new Abstract: Long-context language models often conflate two different goals: compressing history into an efficient state, and maintaining reliable long-term memory. Linear, recurrent, and sparse attention reduce the cost of processing long…

14
arXiv — NLP / Computation & Language research 1h ago

Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

arXiv:2606.29563v1 Announce Type: new Abstract: Large language models (LLMs) excel at complex tasks like question answering and summarization, thanks to their ability to handle long-context inputs. However, deploying LLMs is costly, not only due to the high computational demands…

7
arXiv — NLP / Computation & Language research 1h ago

MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers

arXiv:2606.29844v1 Announce Type: new Abstract: The quadratic computational cost of traditional attention mechanisms poses a major bottleneck to the scalability and practical deployment of large language models (LLMs), particularly in long-context scenarios. To improve…

15
arXiv — NLP / Computation & Language research 1h ago

LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

arXiv:2606.30005v1 Announce Type: new Abstract: Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that…

34
arXiv — NLP / Computation & Language research 1d ago

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

arXiv:2606.27705v1 Announce Type: new Abstract: Large Language Models (LLMs) still struggle with the ``lost-in-the-middle'' problem, where critical information located in the middle of long-context inputs is often underrepresented or lost. While existing methods attempt to…

4
arXiv — NLP / Computation & Language research 1d ago

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

arXiv:2606.27791v1 Announce Type: new Abstract: Hybrid attention models that mix full and sliding-window attention across layers offer a promising approach to efficient long-context inference, but the critical question of \emph{which layers} should retain full attention remains…

19
arXiv — NLP / Computation & Language research 1d ago

Position Bias Correction is Insufficient for One-Pass Attention Sorting

arXiv:2606.27793v1 Announce Type: new Abstract: Long-context language models suffer from position bias, where information in middle positions is underutilized. Attention Sorting addresses this by iteratively reordering documents based on attention patterns, but its multiple…

9
r/LocalLLaMA community 1d ago

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

I got GLM-5.2 NVFP4 running on four DGX Sparks at 128K context. This is still a niche/hacky setup, but it is now a real serving point rather than just a proof of life. Objective : A high quality 4-bit quant running on 4x spark. Model: https://huggingface.co/Mapika/GLM-5.2-NVFP4…

9
r/LocalLLaMA community 1d ago

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Follow-up to my previous Ornith-1.0-35B Q3_K_M post. I grafted a native MTP draft head onto the IQ4_XS body (head at Q6) for self-speculative decode, single GPU, llama.cpp: 1.3-1.35x single-stream decode (172.6 -> 233.8 tok/s). Next-token distribution is byte-identical to…

11
NVIDIA Developer Blog official-blog 3d ago

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

As context windows grow longer, moving large model weights efficiently becomes critical to performance. A common way to address this is quantization, an...

37
Hugging Face Daily Papers research 3d ago

Information-Aware KV Cache Compression for Long Reasoning

Abstract InfoKV is an entropy-aware KV cache compression framework that enhances long-context reasoning in LLMs by incorporating information-theoretic signals alongside attention weights. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Reasoning capability has advanced rapidly in…

10
r/MachineLearning community 3d ago

What if context compression is a diffusion noise function? Proposal + honest results from untrained-model experiments [R]

I'm proposing a way to handle massive context longer than a model's context window by treating semantic compression as the noise function of a diffusion-like process. Instead of denoising masked tokens into coherent text (like DiffusionGemma or Nemotron-Diffusion do for…

20
arXiv — Machine Learning research 4d ago

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

arXiv:2606.26290v1 Announce Type: new Abstract: While parameter-efficient fine-tuning (PEFT) typically targets attention projectors, its efficacy for tasks requiring sequential state accumulation remains under-explored. We examine if PEFT for such tasks can benefit from state…

18
arXiv — Machine Learning research 4d ago

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

arXiv:2606.26666v1 Announce Type: new Abstract: Autoregressive large language model (LLM) serving is increasingly limited by key-value (KV) cache movement rather than dense matrix multiplication. Modern paged-attention systems reduce KV-cache fragmentation and mature kernels…

20
arXiv — NLP / Computation & Language research 4d ago

Context Recycling for Long-Horizon LLM Inference

arXiv:2606.26105v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong capabilities in short-context reasoning but degrade in performance over long conversational horizons due to context window limitations and inefficient token usage. We introduce…

27
arXiv — NLP / Computation & Language research 5d ago

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

arXiv:2606.24957v1 Announce Type: new Abstract: While speculative decoding improves inference throughput for multi-batch long-context Large Language Models (LLMs), its efficiency is often limited by a verification bottleneck where Key-Value (KV) cache loading dominates latency.…

19
arXiv — Machine Learning research 6d ago

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

arXiv:2606.23961v1 Announce Type: new Abstract: Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same…

20
arXiv — NLP / Computation & Language research 6d ago

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

arXiv:2606.24286v1 Announce Type: new Abstract: Multimodal Large Language Models have achieved remarkable progress in short-form audio-video understanding, yet long-form audio-video comprehension remains challenged by limited context windows and severe information redundancy. To…

15
arXiv — NLP / Computation & Language research 6d ago

Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

arXiv:2606.24650v1 Announce Type: new Abstract: We present Harmonic, a hierarchical state space model (SSM) for language modeling. The architecture stacks three recurrent levels at progressively slower timescales; each level receives the prediction error of the level below as…

21
arXiv — NLP / Computation & Language research 6d ago

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

arXiv:2504.17768v3 Announce Type: replace Abstract: Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with…

29
Hugging Face Daily Papers research 6d ago

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Abstract Data-centric approach using curated datasets and minimal GRPO setup significantly improves long-context reasoning in large language models, outperforming prior reinforcement learning methods. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Long-context reasoning is an…

15
Hugging Face Daily Papers research 7d ago

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Abstract HydraHead is a novel attention hybridization architecture that combines Full Attention and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead through interpretability-driven selection and scale-normalized…

34
Hugging Face Daily Papers research 7d ago

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

Abstract EvoEmbedding is a dynamic embedding model that generates adaptive representations by maintaining a continuously updated latent memory, enabling improved retrieval performance in long-context scenarios. Generated by Qwen/Qwen2.5-Coder-32B-Instruct Existing embedding…

32
r/LocalLLaMA community 7d ago

Is Gemma 4 going to be the next Mistral (or Qwen3.6) one day? Concerning the lack of finetunes

https://eqbench.com/creative_writing.html#:~:text=gemma%2D4%2D31B,Sample From what I've seen Gemma 4 has better everything (especially long-context adherence) EXCEPT for the raw prosing performance of Mistral... finetunes . Comparing bases only, Mistral Small 3.2 (the…

5
r/LocalLLaMA community 7d ago

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Just sharing some speed test numbers for GLM-5.2 running on llama.cpp. Setup: Model: unsloth/GLM-5.2-GGUF, UD-IQ1_M quant GPUs: RTX 5090 + RTX 3090 Ti 186 GB DDR5 used Debian 13 CUDA 13.3 128k context, q8_0 KV cache Prefill (prompt processing): n_tokens tokens/s 8,201 579.75…

4
r/LocalLLaMA community 8d ago

Not a new model, just a Happy Father's Day and a thank you.

I know this isn't our usual discussion about context windows, quantization, or the latest model drop, but I just wanted to take a quick moment to say thank you. As a dad myself, I really appreciate this great community. Between the daily grind and family life, diving into this…

12
r/MachineLearning community 8d ago

I released a softmax-free attention model at GPT-2 Medium scale (~354M params, 11.5B tokens): structural sparsity + tile-skipping kernels for long-context VRAM savings. Open weights + custom Triton kernels [R]

  submitted by   /u/NonGameCatharsis [link]   [comments]

29
Simon Willison community 10d ago

Quoting Sean Lynch

The real valuable capability MCP offers over skills/CLI is isolating the auth flow outside of the agent’s context window, and potentially out of the harness completely. [...] Maybe the idealized form of MCP is just an auth gateway for the API and nothing else. That’d still be a…

8
arXiv — NLP / Computation & Language research 11d ago

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

arXiv:2606.19847v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented…

32
arXiv — NLP / Computation & Language research 11d ago

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

arXiv:2606.20097v1 Announce Type: new Abstract: The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the…

13
arXiv — NLP / Computation & Language research 11d ago

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

arXiv:2606.20164v1 Announce Type: new Abstract: Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and…

29
r/LocalLLaMA community 11d ago

2 weeks since the release of Gemma 4 12b Unified, how are we feeling about it?

I'm looking for a good model to run on a 5090 and have ample context ~128k. This model looks good for me, it seems to have good performance in the 12b range, almost comparable to Gemma 4 26B A4B. Building a custom harness for it and have ~300m of tokens to fine tune on. Do you…

6
arXiv — Machine Learning research 12d ago

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

arXiv:2606.18283v1 Announce Type: new Abstract: The dense token-to-token interaction pattern of standard dot-product attention remains a central bottleneck in scaling Transformer architectures to long contexts. We introduce \textbf{Gaussian Mixture Attention (GMA)}, a…

26
arXiv — NLP / Computation & Language research 12d ago

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv:2606.18831v1 Announce Type: new Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a…

36
arXiv — NLP / Computation & Language research 12d ago

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

arXiv:2606.19144v1 Announce Type: cross Abstract: Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as…

33
Hugging Face Daily Papers research 12d ago

Rethinking the Role of Efficient Attention in Hybrid Architectures

Abstract Hybrid architectures combining full attention with efficient attention modules like sliding-window attention exhibit distinct scaling behaviors and optimization trajectories, with efficient attention primarily affecting the emergence speed of long-context capabilities…

29
arXiv — Machine Learning research 13d ago

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

arXiv:2606.17872v1 Announce Type: new Abstract: Large language models (LLMs) outperform earlier architectures on generative inference and long-context tasks, but their large size introduces significant challenges in memory usage, energy cost, and on-device deployment. Since…

27
r/LocalLLaMA community 13d ago

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

GLM-5.2 just released and the early numbers look pretty insane. 1M context window, open weights, MIT license, two reasoning effort modes, and it is already showing up near the top of coding arenas. I know every new model gets hyped for 24 hours, but this one actually looks worth…

28
Smol AI News news-outlet 13d ago

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

**Z.ai released GLM-5.2**, an MIT-licensed open-weight frontier model targeting **coding and long-horizon agentic tasks** with a **1M-token context window** and **two reasoning-effort modes**. It features a **744B-parameter mixture-of-experts architecture** with **40B active…

14
arXiv — Machine Learning research 14d ago

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

arXiv:2606.15157v1 Announce Type: new Abstract: KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all…

29
arXiv — NLP / Computation & Language research 14d ago

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

arXiv:2606.16093v1 Announce Type: new Abstract: Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while…

11
r/LocalLLaMA community 14d ago

Maybe dumb question, but how do you serve multiple users with the full context length?

After experimenting with llama.cpp, I'm wondering a thing. Let's say we have an LLM with a context size of 128k. Now let's say we want have up to 8 parallel users, and we want to provide each client with the full context capabilities. With llama.cpp, how does that work? AFAIK it…

20
Ollama releases dev-tools 14d ago

v0.30.9-rc1

server: context shift for context windows larger than 8k, add error w…

28
r/LocalLLaMA community 14d ago

Context window + project size + Aider?

Forgive the naivety of this post, I'm a noob, bear with me! If a project, understood as a set of files, is larger than the context window of a model, how do you fit it in? After doing some naive research, various major LLMs like Deepseek, Kimi, and company say the solution is…

32
arXiv — NLP / Computation & Language research 15d ago

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

arXiv:2606.14047v1 Announce Type: cross Abstract: Long-context language modeling requires not only extending context windows but maintaining coherent understanding of entity states and relationships across thousands of tokens -- a challenge that semantic similarity alone cannot…

12
arXiv — NLP / Computation & Language research 15d ago

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

arXiv:2606.14470v1 Announce Type: cross Abstract: Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software…

37
Hacker News — AI on Front Page community 15d ago

Don't trust large context windows

Article URL: https://garrit.xyz/posts/2026-05-06-dont-trust-large-context-windows Comments URL: https://news.ycombinator.com/item?id=48524620 Points: 201 # Comments: 146

27
r/LocalLLaMA community 16d ago

[NEW FAMILY OF MODELS] Supra1.5 family just released!

SupraLabs just released the Supra-1.5-exp line, Base, Instruct, and GGUF! (Reasoning soon) Hey r/LocalLLaMA ! We are releasing the experimental Supra-1.5-50M family today: a new Base model with 5x the context window of the original Supra-50M, an Instruct fine-tune on top of it,…

20

HARD-KV: Head-Adaptive Regularization for Decoding-time KV Compression

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers

LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

Position Bias Correction is Insufficient for One-Pass Attention Sorting

High-quality GLM-5.2 Quant on 4x DGX Spark - Guide, Results, and Comps

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Creating the NVIDIA Nemotron 3 Ultra NVFP4 Checkpoint with NVIDIA Model Optimizer

Information-Aware KV Cache Compression for Long Reasoning

What if context compression is a diffusion noise function? Proposal + honest results from untrained-model experiments [R]

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

PersistentKV: Page-Aware Decode Scheduling for Long-Context LLM Serving on Commodity GPUs

Context Recycling for Long-Horizon LLM Inference

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

EvoEmbedding: Evolvable Representations for Long-Context Retrieval and Agentic Memory

Is Gemma 4 going to be the next Mistral (or Qwen3.6) one day? Concerning the lack of finetunes

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Not a new model, just a Happy Father's Day and a thank you.

I released a softmax-free attention model at GPT-2 Medium scale (~354M params, 11.5B tokens): structural sparsity + tile-skipping kernels for long-context VRAM savings. Open weights + custom Triton kernels [R]

Quoting Sean Lynch

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

2 weeks since the release of Gemma 4 12b Unified, how are we feeling about it?

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

Rethinking the Role of Efficient Attention in Hybrid Architectures

AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor

GLM-5.2 just dropped open weights and it already looks weirdly strong for coding

GLM 5.2: the top Frontend Coding model in the world, IndexShare reduces costs

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

Maybe dumb question, but how do you serve multiple users with the full context length?

v0.30.9-rc1

Context window + project size + Aider?

Knowledge Graph Enhanced Memory-Augmented Retrieval for Long Context Modeling

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Don't trust large context windows

[NEW FAMILY OF MODELS] Supra1.5 family just released!