Tag

Long Context

217 articles archived under #long-context · RSS

arXiv — NLP / Computation & Language research 28d ago

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

arXiv:2606.01223v1 Announce Type: new Abstract: Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into…

10
arXiv — NLP / Computation & Language research 28d ago

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

arXiv:2606.01294v1 Announce Type: new Abstract: Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of…

21
arXiv — NLP / Computation & Language research 28d ago

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

arXiv:2606.01336v1 Announce Type: new Abstract: As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs…

35
Hugging Face Daily Papers research 28d ago

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Abstract LongAttnComp adapts AttnComp for long-context processing by fine-tuning lightweight attention layers and implementing token-level chunking and positional reordering techniques. AI-generated summary As real-world applications increasingly require processing inputs of…

27
NVIDIA Developer Blog official-blog 28d ago

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

The rise of autonomous, long-running AI agents has introduced a new class of compute demand, namely tasks that maintain large context windows, spawn concurrent...

16
r/LocalLLaMA community 28d ago

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

The first question I have about Ling-2.6-1T is not “is the model card impressive?” It is whether the boring trade-off makes sense. It is an open-sourced Ant/InclusionAI flagship with about 1T total params / 63B activated params, up to 1M native context, and 256K currently…

21
arXiv — Machine Learning research 29d ago

CoMem: Context Management with A Decoupled Long-Context Model

arXiv:2605.30842v1 Announce Type: new Abstract: Context management enables agentic models to solve long-horizon tasks through iterative summarization of previous interaction histories. However, this process typically incurs substantial decoding overhead for the extra…

11
arXiv — NLP / Computation & Language research 29d ago

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

arXiv:2605.31105v1 Announce Type: new Abstract: Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache…

4
Hugging Face Daily Papers research 29d ago

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Abstract LongTraceRL addresses long-context reasoning challenges in large language models through tiered distractor construction and rubric reward design for improved reasoning quality. AI-generated summary Long-context reasoning remains a central challenge for large language…

38
r/LocalLLaMA community 29d ago

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

  submitted by   /u/dryadofelysium [link]   [comments]

14
Vercel — AI dev-tools 1mo ago

MiniMax M3 on AI Gateway

MiniMax M3 is now available on Vercel AI Gateway . M3 is MiniMax's first model with a 1M-token context window and native multimodality, built around MiniMax Sparse Attention (MSA). M3 improves on software engineering, terminal-based tool use, and agentic web browsing, and is…

8
r/LocalLLaMA community 1mo ago

Liquid AI releases LFM2.5-8B-A1B

Liquid AI released LFM2.5-8B-A1B, an edge model designed to power real-life applications. It builds on LFM2-8B-A1B with three major upgrades: an expanded 128K context window, 38T tokens of pre-training (up from 12T), and large-scale reinforcement learning. It also comes with a…

14
arXiv — NLP / Computation & Language research 1mo ago

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

arXiv:2605.29324v1 Announce Type: new Abstract: Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy…

26
arXiv — NLP / Computation & Language research 1mo ago

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

arXiv:2605.29379v1 Announce Type: new Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's…

38
r/LocalLLaMA community 1mo ago

Upgrade path from 4x 3090s

Hey everyone, looking for some upgrade advice. Right now, I’m running 4x 3090s hosting Qwen 3.6 27B 128K in full precision. It's a great model, but I'm looking for a step up and trying to figure out the best "middle-tier" hardware path. I've seen people here mention running 8x…

5
Hacker News — AI on Front Page community 1mo ago

Bricks and Minifigs Stole a Man's $200k Lego Collection

Article URL: https://mybricklog.com/blog/bricks-minifigs-corporate-stole-old-mans-200000-lego-collection Comments URL: https://news.ycombinator.com/item?id=48314136 Points: 208 # Comments: 57

34
r/LocalLLaMA community 1mo ago

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork ( github.com/spiritbuun/buun-llama-cpp ) and mudler's APEX quantizations ( huggingface.co/mudler ). Spiritbuun's CUDA optimizations for NVIDIA GPUs — fused MMA…

18
r/LocalLLaMA community 1mo ago

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise. I use…

6
arXiv — Machine Learning research 1mo ago

Heterogeneous Parallelism for Multimodal Large Language Model Training

arXiv:2605.27678v1 Announce Type: new Abstract: Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP…

34
arXiv — NLP / Computation & Language research 1mo ago

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

arXiv:2605.27740v1 Announce Type: new Abstract: Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but…

35
arXiv — NLP / Computation & Language research 1mo ago

Periodic RoPE for Infinite Context LLMs

arXiv:2605.27980v1 Announce Type: new Abstract: The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence…

33
arXiv — NLP / Computation & Language research 1mo ago

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

arXiv:2605.28009v1 Announce Type: new Abstract: Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and…

7
arXiv — NLP / Computation & Language research 1mo ago

ATLAS: All-round Testing of Long-context Abilities across Scales

arXiv:2605.28079v1 Announce Type: new Abstract: Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and…

4
r/LocalLLaMA community 1mo ago

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs : Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how…

11
r/LocalLLaMA community 1mo ago

Finally pioneering beyond the local 256k context window frontier!

The autocompact at 341.5k tokens is manually set and I'll be slowly pushing it back now I'm confident there's overhead for memory eviction of key values into cache. The question now is will the proposed fix complete in those remaining 16k tokens, as I'll be cross if the trial…

11
arXiv — NLP / Computation & Language research 1mo ago

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

arXiv:2605.26678v1 Announce Type: new Abstract: Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation,…

30
r/LocalLLaMA community 1mo ago

Long-context performance at lower quants

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a…

26
Hugging Face Daily Papers research 1mo ago

Language Models Need Sleep

Abstract A sleep-like consolidation mechanism for transformer models uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed. AI-generated summary Transformer-based large language models are increasingly used for long-horizon…

25
Hugging Face Daily Papers research 1mo ago

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Abstract ThriftAttention reduces long-context attention computation by selectively applying higher precision to critical query-key interactions, achieving near-full precision quality at reduced bitwidth efficiency. AI-generated summary Efficient attention algorithms are critical…

8
Smol AI News news-outlet 1mo ago

not much happened today

**Inference optimization** is increasingly architectural, with **EAGLE 3.1** improving speculative decoding and long-context handling, collaborating with **vLLM** and **TorchSpec**. **Perplexity** open-sourced a rebuilt **Unigram tokenizer** cutting CPU use by **5–6×** and…

15
Hugging Face Daily Papers research 1mo ago

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Abstract MemForest presents a memory framework for long-context LLM agents that improves scalability and reduces latency through parallel chunk extraction and hierarchical temporal indexing. AI-generated summary Memory is a fundamental component for enabling long-context LLM…

4
arXiv — NLP / Computation & Language research 1mo ago

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

arXiv:2605.24579v1 Announce Type: new Abstract: Long-context memory systems often fail under fixed budgets, but end-to-end evaluation does not reveal whether evidence was discarded during compression or preserved but never retrieved. We introduce a four-condition diagnostic…

27
arXiv — NLP / Computation & Language research 1mo ago

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

arXiv:2605.24930v1 Announce Type: new Abstract: Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream…

13
r/LocalLLaMA community 1mo ago

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

  submitted by   /u/miserlou [link]   [comments]

5
r/LocalLLaMA community 1mo ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost,…

37
arXiv — Machine Learning research 1mo ago

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

arXiv:2605.23081v1 Announce Type: new Abstract: Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit…

37
arXiv — Machine Learning research 1mo ago

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

arXiv:2605.23200v1 Announce Type: new Abstract: The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on…

16
arXiv — Machine Learning research 1mo ago

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

arXiv:2605.23258v1 Announce Type: new Abstract: KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical…

38
arXiv — NLP / Computation & Language research 1mo ago

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

arXiv:2605.23071v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and…

20
arXiv — NLP / Computation & Language research 1mo ago

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

arXiv:2605.23170v1 Announce Type: new Abstract: Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11…

33
Hugging Face Daily Papers research 1mo ago

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Abstract RTPurbo leverages intrinsic sparsity in full-attention LLMs to achieve efficient long-context inference with minimal training overhead, enabling significant speedups while maintaining near-lossless accuracy. AI-generated summary Long-context inference in large language…

10
arXiv — Machine Learning research 1mo ago

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv:2605.21649v1 Announce Type: new Abstract: Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting…

26
arXiv — Machine Learning research 1mo ago

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

arXiv:2605.21768v1 Announce Type: new Abstract: Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session…

13
arXiv — NLP / Computation & Language research 1mo ago

ACC: Compiling Agent Trajectories for Long-Context Training

arXiv:2605.21850v1 Announce Type: new Abstract: Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents…

11
Hugging Face Daily Papers research 1mo ago

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Abstract Gated DeltaNet-2 improves upon existing linear attention models by separating erase and write operations through distinct channel-wise gates, achieving superior performance in long-context language modeling and retrieval tasks. AI-generated summary Linear attention…

29
Hugging Face Daily Papers research 1mo ago

ACC: Compiling Agent Trajectories for Long-Context Training

Abstract Agent Context Compilation (ACC) enhances long-context reasoning in LLMs by converting multi-turn agent trajectories into structured QA pairs, enabling direct supervision of distant context integration without additional annotation. AI-generated summary Recent…

28
Hugging Face Daily Papers research 1mo ago

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Abstract Mix-Quant is a phase-aware quantization framework that accelerates long-context, multi-turn LLM inference by applying high-throughput NVFP4 quantization to the prefilling phase while maintaining BF16 precision for decoding. AI-generated summary LLM agents have recently…

30
arXiv — NLP / Computation & Language research 1mo ago

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv:2605.20201v1 Announce Type: new Abstract: Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context --…

7
arXiv — NLP / Computation & Language research 1mo ago

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

arXiv:2605.20626v1 Announce Type: new Abstract: We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then…

8
Latent.Space news-outlet 1mo ago

Railway: The Agent-Native Cloud — Jake Cooper

3M Users, 100K Signups/Week, Own-Metal Data Centers, $200K+ Coding Agent Spend, and the Death of PRs

21

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

Run Local AI Agents with Faster Models and Multi-Node Clustering on NVIDIA DGX Spark

For Ling-2.6-1T, what would make the size feel justified first: quality per token, local serving reality, or long context stability?

CoMem: Context Management with A Decoupled Long-Context Model

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal

MiniMax M3 on AI Gateway

Liquid AI releases LFM2.5-8B-A1B

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Upgrade path from 4x 3090s

Bricks and Minifigs Stole a Man's $200k Lego Collection

Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Heterogeneous Parallelism for Multimodal Large Language Model Training

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

Periodic RoPE for Infinite Context LLMs

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

ATLAS: All-round Testing of Long-context Abilities across Scales

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Finally pioneering beyond the local 256k context window frontier!

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

Long-context performance at lower quants

Language Models Need Sleep

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

not much happened today

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

WhenLoss: Diagnosing Write and Retrieval Bottlenecks in Long-Context Memory Systems

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

EntmaxKV: Support-Aware Decoding for Entmax Attention

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

ACC: Compiling Agent Trajectories for Long-Context Training

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

ACC: Compiling Agent Trajectories for Long-Context Training

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

Railway: The Agent-Native Cloud — Jake Cooper