Hugging Face Daily Papers · · 4 min read

FastKernels: Benchmarking GPU Kernel Generation in Production

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

FastKernels: A production-aligned GPU kernel generation benchmark that doubles as a minimal inference framework, with compositional tasks from primitives to full models, deployable module interfaces, captured production tensors, and evaluation against real inference-system baselines.</p>\n","updatedAt":"2026-05-27T16:15:32.638Z","author":{"_id":"664549aeeadb2b6c79f73452","avatarUrl":"/avatars/8672d035f4a938a22fc12618d6c718a2.svg","fullname":"Gabriele Oliaro","name":"sfc-gh-goliaro","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7921416163444519},"editors":["sfc-gh-goliaro"],"editorAvatarUrls":["/avatars/8672d035f4a938a22fc12618d6c718a2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23215","authors":[{"_id":"6a1709a2da9422d403a421c5","name":"Gabriele Oliaro","hidden":false},{"_id":"6a1709a2da9422d403a421c6","name":"Yichao Fu","hidden":false},{"_id":"6a1709a2da9422d403a421c7","name":"May Jiang","hidden":false},{"_id":"6a1709a2da9422d403a421c8","name":"Owen Lu","hidden":false},{"_id":"6a1709a2da9422d403a421c9","name":"Junli Wang","hidden":false},{"_id":"6a1709a2da9422d403a421ca","name":"Zhihao Jia","hidden":false},{"_id":"6a1709a2da9422d403a421cb","name":"Hao Zhang","hidden":false},{"_id":"6a1709a2da9422d403a421cc","name":"Samyam Rajbhandari","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/664549aeeadb2b6c79f73452/rAWDlEjblD4q2b0QCp6BP.png"],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"FastKernels: Benchmarking GPU Kernel Generation in Production","submittedOnDailyBy":{"_id":"664549aeeadb2b6c79f73452","avatarUrl":"/avatars/8672d035f4a938a22fc12618d6c718a2.svg","isPro":false,"fullname":"Gabriele Oliaro","user":"sfc-gh-goliaro","type":"user","name":"sfc-gh-goliaro"},"summary":"LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels","upvotes":4,"discussionId":"6a1709a2da9422d403a421cd","projectPage":"https://fastkernels.github.io/","ai_summary":"FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment.","ai_keywords":["GPU kernel generation","LLM-based agents","benchmarks","production inference frameworks","compilation stack","kernel optimization","interface compatibility","correctness degradation","FastKernels","HuggingFace Transformers","vLLM","SGLang"],"organization":{"_id":"62cece4aa3a23014aca72499","name":"Snowflake","fullname":"Snowflake","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64dc52cf858f8a41c12fc819/O9-MWzRjWzbNP_DQlMb-7.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65f5dc345f9b537bfb125988","avatarUrl":"/avatars/7fa9de162694d34a214ccd8ecb02fa0a.svg","isPro":false,"fullname":"Sergey Zubrilin","user":"hiauiarau","type":"user"},{"_id":"66a90dd92df7f4b31e8590f3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a90dd92df7f4b31e8590f3/RFSnVgvBBP9-F5OoFA8ZN.jpeg","isPro":false,"fullname":"Aditya Ramesh","user":"aramesh10","type":"user"},{"_id":"66c5a79cf5c0cc69556b7fb5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66c5a79cf5c0cc69556b7fb5/yhelKfNJcTwoxFFBrFrKH.jpeg","isPro":false,"fullname":"Ruan Letian","user":"Risc-lt","type":"user"},{"_id":"69bcba424b067234c61b2b92","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/gdQzT-5OFEc3VB-FbZw6G.jpeg","isPro":false,"fullname":"Olivia Baker","user":"mateomiller52","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62cece4aa3a23014aca72499","name":"Snowflake","fullname":"Snowflake","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64dc52cf858f8a41c12fc819/O9-MWzRjWzbNP_DQlMb-7.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23215.md"}">
Papers
arxiv:2605.23215

FastKernels: Benchmarking GPU Kernel Generation in Production

Published on May 22
· Submitted by
Gabriele Oliaro
on May 27
Authors:
,
,
,
,
,
,
,

Abstract

FastKernels addresses the gap between benchmark evaluation and production performance for LLM kernel agents by providing a representative set of architectures and a production-grade inference framework that aligns evaluation with real-world deployment.

AI-generated summary

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

Community

FastKernels: A production-aligned GPU kernel generation benchmark that doubles as a minimal inference framework, with compositional tasks from primitives to full models, deployable module interfaces, captured production tensors, and evaluation against real inference-system baselines.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.23215
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.23215 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.23215 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.23215 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers