FINAL Bench introduces a new evaluation paradigm for LLMs:<br><em>functional metacognitive reasoning</em> — not just \"can the model solve it,\"<br>but \"does the model know when, why, and how it solves it.\"</p>\n<ul>\n<li>100 tasks across 15 domains, built on the TICOS framework<br>(Task / Introspection / Calibration / Output / Self-correction)</li>\n<li>Already #5 globally on HF Datasets popularity</li>\n<li>Officially endorsed by the HF Evaluation Team (Nathan Habib)</li>\n</ul>\n<p>We believe metacognition is the missing axis in current LLM benchmarks.<br>Feedback welcome.</p>\n","updatedAt":"2026-05-15T02:44:24.838Z","author":{"_id":"63c3550d8cc87cf0c06838e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/H5ncB4vaBtP8GVCidgxL0.png","fullname":"seawolf","name":"seawolf2357","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":356,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8766076564788818},"editors":["seawolf2357"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/H5ncB4vaBtP8GVCidgxL0.png"],"reactions":[],"isReport":false}},{"id":"6a069c61b798ba08027d2f63","author":{"_id":"6905bc786cb49b1f11d32728","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/504I_H8J9NkiytiHvbx-h.jpeg","fullname":"VIDRAFT_LAB","name":"SeaWolf-AI","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":163,"isUserFollowing":false},"createdAt":"2026-05-15T04:09:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Darwin Family — Architecture Overview\n\n\n\nFlagship update: Darwin-36B-Opus achieves 88.4% on GPQA Diamond,\nmatching Qwen3.5-397B-A17B with ~10× fewer params, training-free.\n\n- Model: https://huggingface.co/FINAL-Bench/Darwin-36B-Opus\n- Paper: https://arxiv.org/abs/2605.14386","html":"<p>Darwin Family — Architecture Overview</p>\n<p><a href=\"https://huggingface.co/FINAL-Bench/Darwin-36B-Opus/resolve/main/DARWIN.png\" rel=\"nofollow\"><img src=\"https://huggingface.co/FINAL-Bench/Darwin-36B-Opus/resolve/main/DARWIN.png\" alt=\"Darwin Family Diagram\"></a></p>\n<p>Flagship update: Darwin-36B-Opus achieves 88.4% on GPQA Diamond,<br>matching Qwen3.5-397B-A17B with ~10× fewer params, training-free.</p>\n<ul>\n<li>Model: <a href=\"https://huggingface.co/FINAL-Bench/Darwin-36B-Opus\">https://huggingface.co/FINAL-Bench/Darwin-36B-Opus</a></li>\n<li>Paper: <a href=\"https://arxiv.org/abs/2605.14386\" rel=\"nofollow\">https://arxiv.org/abs/2605.14386</a></li>\n</ul>\n","updatedAt":"2026-05-15T04:09:05.821Z","author":{"_id":"6905bc786cb49b1f11d32728","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/504I_H8J9NkiytiHvbx-h.jpeg","fullname":"VIDRAFT_LAB","name":"SeaWolf-AI","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":163,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5746477246284485},"editors":["SeaWolf-AI"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/504I_H8J9NkiytiHvbx-h.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14386","authors":[{"_id":"6a067e15b1a8cbabc9f09846","name":"Taebong Kim","hidden":false},{"_id":"6a067e15b1a8cbabc9f09847","name":"Youngsik Hong","hidden":false},{"_id":"6a067e15b1a8cbabc9f09848","name":"Minsik Kim","hidden":false},{"_id":"6a067e15b1a8cbabc9f09849","name":"Sunyoung Choi","hidden":false},{"_id":"6a067e15b1a8cbabc9f0984a","name":"Jaewon Jang","hidden":false},{"_id":"6a067e15b1a8cbabc9f0984b","name":"Junghoon Shin","hidden":false},{"_id":"6a067e15b1a8cbabc9f0984c","name":"Minseo Kim","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning","submittedOnDailyBy":{"_id":"63c3550d8cc87cf0c06838e7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/H5ncB4vaBtP8GVCidgxL0.png","isPro":true,"fullname":"seawolf","user":"seawolf2357","type":"user","name":"seawolf2357"},"summary":"We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.","upvotes":43,"discussionId":"6a067e15b1a8cbabc9f0984d","projectPage":"https://vidraft.net","ai_summary":"The Darwin Family framework enables training-free evolutionary merging of large language models through gradient-free weight-space recombination, achieving superior reasoning performance without additional training.","ai_keywords":["evolutionary merging","gradient-free weight-space recombination","merge genome","MRI-Trust Fusion","trust parameter","Architecture Mapper","cross-architecture breeding","Transformer-based components","Mamba-based components","reasoning performance"],"organization":{"_id":"699976ab4a856643b7429675","name":"FINAL-Bench","fullname":"FINAL_Bench","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6905bc786cb49b1f11d32728/VZmuKH-liifeL2GCXlwka.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6905bc786cb49b1f11d32728","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/504I_H8J9NkiytiHvbx-h.jpeg","isPro":false,"fullname":"VIDRAFT_LAB","user":"SeaWolf-AI","type":"user"},{"_id":"696f2edfa0417065e6a7c3ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/JAcuD_km-TtlyBV6_tLGu.png","isPro":true,"fullname":"Proto_AGI","user":"mayafree","type":"user"},{"_id":"681393732839acbf52102040","avatarUrl":"/avatars/754d2a48673b2587535eca5e33cd0a7e.svg","isPro":false,"fullname":"beans of sloar","user":"solarbeams","type":"user"},{"_id":"680e55ad443976539201fa04","avatarUrl":"/avatars/0069747f479e315aa1b9cf10651b8450.svg","isPro":false,"fullname":"saint marzi","user":"ausntmarzi","type":"user"},{"_id":"68c1e68cdfe1636ab69e1845","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/5x92L_Zvrp4vF_X1Lm4Au.png","isPro":false,"fullname":"Young Sik Hong","user":"RICHARDYHONG","type":"user"},{"_id":"679225a67802e6cf055a8043","avatarUrl":"/avatars/36f734890ad8766abc55c2cbec900884.svg","isPro":false,"fullname":"Minseo Kim","user":"MinseoKim-03","type":"user"},{"_id":"69998706ac17ddf87e499c17","avatarUrl":"/avatars/a83cd98f53f7fc5a47b128a617da4be6.svg","isPro":true,"fullname":"Jinki Jeong","user":"Anserwise","type":"user"},{"_id":"69eeba359124f7337b516d5f","avatarUrl":"/avatars/c3fac1fbb5c08b90ca5143a864317dba.svg","isPro":false,"fullname":"Warecube Korea","user":"Warecube","type":"user"},{"_id":"6607a1cfc50f8393c5744b02","avatarUrl":"/avatars/c86de5d9de3b39757c72e2b5b79d1838.svg","isPro":false,"fullname":"99","user":"cutechicken","type":"user"},{"_id":"685e137743b7d143a1fa73ea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/685e137743b7d143a1fa73ea/tSfTbpy59Yk4X83V34ndh.jpeg","isPro":false,"fullname":"JangJaewon","user":"Be2Jay","type":"user"},{"_id":"6894268a467f7d2f5f3b093f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QyqI5V_2V1D_96iiYeOpL.png","isPro":false,"fullname":"Eunjin","user":"ej329","type":"user"},{"_id":"6999ae7c3ed53ea88e14f687","avatarUrl":"/avatars/2155bf6d1eed45e7bb41f02c995412c0.svg","isPro":false,"fullname":"Jang","user":"jaytoone","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"699976ab4a856643b7429675","name":"FINAL-Bench","fullname":"FINAL_Bench","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6905bc786cb49b1f11d32728/VZmuKH-liifeL2GCXlwka.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.14386.md"}">
Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning
Abstract
The Darwin Family framework enables training-free evolutionary merging of large language models through gradient-free weight-space recombination, achieving superior reasoning performance without additional training.
AI-generated summary
We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.
Community
FINAL Bench introduces a new evaluation paradigm for LLMs:
functional metacognitive reasoning — not just "can the model solve it,"
but "does the model know when, why, and how it solves it."
- 100 tasks across 15 domains, built on the TICOS framework
(Task / Introspection / Calibration / Output / Self-correction)
- Already #5 globally on HF Datasets popularity
- Officially endorsed by the HF Evaluation Team (Nathan Habib)
We believe metacognition is the missing axis in current LLM benchmarks.
Feedback welcome.
Darwin Family — Architecture Overview

Flagship update: Darwin-36B-Opus achieves 88.4% on GPQA Diamond,
matching Qwen3.5-397B-A17B with ~10× fewer params, training-free.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.