Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.</p>\n","updatedAt":"2026-06-19T03:05:38.668Z","author":{"_id":"6573a9fe769f3ee9bdf4d9c7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xC41F7Vp9SVzVHc3cUiRU.jpeg","fullname":"Paul Kassianik","name":"paulkass","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8342642188072205},"editors":["paulkass"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xC41F7Vp9SVzVHc3cUiRU.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19605","authors":[{"_id":"6a34b1ca4c5c5e0d69bf1cab","name":"Paul Kassianik","hidden":false},{"_id":"6a34b1ca4c5c5e0d69bf1cac","name":"Baturay Saglam","hidden":false},{"_id":"6a34b1ca4c5c5e0d69bf1cad","name":"Huaibo Zhao","hidden":false},{"_id":"6a34b1ca4c5c5e0d69bf1cae","name":"Blaine Nelson","hidden":false},{"_id":"6a34b1ca4c5c5e0d69bf1caf","name":"Supriti Vijay","hidden":false},{"_id":"6a34b1ca4c5c5e0d69bf1cb0","name":"Aman Priyanshu","hidden":false},{"_id":"6a34b1ca4c5c5e0d69bf1cb1","name":"Amin Karbasi","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines","submittedOnDailyBy":{"_id":"6573a9fe769f3ee9bdf4d9c7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xC41F7Vp9SVzVHc3cUiRU.jpeg","isPro":false,"fullname":"Paul Kassianik","user":"paulkass","type":"user","name":"paulkass"},"summary":"Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.","upvotes":2,"discussionId":"6a34b1ca4c5c5e0d69bf1cb2","githubRepo":"https://github.com/cisco-foundation-ai/fully-automated-prompt-optimization","githubRepoAddedBy":"user","ai_summary":"FAPO optimizes LLM pipelines by combining prompt editing with structural changes, demonstrating superior performance across multiple benchmarks and security tasks.","ai_keywords":["prompt optimization","LLM pipelines","structured prompting","pipeline optimization","prompt-only optimization","structural changes","chain structure","prompt-first search","FAPO","Claude Code","GEPA","CTIBench-RCM","security CVE-to-CWE task"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":8,"organization":{"_id":"67cb6bcf560c3dcbb1a9c8b6","name":"fdtn-ai","fullname":"Cisco Foundation AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6573a9fe769f3ee9bdf4d9c7/MfBxEGubvNKGKnWcmR_Cu.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6573a9fe769f3ee9bdf4d9c7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xC41F7Vp9SVzVHc3cUiRU.jpeg","isPro":false,"fullname":"Paul Kassianik","user":"paulkass","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67cb6bcf560c3dcbb1a9c8b6","name":"fdtn-ai","fullname":"Cisco Foundation AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6573a9fe769f3ee9bdf4d9c7/MfBxEGubvNKGKnWcmR_Cu.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.19605.md","query":{}}">
FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines
Abstract
FAPO optimizes LLM pipelines by combining prompt editing with structural changes, demonstrating superior performance across multiple benchmarks and security tasks.
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.
Community
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.19605 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.19605 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.19605 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.