Hugging Face Daily Papers · June 8, 2026 · 13 min read

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Title: When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges<br>Authors: Parth Darshan (IIT Jodhpur), Abhishek Divekar (Amazon)<br>Blogpost: <a href=\"https://textgrad-failure-modes.github.io\" rel=\"nofollow\">https://textgrad-failure-modes.github.io</a><br>Codebase: <a href=\"https://github.com/adivekar-utexas/when-gradients-collide\" rel=\"nofollow\">https://github.com/adivekar-utexas/when-gradients-collide</a></p>\n<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"introduction\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#introduction\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tIntroduction\n\t</span>\n</h1>\n<p>LLM judges increasingly score text along multiple criteria at once. <a href=\"https://arxiv.org/abs/2406.07496\" rel=\"nofollow\">TextGrad</a> can optimize a prompt for one criterion, but its \"gradients\" are natural-language edit suggestions, not numerical vectors. They cannot be projected, averaged, or constrained the way PCGrad or MGDA operate on vector gradients. This paper asks what happens when textual gradients are forced into the <strong>multi-objective</strong> setting. We find two separable failure modes: during optimization, jointly generated gradients lose criterion-specific information; during inference, individually optimized instructions interfere when packed into a single judge prompt.</p>\n<p>We evaluate on SummEval, which provides expert annotations for four separable summary-evaluation criteria: fluency, relevance, coherence, and consistency. Each optimization step has three stages where the criteria can interact: the loss LLM, the gradient LLM, and the optimizer LLM. We encode each mode with three letters: S means the stage processes each criterion separately; C means the stage processes all four criteria jointly.</p>\n<p>The four multi-objective modes are: SSS (all stages separate), SSC (loss and gradient separate, optimizer combined), SCC (only loss separate, gradient and optimizer combined), and CCC (all stages combined). We also include a Single-Task baseline where each criterion receives its own independent optimization run. This baseline is not a deployable one-prompt judge, but it measures the ceiling we would hope to approach if multi-objective coupling caused no damage. All experiments use N=3 independent runs per configuration over 12 optimization steps.</p>\n<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"failure-mode-1-gradient-dilution\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#failure-mode-1-gradient-dilution\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tFailure Mode 1: Gradient Dilution\n\t</span>\n</h1>\n<p>The first failure happens during optimization. We measure each textual gradient for gradient specificity: how targeted its improvement suggestions are to a single criterion (scored 1–10 by an LLM evaluator). When the gradient LLM processes each task separately (modes Single, SSS, SSC), gradients are sharply focused, scoring a mean of 9.0 (±0.3). But when it must reconcile feedback from all four criteria in one call (modes SCC, CCC), specificity drops to 3.7 (±0.5), a 59% reduction with no overlap between the per-task and cross-task distributions.</p>\n<p>The per-criterion breakdown reveals uneven dilution. Consistency is the most diluted: SCC scores 2.6 and CCC scores 2.4. Coherence retains more focus: SCC scores 4.8 and CCC scores 5.1. Joint gradients do not merely become uniformly worse; they become uneven, preserving generic writing-quality feedback while losing the criterion whose rubric is easiest to confuse with other dimensions.</p>\n<p>This finding extends the rule-dilution hypothesis of <a href=\"https://arxiv.org/abs/2603.00451\" rel=\"nofollow\">CARO</a> from the within-criterion to the cross-criterion setting. CARO shows that aggregating heterogeneous error modes in a single optimization step degrades rubric accuracy; we observe the analogous effect when multiple task gradients are combined in a single gradient call, degrading the per-task optimization signal.</p>\n<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"failure-mode-2-instruction-interference\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#failure-mode-2-instruction-interference\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tFailure Mode 2: Instruction Interference\n\t</span>\n</h1>\n<p>Gradient dilution explains why the cross-task modes fail. But why do the per-task modes (SSS, SSC) also stagnate, when their gradients are sharp and their edits faithful? The answer lives at inference time, not optimization time.</p>\n<p>We run an oracle experiment: for each criterion, we pick the single best instruction across all single-task runs, the one with the highest held-out Spearman for that task, then combine the four oracle-optimal instructions into one prompt. Even these individually-best instructions degrade when combined, falling from 0.305 to 0.220 average Spearman (−0.085), strictly worse than the generic baseline (0.284).</p>\n<p>The mechanism is instruction-length asymmetry. Optimization over-specifies some criteria (the fluency rubric expands to approximately 800 tokens with detailed scoring anchors) while leaving others under-specified (the relevance instruction remains at approximately 4 tokens of the initial prompt). Packed into a single prompt, verbose instructions receive disproportionate attention relative to brief ones at inference time. Individually good rubrics can hurt when combined, so interference cannot be fixed by better per-task optimization alone.</p>\n<p>This result strengthens a finding from RRD, which shows that naive rubric construction degrades GPT-4o preference-judgment accuracy by 13 points on JudgeBench. RRD's result shows that bad rubrics hurt. Our result shows that individually good rubrics can hurt when combined, implying that instruction interference is not resolvable by improving per-task optimization alone.</p>\n<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"what-this-means-for-custom-llm-judges\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#what-this-means-for-custom-llm-judges\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tWhat This Means for Custom LLM Judges\n\t</span>\n</h1>\n<p>For practitioners customizing judges to domain-specific criteria, these results indicate that architectural changes are required before the multi-objective setting can work reliably. Addressing either failure mode alone is insufficient.</p>\n<p>For gradient dilution: conflict-aware gradient resolution adapted from numerical multi-task learning (PCGrad, CAGrad) could address dilution if textual gradients can be meaningfully embedded and projected. A specificity-aware router could fall back to per-task gradient calls when multi-task specificity drops below a threshold, capturing CCC's hypervolume gains without losing task focus.</p>\n<p>For instruction interference: separate judge calls per criterion eliminates interference but multiplies inference cost. Length-aware instruction synthesis that normalizes rubric length during optimization prevents verbose rubrics from dominating the attention budget. Next-token attention masking that exposes only the relevant criterion instruction during each output field eliminates interference at no cost.</p>\n<p>The diagnostics we provide (gradient specificity and feedback adherence) give a way to measure both failure modes, so future work can evaluate mitigations against the same yardstick.</p>\n","updatedAt":"2026-06-08T02:18:08.150Z","author":{"_id":"622dd5824ae51cf82b38239c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/622dd5824ae51cf82b38239c/VoT6U74q8MOvTuHiCOX66.png","fullname":"Abhishek Divekar","name":"adivekar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8942557573318481},"editors":["adivekar"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/622dd5824ae51cf82b38239c/VoT6U74q8MOvTuHiCOX66.png"],"reactions":[{"reaction":"🔥","users":["adivekar"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26046","authors":[{"_id":"6a16a085991d34bf2034ffa1","name":"Parth Darshan","hidden":false},{"_id":"6a16a085991d34bf2034ffa2","name":"Abhishek Divekar","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/622dd5824ae51cf82b38239c/4nhgivoOIT7Gr5jDE4Hdt.png","https://cdn-uploads.huggingface.co/production/uploads/622dd5824ae51cf82b38239c/aVRRRTYYvKMtG0ib625UP.jpeg","https://cdn-uploads.huggingface.co/production/uploads/622dd5824ae51cf82b38239c/CtCDcuybO2LrwIaZzRgEw.png","https://cdn-uploads.huggingface.co/production/uploads/622dd5824ae51cf82b38239c/kGKTdR8Vd0KmSfLpNwK03.png"],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges","submittedOnDailyBy":{"_id":"622dd5824ae51cf82b38239c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/622dd5824ae51cf82b38239c/VoT6U74q8MOvTuHiCOX66.png","isPro":false,"fullname":"Abhishek Divekar","user":"adivekar","type":"user","name":"adivekar"},"summary":"Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) doesn't apply to the multi-objective textual gradient setting. We test five decomposition modes of textual gradient optimizers by varying how much cross-task information the loss, gradient and optimizer LLMs share. In 6 of 10 configurations, we observe that optimization never improves over the initial prompt. Gradient specificity drops by 59% (from 9.0 to 3.7) when the gradient LLM processes multiple criteria jointly. Separately, we observe that naively combining per-task instructions into a single prompt degrades Spearman's rho by -5.3%. These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge customization using textual feedback.","upvotes":3,"discussionId":"6a16a085991d34bf2034ffa3","projectPage":"https://textgrad-failure-modes.github.io/","githubRepo":"https://github.com/adivekar-utexas/when-gradients-collide","githubRepoAddedBy":"user","ai_summary":"Multi-objective LLM judge customization using textual gradients faces challenges from gradient dilution and instruction interference that limit optimization effectiveness.","ai_keywords":["textual gradient methods","multi-task learning","PCGrad","MGDA","gradient optimization","prompt optimization","textual feedback","gradient specificity","Spearman's rho"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"62cf7d293200bfd438ebeffc","name":"IITJodhpur","fullname":"Indian Institute of Technology Jodhpur","avatar":"https://www.gravatar.com/avatar/8279b81c6446aad6297ddcc65c8471f5?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"622dd5824ae51cf82b38239c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/622dd5824ae51cf82b38239c/VoT6U74q8MOvTuHiCOX66.png","isPro":false,"fullname":"Abhishek Divekar","user":"adivekar","type":"user"},{"_id":"6a26272f27d4b5b6fe7b02f4","avatarUrl":"/avatars/6effeb9bbccea5f4e8f408608365946f.svg","isPro":false,"fullname":"Zippy Slashz","user":"zipslashgh5","type":"user"},{"_id":"6a26830030ee6257332bd0b9","avatarUrl":"/avatars/e25aa32557ae5285ef5e63a41febc070.svg","isPro":false,"fullname":"Hero","user":"hero2312","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62cf7d293200bfd438ebeffc","name":"IITJodhpur","fullname":"Indian Institute of Technology Jodhpur","avatar":"https://www.gravatar.com/avatar/8279b81c6446aad6297ddcc65c8471f5?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26046.md"}">

Papers

arxiv:2605.26046

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Published on May 25

· Submitted by

Abhishek Divekar on Jun 8

Indian Institute of Technology Jodhpur

Upvote

Authors:

Abstract

Multi-objective LLM judge customization using textual gradients faces challenges from gradient dilution and instruction interference that limit optimization effectiveness.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) doesn't apply to the multi-objective textual gradient setting. We test five decomposition modes of textual gradient optimizers by varying how much cross-task information the loss, gradient and optimizer LLMs share. In 6 of 10 configurations, we observe that optimization never improves over the initial prompt. Gradient specificity drops by 59% (from 9.0 to 3.7) when the gradient LLM processes multiple criteria jointly. Separately, we observe that naively combining per-task instructions into a single prompt degrades Spearman's rho by -5.3%. These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge customization using textual feedback.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

adivekar

Paper submitter about 7 hours ago

•

edited about 7 hours ago

Title: When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Authors: Parth Darshan (IIT Jodhpur), Abhishek Divekar (Amazon)
Blogpost: https://textgrad-failure-modes.github.io
Codebase: https://github.com/adivekar-utexas/when-gradients-collide

Introduction

LLM judges increasingly score text along multiple criteria at once. TextGrad can optimize a prompt for one criterion, but its "gradients" are natural-language edit suggestions, not numerical vectors. They cannot be projected, averaged, or constrained the way PCGrad or MGDA operate on vector gradients. This paper asks what happens when textual gradients are forced into the multi-objective setting. We find two separable failure modes: during optimization, jointly generated gradients lose criterion-specific information; during inference, individually optimized instructions interfere when packed into a single judge prompt.

We evaluate on SummEval, which provides expert annotations for four separable summary-evaluation criteria: fluency, relevance, coherence, and consistency. Each optimization step has three stages where the criteria can interact: the loss LLM, the gradient LLM, and the optimizer LLM. We encode each mode with three letters: S means the stage processes each criterion separately; C means the stage processes all four criteria jointly.

The four multi-objective modes are: SSS (all stages separate), SSC (loss and gradient separate, optimizer combined), SCC (only loss separate, gradient and optimizer combined), and CCC (all stages combined). We also include a Single-Task baseline where each criterion receives its own independent optimization run. This baseline is not a deployable one-prompt judge, but it measures the ceiling we would hope to approach if multi-objective coupling caused no damage. All experiments use N=3 independent runs per configuration over 12 optimization steps.

Failure Mode 1: Gradient Dilution

The first failure happens during optimization. We measure each textual gradient for gradient specificity: how targeted its improvement suggestions are to a single criterion (scored 1–10 by an LLM evaluator). When the gradient LLM processes each task separately (modes Single, SSS, SSC), gradients are sharply focused, scoring a mean of 9.0 (±0.3). But when it must reconcile feedback from all four criteria in one call (modes SCC, CCC), specificity drops to 3.7 (±0.5), a 59% reduction with no overlap between the per-task and cross-task distributions.

The per-criterion breakdown reveals uneven dilution. Consistency is the most diluted: SCC scores 2.6 and CCC scores 2.4. Coherence retains more focus: SCC scores 4.8 and CCC scores 5.1. Joint gradients do not merely become uniformly worse; they become uneven, preserving generic writing-quality feedback while losing the criterion whose rubric is easiest to confuse with other dimensions.

This finding extends the rule-dilution hypothesis of CARO from the within-criterion to the cross-criterion setting. CARO shows that aggregating heterogeneous error modes in a single optimization step degrades rubric accuracy; we observe the analogous effect when multiple task gradients are combined in a single gradient call, degrading the per-task optimization signal.

Failure Mode 2: Instruction Interference

Gradient dilution explains why the cross-task modes fail. But why do the per-task modes (SSS, SSC) also stagnate, when their gradients are sharp and their edits faithful? The answer lives at inference time, not optimization time.

We run an oracle experiment: for each criterion, we pick the single best instruction across all single-task runs, the one with the highest held-out Spearman for that task, then combine the four oracle-optimal instructions into one prompt. Even these individually-best instructions degrade when combined, falling from 0.305 to 0.220 average Spearman (−0.085), strictly worse than the generic baseline (0.284).

The mechanism is instruction-length asymmetry. Optimization over-specifies some criteria (the fluency rubric expands to approximately 800 tokens with detailed scoring anchors) while leaving others under-specified (the relevance instruction remains at approximately 4 tokens of the initial prompt). Packed into a single prompt, verbose instructions receive disproportionate attention relative to brief ones at inference time. Individually good rubrics can hurt when combined, so interference cannot be fixed by better per-task optimization alone.

This result strengthens a finding from RRD, which shows that naive rubric construction degrades GPT-4o preference-judgment accuracy by 13 points on JudgeBench. RRD's result shows that bad rubrics hurt. Our result shows that individually good rubrics can hurt when combined, implying that instruction interference is not resolvable by improving per-task optimization alone.

What This Means for Custom LLM Judges

For practitioners customizing judges to domain-specific criteria, these results indicate that architectural changes are required before the multi-objective setting can work reliably. Addressing either failure mode alone is insufficient.

For gradient dilution: conflict-aware gradient resolution adapted from numerical multi-task learning (PCGrad, CAGrad) could address dilution if textual gradients can be meaningfully embedded and projected. A specificity-aware router could fall back to per-task gradient calls when multi-task specificity drops below a threshold, capturing CCC's hypervolume gains without losing task focus.

For instruction interference: separate judge calls per criterion eliminates interference but multiplies inference cost. Length-aware instruction synthesis that normalizes rubric length during optimization prevents verbose rubrics from dominating the attention budget. Next-token attention masking that exposes only the relevant criterion instruction during each output field eliminates interference at no cost.

The diagnostics we provide (gradient specificity and feedback adherence) give a way to measure both failure modes, so future work can evaluate mitigations against the same yardstick.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26046

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26046 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26046 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26046 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Abstract

Community

Introduction

Failure Mode 1: Gradient Dilution

Failure Mode 2: Instruction Interference

What This Means for Custom LLM Judges

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers