One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Mirrored from arXiv — NLP / Computation & Language for archival readability. Support the source by reading on the original site.
Computer Science > Computation and Language
Title:One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Abstract:Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.
| Subjects: | Computation and Language (cs.CL); Information Retrieval (cs.IR) |
| Cite as: | arXiv:2605.22544 [cs.CL] |
| (or arXiv:2605.22544v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.22544
arXiv-issued DOI via DataCite (pending registration)
|
Submission history
From: Kenneth Enevoldsen [view email][v1] Thu, 21 May 2026 14:27:46 UTC (8,807 KB)
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
More from arXiv — NLP / Computation & Language
-
CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety
May 22
-
Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries
May 22
-
Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews
May 22
-
Probabilistic Attribution For Large Language Models
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.