The MTEB leaderboard ranks embedding models on aggregate benchmark scores, and picking the top-ranked model for your use case is usually the wrong call. MTEB averages performance across dozens of tasks, most of which are not retrieval, so a model that tops the average can underperform on the one task that matters for AEO: retrieving the right passage for a real query. The score that matters is retrieval performance on data resembling yours, not the headline average. This is why the leaderboard misleads, which sub-scores actually predict AEO performance, and how to choose without running a full benchmark.
When teams pick an embedding model, they tend to open the MTEB leaderboard, sort by the top-line score, and choose whatever sits at number one. It feels rigorous because it is a benchmark. It is misleading because of what the benchmark averages. MTEB, the Massive Text Embedding Benchmark, scores models across a large suite of tasks spanning classification, clustering, reranking, retrieval, summarization, and more. The headline number is an average across all of them. For AEO, you care about exactly one category, and the average obscures it.
This matters because the model that wins the average is not necessarily the model that wins retrieval, and retrieval is the entire game for getting your content found and cited. A model can post a strong overall MTEB score on the strength of its classification and clustering performance while being merely average at retrieval, and you would never see that by sorting on the headline number. Reading the benchmark correctly means ignoring most of it.
What MTEB actually measures
MTEB is a genuinely valuable resource; the problem is not the benchmark, it is how people read it. It evaluates embedding models across many task categories, each testing a different capability. Classification tests whether embeddings separate labeled categories. Clustering tests whether similar items group together. Retrieval tests whether, given a query, the model surfaces the relevant passage from a corpus. Reranking, summarization, and other tasks each test something else. The headline MTEB score is the average across all of these.
For most production AEO and RAG use cases, retrieval is the only category that directly predicts your outcome, because retrieval is literally what happens when an engine tries to find your passage for a query. The other categories are real capabilities, but they do not determine whether your content gets found. A model that is brilliant at clustering and average at retrieval will disappoint you for AEO no matter how high its overall score, because clustering is not the job.
Why the average inverts the ranking
The averaging creates a specific failure: models optimized to win the overall benchmark may distribute their strength across many categories rather than maximizing the one you need. A model tuned for a strong average is tuned to be good at everything, which can mean it is not the best at retrieval specifically, because being the best at retrieval might require tradeoffs that hurt its classification or clustering scores and thus its average. The leaderboard rewards generalists; AEO needs a retrieval specialist.
This is why the top-ranked model by average is often not the top-ranked model by retrieval sub-score. When you sort MTEB by the retrieval category specifically rather than by the headline, the ordering changes, sometimes substantially. The model you would have picked by average and the model you should pick for retrieval can be different models. Reading the sub-score rather than the average is the single correction that fixes most embedding-model selection mistakes.
The sub-scores that actually predict AEO performance
Sort MTEB by retrieval performance, not the average. That is the first and biggest correction. But even the retrieval sub-score is an average across many retrieval datasets, and your content may resemble some of those datasets more than others. A model that excels at retrieval over scientific abstracts may not be the best for conversational ecommerce queries. Look at the retrieval performance on the specific datasets closest to your domain, not just the retrieval average.
Two practical constraints sit alongside the score. Dimension: higher-dimensional embeddings can capture more nuance but cost more to store and search, and the marginal retrieval gain often does not justify the cost at scale. Cost and latency: a top-scoring model that is expensive or slow to run may be the wrong production choice versus a slightly lower-scoring model that is cheaper and faster, because in production you run it on every chunk of every page. The best model on the leaderboard is not the best model for a budget-constrained pipeline processing millions of chunks.
How to choose without a full benchmark
You do not need to run the entire MTEB suite yourself. The practical selection process is shorter. Start by sorting the public leaderboard on the retrieval category and noting the top few models, ignoring the headline average entirely. Filter that shortlist by your real constraints: dimension you can afford to store and search, cost and latency you can sustain at your volume, and any deployment requirement like open-weights versus API.
Then validate the shortlist on a small sample of your own data, which is the step that actually decides it. Take fifteen or twenty real queries from your domain and the passages that should answer them, embed both with each candidate model, and check which model retrieves the right passage most reliably. This is a small, honest, first-party test you can run in an afternoon, and it beats any leaderboard ranking because it measures the model on your content and your queries rather than on a benchmark's. The leaderboard narrows the field; your own small sample picks the winner.
The contrarian conclusion is that the embedding-model decision is not won by reading a leaderboard more carefully, it is won by testing the shortlist on your own data, and the leaderboard's role is only to narrow the candidates so your small test is tractable. Teams that pick by the headline number and skip the validation are optimizing for a benchmark's average instead of their own retrieval reality, which is the same category error as optimizing for citation count instead of the signals that actually predict outcomes. The model that retrieves your passages for your queries is the right model, whatever its rank, and the only way to know that is to test it on the content you are actually trying to get cited, which is the same first-party discipline behind measuring readiness on your real pages rather than trusting a generic score.
Sources
- MTEB leaderboard on Hugging Face: the public benchmark, sortable by task category. huggingface.co
- MTEB paper: the methodology behind the benchmark and its task categories. arxiv.org/abs/2210.07316
- OpenAI, embeddings guide: dimension, cost, and retrieval considerations for production embeddings. platform.openai.com
- Hugging Face, sentence-transformers documentation: running and evaluating embedding models on your own data. sbert.net
- Website AI Score, AEO scoring signals: why outcome-based measurement beats proxy benchmarks. View article

