A reranker is the second pass in retrieval: after the embedding search returns a rough candidate set, the reranker re-scores those candidates against the query with a more precise model and reorders them. It is where many citations are won or lost, because the reranker, not the initial search, often decides final order. Two families dominate: open-weight cross-encoders like BGE-Reranker and hosted API rerankers like Cohere Rerank. Neither is universally better. BGE wins on cost, control, and on-prem deployment; Cohere wins on convenience and managed scaling. The choice is an engineering tradeoff, not a quality verdict. This is how reranking works and when each family fits.
Most explanations of AI retrieval stop at the embedding search: you embed the query, find the nearest chunks, done. That is only the first pass, and it is the imprecise one. Embedding search is fast and approximate; it casts a wide net and returns candidates that are roughly relevant. The reranking step is what turns "roughly relevant" into "correctly ordered," and the order is what determines which passage the engine actually uses. A chunk that the embedding search ranked fifth can be promoted to first by a reranker, or a chunk that ranked first can be demoted. The final order is often the reranker's decision, not the search's.
This matters for AEO because the reranker sits at the attribution layer, deciding which of the retrieved candidates is good enough to quote. Understanding the two dominant reranker families, and when each fits, is useful whether you are building your own retrieval pipeline or reasoning about how the engines that cite you make their final ordering decisions.
What a reranker actually does
The embedding search and the reranker use fundamentally different mechanisms. Embedding search is a bi-encoder approach: it embeds the query and the documents separately, ahead of time, and compares them by vector distance. This is fast because the document embeddings are precomputed, but it is approximate because the query and document never actually meet; they are compared as independent vectors. It is good enough to find candidates and not precise enough to order them well.
A reranker is typically a cross-encoder: it takes the query and a candidate document together, as a pair, and runs them through the model jointly to produce a relevance score. Because the query and document are processed together, the cross-encoder can model their interaction directly, which makes it far more precise than the bi-encoder search. The cost is that it cannot precompute; it has to run fresh for every query-document pair at query time, which is why it only runs on the small candidate set the search already narrowed, not the whole corpus. Search casts the net, the reranker sorts the catch.
BGE-Reranker: the open-weight family
BGE-Reranker is a family of open-weight cross-encoder rerankers you download and run yourself. Its advantages are the advantages of open weights generally. There is no per-query cost, which matters enormously at scale, because reranking runs on every query and a per-query API fee compounds fast. It runs on your own hardware, so your data never leaves your perimeter, which is decisive for regulated industries or sensitive content. And you have full control: you can fine-tune it on your domain, deploy it on-prem, and tune its behavior to your needs.
The cost of BGE is operational. You run the infrastructure, you handle the scaling, you own the tuning and the maintenance. For a team with ML engineering capacity and either a scale or a data-sensitivity reason, this is often the right choice, and the published benchmarks show the stronger BGE-Reranker variants performing competitively with hosted options on standard reranking tasks. But "competitive on benchmarks" is not the deciding factor; the deciding factor is whether you have the engineering capacity to run it well, because a poorly-deployed open-weight reranker underperforms a well-run hosted one.
Cohere Rerank: the hosted family
Cohere Rerank is a hosted API reranker. You send it the query and the candidates, it returns them reordered with relevance scores. Its advantages are the advantages of managed services. There is no infrastructure to run, scaling is handled for you, and integration is fast, often a single API call dropped into your pipeline. For a team without ML infrastructure capacity, or one that wants to ship quickly and not own a model deployment, this convenience is real and valuable.
The costs are the mirror image of BGE's advantages. There is a per-query fee that compounds at scale, so a high-volume pipeline can become expensive in a way an owned model would not. And your data leaves your perimeter to reach the API, which is a non-starter for some compliance regimes and a consideration for any sensitive content. The hosted reranker trades cost and control for convenience and managed operations, which is exactly the right trade for some teams and exactly the wrong one for others.
When each one wins
The honest framing is that this is an engineering fit decision, not a quality ranking. BGE wins when you have scale that makes per-query API fees painful, when your data cannot leave your perimeter, or when you have the ML engineering capacity to run and tune it well and want the control. Cohere wins when you want to ship fast without owning infrastructure, when your volume is modest enough that per-query cost is not the dominant concern, and when your data sensitivity allows an external API.
The mistake is treating this as "which reranker is better" and looking for a benchmark winner, the same error as picking an embedding model by its headline MTEB score. The published reranking benchmarks tell you these models are broadly competitive on standard tasks; they do not tell you which fits your cost structure, your compliance requirements, and your team's capacity, which are the factors that actually decide it. As with every other retrieval component, the right choice is the one that performs on your data under your constraints, validated on a small real sample rather than chosen by leaderboard.
Why this matters even if you never run a reranker
Most people reading this will never deploy a reranker; the engines that cite them run the rerankers. The reason to understand it anyway is that it explains a behavior you will otherwise find baffling: why a page that should obviously be the top result for a query sometimes loses to a less relevant one, and vice versa. The reranker is making the final ordering call on a precise query-document interaction, and its judgment can differ from what the rough embedding similarity suggested. Knowing the reranker exists explains why final citation order is not simply "closest embedding wins."
The takeaway for content is the same as it has been throughout the retrieval stack: the reranker rewards passages that precisely and directly answer the query when examined closely, because the cross-encoder examines the query and your passage together in detail. A passage that is topically near the query but does not directly answer it survives the embedding search and then loses the rerank, because under close joint examination it does not actually respond to what was asked. Writing passages that directly and completely answer specific questions is what wins the rerank, which is the same discipline that wins the attribution layer of the citation stack and the same reason leading with a direct answer outperforms building toward one.
Sources
- BAAI, BGE-Reranker model cards: the open-weight reranker family and its published performance. huggingface.co/BAAI
- Cohere, Rerank documentation: the hosted reranker API and its usage. docs.cohere.com
- Pinecone, rerankers and two-stage retrieval: a technical overview of why reranking improves retrieval. pinecone.io
- Hugging Face, cross-encoders: how cross-encoder rerankers differ from bi-encoder search. sbert.net
- Website AI Score, the Citation Stack: where reranking sits in the attribution layer. View article

