We often pick Embedding models based on public benchmarks like MTEB and various other factors like reputation and popularity etc. but how often we make informed decision by validating the model against our own dataset, on our own machines and curated conditions - Closer to the realtime application/solution
To pick the right embedding model we need to validate few metrics and parameters as follows. Here is the list of metrics we should measure
| Metric | What it measures | Target |
|---|---|---|
| Encode throughput | Documents embedded per second | ⬆ Higher |
| Query latency mean | Avg time from query to results (ms) | ⬇ Lower |
| Query latency p95 | Worst-case latency for 95% of queries | ⬇ Lower |
| Query latency p99 | Tail latency — slowest 1 in 100 queries | ⬇ Lower |
| Recall@1 | Top result was relevant | ⬆ Higher · ideal 1.0 |
| Recall@3 | Relevant doc found in top 3 | ⬆ Higher · ideal 1.0 |
| Recall@5 | Relevant doc found in top 5 | ⬆ Higher · ideal 1.0 |
| MRR | Relevant doc's average rank position | ⬆ Higher · ideal 1.0 |
| Cosine distribution | Similarity score spread across top-K hits | ⬆ Mean · ↔ narrow spread |
But how ?
I thought of writing a simple program and one thing lead to another and I ended up creating a full open source project named EmbedComp
The following are the models we have tried to validate
- 'e5-base-v2'
- 'bge-base-en'
- 'multilingual-e5-base'
- 'all-MiniLM-L6-v2'
- 'nomic-embed-v1'
A Framework, A Jupyter Notebook and completely metadata/configuration driven approach where you can define the list of models you want to benchmark and do the validation.
MODELS = {
'e5-base-v2': 'intfloat/e5-base-v2',
'bge-base-en': 'BAAI/bge-base-en-v1.5',
'multilingual-e5-base': 'intfloat/multilingual-e5-base',
'all-MiniLM-L6-v2': 'sentence-transformers/all-MiniLM-L6-v2',
'nomic-embed-v1': 'nomic-ai/nomic-embed-text-v1'
}
QUERY_PREFIX = {
'e5-base-v2': 'query: ',
'bge-base-en': 'Represent this sentence for searching relevant passages: ',
'multilingual-e5-base': 'query: ',
'all-MiniLM-L6-v2': '',
'nomic-embed-v1': 'query: '
}
DOC_PREFIX = {
'e5-base-v2': 'passage: ',
'bge-base-en': '',
'multilingual-e5-base': 'passage: ',
'all-MiniLM-L6-v2': '',
'nomic-embed-v1': 'passage: '
}
For the brevity and to make the notebook self sustainable - I have use the public dataset available in Hugging Face - even the models load directly from
dataset = load_dataset('BeIR/trec-covid', 'corpus')['corpus'].select(range(CORPUS_SIZE))
The Entire source code is available in the following Github repo
https://github.com/AKSarav/EmbedComp
here is the HTML version of the entire Notebook with all the code and screenshots
Thanks
Sarav





