Benchmarking Open Embedding models the effective way | AI Engineering

We often pick Embedding models based on public benchmarks like MTEB and various other factors like reputation and popularity etc. but how often we make informed decision by validating the model against our own dataset, on our own machines and curated conditions - Closer to the realtime application/solution

To pick the right embedding model we need to validate few metrics and parameters as follows. Here is the list of metrics we should measure

Metric What it measures Target
Encode throughput Documents embedded per second ⬆ Higher
Query latency mean Avg time from query to results (ms) ⬇ Lower
Query latency p95 Worst-case latency for 95% of queries ⬇ Lower
Query latency p99 Tail latency — slowest 1 in 100 queries ⬇ Lower
Recall@1 Top result was relevant ⬆ Higher · ideal 1.0
Recall@3 Relevant doc found in top 3 ⬆ Higher · ideal 1.0
Recall@5 Relevant doc found in top 5 ⬆ Higher · ideal 1.0
MRR Relevant doc's average rank position ⬆ Higher · ideal 1.0
Cosine distribution Similarity score spread across top-K hits ⬆ Mean · ↔ narrow spread

But how ?

I thought of writing a simple program and one thing lead to another and I ended up creating a full open source project named EmbedComp

The following are the models we have tried to validate

  • 'e5-base-v2'
  • 'bge-base-en'
  • 'multilingual-e5-base'
  • 'all-MiniLM-L6-v2'
  • 'nomic-embed-v1'

A Framework,  A Jupyter Notebook and completely metadata/configuration driven approach where you can define the list of models you want to benchmark and do the validation.

MODELS = {
    'e5-base-v2':           'intfloat/e5-base-v2',
    'bge-base-en':          'BAAI/bge-base-en-v1.5',
    'multilingual-e5-base': 'intfloat/multilingual-e5-base',
    'all-MiniLM-L6-v2':    'sentence-transformers/all-MiniLM-L6-v2',
    'nomic-embed-v1':       'nomic-ai/nomic-embed-text-v1'
}

QUERY_PREFIX = {
    'e5-base-v2':           'query: ',
    'bge-base-en':          'Represent this sentence for searching relevant passages: ',
    'multilingual-e5-base': 'query: ',
    'all-MiniLM-L6-v2':    '',
    'nomic-embed-v1':               'query: '
}
DOC_PREFIX = {
    'e5-base-v2':           'passage: ',
    'bge-base-en':          '',
    'multilingual-e5-base': 'passage: ',
    'all-MiniLM-L6-v2':    '',
    'nomic-embed-v1':       'passage: '
}

For the brevity and to make the notebook self sustainable - I have use the public dataset available in Hugging Face - even the models load directly from

dataset = load_dataset('BeIR/trec-covid', 'corpus')['corpus'].select(range(CORPUS_SIZE))

The Entire source code is available in the following Github repo

https://github.com/AKSarav/EmbedComp

here is the HTML version of the entire Notebook with all the code and screenshots

Thanks
Sarav