Coursify

Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

Using the MTEB Leaderboard

Learn to use the Massive Text Embedding Benchmark (MTEB) leaderboard to select the right embedding model. Understand retrieval, clustering, classification, and semantic similarity tasks.

Learning Goals

  • Navigate the MTEB leaderboard to select embedding models
  • Interpret benchmark scores across retrieval and clustering tasks

Using the MTEB Leaderboard

With hundreds of embedding models available, how do you know which one to pick? The answer lies in the MTEB (Massive Text Embedding Benchmark) leaderboard.

Hosted on Hugging Face, the MTEB leaderboard is the industry standard for evaluating text embedding models across dozens of tasks and languages.

Learning Goals

  • Navigate the MTEB leaderboard to select embedding models.
  • Interpret benchmark scores across retrieval and clustering tasks.
  • Understand the difference between 'Average' scores and task-specific performance.

Core Concepts

1. What is MTEB?

MTEB evaluates models on 8 distinct tasks:

  • Retrieval: Finding the right document for a query (Most critical for RAG).
  • Clustering: Grouping similar documents together.
  • Classification: Predicting categories for text.
  • Semantic Textual Similarity (STS): Measuring how similar two sentences are.
  • Reranking: Ordering a set of documents by relevance.
  • Summarization, Pair Classification, and Bitext Mining.

2. How to Read the Leaderboard

When you visit the MTEB Leaderboard, you'll see several columns:

  • Rank: Overall position based on average score.
  • Model Name: The identifier used to download the model (e.g., BAAI/bge-large-en-v1.5).
  • Retrieval Average: The most important metric for RAG developers.
  • Params (M): The number of parameters (indicates memory usage and speed).
  • Embedding Dimensions: The size of the output vector.

The Decision Workflow

Finding your Model

  1. 1
    Step 1

    Go to the Hugging Face MTEB space.

  2. 2
    Step 2

    Use the tabs to filter for English, Chinese, or Multilingual models.

  3. 3
    Step 3

    Click the 'Retrieval' column header to see which models excel at finding relevant context.

  4. 4
    Step 4

    Look at the 'Params' column. A 100M parameter model will be much faster and cheaper to run than a 7B parameter model, even if the 7B model has a slightly higher score.

  5. 5
    Step 5

    Ensure the 'Embedding Dimensions' fit within your vector database's limits and your storage budget.

Practice: Analysis Task

Go to the leaderboard and compare BAAI/bge-small-en-v1.5 with BAAI/bge-large-en-v1.5.

  • Question: Is the 'Retrieval' score for the Large model significantly higher than the Small model?
  • Insight: Often, the 'Small' or 'Base' versions of models offer 95% of the performance of 'Large' versions while being 5-10x faster.

Common Mistakes

  • Sorting by 'Average' Only: A model might have a high average because it's great at classification, but it might be mediocre at retrieval. Always sort by the specific task you are building for.
  • Ignoring Model Size: State-of-the-art models are often huge (7B+ parameters). For most RAG applications, models in the 100M-500M parameter range provide the best balance of speed and accuracy.

Recap

  • MTEB is the "Gold Standard" for comparing embedding models.
  • For RAG, the Retrieval score is more important than the overall average.
  • Balance benchmark scores with operational costs (Model Size and Dimensions).

Knowledge Check

Question 1 of 3
Q1Single choice

Which MTEB task is most relevant for a RAG (Retrieval-Augmented Generation) system?

Using the MTEB Leaderboard | Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems | Coursify