Retrieval-Augmented Generation (RAG) — From Fundamentals to Production-Ready Agentic RAG Systems

How Embeddings Work — Vector Representation of Text

Understand how text embeddings represent words, sentences, and documents as dense vectors. Learn about semantic similarity measured by cosine distance and dot product.

Learning Goals

Explain how embeddings map text to vector spaces
Calculate semantic similarity using cosine distance

What are Embeddings?

In the world of Retrieval-Augmented Generation (RAG), embeddings are the bridge between human language and machine understanding. They are the mathematical foundation that allows us to perform semantic search—finding information based on meaning rather than just keyword matching.

By the end of this lesson, you will understand what embeddings are, how they represent semantic relationships in high-dimensional space, and why they are indispensable for modern AI applications.

Learning Goals

Define embeddings and their role in Natural Language Processing (NLP).
Explain the concept of vector space and semantic similarity.
Understand the difference between sparse and dense representations.
Identify how distance metrics (like Cosine Similarity) are used to measure meaning.

Core Concepts

1. From Words to Numbers

Computers cannot 'read' text. To process language, they must convert words, sentences, or even images into numbers. An embedding is a numerical representation of a piece of data (like a word or a sentence) in the form of a vector (a list of numbers).

Unlike simple encoding (where 'apple' might be 1 and 'orange' might be 2), embeddings capture the contextual meaning.

2. High-Dimensional Vector Space

Imagine a graph. A word like 'king' might be represented by coordinates in a space with hundreds or thousands of dimensions. In this space:

Words with similar meanings are located close to each other.
The 'distance' and 'direction' between vectors represent relationships.

3. Sparse vs. Dense Vectors

Sparse Vectors (Traditional): Think of a massive list of all possible words. If a word is present, it gets a 1; otherwise, a 0. These are very long and mostly empty (sparse). Example: BM25, TF-IDF.
Dense Vectors (Modern Embeddings): These are shorter, fixed-length lists of floating-point numbers (e.g., 768 or 1536 dimensions). Every number carries information about a different 'aspect' of the meaning.

Visualizing Semantic Relationships

Here is a simplified view of how embeddings cluster related concepts in a 2D space (though in reality, they exist in much higher dimensions).

How an Embedding is Created

1
Step 1
You provide a string, for example: "The cat sat on the mat.".
2
Step 2
The model breaks the text into smaller pieces called tokens (words or sub-words).
3
Step 3
The tokens are passed through a neural network (like BERT or OpenAI's text-embedding-3-small).
4
Step 4
The model analyzes the relationship between tokens and calculates a fixed-length array of numbers.
5
Step 5
You receive a vector, e.g., [0.12, -0.45, 0.88, ... 1536 total values].

Example: The 'Man - Woman = King - Queen' Analogy

A famous property of embeddings is that they can capture relationships as geometric offsets. If you take the vector for 'King', subtract 'Man', and add 'Woman', the resulting vector will be very close to the vector for 'Queen'. This demonstrates that the model has 'learned' the concept of gender as a direction in vector space.

Practice: Thinking in Dimensions

Consider the following three sentences:

"The weather is sunny today."
"It is a bright and clear day."
"I am eating a delicious pizza."

Which two sentences do you think will have the smallest 'distance' between their embedding vectors?

Answer: Sentences 1 and 2, because they share semantic meaning (weather/sunny/bright), even though they use different words. Sentence 3 is semantically unrelated.

Common Mistakes

Assuming Keywords Matter Most: Embeddings focus on meaning. If you search for "How to fix a car," an embedding model might return "Automobile repair guide" because the meaning is the same, even if the words are different.
Comparing Different Models: You cannot compare a vector from OpenAI with a vector from HuggingFace. They use different 'maps' (vector spaces).
Ignoring Sequence Length: Most models have a limit (e.g., 8192 tokens). If your text is too long, the 'meaning' of the later parts may be lost or the text will be truncated.

Recap

Embeddings transform text into numerical vectors.
They capture semantic meaning, not just keywords.
Related concepts are physically close in vector space.
Distance metrics (like Cosine Similarity) are used to find 'neighbor' documents in RAG.

Knowledge Check

Question 1 of 3

Q1Single choice

What is the primary advantage of embeddings over keyword search?

They are faster to calculate

They capture semantic meaning and context

They require less storage space

Hands-on Document Processing Pipeline

Embedding Models Compared