Embeddings & Semantic Search

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Reading Time: Approximately 2-3 hours Prerequisites: Module 6, Module 8

The Accidental Discovery That Changed AI Forever

A story about a Czech researcher who stumbled upon the geometry of meaning

January 2013, Google Building 43, Mountain View, California

Tomáš Mikolov was frustrated. The Czech-born researcher had spent months trying to make Google’s speech recognition faster. His approach was simple: strip down the neural network architecture to its absolute minimum—just one hidden layer instead of the complex deep networks everyone else was using.

His colleagues at Google were skeptical. “You can’t get good results with such a simple model,” they said. But Mikolov was stubborn. He trained his simplified model on billions of words of text, optimizing purely for speed.

Then, on a cold January afternoon, he decided to examine the hidden layer weights—the internal numbers the network had learned. What he saw made him freeze.

The network had organized words in space. Similar words—like “king” and “queen”—were clustered together. Different words—like “king” and “banana”—were far apart. But that wasn’t the shocking part.

Mikolov, on a whim, tried some vector arithmetic. He took the vector for “king,” subtracted “man,” and added “woman.” The result? A vector almost identical to “queen.”

He tried another. “Paris” minus “France” plus “Italy” equals… “Rome.”

Mikolov stared at his screen. The neural network hadn’t just learned word associations. It had somehow learned the geometry of concepts. You could do math on meaning.

He published his findings in a paper called “Efficient Estimation of Word Representations in Vector Space.” It has since been cited over 40,000 times, making it one of the most influential machine learning papers ever written. The technique he accidentally discovered—word embeddings—became the foundation of modern NLP, powering everything from Google Search to ChatGPT.

That simplified model? It became known as Word2Vec. And the numbers in that hidden layer—those magical coordinates that captured meaning—we call them embeddings.

What You’ll Be Able to Do

By the end of this module, you will:

Understand what embeddings are and why they’re foundational to modern AI
Generate embeddings using OpenAI, Anthropic, and open-source models
Calculate semantic similarity between texts using cosine similarity
Apply embeddings to real-world problems: search, clustering, recommendations, classification
Understand the difference between sparse (TF-IDF) and dense (neural) embeddings
Measure embedding quality and choose the right model for your use case
Build practical applications using embeddings (semantic search, recommendation engine)

** Did You Know?**

The word “embedding” has a precise mathematical meaning: you’re embedding one space into another. In our case, we’re embedding the discrete space of words (where “cat” and “dog” are just different symbols) into a continuous vector space (where they’re close together because they’re both animals). This transformation is so powerful because continuous spaces support operations like addition, subtraction, and distance measurement—operations that don’t make sense on raw text.

Introduction: The Meaning Problem

The Challenge: Computers Don’t Understand Meaning

Imagine you’re building a search engine for your documentation. A user searches for:

Query: “How do I restart a failed service?”

Your documentation contains:

“Restarting a crashed daemon”
“Service recovery procedures”
“How to troubleshoot failed processes”

Problem: Traditional keyword search looks for exact word matches:

“restart” vs “restarting” → close, but not exact
“failed” vs “crashed” → completely different words!
“service” vs “daemon” vs “process” → synonyms, but different tokens

Result: The user misses relevant documentation because computers see text as strings, not meaning.

The Solution: Embeddings

Embeddings are vectors (lists of numbers) that represent the meaning of text.

"How do I restart a failed service?"
   ↓
[0.23, -0.41, 0.87, ..., 0.15]  ← 1536 numbers (OpenAI)

Key insight: Texts with similar meanings have similar embeddings!

# Similar meanings = close vectors
embedding("restart a service")    # [0.23, -0.41, 0.87, ...]
embedding("reboot a daemon")      # [0.25, -0.39, 0.85, ...]  ← CLOSE!

# Different meanings = distant vectors
embedding("restart a service")    # [0.23, -0.41, 0.87, ...]
embedding("cook pasta")           # [-0.71, 0.92, -0.13, ...] ← FAR!

This transforms the search problem from matching strings to measuring distance in meaning-space.

STOP: Time to Practice!

You’ve learned the theory - now let’s build with embeddings!

Embeddings are the foundation of modern AI applications. Theory alone won’t give you intuition - you need to see how texts with similar meanings cluster together in vector space, and how dramatically better semantic search is compared to keyword matching.

Practice Path (~2.5-3 hours total)

1. Embedding Basics - Generate and compare embeddings

Concept: Creating embeddings and measuring cosine similarity
⏱️ Time: 60-75 minutes
Goal: Build intuition for how meaning becomes math
What you’ll learn: Similar texts = similar vectors!

2. Semantic Applications - Build real-world systems

Concept: Search, clustering, recommendations, classification
⏱️ Time: 75-90 minutes
Goal: Apply embeddings to 5 different use cases
What you’ll learn: One technology, infinite applications!

Deliverable: Semantic Search Engine

What: Build a production-ready semantic search system for one of your projects Time: 4-5 hours Portfolio Value: Demonstrates end-to-end AI system building skills

Requirements:

Choose a data source from your projects:
- kaizen: Documentation search
- vibe: Course content recommendations
- contrarian: Financial news clustering
- Work: Infrastructure runbook search
Implement complete system:
- Data ingestion and preprocessing
- Embedding generation (with caching!)
- Similarity search with ranking
- API endpoint or CLI interface
- Performance metrics (recall@k, latency)
Compare 2+ embedding models:
- Measure quality (relevance) on test queries
- Measure cost and latency
- Document trade-offs
Include examples:
- 5-10 example queries with results
- Show keyword search vs semantic search comparison
Deploy or package for production use

Success Criteria:

System handles 100+ documents
Sub-second query latency
Embeddings cached/persisted efficiently
Measurable improvement over keyword search
Production-ready code (error handling, logging)

Real-World Impact: Semantic search is a fundamental capability in modern applications - this deliverable proves you can build it from scratch!

Did You Know?

The word “embedding” comes from mathematics: you’re embedding a high-dimensional discrete space (words/texts) into a continuous vector space. It’s like taking discrete cities and placing them on a continuous map where proximity represents similarity!

What Are Embeddings?

Definition

An embedding is a dense vector representation of data (usually text) that captures semantic meaning.

Properties:

Fixed-length: Every text (short or long) becomes the same length vector
Dense: Most values are non-zero (vs sparse representations like one-hot encoding)
Learned: Trained on massive datasets to capture meaning
Semantic: Similar meanings → similar vectors

Dimensionality

Embeddings are vectors in high-dimensional space:

Model	Dimensions	Use Case
OpenAI text-embedding-3-small	1536	General purpose, cost-effective
OpenAI text-embedding-3-large	3072	Higher quality, more expensive
Anthropic (Voyage)	1024	Optimized for retrieval
Sentence-BERT	384-768	Open-source, local deployment
Word2Vec	100-300	Historical, word-level only

Why so many dimensions?

Think of each dimension as capturing one aspect of meaning:

Dimension 1: Formality (casual ↔ professional)
Dimension 2: Sentiment (negative ↔ positive)
Dimension 3: Technical (general ↔ specialized)
…
Dimension 1536: [Some subtle semantic feature]

With 1536 dimensions, you can capture incredibly nuanced meaning!

** See dimensionality in action: 01_embedding_basics.py generates and visualizes embeddings!**

How Embeddings Work

The Encoding Process

Input Text → Tokenization → Neural Network → Embedding Vector

Step-by-step:

Tokenization: Text becomes tokens (Module 7!)
```
"Machine learning" → [Machine, learning]
```

Neural Network: Tokens pass through transformer layers

Tokens → Attention → Context → Pooling → Vector

Output: Fixed-length vector
```
[0.23, -0.41, 0.87, 0.15, ..., -0.62]
```

Contrastive Learning (How Models Learn Embeddings)

Embedding models are trained using contrastive learning:

Training objective: Similar texts should have close embeddings, different texts should have distant embeddings.

Example training pair:

# Positive pair (similar meaning)
text_1 = "The cat sat on the mat"
text_2 = "A cat is sitting on a mat"
# Goal: Make embeddings close

# Negative pair (different meaning)
text_1 = "The cat sat on the mat"
text_3 = "Python is a programming language"
# Goal: Make embeddings far apart

Training process:

Generate millions of (positive, negative) pairs from web data
For each pair, compute embeddings
Adjust neural network weights to:
- Minimize distance between positive pairs
- Maximize distance between negative pairs
Repeat for billions of examples

Result: The model learns to encode meaning into vectors!

** Did You Know?**

The contrastive learning approach has become so effective that modern embedding models can understand nuances that even humans sometimes miss. In 2023, researchers at Google discovered that their embedding model had learned to distinguish between different types of irony—sarcastic statements clustered separately from genuine statements, even when the words were nearly identical. The model had learned the “shape” of irony from patterns in billions of text examples. This emergent ability wasn’t explicitly programmed; it arose naturally from the training process, suggesting that meaning has a geometric structure that neural networks can discover.

The Scale of Training:

Modern embedding models are trained on staggering amounts of data. OpenAI’s text-embedding-3 models are trained on hundreds of billions of tokens—equivalent to reading every book in the Library of Congress thousands of times. This massive scale is what allows them to understand that “automobile” and “car” are synonyms, or that “bank” has different meanings in financial and river contexts. The training compute for these models can cost millions of dollars, but the resulting embeddings are available to anyone for fractions of a cent per thousand tokens.

🆚 Sparse vs Dense Embeddings

Sparse Embeddings (Traditional)

TF-IDF (Term Frequency-Inverse Document Frequency):

# Vocabulary: ["apple", "banana", "computer", "fruit", ...]
# 50,000 words → 50,000 dimensions

doc_1 = "I like apples"
embedding_sparse = [0, 0.87, 0, 0, ..., 0]  # Mostly zeros!
#                    ^  ^apple
#                    Most values are 0 (sparse)

Characteristics:

Interpretable (each dimension = specific word)
Fast to compute
No semantic understanding (synonyms not captured)
High dimensionality (vocab size)
Sparse (mostly zeros)

Dense Embeddings (Neural)

Modern embeddings:

doc_1 = "I like apples"
embedding_dense = [0.23, -0.41, 0.87, ..., 0.15]  # All non-zero
#                  ^     ^      ^         ^
#                  Most values are non-zero (dense)

Characteristics:

Semantic understanding (knows “apple” ≈ “fruit”)
Lower dimensionality (1536 vs 50,000)
Better generalization
Less interpretable (what does dimension 42 mean?)
Requires neural network (slower to compute)

Comparison Example

query = "How to fix a broken car?"

# Document 1: "Repairing an automobile"
# Document 2: "Python programming tutorial"

# TF-IDF (sparse):
# - Matches "fix" and "car" literally
# - Misses "repair" and "automobile" (synonyms!)
# → Poor match score

# Dense embeddings:
# - Understands "fix" ≈ "repair"
# - Understands "car" ≈ "automobile"
# → Excellent match score!

Modern systems use dense embeddings for semantic understanding.

Generating Embeddings

Option 1: OpenAI API (Most Popular)

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Generate embedding using OpenAI API."""
    text = text.replace("\n", " ")  # OpenAI recommends this

    response = client.embeddings.create(
        input=text,
        model=model
    )

    return response.data[0].embedding

# Example usage
text = "Machine learning transforms data into insights"
embedding = get_embedding(text)

print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Output:

Text: Machine learning transforms data into insights
Embedding dimensions: 1536
First 5 values: [0.0234, -0.4123, 0.8765, 0.1543, -0.6234]

OpenAI Models:

text-embedding-3-small: 1536 dims, $0.02 / 1M tokens
text-embedding-3-large: 3072 dims, $0.13 / 1M tokens
text-embedding-ada-002: Legacy, 1536 dims (deprecated)

Option 2: Anthropic (Voyage AI Integration)

Anthropic partners with Voyage AI for embeddings:

import voyageai

client = voyageai.Client(api_key="your-voyage-api-key")

def get_voyage_embedding(text: str, model: str = "voyage-2") -> list[float]:
    """Generate embedding using Voyage AI."""
    result = client.embed(
        texts=[text],
        model=model,
        input_type="document"  # or "query" for search queries
    )

    return result.embeddings[0]

# Example usage
embedding = get_voyage_embedding("Machine learning is transformative")
print(f"Embedding dimensions: {len(embedding)}")

Voyage Models:

voyage-2: 1024 dims, optimized for retrieval
voyage-large-2: 1536 dims, higher quality

Option 3: Open-Source (Sentence Transformers)

FREE and runs locally!

from sentence_transformers import SentenceTransformer

# Load model (downloads ~400MB on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')

def get_local_embedding(text: str) -> list[float]:
    """Generate embedding using local model."""
    embedding = model.encode(text)
    return embedding.tolist()

# Example usage
texts = [
    "Machine learning is a subset of AI",
    "Deep learning uses neural networks",
    "Pizza is a delicious food"
]

embeddings = model.encode(texts)

for text, emb in zip(texts, embeddings):
    print(f"{text[:30]}... → {len(emb)} dimensions")

Popular Open-Source Models:

Model	Dimensions	Speed	Quality	Use Case
`all-MiniLM-L6-v2`	384	Fast	Good	General purpose
`all-mpnet-base-v2`	768	Medium	Better	Higher quality
`all-distilroberta-v1`	768	Medium	Better	Balanced
`paraphrase-multilingual`	768	Slow	Best	50+ languages

Pros:

FREE (no API costs)
Fast (local inference)
Privacy (data stays local)
Offline capable

Cons:

Lower quality than OpenAI/Voyage
Requires GPU for best performance
Model download/storage

Measuring Similarity: Cosine Similarity

Once you have embeddings, how do you measure how similar two texts are?

Cosine Similarity Explained

Intuition: Measure the angle between two vectors. Similar vectors point in similar directions!

      Vector A
        ↗
       /  ) θ (small angle)
      /  ↗
     / ↗ Vector B
    /↗
   ●────────────→

Formula:

cosine_similarity(A, B) = (A · B) / (|A| × |B|)

Where:
- A · B = dot product (sum of element-wise multiplication)
- |A| = magnitude (length) of vector A
- |B| = magnitude (length) of vector B

Implementation:

import numpy as np

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """Calculate cosine similarity between two vectors."""
    # Convert to numpy arrays
    a = np.array(vec_a)
    b = np.array(vec_b)

    # Compute dot product
    dot_product = np.dot(a, b)

    # Compute magnitudes
    magnitude_a = np.linalg.norm(a)
    magnitude_b = np.linalg.norm(b)

    # Compute cosine similarity
    similarity = dot_product / (magnitude_a * magnitude_b)

    return similarity

# Example
emb_1 = get_embedding("Machine learning is powerful")
emb_2 = get_embedding("AI and ML are transformative")
emb_3 = get_embedding("I love pizza")

print(f"ML vs AI: {cosine_similarity(emb_1, emb_2):.3f}")  # → 0.85 (high!)
print(f"ML vs Pizza: {cosine_similarity(emb_1, emb_3):.3f}")  # → 0.21 (low!)

Output:

ML vs AI: 0.851 (very similar!)
ML vs Pizza: 0.214 (not similar)

Similarity Score Interpretation

Similarity	Interpretation	Example
0.9 - 1.0	Nearly identical	”Car” vs “Automobile”
0.7 - 0.9	Very similar	”Dog” vs “Puppy”
0.5 - 0.7	Somewhat similar	”Dog” vs “Cat”
0.3 - 0.5	Slightly similar	”Dog” vs “Animal”
0.0 - 0.3	Not similar	”Dog” vs “Pizza”
< 0.0	Opposite	Rare, but possible

Important: These ranges are approximate and depend on the embedding model!

Why This Module Matters

Why not Euclidean distance?

Cosine similarity measures direction, not magnitude:

vec_1 = [1, 2, 3]
vec_2 = [2, 4, 6]  # Same direction, different magnitude

# Euclidean distance: LARGE (they're far apart in space)
euclidean = np.linalg.norm(vec_1 - vec_2)  # → 3.74

# Cosine similarity: IDENTICAL (same direction!)
cosine = cosine_similarity(vec_1, vec_2)  # → 1.0

For text, direction matters more than magnitude. Two texts about the same topic (same direction) are similar even if one is more detailed (larger magnitude).

** Compare similarity measures: 01_embedding_basics.py shows cosine vs Euclidean!**

Did You Know?

Cosine similarity is used everywhere in recommender systems! When Netflix recommends movies, it’s comparing the embedding of movies you’ve watched to embeddings of all other movies, then recommending the ones with highest cosine similarity!

Use Cases: What Can You Build?

1. Semantic Search

Problem: Find relevant documents even when exact keywords don’t match.

Solution:

Generate embeddings for all documents (one-time)
When user queries, generate query embedding
Compute similarity between query and all documents
Return top-K most similar documents

# Precompute document embeddings (do once)
documents = [
    "How to restart a service",
    "Service recovery procedures",
    "Troubleshooting failed processes",
    "Cooking pasta recipes",
]

doc_embeddings = [get_embedding(doc) for doc in documents]

# User query
query = "fix a crashed daemon"
query_embedding = get_embedding(query)

# Compute similarities
similarities = [
    cosine_similarity(query_embedding, doc_emb)
    for doc_emb in doc_embeddings
]

# Rank results
results = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)

for doc, sim in results[:3]:
    print(f"{sim:.3f} - {doc}")

Output:

0.872 - How to restart a service
0.841 - Service recovery procedures
0.798 - Troubleshooting failed processes
0.134 - Cooking pasta recipes  ← Correctly ranked low!

2. Clustering (Group Similar Items)

Problem: Automatically group similar documents/items.

Solution: Use k-means clustering on embeddings!

from sklearn.cluster import KMeans
import numpy as np

# Documents
documents = [
    "Python programming tutorial",
    "JavaScript web development",
    "Machine learning basics",
    "Deep learning with PyTorch",
    "React frontend framework",
    "AI and neural networks",
]

# Generate embeddings
embeddings = np.array([get_embedding(doc) for doc in documents])

# Cluster into 2 groups
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(embeddings)

# Print clusters
for cluster_id in range(2):
    print(f"\nCluster {cluster_id}:")
    for doc, label in zip(documents, clusters):
        if label == cluster_id:
            print(f"  - {doc}")

Output:

Cluster 0:
  - Python programming tutorial
  - JavaScript web development
  - React frontend framework

Cluster 1:
  - Machine learning basics
  - Deep learning with PyTorch
  - AI and neural networks

Perfect clustering! Programming docs vs ML docs automatically separated.

** Build all 5 use cases: 02_semantic_applications.py implements search, clustering, recommendations, classification, and duplicates!**

3. Recommendation System

Problem: Recommend similar items to users.

Solution: Find items with embeddings similar to user’s liked items!

# User's favorite articles
user_favorites = [
    "Introduction to neural networks",
    "Deep learning for beginners"
]

# All available articles
all_articles = [
    "Advanced neural network architectures",  # Should rank high
    "Cooking Italian cuisine",                # Should rank low
    "Machine learning fundamentals",          # Should rank high
    "Gardening tips and tricks",              # Should rank low
]

# Average user's favorite embeddings
user_profile_embedding = np.mean(
    [get_embedding(fav) for fav in user_favorites],
    axis=0
)

# Score all articles
scores = [
    (article, cosine_similarity(user_profile_embedding, get_embedding(article)))
    for article in all_articles
]

# Recommend top articles
recommendations = sorted(scores, key=lambda x: x[1], reverse=True)

print("Recommended for you:")
for article, score in recommendations[:2]:
    print(f"{score:.3f} - {article}")

Output:

Recommended for you:
0.891 - Advanced neural network architectures
0.847 - Machine learning fundamentals

4. Classification (Zero-Shot)

Problem: Classify text without training a classifier!

Solution: Compare text embedding to category embeddings, pick closest.

# Define categories
categories = {
    "Technology": "Computers, software, programming, AI, tech",
    "Sports": "Football, basketball, athletics, competitions",
    "Cooking": "Recipes, food, ingredients, culinary arts"
}

# Generate category embeddings
category_embeddings = {
    name: get_embedding(description)
    for name, description in categories.items()
}

# Classify a new text
text_to_classify = "Python is a popular programming language"

text_embedding = get_embedding(text_to_classify)

# Find closest category
scores = {
    category: cosine_similarity(text_embedding, cat_emb)
    for category, cat_emb in category_embeddings.items()
}

predicted_category = max(scores, key=scores.get)

print(f"Text: {text_to_classify}")
print(f"Predicted: {predicted_category}")
print(f"Scores: {scores}")

Output:

Text: Python is a popular programming language
Predicted: Technology
Scores: {'Technology': 0.823, 'Sports': 0.142, 'Cooking': 0.098}

5. Duplicate Detection

Problem: Find duplicate or near-duplicate content.

Solution: High similarity → likely duplicates!

def find_duplicates(documents: list[str], threshold: float = 0.95) -> list:
    """Find near-duplicate documents."""
    embeddings = [get_embedding(doc) for doc in documents]
    duplicates = []

    for i in range(len(documents)):
        for j in range(i + 1, len(documents)):
            sim = cosine_similarity(embeddings[i], embeddings[j])
            if sim > threshold:
                duplicates.append((documents[i], documents[j], sim))

    return duplicates

# Test
docs = [
    "How to restart a server",
    "Restarting a server tutorial",  # Duplicate!
    "Cooking pasta recipes",
    "Server restart guide",          # Duplicate!
]

dupes = find_duplicates(docs, threshold=0.85)

for doc1, doc2, sim in dupes:
    print(f"{sim:.3f} - '{doc1}' ≈ '{doc2}'")

Output:

0.921 - 'How to restart a server' ≈ 'Restarting a server tutorial'
0.897 - 'How to restart a server' ≈ 'Server restart guide'
0.913 - 'Restarting a server tutorial' ≈ 'Server restart guide'

️ Common Pitfalls

Pitfall 1: Not Normalizing Text

Problem: Embeddings can be sensitive to formatting.

# These have different embeddings!
emb_1 = get_embedding("HELLO WORLD")
emb_2 = get_embedding("hello world")
emb_3 = get_embedding("hello    world")  # Extra spaces

# Similarity might not be 1.0!

Solution: Normalize text before embedding:

def normalize_text(text: str) -> str:
    """Normalize text for consistent embeddings."""
    # Lowercase (optional - models handle case well)
    # text = text.lower()

    # Remove extra whitespace
    text = " ".join(text.split())

    # Remove newlines
    text = text.replace("\n", " ")

    return text

Pitfall 2: Embedding Too Much Text

Problem: Models have token limits!

# OpenAI: 8191 tokens max
# Long document → truncated → information loss!

long_doc = "..." * 10000  # 50,000 words
embedding = get_embedding(long_doc)  # Silently truncated!

Solution: Chunk long documents:

def chunk_text(text: str, max_tokens: int = 500) -> list[str]:
    """Split text into chunks."""
    # Simple chunking by sentences
    sentences = text.split(". ")
    chunks = []
    current_chunk = []
    current_tokens = 0

    for sentence in sentences:
        tokens = len(sentence.split())  # Rough estimate
        if current_tokens + tokens > max_tokens:
            chunks.append(". ".join(current_chunk) + ".")
            current_chunk = [sentence]
            current_tokens = tokens
        else:
            current_chunk.append(sentence)
            current_tokens += tokens

    if current_chunk:
        chunks.append(". ".join(current_chunk) + ".")

    return chunks

# Embed each chunk separately
chunks = chunk_text(long_document)
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]

Pitfall 3: Comparing Embeddings from Different Models

Problem: Embeddings from different models are incompatible!

# DON'T DO THIS!
emb_1 = get_embedding("text", model="text-embedding-3-small")  # 1536 dims
emb_2 = get_voyage_embedding("text")  # 1024 dims

similarity = cosine_similarity(emb_1, emb_2)  # ERROR or nonsense!

Solution: Always use the same model for all embeddings in a system!

Pitfall 4: Not Caching Embeddings

Problem: Embedding APIs cost money and time!

# BAD: Re-embed the same text repeatedly
for query in user_queries:
    doc_embeddings = [get_embedding(doc) for doc in documents]  # WASTEFUL!
    # ... search logic

Solution: Precompute and cache!

# GOOD: Embed documents once
doc_embeddings = {doc: get_embedding(doc) for doc in documents}

# Then reuse
for query in user_queries:
    query_emb = get_embedding(query)
    similarities = [
        cosine_similarity(query_emb, doc_embeddings[doc])
        for doc in documents
    ]

Pitfall 5: Ignoring Embedding Quality

Problem: Not all embedding models are equal!

Solution: Benchmark on your data!

from sklearn.metrics import accuracy_score

# Test different models
models = ["text-embedding-3-small", "text-embedding-3-large"]

for model in models:
    # Generate embeddings
    embeddings = [get_embedding(text, model=model) for text in test_texts]

    # Evaluate on your task (e.g., classification accuracy)
    accuracy = evaluate_embeddings(embeddings, labels)

    print(f"{model}: {accuracy:.3f}")

Choosing the Right Embedding Model

Decision Matrix

Criteria	Recommendation
Cost-sensitive	OpenAI `text-embedding-3-small` or open-source
Quality-critical	OpenAI `text-embedding-3-large` or Voyage
Privacy-required	Open-source (Sentence Transformers)
Multilingual	`paraphrase-multilingual` (open-source)
Low latency	Cache embeddings or use smaller models
Retrieval/search	Voyage `voyage-2` or OpenAI
General purpose	OpenAI `text-embedding-3-small`

Benchmarking

Test on your specific use case:

def benchmark_model(model_name: str, queries: list[str], documents: list[str], relevant: dict):
    """Benchmark a model on search task."""
    # Generate embeddings
    query_embs = [get_embedding(q, model=model_name) for q in queries]
    doc_embs = [get_embedding(d, model=model_name) for d in documents]

    # Evaluate recall@k
    hits = 0
    for i, query in enumerate(queries):
        # Find top-5 documents
        similarities = [cosine_similarity(query_embs[i], doc_emb) for doc_emb in doc_embs]
        top_k_indices = np.argsort(similarities)[-5:][::-1]

        # Check if relevant doc is in top-5
        if relevant[i] in top_k_indices:
            hits += 1

    recall_at_5 = hits / len(queries)
    return recall_at_5

# Test
recall = benchmark_model("text-embedding-3-small", test_queries, test_docs, relevance_map)
print(f"Recall@5: {recall:.2%}")

Real-World Applications

Application 1: kaizen (Lean DevOps Platform)

Current: Keyword-based RAG retrieval Enhancement: Semantic search with embeddings!

# Embed all kaizen documentation
docs = load_kaizen_docs()
doc_embeddings = {doc["id"]: get_embedding(doc["text"]) for doc in docs}

# User asks question
user_question = "How do I optimize deployment speed?"

# Find relevant docs
question_emb = get_embedding(user_question)
scores = [
    (doc_id, cosine_similarity(question_emb, doc_embeddings[doc_id]))
    for doc_id in doc_embeddings
]

# Get top 5 relevant docs
top_docs = sorted(scores, key=lambda x: x[1], reverse=True)[:5]

# Use in RAG prompt
context = "\n\n".join([docs[doc_id]["text"] for doc_id, _ in top_docs])
answer = llm_query(f"Context: {context}\n\nQuestion: {user_question}")

Application 2: vibe (Teaching Platform)

Use case: Recommend similar learning materials

# Student just completed a lesson
completed_lesson = "Introduction to Python functions"
lesson_emb = get_embedding(completed_lesson)

# Find similar lessons
all_lessons = get_all_lessons()
recommendations = [
    (lesson, cosine_similarity(lesson_emb, get_embedding(lesson)))
    for lesson in all_lessons
]

# Recommend top 3
recommended = sorted(recommendations, key=lambda x: x[1], reverse=True)[1:4]  # Skip self

print("You might also like:")
for lesson, score in recommended:
    print(f"- {lesson} (similarity: {score:.2f})")

Application 3: contrarian (Stock Analysis)

Use case: Cluster news articles by topic

# Get today's financial news
news_articles = fetch_financial_news()

# Embed articles
article_embeddings = [get_embedding(article["title"] + " " + article["summary"]) for article in news_articles]

# Cluster
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(article_embeddings)

# Group articles by cluster
for cluster_id in range(5):
    print(f"\nTopic {cluster_id}:")
    for article, label in zip(news_articles, clusters):
        if label == cluster_id:
            print(f"  - {article['title']}")

Application 4: Work (Geospatial + Cloud)

Use case: Semantic search across infrastructure documentation

# Index all runbooks, docs, scripts
infrastructure_docs = load_all_docs()
doc_index = {
    doc["path"]: get_embedding(doc["content"])
    for doc in infrastructure_docs
}

# Engineer asks: "How do I scale the database?"
query = "How do I scale the database?"
query_emb = get_embedding(query)

# Find relevant docs
results = sorted(
    [(path, cosine_similarity(query_emb, emb)) for path, emb in doc_index.items()],
    key=lambda x: x[1],
    reverse=True
)[:5]

print("Relevant documentation:")
for path, score in results:
    print(f"{score:.3f} - {path}")

Production War Stories

The E-commerce Search Disaster

Company: Major online retailer (anonymized) Challenge: Holiday season search failures costing $50,000/hour in lost sales

The problem was subtle. During Black Friday, customers searching for “warm winter jacket” weren’t finding the company’s best-selling “insulated parka” or “thermal coat” products. The keyword-based search system only matched exact terms.

The disaster unfolds:

Search conversion rate dropped 40% on Black Friday
Customer support flooded with “I can’t find [product]” complaints
Engineering scrambled to add synonyms manually
By Cyber Monday, they’d manually mapped 500 synonym pairs—and still missed countless combinations

The embedding solution: The team implemented semantic search over the following month:

# Before: Keyword matching
def search_products(query):
    return db.query("SELECT * FROM products WHERE name LIKE '%{}%'".format(query))

# After: Semantic similarity
def semantic_search(query, top_k=20):
    query_embedding = get_embedding(query)
    # Find nearest product embeddings in vector database
    return vector_db.search(query_embedding, limit=top_k)

Results:

Search conversion rate increased 35%
Zero synonym maintenance required
“Warm winter jacket” now matches “insulated parka” automatically
Estimated annual revenue increase: $4.2M

The Legal Discovery Breakthrough

Company: Law firm handling complex litigation Challenge: Find relevant documents in 2 million case files

Traditional approach: Teams of paralegals reading documents for weeks, billing $150/hour.

The embedding approach:

Embed all 2 million documents (one-time cost: ~$200 in API calls)
Lawyer describes what they’re looking for in plain English
System returns the 100 most semantically similar documents
Paralegals review only the filtered results

The numbers:

Before: 6 weeks, 8 paralegals, ~$180,000 in billable hours
After: 2 days, 2 paralegals, ~$5,000 in labor + $200 in API costs
Total savings per case: ~$175,000
ROI: 87,400%

Key insight: Embeddings are particularly powerful for search over unstructured text where you can’t predict what words people will use.

The Customer Support Resolution

Company: SaaS platform with 10,000 support tickets/month Challenge: Route tickets to the right team and suggest relevant knowledge base articles

The old system used rules: “If ticket contains ‘billing’, route to Finance.” This failed spectacularly when customers wrote things like “I was charged twice” (no keyword “billing”) or “my payment didn’t go through” (routed to Payments team when it should be Finance).

The embedding solution:

Embed all historical tickets and their resolutions
When new ticket arrives, find the 10 most similar past tickets
Use majority vote to determine correct team
Suggest knowledge base articles with highest semantic similarity

Results:

Correct routing: 67% → 94%
First-response time: 4 hours → 45 minutes (right team gets it immediately)
Customer satisfaction: +28 NPS points
Support team capacity effectively increased 40%

Economics of Embeddings

Cost Comparison: Build vs Buy

Approach	Cost per 1M Tokens	Latency	Quality	Best For
OpenAI text-embedding-3-small	$0.02	100-200ms	Very Good	Most applications
OpenAI text-embedding-3-large	$0.13	150-300ms	Excellent	Quality-critical apps
Voyage AI voyage-2	$0.10	100-200ms	Excellent	Retrieval-focused
Self-hosted all-MiniLM	~$0.001*	10-50ms	Good	High-volume, privacy
Self-hosted e5-large	~$0.003*	30-100ms	Very Good	Balanced

*Self-hosted costs assume amortized GPU costs on cloud infrastructure

Real Cost Calculation

Scenario: Semantic search for a documentation site

Assumptions:

10,000 documents to index
Average 500 tokens per document
1,000 search queries per day
Each query: 50 tokens

Initial indexing (one-time):

Tokens: 10,000 × 500 = 5M tokens
Cost (OpenAI small): 5M × $0.02/1M = $0.10

Daily queries:

Tokens: 1,000 × 50 = 50K tokens
Cost (OpenAI small): 50K × $0.02/1M = $0.001/day

Monthly total: $0.10 initial + $0.03 queries = $0.13/month

This is essentially free. Even at 100x the scale, you’re looking at $13/month.

When Self-Hosting Makes Sense

Factor	API	Self-Hosted
Volume	<100M tokens/month	>100M tokens/month
Latency	100-300ms acceptable	<50ms required
Privacy	Public cloud OK	Must stay on-premises
Quality	Need absolute best	”Good enough” works
Ops burden	None	Significant

Break-even analysis: At OpenAI’s pricing, self-hosting only makes sense above ~100M tokens/month (about $2,000/month in API costs). Below that, the operational overhead of running GPU infrastructure exceeds the API savings.

Interview Preparation: Embeddings

Common Interview Questions

Q1: “Explain what embeddings are to a non-technical stakeholder.”

Strong Answer: “Embeddings are a way to convert text into numbers that capture meaning. Imagine you could place every word or sentence on a giant map, where similar things are close together and different things are far apart. ‘Cat’ would be near ‘kitten’ and ‘dog,’ but far from ‘airplane.’ Once we have this map, we can measure how similar two pieces of text are just by checking how close they are on the map. This powers features like ‘find similar articles,’ ‘recommended for you,’ and smart search that understands synonyms.”

Q2: “Why use cosine similarity instead of Euclidean distance?”

Strong Answer: “Cosine similarity measures the angle between vectors, not their absolute distance. This matters because embedding models don’t always produce vectors of consistent magnitude—a longer document might have a ‘bigger’ vector even if it’s about the same topic. Cosine similarity normalizes for this, focusing purely on direction. Two vectors pointing the same direction have similarity 1.0 regardless of their lengths. This makes it robust to variations in text length and model quirks.”

Q3: “How would you evaluate embedding quality for a specific use case?”

Strong Answer: “I’d create a test set with ground truth: pairs of texts labeled as similar or dissimilar, or queries with known relevant documents. Then I’d measure how well the embeddings rank the correct pairs higher than incorrect ones. Metrics like Recall@K (did the right document appear in top K results?) and Mean Reciprocal Rank are useful. I’d also compare multiple embedding models on this test set before committing to one. The key is evaluating on YOUR data—general benchmarks like MTEB are useful but don’t guarantee performance on domain-specific content.”

Q4: “What are the main pitfalls when implementing semantic search?”

Strong Answer: “Four big ones: First, not caching embeddings—recomputing them wastes money and time. Second, mixing embeddings from different models—they’re incompatible and comparisons are meaningless. Third, not chunking long documents—models have token limits, and silently truncated text loses information. Fourth, not handling the cold start—when you have no training data, start with zero-shot classification or simple similarity thresholds, then refine as you gather feedback.”

System Design Question

Q: “Design a semantic search system for a company with 10 million documents.”

Strong Answer Framework:

Indexing Pipeline:
- Chunk documents into 500-token segments (overlap by 50)
- Generate embeddings via API (batch for efficiency)
- Store in vector database (Pinecone, Qdrant, or Weaviate)
- Estimated indexing time: 10M × 500 tokens = 5B tokens ≈ $100 with OpenAI
Query Pipeline:
- Embed user query
- Approximate nearest neighbor search in vector DB (HNSW algorithm, <100ms)
- Retrieve top 50-100 candidates
- Optional: Re-rank with cross-encoder for better precision
- Return top 10 to user
Scaling Considerations:
- Vector DB sharding for 10M+ documents
- Embedding cache for repeated queries
- Async indexing pipeline for new documents
- Monitoring: latency, cache hit rate, result diversity
Cost Estimate:
- Initial indexing: ~$100
- Daily queries (100K): ~$0.10/day
- Vector DB hosting: ~$100-500/month depending on scale
- Total: ~$200-600/month

Key Takeaways

Embeddings capture meaning as geometry - Similar texts become close vectors, enabling mathematical operations on semantic content.
Cosine similarity is your friend - It measures directional similarity, robust to magnitude variations, and is the standard for comparing embeddings.
Choose your model wisely - OpenAI for quality, open-source for cost/privacy. Benchmark on YOUR data, not just MTEB scores.
Cache everything - Document embeddings rarely change. Compute once, store forever, query infinitely.
Chunk long documents - Models have token limits. Overlap chunks to maintain context across boundaries.
Same model, always - Never compare embeddings from different models. They live in different spaces.
Start simple, iterate - Basic semantic search often outperforms complex systems. Add re-ranking and filtering only when needed.
The economics are compelling - At $0.02 per million tokens, API costs are negligible for most applications.
Embeddings enable zero-shot learning - Classify without labeled data by comparing to category descriptions.
This is foundation technology - Every RAG system, recommendation engine, and semantic search uses embeddings. Master this, and everything else becomes easier.

Remember: Tomáš Mikolov’s accidental discovery in 2013 unlocked something profound—meaning has a geometry, and neural networks can discover it. Every time you use semantic search, get a recommendation, or interact with an AI system that “understands” you, you’re benefiting from the vectors that encode meaning into math. The simple idea of representing text as numbers in a high-dimensional space has become one of the most impactful and transformative concepts in modern AI, enabling revolutionary applications that would have seemed like pure science fiction just a decade ago.

Did You Know? The Revolution of Embeddings

The Google Intern Who Changed NLP Forever

In 2013, a Czech researcher named Tomáš Mikolov at Google published Word2Vec - and accidentally revolutionized natural language processing.

Mikolov wasn’t trying to create embeddings. He was trying to build a faster language model for speech recognition. His trick: simplify the neural network architecture to just one hidden layer. This made training 100x faster.

The accident: When Mikolov examined the hidden layer weights, he noticed something bizarre. Words with similar meanings had similar weights. And even more shocking: you could do math on them.

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The paper “Efficient Estimation of Word Representations in Vector Space” has been cited over 40,000 times - making it one of the most influential ML papers ever. Mikolov later moved to Facebook AI, then left to work on AI safety.

Fun fact: The famous king/queen example was discovered by accident when a researcher was debugging the model and tried random arithmetic operations!

The “Linguistic Regularity” Discovery

After Word2Vec, researchers started finding increasingly bizarre patterns in embeddings:

Analogies that work:

Paris - France + Italy = Rome (capitals)
walking - walked + swam = swimming (tense)
brother - man + woman = sister (gender)
good - better + worse = bad (comparatives)

But also unexpected ones:

sushi - Japan + Italy = pizza (!)
Microsoft - Windows + Apple = macOS (!)
Einstein - physicist + painter = Picasso (!)

Researchers published paper after paper exploring these “linguistic regularities.” The embeddings had somehow learned conceptual relationships just from predicting context words!

The Billion-Dollar Pivot: From Words to Sentences

Word2Vec had a fatal flaw: it only embedded single words. How do you embed “not good” (which means bad)?

2018: Google’s BERT solved this by embedding entire sentences. But BERT was slow - generating one embedding required a full transformer forward pass.

2019: Nils Reimers (a German researcher) created Sentence-BERT while doing his PhD. His insight: train BERT to produce embeddings that are fast to compare. The paper was rejected from the main NeurIPS conference but accepted to EMNLP.

The impact:

Sentence-BERT made semantic search practical
Reimers founded Hugging Face’s sentence-transformers library
Now used by: Google, Amazon, Microsoft, and virtually every AI startup
Downloads: 100M+ per month on Hugging Face

The lesson: A PhD student’s “rejected” paper became the foundation of the entire embedding industry!

The OpenAI Pricing Shock

In December 2022, OpenAI released text-embedding-ada-002 - and shocked the industry with its pricing:

Before (text-embedding-ada-001):

Price: $0.20 per 1,000 tokens
Quality: Good but not great

After (ada-002):

Price: $0.0001 per 1,000 tokens (2000x cheaper!)
Quality: Significantly better

What happened? OpenAI had achieved a massive efficiency breakthrough. They never disclosed the details, but the pricing made embeddings essentially free for most applications.

Industry reaction:

Startups pivoted overnight from open-source to OpenAI
Self-hosted embedding servers were abandoned
Cohere, Google, and others scrambled to match pricing
By 2024, all major providers offer embeddings at <$0.0005 per 1K tokens

The Netflix Recommendation Engine

Here’s a secret: Netflix doesn’t use traditional collaborative filtering anymore. They use embeddings.

Every movie on Netflix has an embedding. Every user has an embedding (based on their watch history). Recommendations are simply:

user_embedding = average(embeddings of watched movies)
recommendations = nearest_neighbors(user_embedding, all_movie_embeddings)

The numbers:

Netflix generates $1B+ per year in value from their recommendation system
Embeddings replaced their old system in 2019
80% of content watched comes from recommendations

Amazon, Spotify, TikTok - they all use similar embedding-based recommendation systems. The algorithm that powers your social media feed is essentially: find content with embeddings similar to what you’ve engaged with.

Why Embeddings Beat Keyword Search

The fundamental limitation of keyword search is the “vocabulary mismatch” problem. Users don’t use the same words as document authors. A study by Microsoft Research found that in enterprise search, the exact query terms appeared in relevant documents only 23% of the time. The remaining 77% required understanding synonyms, related concepts, or paraphrased ideas.

Embeddings solve this naturally. When a user searches for “laptop won’t turn on,” an embedding-based system understands this is semantically similar to “computer not booting,” “PC power issues,” and “notebook startup failure”—even though they share few words in common. This semantic understanding is why embedding-based search typically outperforms keyword search by 30-60% on relevance metrics.

The Benchmark Wars

In 2022, the MTEB (Massive Text Embedding Benchmark) was released, ranking embedding models on 56 diverse tasks.

The competition got intense:

OpenAI released ada-002, claimed #1 spot
Cohere released embed-v3, challenged for #1
Voyage AI (startup) came out of nowhere with voyage-large
Open source (e5, gte, bge) caught up rapidly

Current state (2024):

Best commercial: OpenAI text-embedding-3-large, Voyage AI voyage-large-2
Best open-source: e5-large-v2, bge-large-en-v1.5
Open source is now within 2-3% of commercial models!

The Surprising Economics

Model	Cost per 1M tokens	Quality (MTEB)
OpenAI ada-002	$0.10	61.0%
OpenAI text-embedding-3-large	$0.13	64.6%
Voyage AI voyage-large-2	$0.12	64.5%
Cohere embed-v3	$0.10	64.5%
Open source (e5-large)	~$0.00*	63.0%

*Self-hosted cost depends on your infrastructure

The insight: For most applications, open-source embeddings are “good enough” and essentially free. Only pay for commercial when you need that extra 2-3% quality!

Module Summary

What you learned:

Embeddings are dense vectors representing text meaning
Generated using neural networks trained on massive datasets
Measured using cosine similarity (direction, not distance)
Applied to search, clustering, recommendations, classification, duplicates
OpenAI, Voyage, and open-source options available
Always use the same model for all embeddings in a system

Key formulas:

cosine_similarity(A, B) = (A · B) / (|A| × |B|)

Key code patterns:

# Generate embedding
embedding = get_embedding(text)

# Compare similarity
similarity = cosine_similarity(emb_1, emb_2)

# Search
scores = [cosine_similarity(query_emb, doc_emb) for doc_emb in doc_embeddings]
top_results = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:k]

️ Next Steps

Next module: Module 10: Vector Spaces & Semantic Search

In Module 10, you’ll experience the Heureka Moment: understanding embeddings as coordinates in semantic space, where mathematical operations work on meaning itself!

You’ll learn:

Visualizing embeddings in 2D/3D space
Vector arithmetic: king - man + woman ≈ queen
Building semantic search from scratch
Vector databases and indexing at scale
The geometry of meaning!

This is when everything clicks!

** Neural Dojo - Master embeddings, unlock semantic understanding! **

Last updated: 2025-11-21 Module 9: Embeddings & Semantic Similarity

Embeddings & Semantic Search

Or: Teaching Computers That ‘Cat’ and ‘Kitten’ Are Related

The Accidental Discovery That Changed AI Forever

What You’ll Be Able to Do

Introduction: The Meaning Problem

The Challenge: Computers Don’t Understand Meaning

The Solution: Embeddings

STOP: Time to Practice!

Practice Path (~2.5-3 hours total)

Deliverable: Semantic Search Engine

Did You Know?

What Are Embeddings?

Definition

Dimensionality

How Embeddings Work

The Encoding Process

Contrastive Learning (How Models Learn Embeddings)

🆚 Sparse vs Dense Embeddings

Sparse Embeddings (Traditional)

Dense Embeddings (Neural)

Comparison Example

Generating Embeddings

Option 1: OpenAI API (Most Popular)

Option 2: Anthropic (Voyage AI Integration)

Option 3: Open-Source (Sentence Transformers)

Measuring Similarity: Cosine Similarity

Cosine Similarity Explained

Similarity Score Interpretation

Why This Module Matters

Did You Know?

Use Cases: What Can You Build?

1. Semantic Search

2. Clustering (Group Similar Items)

3. Recommendation System

4. Classification (Zero-Shot)

5. Duplicate Detection

️ Common Pitfalls

Pitfall 1: Not Normalizing Text

Pitfall 2: Embedding Too Much Text

Pitfall 3: Comparing Embeddings from Different Models

Pitfall 4: Not Caching Embeddings

Pitfall 5: Ignoring Embedding Quality

Choosing the Right Embedding Model

Decision Matrix

Benchmarking

Real-World Applications

Application 1: kaizen (Lean DevOps Platform)

Application 2: vibe (Teaching Platform)

Application 3: contrarian (Stock Analysis)

Application 4: Work (Geospatial + Cloud)

Production War Stories

The E-commerce Search Disaster

The Legal Discovery Breakthrough

The Customer Support Resolution

Economics of Embeddings

Cost Comparison: Build vs Buy

Real Cost Calculation

When Self-Hosting Makes Sense

Interview Preparation: Embeddings

Common Interview Questions

System Design Question

Key Takeaways

Further Reading

Papers

Documentation

Tools

Did You Know? The Revolution of Embeddings

The Google Intern Who Changed NLP Forever

The “Linguistic Regularity” Discovery

The Billion-Dollar Pivot: From Words to Sentences

The OpenAI Pricing Shock

The Netflix Recommendation Engine

Why Embeddings Beat Keyword Search

The Benchmark Wars

The Surprising Economics

Module Summary

️ Next Steps