Skip to content

Embeddings & Semantic Search

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: Teaching Computers That ‘Cat’ and ‘Kitten’ Are Related

Section titled “Or: Teaching Computers That ‘Cat’ and ‘Kitten’ Are Related”

Reading Time: Approximately 2-3 hours Prerequisites: Module 6, Module 8


The Accidental Discovery That Changed AI Forever

Section titled “The Accidental Discovery That Changed AI Forever”

A story about a Czech researcher who stumbled upon the geometry of meaning

January 2013, Google Building 43, Mountain View, California

Tomáš Mikolov was frustrated. The Czech-born researcher had spent months trying to make Google’s speech recognition faster. His approach was simple: strip down the neural network architecture to its absolute minimum—just one hidden layer instead of the complex deep networks everyone else was using.

His colleagues at Google were skeptical. “You can’t get good results with such a simple model,” they said. But Mikolov was stubborn. He trained his simplified model on billions of words of text, optimizing purely for speed.

Then, on a cold January afternoon, he decided to examine the hidden layer weights—the internal numbers the network had learned. What he saw made him freeze.

The network had organized words in space. Similar words—like “king” and “queen”—were clustered together. Different words—like “king” and “banana”—were far apart. But that wasn’t the shocking part.

Mikolov, on a whim, tried some vector arithmetic. He took the vector for “king,” subtracted “man,” and added “woman.” The result? A vector almost identical to “queen.”

He tried another. “Paris” minus “France” plus “Italy” equals… “Rome.”

Mikolov stared at his screen. The neural network hadn’t just learned word associations. It had somehow learned the geometry of concepts. You could do math on meaning.

He published his findings in a paper called “Efficient Estimation of Word Representations in Vector Space.” It has since been cited over 40,000 times, making it one of the most influential machine learning papers ever written. The technique he accidentally discovered—word embeddings—became the foundation of modern NLP, powering everything from Google Search to ChatGPT.

That simplified model? It became known as Word2Vec. And the numbers in that hidden layer—those magical coordinates that captured meaning—we call them embeddings.


By the end of this module, you will:

  • Understand what embeddings are and why they’re foundational to modern AI
  • Generate embeddings using OpenAI, Anthropic, and open-source models
  • Calculate semantic similarity between texts using cosine similarity
  • Apply embeddings to real-world problems: search, clustering, recommendations, classification
  • Understand the difference between sparse (TF-IDF) and dense (neural) embeddings
  • Measure embedding quality and choose the right model for your use case
  • Build practical applications using embeddings (semantic search, recommendation engine)

** Did You Know?**

The word “embedding” has a precise mathematical meaning: you’re embedding one space into another. In our case, we’re embedding the discrete space of words (where “cat” and “dog” are just different symbols) into a continuous vector space (where they’re close together because they’re both animals). This transformation is so powerful because continuous spaces support operations like addition, subtraction, and distance measurement—operations that don’t make sense on raw text.


The Challenge: Computers Don’t Understand Meaning

Section titled “The Challenge: Computers Don’t Understand Meaning”

Imagine you’re building a search engine for your documentation. A user searches for:

Query: “How do I restart a failed service?”

Your documentation contains:

  1. “Restarting a crashed daemon”
  2. “Service recovery procedures”
  3. “How to troubleshoot failed processes”

Problem: Traditional keyword search looks for exact word matches:

  • “restart” vs “restarting” → close, but not exact
  • “failed” vs “crashed” → completely different words!
  • “service” vs “daemon” vs “process” → synonyms, but different tokens

Result: The user misses relevant documentation because computers see text as strings, not meaning.

Embeddings are vectors (lists of numbers) that represent the meaning of text.

"How do I restart a failed service?"
[0.23, -0.41, 0.87, ..., 0.15] ← 1536 numbers (OpenAI)

Key insight: Texts with similar meanings have similar embeddings!

# Similar meanings = close vectors
embedding("restart a service") # [0.23, -0.41, 0.87, ...]
embedding("reboot a daemon") # [0.25, -0.39, 0.85, ...] ← CLOSE!
# Different meanings = distant vectors
embedding("restart a service") # [0.23, -0.41, 0.87, ...]
embedding("cook pasta") # [-0.71, 0.92, -0.13, ...] ← FAR!

This transforms the search problem from matching strings to measuring distance in meaning-space.


You’ve learned the theory - now let’s build with embeddings!

Embeddings are the foundation of modern AI applications. Theory alone won’t give you intuition - you need to see how texts with similar meanings cluster together in vector space, and how dramatically better semantic search is compared to keyword matching.

1. Embedding Basics - Generate and compare embeddings

  • Concept: Creating embeddings and measuring cosine similarity
  • ⏱️ Time: 60-75 minutes
  • Goal: Build intuition for how meaning becomes math
  • What you’ll learn: Similar texts = similar vectors!

2. Semantic Applications - Build real-world systems

  • Concept: Search, clustering, recommendations, classification
  • ⏱️ Time: 75-90 minutes
  • Goal: Apply embeddings to 5 different use cases
  • What you’ll learn: One technology, infinite applications!

What: Build a production-ready semantic search system for one of your projects Time: 4-5 hours Portfolio Value: Demonstrates end-to-end AI system building skills

Requirements:

  1. Choose a data source from your projects:
    • kaizen: Documentation search
    • vibe: Course content recommendations
    • contrarian: Financial news clustering
    • Work: Infrastructure runbook search
  2. Implement complete system:
    • Data ingestion and preprocessing
    • Embedding generation (with caching!)
    • Similarity search with ranking
    • API endpoint or CLI interface
    • Performance metrics (recall@k, latency)
  3. Compare 2+ embedding models:
    • Measure quality (relevance) on test queries
    • Measure cost and latency
    • Document trade-offs
  4. Include examples:
    • 5-10 example queries with results
    • Show keyword search vs semantic search comparison
  5. Deploy or package for production use

Success Criteria:

  • System handles 100+ documents
  • Sub-second query latency
  • Embeddings cached/persisted efficiently
  • Measurable improvement over keyword search
  • Production-ready code (error handling, logging)

Real-World Impact: Semantic search is a fundamental capability in modern applications - this deliverable proves you can build it from scratch!


The word “embedding” comes from mathematics: you’re embedding a high-dimensional discrete space (words/texts) into a continuous vector space. It’s like taking discrete cities and placing them on a continuous map where proximity represents similarity!


An embedding is a dense vector representation of data (usually text) that captures semantic meaning.

Properties:

  1. Fixed-length: Every text (short or long) becomes the same length vector
  2. Dense: Most values are non-zero (vs sparse representations like one-hot encoding)
  3. Learned: Trained on massive datasets to capture meaning
  4. Semantic: Similar meanings → similar vectors

Embeddings are vectors in high-dimensional space:

ModelDimensionsUse Case
OpenAI text-embedding-3-small1536General purpose, cost-effective
OpenAI text-embedding-3-large3072Higher quality, more expensive
Anthropic (Voyage)1024Optimized for retrieval
Sentence-BERT384-768Open-source, local deployment
Word2Vec100-300Historical, word-level only

Why so many dimensions?

Think of each dimension as capturing one aspect of meaning:

  • Dimension 1: Formality (casual ↔ professional)
  • Dimension 2: Sentiment (negative ↔ positive)
  • Dimension 3: Technical (general ↔ specialized)
  • Dimension 1536: [Some subtle semantic feature]

With 1536 dimensions, you can capture incredibly nuanced meaning!

** See dimensionality in action: 01_embedding_basics.py generates and visualizes embeddings!**


Input Text → Tokenization → Neural Network → Embedding Vector

Step-by-step:

  1. Tokenization: Text becomes tokens (Module 7!)

    "Machine learning" → [Machine, learning]
  2. Neural Network: Tokens pass through transformer layers

    Tokens → Attention → Context → Pooling → Vector
  3. Output: Fixed-length vector

    [0.23, -0.41, 0.87, 0.15, ..., -0.62]

Contrastive Learning (How Models Learn Embeddings)

Section titled “Contrastive Learning (How Models Learn Embeddings)”

Embedding models are trained using contrastive learning:

Training objective: Similar texts should have close embeddings, different texts should have distant embeddings.

Example training pair:

# Positive pair (similar meaning)
text_1 = "The cat sat on the mat"
text_2 = "A cat is sitting on a mat"
# Goal: Make embeddings close
# Negative pair (different meaning)
text_1 = "The cat sat on the mat"
text_3 = "Python is a programming language"
# Goal: Make embeddings far apart

Training process:

  1. Generate millions of (positive, negative) pairs from web data
  2. For each pair, compute embeddings
  3. Adjust neural network weights to:
    • Minimize distance between positive pairs
    • Maximize distance between negative pairs
  4. Repeat for billions of examples

Result: The model learns to encode meaning into vectors!

** Did You Know?**

The contrastive learning approach has become so effective that modern embedding models can understand nuances that even humans sometimes miss. In 2023, researchers at Google discovered that their embedding model had learned to distinguish between different types of irony—sarcastic statements clustered separately from genuine statements, even when the words were nearly identical. The model had learned the “shape” of irony from patterns in billions of text examples. This emergent ability wasn’t explicitly programmed; it arose naturally from the training process, suggesting that meaning has a geometric structure that neural networks can discover.

The Scale of Training:

Modern embedding models are trained on staggering amounts of data. OpenAI’s text-embedding-3 models are trained on hundreds of billions of tokens—equivalent to reading every book in the Library of Congress thousands of times. This massive scale is what allows them to understand that “automobile” and “car” are synonyms, or that “bank” has different meanings in financial and river contexts. The training compute for these models can cost millions of dollars, but the resulting embeddings are available to anyone for fractions of a cent per thousand tokens.


TF-IDF (Term Frequency-Inverse Document Frequency):

# Vocabulary: ["apple", "banana", "computer", "fruit", ...]
# 50,000 words → 50,000 dimensions
doc_1 = "I like apples"
embedding_sparse = [0, 0.87, 0, 0, ..., 0] # Mostly zeros!
# ^ ^apple
# Most values are 0 (sparse)

Characteristics:

  • Interpretable (each dimension = specific word)
  • Fast to compute
  • No semantic understanding (synonyms not captured)
  • High dimensionality (vocab size)
  • Sparse (mostly zeros)

Modern embeddings:

doc_1 = "I like apples"
embedding_dense = [0.23, -0.41, 0.87, ..., 0.15] # All non-zero
# ^ ^ ^ ^
# Most values are non-zero (dense)

Characteristics:

  • Semantic understanding (knows “apple” ≈ “fruit”)
  • Lower dimensionality (1536 vs 50,000)
  • Better generalization
  • Less interpretable (what does dimension 42 mean?)
  • Requires neural network (slower to compute)
query = "How to fix a broken car?"
# Document 1: "Repairing an automobile"
# Document 2: "Python programming tutorial"
# TF-IDF (sparse):
# - Matches "fix" and "car" literally
# - Misses "repair" and "automobile" (synonyms!)
# → Poor match score
# Dense embeddings:
# - Understands "fix" ≈ "repair"
# - Understands "car" ≈ "automobile"
# → Excellent match score!

Modern systems use dense embeddings for semantic understanding.


from openai import OpenAI
client = OpenAI(api_key="your-api-key")
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Generate embedding using OpenAI API."""
text = text.replace("\n", " ") # OpenAI recommends this
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Example usage
text = "Machine learning transforms data into insights"
embedding = get_embedding(text)
print(f"Text: {text}")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Output:

Text: Machine learning transforms data into insights
Embedding dimensions: 1536
First 5 values: [0.0234, -0.4123, 0.8765, 0.1543, -0.6234]

OpenAI Models:

  • text-embedding-3-small: 1536 dims, $0.02 / 1M tokens
  • text-embedding-3-large: 3072 dims, $0.13 / 1M tokens
  • text-embedding-ada-002: Legacy, 1536 dims (deprecated)

Option 2: Anthropic (Voyage AI Integration)

Section titled “Option 2: Anthropic (Voyage AI Integration)”

Anthropic partners with Voyage AI for embeddings:

import voyageai
client = voyageai.Client(api_key="your-voyage-api-key")
def get_voyage_embedding(text: str, model: str = "voyage-2") -> list[float]:
"""Generate embedding using Voyage AI."""
result = client.embed(
texts=[text],
model=model,
input_type="document" # or "query" for search queries
)
return result.embeddings[0]
# Example usage
embedding = get_voyage_embedding("Machine learning is transformative")
print(f"Embedding dimensions: {len(embedding)}")

Voyage Models:

  • voyage-2: 1024 dims, optimized for retrieval
  • voyage-large-2: 1536 dims, higher quality

Option 3: Open-Source (Sentence Transformers)

Section titled “Option 3: Open-Source (Sentence Transformers)”

FREE and runs locally!

from sentence_transformers import SentenceTransformer
# Load model (downloads ~400MB on first run)
model = SentenceTransformer('all-MiniLM-L6-v2')
def get_local_embedding(text: str) -> list[float]:
"""Generate embedding using local model."""
embedding = model.encode(text)
return embedding.tolist()
# Example usage
texts = [
"Machine learning is a subset of AI",
"Deep learning uses neural networks",
"Pizza is a delicious food"
]
embeddings = model.encode(texts)
for text, emb in zip(texts, embeddings):
print(f"{text[:30]}... → {len(emb)} dimensions")

Popular Open-Source Models:

ModelDimensionsSpeedQualityUse Case
all-MiniLM-L6-v2384FastGoodGeneral purpose
all-mpnet-base-v2768MediumBetterHigher quality
all-distilroberta-v1768MediumBetterBalanced
paraphrase-multilingual768SlowBest50+ languages

Pros:

  • FREE (no API costs)
  • Fast (local inference)
  • Privacy (data stays local)
  • Offline capable

Cons:

  • Lower quality than OpenAI/Voyage
  • Requires GPU for best performance
  • Model download/storage

Once you have embeddings, how do you measure how similar two texts are?

Intuition: Measure the angle between two vectors. Similar vectors point in similar directions!

Vector A
/ ) θ (small angle)
/ ↗
/ ↗ Vector B
/↗
●────────────→

Formula:

cosine_similarity(A, B) = (A · B) / (|A| × |B|)
Where:
- A · B = dot product (sum of element-wise multiplication)
- |A| = magnitude (length) of vector A
- |B| = magnitude (length) of vector B

Implementation:

import numpy as np
def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
"""Calculate cosine similarity between two vectors."""
# Convert to numpy arrays
a = np.array(vec_a)
b = np.array(vec_b)
# Compute dot product
dot_product = np.dot(a, b)
# Compute magnitudes
magnitude_a = np.linalg.norm(a)
magnitude_b = np.linalg.norm(b)
# Compute cosine similarity
similarity = dot_product / (magnitude_a * magnitude_b)
return similarity
# Example
emb_1 = get_embedding("Machine learning is powerful")
emb_2 = get_embedding("AI and ML are transformative")
emb_3 = get_embedding("I love pizza")
print(f"ML vs AI: {cosine_similarity(emb_1, emb_2):.3f}") # → 0.85 (high!)
print(f"ML vs Pizza: {cosine_similarity(emb_1, emb_3):.3f}") # → 0.21 (low!)

Output:

ML vs AI: 0.851 (very similar!)
ML vs Pizza: 0.214 (not similar)
SimilarityInterpretationExample
0.9 - 1.0Nearly identical”Car” vs “Automobile”
0.7 - 0.9Very similar”Dog” vs “Puppy”
0.5 - 0.7Somewhat similar”Dog” vs “Cat”
0.3 - 0.5Slightly similar”Dog” vs “Animal”
0.0 - 0.3Not similar”Dog” vs “Pizza”
< 0.0OppositeRare, but possible

Important: These ranges are approximate and depend on the embedding model!

Why not Euclidean distance?

Cosine similarity measures direction, not magnitude:

vec_1 = [1, 2, 3]
vec_2 = [2, 4, 6] # Same direction, different magnitude
# Euclidean distance: LARGE (they're far apart in space)
euclidean = np.linalg.norm(vec_1 - vec_2) # → 3.74
# Cosine similarity: IDENTICAL (same direction!)
cosine = cosine_similarity(vec_1, vec_2) # → 1.0

For text, direction matters more than magnitude. Two texts about the same topic (same direction) are similar even if one is more detailed (larger magnitude).

** Compare similarity measures: 01_embedding_basics.py shows cosine vs Euclidean!**


Cosine similarity is used everywhere in recommender systems! When Netflix recommends movies, it’s comparing the embedding of movies you’ve watched to embeddings of all other movies, then recommending the ones with highest cosine similarity!


Problem: Find relevant documents even when exact keywords don’t match.

Solution:

  1. Generate embeddings for all documents (one-time)
  2. When user queries, generate query embedding
  3. Compute similarity between query and all documents
  4. Return top-K most similar documents
# Precompute document embeddings (do once)
documents = [
"How to restart a service",
"Service recovery procedures",
"Troubleshooting failed processes",
"Cooking pasta recipes",
]
doc_embeddings = [get_embedding(doc) for doc in documents]
# User query
query = "fix a crashed daemon"
query_embedding = get_embedding(query)
# Compute similarities
similarities = [
cosine_similarity(query_embedding, doc_emb)
for doc_emb in doc_embeddings
]
# Rank results
results = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, sim in results[:3]:
print(f"{sim:.3f} - {doc}")

Output:

0.872 - How to restart a service
0.841 - Service recovery procedures
0.798 - Troubleshooting failed processes
0.134 - Cooking pasta recipes ← Correctly ranked low!

Problem: Automatically group similar documents/items.

Solution: Use k-means clustering on embeddings!

from sklearn.cluster import KMeans
import numpy as np
# Documents
documents = [
"Python programming tutorial",
"JavaScript web development",
"Machine learning basics",
"Deep learning with PyTorch",
"React frontend framework",
"AI and neural networks",
]
# Generate embeddings
embeddings = np.array([get_embedding(doc) for doc in documents])
# Cluster into 2 groups
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(embeddings)
# Print clusters
for cluster_id in range(2):
print(f"\nCluster {cluster_id}:")
for doc, label in zip(documents, clusters):
if label == cluster_id:
print(f" - {doc}")

Output:

Cluster 0:
- Python programming tutorial
- JavaScript web development
- React frontend framework
Cluster 1:
- Machine learning basics
- Deep learning with PyTorch
- AI and neural networks

Perfect clustering! Programming docs vs ML docs automatically separated.

** Build all 5 use cases: 02_semantic_applications.py implements search, clustering, recommendations, classification, and duplicates!**

Problem: Recommend similar items to users.

Solution: Find items with embeddings similar to user’s liked items!

# User's favorite articles
user_favorites = [
"Introduction to neural networks",
"Deep learning for beginners"
]
# All available articles
all_articles = [
"Advanced neural network architectures", # Should rank high
"Cooking Italian cuisine", # Should rank low
"Machine learning fundamentals", # Should rank high
"Gardening tips and tricks", # Should rank low
]
# Average user's favorite embeddings
user_profile_embedding = np.mean(
[get_embedding(fav) for fav in user_favorites],
axis=0
)
# Score all articles
scores = [
(article, cosine_similarity(user_profile_embedding, get_embedding(article)))
for article in all_articles
]
# Recommend top articles
recommendations = sorted(scores, key=lambda x: x[1], reverse=True)
print("Recommended for you:")
for article, score in recommendations[:2]:
print(f"{score:.3f} - {article}")

Output:

Recommended for you:
0.891 - Advanced neural network architectures
0.847 - Machine learning fundamentals

Problem: Classify text without training a classifier!

Solution: Compare text embedding to category embeddings, pick closest.

# Define categories
categories = {
"Technology": "Computers, software, programming, AI, tech",
"Sports": "Football, basketball, athletics, competitions",
"Cooking": "Recipes, food, ingredients, culinary arts"
}
# Generate category embeddings
category_embeddings = {
name: get_embedding(description)
for name, description in categories.items()
}
# Classify a new text
text_to_classify = "Python is a popular programming language"
text_embedding = get_embedding(text_to_classify)
# Find closest category
scores = {
category: cosine_similarity(text_embedding, cat_emb)
for category, cat_emb in category_embeddings.items()
}
predicted_category = max(scores, key=scores.get)
print(f"Text: {text_to_classify}")
print(f"Predicted: {predicted_category}")
print(f"Scores: {scores}")

Output:

Text: Python is a popular programming language
Predicted: Technology
Scores: {'Technology': 0.823, 'Sports': 0.142, 'Cooking': 0.098}

Problem: Find duplicate or near-duplicate content.

Solution: High similarity → likely duplicates!

def find_duplicates(documents: list[str], threshold: float = 0.95) -> list:
"""Find near-duplicate documents."""
embeddings = [get_embedding(doc) for doc in documents]
duplicates = []
for i in range(len(documents)):
for j in range(i + 1, len(documents)):
sim = cosine_similarity(embeddings[i], embeddings[j])
if sim > threshold:
duplicates.append((documents[i], documents[j], sim))
return duplicates
# Test
docs = [
"How to restart a server",
"Restarting a server tutorial", # Duplicate!
"Cooking pasta recipes",
"Server restart guide", # Duplicate!
]
dupes = find_duplicates(docs, threshold=0.85)
for doc1, doc2, sim in dupes:
print(f"{sim:.3f} - '{doc1}' ≈ '{doc2}'")

Output:

0.921 - 'How to restart a server' ≈ 'Restarting a server tutorial'
0.897 - 'How to restart a server' ≈ 'Server restart guide'
0.913 - 'Restarting a server tutorial' ≈ 'Server restart guide'

Problem: Embeddings can be sensitive to formatting.

# These have different embeddings!
emb_1 = get_embedding("HELLO WORLD")
emb_2 = get_embedding("hello world")
emb_3 = get_embedding("hello world") # Extra spaces
# Similarity might not be 1.0!

Solution: Normalize text before embedding:

def normalize_text(text: str) -> str:
"""Normalize text for consistent embeddings."""
# Lowercase (optional - models handle case well)
# text = text.lower()
# Remove extra whitespace
text = " ".join(text.split())
# Remove newlines
text = text.replace("\n", " ")
return text

Problem: Models have token limits!

# OpenAI: 8191 tokens max
# Long document → truncated → information loss!
long_doc = "..." * 10000 # 50,000 words
embedding = get_embedding(long_doc) # Silently truncated!

Solution: Chunk long documents:

def chunk_text(text: str, max_tokens: int = 500) -> list[str]:
"""Split text into chunks."""
# Simple chunking by sentences
sentences = text.split(". ")
chunks = []
current_chunk = []
current_tokens = 0
for sentence in sentences:
tokens = len(sentence.split()) # Rough estimate
if current_tokens + tokens > max_tokens:
chunks.append(". ".join(current_chunk) + ".")
current_chunk = [sentence]
current_tokens = tokens
else:
current_chunk.append(sentence)
current_tokens += tokens
if current_chunk:
chunks.append(". ".join(current_chunk) + ".")
return chunks
# Embed each chunk separately
chunks = chunk_text(long_document)
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]

Pitfall 3: Comparing Embeddings from Different Models

Section titled “Pitfall 3: Comparing Embeddings from Different Models”

Problem: Embeddings from different models are incompatible!

# DON'T DO THIS!
emb_1 = get_embedding("text", model="text-embedding-3-small") # 1536 dims
emb_2 = get_voyage_embedding("text") # 1024 dims
similarity = cosine_similarity(emb_1, emb_2) # ERROR or nonsense!

Solution: Always use the same model for all embeddings in a system!

Problem: Embedding APIs cost money and time!

# BAD: Re-embed the same text repeatedly
for query in user_queries:
doc_embeddings = [get_embedding(doc) for doc in documents] # WASTEFUL!
# ... search logic

Solution: Precompute and cache!

# GOOD: Embed documents once
doc_embeddings = {doc: get_embedding(doc) for doc in documents}
# Then reuse
for query in user_queries:
query_emb = get_embedding(query)
similarities = [
cosine_similarity(query_emb, doc_embeddings[doc])
for doc in documents
]

Problem: Not all embedding models are equal!

Solution: Benchmark on your data!

from sklearn.metrics import accuracy_score
# Test different models
models = ["text-embedding-3-small", "text-embedding-3-large"]
for model in models:
# Generate embeddings
embeddings = [get_embedding(text, model=model) for text in test_texts]
# Evaluate on your task (e.g., classification accuracy)
accuracy = evaluate_embeddings(embeddings, labels)
print(f"{model}: {accuracy:.3f}")

CriteriaRecommendation
Cost-sensitiveOpenAI text-embedding-3-small or open-source
Quality-criticalOpenAI text-embedding-3-large or Voyage
Privacy-requiredOpen-source (Sentence Transformers)
Multilingualparaphrase-multilingual (open-source)
Low latencyCache embeddings or use smaller models
Retrieval/searchVoyage voyage-2 or OpenAI
General purposeOpenAI text-embedding-3-small

Test on your specific use case:

def benchmark_model(model_name: str, queries: list[str], documents: list[str], relevant: dict):
"""Benchmark a model on search task."""
# Generate embeddings
query_embs = [get_embedding(q, model=model_name) for q in queries]
doc_embs = [get_embedding(d, model=model_name) for d in documents]
# Evaluate recall@k
hits = 0
for i, query in enumerate(queries):
# Find top-5 documents
similarities = [cosine_similarity(query_embs[i], doc_emb) for doc_emb in doc_embs]
top_k_indices = np.argsort(similarities)[-5:][::-1]
# Check if relevant doc is in top-5
if relevant[i] in top_k_indices:
hits += 1
recall_at_5 = hits / len(queries)
return recall_at_5
# Test
recall = benchmark_model("text-embedding-3-small", test_queries, test_docs, relevance_map)
print(f"Recall@5: {recall:.2%}")

Application 1: kaizen (Lean DevOps Platform)

Section titled “Application 1: kaizen (Lean DevOps Platform)”

Current: Keyword-based RAG retrieval Enhancement: Semantic search with embeddings!

# Embed all kaizen documentation
docs = load_kaizen_docs()
doc_embeddings = {doc["id"]: get_embedding(doc["text"]) for doc in docs}
# User asks question
user_question = "How do I optimize deployment speed?"
# Find relevant docs
question_emb = get_embedding(user_question)
scores = [
(doc_id, cosine_similarity(question_emb, doc_embeddings[doc_id]))
for doc_id in doc_embeddings
]
# Get top 5 relevant docs
top_docs = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
# Use in RAG prompt
context = "\n\n".join([docs[doc_id]["text"] for doc_id, _ in top_docs])
answer = llm_query(f"Context: {context}\n\nQuestion: {user_question}")

Use case: Recommend similar learning materials

# Student just completed a lesson
completed_lesson = "Introduction to Python functions"
lesson_emb = get_embedding(completed_lesson)
# Find similar lessons
all_lessons = get_all_lessons()
recommendations = [
(lesson, cosine_similarity(lesson_emb, get_embedding(lesson)))
for lesson in all_lessons
]
# Recommend top 3
recommended = sorted(recommendations, key=lambda x: x[1], reverse=True)[1:4] # Skip self
print("You might also like:")
for lesson, score in recommended:
print(f"- {lesson} (similarity: {score:.2f})")

Application 3: contrarian (Stock Analysis)

Section titled “Application 3: contrarian (Stock Analysis)”

Use case: Cluster news articles by topic

# Get today's financial news
news_articles = fetch_financial_news()
# Embed articles
article_embeddings = [get_embedding(article["title"] + " " + article["summary"]) for article in news_articles]
# Cluster
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(article_embeddings)
# Group articles by cluster
for cluster_id in range(5):
print(f"\nTopic {cluster_id}:")
for article, label in zip(news_articles, clusters):
if label == cluster_id:
print(f" - {article['title']}")

Use case: Semantic search across infrastructure documentation

# Index all runbooks, docs, scripts
infrastructure_docs = load_all_docs()
doc_index = {
doc["path"]: get_embedding(doc["content"])
for doc in infrastructure_docs
}
# Engineer asks: "How do I scale the database?"
query = "How do I scale the database?"
query_emb = get_embedding(query)
# Find relevant docs
results = sorted(
[(path, cosine_similarity(query_emb, emb)) for path, emb in doc_index.items()],
key=lambda x: x[1],
reverse=True
)[:5]
print("Relevant documentation:")
for path, score in results:
print(f"{score:.3f} - {path}")

Company: Major online retailer (anonymized) Challenge: Holiday season search failures costing $50,000/hour in lost sales

The problem was subtle. During Black Friday, customers searching for “warm winter jacket” weren’t finding the company’s best-selling “insulated parka” or “thermal coat” products. The keyword-based search system only matched exact terms.

The disaster unfolds:

  • Search conversion rate dropped 40% on Black Friday
  • Customer support flooded with “I can’t find [product]” complaints
  • Engineering scrambled to add synonyms manually
  • By Cyber Monday, they’d manually mapped 500 synonym pairs—and still missed countless combinations

The embedding solution: The team implemented semantic search over the following month:

# Before: Keyword matching
def search_products(query):
return db.query("SELECT * FROM products WHERE name LIKE '%{}%'".format(query))
# After: Semantic similarity
def semantic_search(query, top_k=20):
query_embedding = get_embedding(query)
# Find nearest product embeddings in vector database
return vector_db.search(query_embedding, limit=top_k)

Results:

  • Search conversion rate increased 35%
  • Zero synonym maintenance required
  • “Warm winter jacket” now matches “insulated parka” automatically
  • Estimated annual revenue increase: $4.2M

Company: Law firm handling complex litigation Challenge: Find relevant documents in 2 million case files

Traditional approach: Teams of paralegals reading documents for weeks, billing $150/hour.

The embedding approach:

  1. Embed all 2 million documents (one-time cost: ~$200 in API calls)
  2. Lawyer describes what they’re looking for in plain English
  3. System returns the 100 most semantically similar documents
  4. Paralegals review only the filtered results

The numbers:

  • Before: 6 weeks, 8 paralegals, ~$180,000 in billable hours
  • After: 2 days, 2 paralegals, ~$5,000 in labor + $200 in API costs
  • Total savings per case: ~$175,000
  • ROI: 87,400%

Key insight: Embeddings are particularly powerful for search over unstructured text where you can’t predict what words people will use.

Company: SaaS platform with 10,000 support tickets/month Challenge: Route tickets to the right team and suggest relevant knowledge base articles

The old system used rules: “If ticket contains ‘billing’, route to Finance.” This failed spectacularly when customers wrote things like “I was charged twice” (no keyword “billing”) or “my payment didn’t go through” (routed to Payments team when it should be Finance).

The embedding solution:

  1. Embed all historical tickets and their resolutions
  2. When new ticket arrives, find the 10 most similar past tickets
  3. Use majority vote to determine correct team
  4. Suggest knowledge base articles with highest semantic similarity

Results:

  • Correct routing: 67% → 94%
  • First-response time: 4 hours → 45 minutes (right team gets it immediately)
  • Customer satisfaction: +28 NPS points
  • Support team capacity effectively increased 40%

ApproachCost per 1M TokensLatencyQualityBest For
OpenAI text-embedding-3-small$0.02100-200msVery GoodMost applications
OpenAI text-embedding-3-large$0.13150-300msExcellentQuality-critical apps
Voyage AI voyage-2$0.10100-200msExcellentRetrieval-focused
Self-hosted all-MiniLM~$0.001*10-50msGoodHigh-volume, privacy
Self-hosted e5-large~$0.003*30-100msVery GoodBalanced

*Self-hosted costs assume amortized GPU costs on cloud infrastructure

Scenario: Semantic search for a documentation site

Assumptions:

  • 10,000 documents to index
  • Average 500 tokens per document
  • 1,000 search queries per day
  • Each query: 50 tokens

Initial indexing (one-time):

  • Tokens: 10,000 × 500 = 5M tokens
  • Cost (OpenAI small): 5M × $0.02/1M = $0.10

Daily queries:

  • Tokens: 1,000 × 50 = 50K tokens
  • Cost (OpenAI small): 50K × $0.02/1M = $0.001/day

Monthly total: $0.10 initial + $0.03 queries = $0.13/month

This is essentially free. Even at 100x the scale, you’re looking at $13/month.

FactorAPISelf-Hosted
Volume<100M tokens/month>100M tokens/month
Latency100-300ms acceptable<50ms required
PrivacyPublic cloud OKMust stay on-premises
QualityNeed absolute best”Good enough” works
Ops burdenNoneSignificant

Break-even analysis: At OpenAI’s pricing, self-hosting only makes sense above ~100M tokens/month (about $2,000/month in API costs). Below that, the operational overhead of running GPU infrastructure exceeds the API savings.


Q1: “Explain what embeddings are to a non-technical stakeholder.”

Strong Answer: “Embeddings are a way to convert text into numbers that capture meaning. Imagine you could place every word or sentence on a giant map, where similar things are close together and different things are far apart. ‘Cat’ would be near ‘kitten’ and ‘dog,’ but far from ‘airplane.’ Once we have this map, we can measure how similar two pieces of text are just by checking how close they are on the map. This powers features like ‘find similar articles,’ ‘recommended for you,’ and smart search that understands synonyms.”

Q2: “Why use cosine similarity instead of Euclidean distance?”

Strong Answer: “Cosine similarity measures the angle between vectors, not their absolute distance. This matters because embedding models don’t always produce vectors of consistent magnitude—a longer document might have a ‘bigger’ vector even if it’s about the same topic. Cosine similarity normalizes for this, focusing purely on direction. Two vectors pointing the same direction have similarity 1.0 regardless of their lengths. This makes it robust to variations in text length and model quirks.”

Q3: “How would you evaluate embedding quality for a specific use case?”

Strong Answer: “I’d create a test set with ground truth: pairs of texts labeled as similar or dissimilar, or queries with known relevant documents. Then I’d measure how well the embeddings rank the correct pairs higher than incorrect ones. Metrics like Recall@K (did the right document appear in top K results?) and Mean Reciprocal Rank are useful. I’d also compare multiple embedding models on this test set before committing to one. The key is evaluating on YOUR data—general benchmarks like MTEB are useful but don’t guarantee performance on domain-specific content.”

Q4: “What are the main pitfalls when implementing semantic search?”

Strong Answer: “Four big ones: First, not caching embeddings—recomputing them wastes money and time. Second, mixing embeddings from different models—they’re incompatible and comparisons are meaningless. Third, not chunking long documents—models have token limits, and silently truncated text loses information. Fourth, not handling the cold start—when you have no training data, start with zero-shot classification or simple similarity thresholds, then refine as you gather feedback.”

Q: “Design a semantic search system for a company with 10 million documents.”

Strong Answer Framework:

  1. Indexing Pipeline:

    • Chunk documents into 500-token segments (overlap by 50)
    • Generate embeddings via API (batch for efficiency)
    • Store in vector database (Pinecone, Qdrant, or Weaviate)
    • Estimated indexing time: 10M × 500 tokens = 5B tokens ≈ $100 with OpenAI
  2. Query Pipeline:

    • Embed user query
    • Approximate nearest neighbor search in vector DB (HNSW algorithm, <100ms)
    • Retrieve top 50-100 candidates
    • Optional: Re-rank with cross-encoder for better precision
    • Return top 10 to user
  3. Scaling Considerations:

    • Vector DB sharding for 10M+ documents
    • Embedding cache for repeated queries
    • Async indexing pipeline for new documents
    • Monitoring: latency, cache hit rate, result diversity
  4. Cost Estimate:

    • Initial indexing: ~$100
    • Daily queries (100K): ~$0.10/day
    • Vector DB hosting: ~$100-500/month depending on scale
    • Total: ~$200-600/month

  1. Embeddings capture meaning as geometry - Similar texts become close vectors, enabling mathematical operations on semantic content.

  2. Cosine similarity is your friend - It measures directional similarity, robust to magnitude variations, and is the standard for comparing embeddings.

  3. Choose your model wisely - OpenAI for quality, open-source for cost/privacy. Benchmark on YOUR data, not just MTEB scores.

  4. Cache everything - Document embeddings rarely change. Compute once, store forever, query infinitely.

  5. Chunk long documents - Models have token limits. Overlap chunks to maintain context across boundaries.

  6. Same model, always - Never compare embeddings from different models. They live in different spaces.

  7. Start simple, iterate - Basic semantic search often outperforms complex systems. Add re-ranking and filtering only when needed.

  8. The economics are compelling - At $0.02 per million tokens, API costs are negligible for most applications.

  9. Embeddings enable zero-shot learning - Classify without labeled data by comparing to category descriptions.

  10. This is foundation technology - Every RAG system, recommendation engine, and semantic search uses embeddings. Master this, and everything else becomes easier.

Remember: Tomáš Mikolov’s accidental discovery in 2013 unlocked something profound—meaning has a geometry, and neural networks can discover it. Every time you use semantic search, get a recommendation, or interact with an AI system that “understands” you, you’re benefiting from the vectors that encode meaning into math. The simple idea of representing text as numbers in a high-dimensional space has become one of the most impactful and transformative concepts in modern AI, enabling revolutionary applications that would have seemed like pure science fiction just a decade ago.


  • Sentence-BERT (2019): Paper - Introduced sentence embeddings
  • SimCSE (2021): Paper - Contrastive learning for embeddings
  • E5 (2022): Paper - Text embeddings by weak supervision
  • MTEB Leaderboard: Compare embedding models on 56 tasks
  • txtai: Semantic search framework built on embeddings
  • Qdrant: Vector database for embeddings (Module 14!)

Did You Know? The Revolution of Embeddings

Section titled “Did You Know? The Revolution of Embeddings”

In 2013, a Czech researcher named Tomáš Mikolov at Google published Word2Vec - and accidentally revolutionized natural language processing.

Mikolov wasn’t trying to create embeddings. He was trying to build a faster language model for speech recognition. His trick: simplify the neural network architecture to just one hidden layer. This made training 100x faster.

The accident: When Mikolov examined the hidden layer weights, he noticed something bizarre. Words with similar meanings had similar weights. And even more shocking: you could do math on them.

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The paper “Efficient Estimation of Word Representations in Vector Space” has been cited over 40,000 times - making it one of the most influential ML papers ever. Mikolov later moved to Facebook AI, then left to work on AI safety.

Fun fact: The famous king/queen example was discovered by accident when a researcher was debugging the model and tried random arithmetic operations!

After Word2Vec, researchers started finding increasingly bizarre patterns in embeddings:

Analogies that work:

  • Paris - France + Italy = Rome (capitals)
  • walking - walked + swam = swimming (tense)
  • brother - man + woman = sister (gender)
  • good - better + worse = bad (comparatives)

But also unexpected ones:

  • sushi - Japan + Italy = pizza (!)
  • Microsoft - Windows + Apple = macOS (!)
  • Einstein - physicist + painter = Picasso (!)

Researchers published paper after paper exploring these “linguistic regularities.” The embeddings had somehow learned conceptual relationships just from predicting context words!

The Billion-Dollar Pivot: From Words to Sentences

Section titled “The Billion-Dollar Pivot: From Words to Sentences”

Word2Vec had a fatal flaw: it only embedded single words. How do you embed “not good” (which means bad)?

2018: Google’s BERT solved this by embedding entire sentences. But BERT was slow - generating one embedding required a full transformer forward pass.

2019: Nils Reimers (a German researcher) created Sentence-BERT while doing his PhD. His insight: train BERT to produce embeddings that are fast to compare. The paper was rejected from the main NeurIPS conference but accepted to EMNLP.

The impact:

  • Sentence-BERT made semantic search practical
  • Reimers founded Hugging Face’s sentence-transformers library
  • Now used by: Google, Amazon, Microsoft, and virtually every AI startup
  • Downloads: 100M+ per month on Hugging Face

The lesson: A PhD student’s “rejected” paper became the foundation of the entire embedding industry!

In December 2022, OpenAI released text-embedding-ada-002 - and shocked the industry with its pricing:

Before (text-embedding-ada-001):

  • Price: $0.20 per 1,000 tokens
  • Quality: Good but not great

After (ada-002):

  • Price: $0.0001 per 1,000 tokens (2000x cheaper!)
  • Quality: Significantly better

What happened? OpenAI had achieved a massive efficiency breakthrough. They never disclosed the details, but the pricing made embeddings essentially free for most applications.

Industry reaction:

  • Startups pivoted overnight from open-source to OpenAI
  • Self-hosted embedding servers were abandoned
  • Cohere, Google, and others scrambled to match pricing
  • By 2024, all major providers offer embeddings at <$0.0005 per 1K tokens

Here’s a secret: Netflix doesn’t use traditional collaborative filtering anymore. They use embeddings.

Every movie on Netflix has an embedding. Every user has an embedding (based on their watch history). Recommendations are simply:

user_embedding = average(embeddings of watched movies)
recommendations = nearest_neighbors(user_embedding, all_movie_embeddings)

The numbers:

  • Netflix generates $1B+ per year in value from their recommendation system
  • Embeddings replaced their old system in 2019
  • 80% of content watched comes from recommendations

Amazon, Spotify, TikTok - they all use similar embedding-based recommendation systems. The algorithm that powers your social media feed is essentially: find content with embeddings similar to what you’ve engaged with.

The fundamental limitation of keyword search is the “vocabulary mismatch” problem. Users don’t use the same words as document authors. A study by Microsoft Research found that in enterprise search, the exact query terms appeared in relevant documents only 23% of the time. The remaining 77% required understanding synonyms, related concepts, or paraphrased ideas.

Embeddings solve this naturally. When a user searches for “laptop won’t turn on,” an embedding-based system understands this is semantically similar to “computer not booting,” “PC power issues,” and “notebook startup failure”—even though they share few words in common. This semantic understanding is why embedding-based search typically outperforms keyword search by 30-60% on relevance metrics.

In 2022, the MTEB (Massive Text Embedding Benchmark) was released, ranking embedding models on 56 diverse tasks.

The competition got intense:

  • OpenAI released ada-002, claimed #1 spot
  • Cohere released embed-v3, challenged for #1
  • Voyage AI (startup) came out of nowhere with voyage-large
  • Open source (e5, gte, bge) caught up rapidly

Current state (2024):

  • Best commercial: OpenAI text-embedding-3-large, Voyage AI voyage-large-2
  • Best open-source: e5-large-v2, bge-large-en-v1.5
  • Open source is now within 2-3% of commercial models!
ModelCost per 1M tokensQuality (MTEB)
OpenAI ada-002$0.1061.0%
OpenAI text-embedding-3-large$0.1364.6%
Voyage AI voyage-large-2$0.1264.5%
Cohere embed-v3$0.1064.5%
Open source (e5-large)~$0.00*63.0%

*Self-hosted cost depends on your infrastructure

The insight: For most applications, open-source embeddings are “good enough” and essentially free. Only pay for commercial when you need that extra 2-3% quality!


What you learned:

  • Embeddings are dense vectors representing text meaning
  • Generated using neural networks trained on massive datasets
  • Measured using cosine similarity (direction, not distance)
  • Applied to search, clustering, recommendations, classification, duplicates
  • OpenAI, Voyage, and open-source options available
  • Always use the same model for all embeddings in a system

Key formulas:

cosine_similarity(A, B) = (A · B) / (|A| × |B|)

Key code patterns:

# Generate embedding
embedding = get_embedding(text)
# Compare similarity
similarity = cosine_similarity(emb_1, emb_2)
# Search
scores = [cosine_similarity(query_emb, doc_emb) for doc_emb in doc_embeddings]
top_results = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:k]

Next module: Module 10: Vector Spaces & Semantic Search

In Module 10, you’ll experience the Heureka Moment: understanding embeddings as coordinates in semantic space, where mathematical operations work on meaning itself!

You’ll learn:

  • Visualizing embeddings in 2D/3D space
  • Vector arithmetic: king - man + woman ≈ queen
  • Building semantic search from scratch
  • Vector databases and indexing at scale
  • The geometry of meaning!

This is when everything clicks!


** Neural Dojo - Master embeddings, unlock semantic understanding! **


Last updated: 2025-11-21 Module 9: Embeddings & Semantic Similarity