Embeddings & Semantic Search
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
Or: Teaching Computers That ‘Cat’ and ‘Kitten’ Are Related
Section titled “Or: Teaching Computers That ‘Cat’ and ‘Kitten’ Are Related”Reading Time: Approximately 2-3 hours Prerequisites: Module 6, Module 8
The Accidental Discovery That Changed AI Forever
Section titled “The Accidental Discovery That Changed AI Forever”A story about a Czech researcher who stumbled upon the geometry of meaning
January 2013, Google Building 43, Mountain View, California
Tomáš Mikolov was frustrated. The Czech-born researcher had spent months trying to make Google’s speech recognition faster. His approach was simple: strip down the neural network architecture to its absolute minimum—just one hidden layer instead of the complex deep networks everyone else was using.
His colleagues at Google were skeptical. “You can’t get good results with such a simple model,” they said. But Mikolov was stubborn. He trained his simplified model on billions of words of text, optimizing purely for speed.
Then, on a cold January afternoon, he decided to examine the hidden layer weights—the internal numbers the network had learned. What he saw made him freeze.
The network had organized words in space. Similar words—like “king” and “queen”—were clustered together. Different words—like “king” and “banana”—were far apart. But that wasn’t the shocking part.
Mikolov, on a whim, tried some vector arithmetic. He took the vector for “king,” subtracted “man,” and added “woman.” The result? A vector almost identical to “queen.”
He tried another. “Paris” minus “France” plus “Italy” equals… “Rome.”
Mikolov stared at his screen. The neural network hadn’t just learned word associations. It had somehow learned the geometry of concepts. You could do math on meaning.
He published his findings in a paper called “Efficient Estimation of Word Representations in Vector Space.” It has since been cited over 40,000 times, making it one of the most influential machine learning papers ever written. The technique he accidentally discovered—word embeddings—became the foundation of modern NLP, powering everything from Google Search to ChatGPT.
That simplified model? It became known as Word2Vec. And the numbers in that hidden layer—those magical coordinates that captured meaning—we call them embeddings.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand what embeddings are and why they’re foundational to modern AI
- Generate embeddings using OpenAI, Anthropic, and open-source models
- Calculate semantic similarity between texts using cosine similarity
- Apply embeddings to real-world problems: search, clustering, recommendations, classification
- Understand the difference between sparse (TF-IDF) and dense (neural) embeddings
- Measure embedding quality and choose the right model for your use case
- Build practical applications using embeddings (semantic search, recommendation engine)
** Did You Know?**
The word “embedding” has a precise mathematical meaning: you’re embedding one space into another. In our case, we’re embedding the discrete space of words (where “cat” and “dog” are just different symbols) into a continuous vector space (where they’re close together because they’re both animals). This transformation is so powerful because continuous spaces support operations like addition, subtraction, and distance measurement—operations that don’t make sense on raw text.
Introduction: The Meaning Problem
Section titled “Introduction: The Meaning Problem”The Challenge: Computers Don’t Understand Meaning
Section titled “The Challenge: Computers Don’t Understand Meaning”Imagine you’re building a search engine for your documentation. A user searches for:
Query: “How do I restart a failed service?”
Your documentation contains:
- “Restarting a crashed daemon”
- “Service recovery procedures”
- “How to troubleshoot failed processes”
Problem: Traditional keyword search looks for exact word matches:
- “restart” vs “restarting” → close, but not exact
- “failed” vs “crashed” → completely different words!
- “service” vs “daemon” vs “process” → synonyms, but different tokens
Result: The user misses relevant documentation because computers see text as strings, not meaning.
The Solution: Embeddings
Section titled “The Solution: Embeddings”Embeddings are vectors (lists of numbers) that represent the meaning of text.
"How do I restart a failed service?" ↓[0.23, -0.41, 0.87, ..., 0.15] ← 1536 numbers (OpenAI)Key insight: Texts with similar meanings have similar embeddings!
# Similar meanings = close vectorsembedding("restart a service") # [0.23, -0.41, 0.87, ...]embedding("reboot a daemon") # [0.25, -0.39, 0.85, ...] ← CLOSE!
# Different meanings = distant vectorsembedding("restart a service") # [0.23, -0.41, 0.87, ...]embedding("cook pasta") # [-0.71, 0.92, -0.13, ...] ← FAR!This transforms the search problem from matching strings to measuring distance in meaning-space.
STOP: Time to Practice!
Section titled “STOP: Time to Practice!”You’ve learned the theory - now let’s build with embeddings!
Embeddings are the foundation of modern AI applications. Theory alone won’t give you intuition - you need to see how texts with similar meanings cluster together in vector space, and how dramatically better semantic search is compared to keyword matching.
Practice Path (~2.5-3 hours total)
Section titled “Practice Path (~2.5-3 hours total)”1. Embedding Basics - Generate and compare embeddings
- Concept: Creating embeddings and measuring cosine similarity
- ⏱️ Time: 60-75 minutes
- Goal: Build intuition for how meaning becomes math
- What you’ll learn: Similar texts = similar vectors!
2. Semantic Applications - Build real-world systems
- Concept: Search, clustering, recommendations, classification
- ⏱️ Time: 75-90 minutes
- Goal: Apply embeddings to 5 different use cases
- What you’ll learn: One technology, infinite applications!
Deliverable: Semantic Search Engine
Section titled “Deliverable: Semantic Search Engine”What: Build a production-ready semantic search system for one of your projects Time: 4-5 hours Portfolio Value: Demonstrates end-to-end AI system building skills
Requirements:
- Choose a data source from your projects:
- kaizen: Documentation search
- vibe: Course content recommendations
- contrarian: Financial news clustering
- Work: Infrastructure runbook search
- Implement complete system:
- Data ingestion and preprocessing
- Embedding generation (with caching!)
- Similarity search with ranking
- API endpoint or CLI interface
- Performance metrics (recall@k, latency)
- Compare 2+ embedding models:
- Measure quality (relevance) on test queries
- Measure cost and latency
- Document trade-offs
- Include examples:
- 5-10 example queries with results
- Show keyword search vs semantic search comparison
- Deploy or package for production use
Success Criteria:
- System handles 100+ documents
- Sub-second query latency
- Embeddings cached/persisted efficiently
- Measurable improvement over keyword search
- Production-ready code (error handling, logging)
Real-World Impact: Semantic search is a fundamental capability in modern applications - this deliverable proves you can build it from scratch!
Did You Know?
Section titled “Did You Know?”The word “embedding” comes from mathematics: you’re embedding a high-dimensional discrete space (words/texts) into a continuous vector space. It’s like taking discrete cities and placing them on a continuous map where proximity represents similarity!
What Are Embeddings?
Section titled “What Are Embeddings?”Definition
Section titled “Definition”An embedding is a dense vector representation of data (usually text) that captures semantic meaning.
Properties:
- Fixed-length: Every text (short or long) becomes the same length vector
- Dense: Most values are non-zero (vs sparse representations like one-hot encoding)
- Learned: Trained on massive datasets to capture meaning
- Semantic: Similar meanings → similar vectors
Dimensionality
Section titled “Dimensionality”Embeddings are vectors in high-dimensional space:
| Model | Dimensions | Use Case |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | General purpose, cost-effective |
| OpenAI text-embedding-3-large | 3072 | Higher quality, more expensive |
| Anthropic (Voyage) | 1024 | Optimized for retrieval |
| Sentence-BERT | 384-768 | Open-source, local deployment |
| Word2Vec | 100-300 | Historical, word-level only |
Why so many dimensions?
Think of each dimension as capturing one aspect of meaning:
- Dimension 1: Formality (casual ↔ professional)
- Dimension 2: Sentiment (negative ↔ positive)
- Dimension 3: Technical (general ↔ specialized)
- …
- Dimension 1536: [Some subtle semantic feature]
With 1536 dimensions, you can capture incredibly nuanced meaning!
** See dimensionality in action: 01_embedding_basics.py generates and visualizes embeddings!**
How Embeddings Work
Section titled “How Embeddings Work”The Encoding Process
Section titled “The Encoding Process”Input Text → Tokenization → Neural Network → Embedding VectorStep-by-step:
-
Tokenization: Text becomes tokens (Module 7!)
"Machine learning" → [Machine, learning] -
Neural Network: Tokens pass through transformer layers
Tokens → Attention → Context → Pooling → Vector -
Output: Fixed-length vector
[0.23, -0.41, 0.87, 0.15, ..., -0.62]
Contrastive Learning (How Models Learn Embeddings)
Section titled “Contrastive Learning (How Models Learn Embeddings)”Embedding models are trained using contrastive learning:
Training objective: Similar texts should have close embeddings, different texts should have distant embeddings.
Example training pair:
# Positive pair (similar meaning)text_1 = "The cat sat on the mat"text_2 = "A cat is sitting on a mat"# Goal: Make embeddings close
# Negative pair (different meaning)text_1 = "The cat sat on the mat"text_3 = "Python is a programming language"# Goal: Make embeddings far apartTraining process:
- Generate millions of (positive, negative) pairs from web data
- For each pair, compute embeddings
- Adjust neural network weights to:
- Minimize distance between positive pairs
- Maximize distance between negative pairs
- Repeat for billions of examples
Result: The model learns to encode meaning into vectors!
** Did You Know?**
The contrastive learning approach has become so effective that modern embedding models can understand nuances that even humans sometimes miss. In 2023, researchers at Google discovered that their embedding model had learned to distinguish between different types of irony—sarcastic statements clustered separately from genuine statements, even when the words were nearly identical. The model had learned the “shape” of irony from patterns in billions of text examples. This emergent ability wasn’t explicitly programmed; it arose naturally from the training process, suggesting that meaning has a geometric structure that neural networks can discover.
The Scale of Training:
Modern embedding models are trained on staggering amounts of data. OpenAI’s text-embedding-3 models are trained on hundreds of billions of tokens—equivalent to reading every book in the Library of Congress thousands of times. This massive scale is what allows them to understand that “automobile” and “car” are synonyms, or that “bank” has different meanings in financial and river contexts. The training compute for these models can cost millions of dollars, but the resulting embeddings are available to anyone for fractions of a cent per thousand tokens.
🆚 Sparse vs Dense Embeddings
Section titled “🆚 Sparse vs Dense Embeddings”Sparse Embeddings (Traditional)
Section titled “Sparse Embeddings (Traditional)”TF-IDF (Term Frequency-Inverse Document Frequency):
# Vocabulary: ["apple", "banana", "computer", "fruit", ...]# 50,000 words → 50,000 dimensions
doc_1 = "I like apples"embedding_sparse = [0, 0.87, 0, 0, ..., 0] # Mostly zeros!# ^ ^apple# Most values are 0 (sparse)Characteristics:
- Interpretable (each dimension = specific word)
- Fast to compute
- No semantic understanding (synonyms not captured)
- High dimensionality (vocab size)
- Sparse (mostly zeros)
Dense Embeddings (Neural)
Section titled “Dense Embeddings (Neural)”Modern embeddings:
doc_1 = "I like apples"embedding_dense = [0.23, -0.41, 0.87, ..., 0.15] # All non-zero# ^ ^ ^ ^# Most values are non-zero (dense)Characteristics:
- Semantic understanding (knows “apple” ≈ “fruit”)
- Lower dimensionality (1536 vs 50,000)
- Better generalization
- Less interpretable (what does dimension 42 mean?)
- Requires neural network (slower to compute)
Comparison Example
Section titled “Comparison Example”query = "How to fix a broken car?"
# Document 1: "Repairing an automobile"# Document 2: "Python programming tutorial"
# TF-IDF (sparse):# - Matches "fix" and "car" literally# - Misses "repair" and "automobile" (synonyms!)# → Poor match score
# Dense embeddings:# - Understands "fix" ≈ "repair"# - Understands "car" ≈ "automobile"# → Excellent match score!Modern systems use dense embeddings for semantic understanding.
Generating Embeddings
Section titled “Generating Embeddings”Option 1: OpenAI API (Most Popular)
Section titled “Option 1: OpenAI API (Most Popular)”from openai import OpenAI
client = OpenAI(api_key="your-api-key")
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]: """Generate embedding using OpenAI API.""" text = text.replace("\n", " ") # OpenAI recommends this
response = client.embeddings.create( input=text, model=model )
return response.data[0].embedding
# Example usagetext = "Machine learning transforms data into insights"embedding = get_embedding(text)
print(f"Text: {text}")print(f"Embedding dimensions: {len(embedding)}")print(f"First 5 values: {embedding[:5]}")Output:
Text: Machine learning transforms data into insightsEmbedding dimensions: 1536First 5 values: [0.0234, -0.4123, 0.8765, 0.1543, -0.6234]OpenAI Models:
text-embedding-3-small: 1536 dims, $0.02 / 1M tokenstext-embedding-3-large: 3072 dims, $0.13 / 1M tokenstext-embedding-ada-002: Legacy, 1536 dims (deprecated)
Option 2: Anthropic (Voyage AI Integration)
Section titled “Option 2: Anthropic (Voyage AI Integration)”Anthropic partners with Voyage AI for embeddings:
import voyageai
client = voyageai.Client(api_key="your-voyage-api-key")
def get_voyage_embedding(text: str, model: str = "voyage-2") -> list[float]: """Generate embedding using Voyage AI.""" result = client.embed( texts=[text], model=model, input_type="document" # or "query" for search queries )
return result.embeddings[0]
# Example usageembedding = get_voyage_embedding("Machine learning is transformative")print(f"Embedding dimensions: {len(embedding)}")Voyage Models:
voyage-2: 1024 dims, optimized for retrievalvoyage-large-2: 1536 dims, higher quality
Option 3: Open-Source (Sentence Transformers)
Section titled “Option 3: Open-Source (Sentence Transformers)”FREE and runs locally!
from sentence_transformers import SentenceTransformer
# Load model (downloads ~400MB on first run)model = SentenceTransformer('all-MiniLM-L6-v2')
def get_local_embedding(text: str) -> list[float]: """Generate embedding using local model.""" embedding = model.encode(text) return embedding.tolist()
# Example usagetexts = [ "Machine learning is a subset of AI", "Deep learning uses neural networks", "Pizza is a delicious food"]
embeddings = model.encode(texts)
for text, emb in zip(texts, embeddings): print(f"{text[:30]}... → {len(emb)} dimensions")Popular Open-Source Models:
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
all-MiniLM-L6-v2 | 384 | Fast | Good | General purpose |
all-mpnet-base-v2 | 768 | Medium | Better | Higher quality |
all-distilroberta-v1 | 768 | Medium | Better | Balanced |
paraphrase-multilingual | 768 | Slow | Best | 50+ languages |
Pros:
- FREE (no API costs)
- Fast (local inference)
- Privacy (data stays local)
- Offline capable
Cons:
- Lower quality than OpenAI/Voyage
- Requires GPU for best performance
- Model download/storage
Measuring Similarity: Cosine Similarity
Section titled “Measuring Similarity: Cosine Similarity”Once you have embeddings, how do you measure how similar two texts are?
Cosine Similarity Explained
Section titled “Cosine Similarity Explained”Intuition: Measure the angle between two vectors. Similar vectors point in similar directions!
Vector A ↗ / ) θ (small angle) / ↗ / ↗ Vector B /↗ ●────────────→Formula:
cosine_similarity(A, B) = (A · B) / (|A| × |B|)
Where:- A · B = dot product (sum of element-wise multiplication)- |A| = magnitude (length) of vector A- |B| = magnitude (length) of vector BImplementation:
import numpy as np
def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float: """Calculate cosine similarity between two vectors.""" # Convert to numpy arrays a = np.array(vec_a) b = np.array(vec_b)
# Compute dot product dot_product = np.dot(a, b)
# Compute magnitudes magnitude_a = np.linalg.norm(a) magnitude_b = np.linalg.norm(b)
# Compute cosine similarity similarity = dot_product / (magnitude_a * magnitude_b)
return similarity
# Exampleemb_1 = get_embedding("Machine learning is powerful")emb_2 = get_embedding("AI and ML are transformative")emb_3 = get_embedding("I love pizza")
print(f"ML vs AI: {cosine_similarity(emb_1, emb_2):.3f}") # → 0.85 (high!)print(f"ML vs Pizza: {cosine_similarity(emb_1, emb_3):.3f}") # → 0.21 (low!)Output:
ML vs AI: 0.851 (very similar!)ML vs Pizza: 0.214 (not similar)Similarity Score Interpretation
Section titled “Similarity Score Interpretation”| Similarity | Interpretation | Example |
|---|---|---|
| 0.9 - 1.0 | Nearly identical | ”Car” vs “Automobile” |
| 0.7 - 0.9 | Very similar | ”Dog” vs “Puppy” |
| 0.5 - 0.7 | Somewhat similar | ”Dog” vs “Cat” |
| 0.3 - 0.5 | Slightly similar | ”Dog” vs “Animal” |
| 0.0 - 0.3 | Not similar | ”Dog” vs “Pizza” |
| < 0.0 | Opposite | Rare, but possible |
Important: These ranges are approximate and depend on the embedding model!
Why This Module Matters
Section titled “Why This Module Matters”Why not Euclidean distance?
Cosine similarity measures direction, not magnitude:
vec_1 = [1, 2, 3]vec_2 = [2, 4, 6] # Same direction, different magnitude
# Euclidean distance: LARGE (they're far apart in space)euclidean = np.linalg.norm(vec_1 - vec_2) # → 3.74
# Cosine similarity: IDENTICAL (same direction!)cosine = cosine_similarity(vec_1, vec_2) # → 1.0For text, direction matters more than magnitude. Two texts about the same topic (same direction) are similar even if one is more detailed (larger magnitude).
** Compare similarity measures: 01_embedding_basics.py shows cosine vs Euclidean!**
Did You Know?
Section titled “Did You Know?”Cosine similarity is used everywhere in recommender systems! When Netflix recommends movies, it’s comparing the embedding of movies you’ve watched to embeddings of all other movies, then recommending the ones with highest cosine similarity!
Use Cases: What Can You Build?
Section titled “Use Cases: What Can You Build?”1. Semantic Search
Section titled “1. Semantic Search”Problem: Find relevant documents even when exact keywords don’t match.
Solution:
- Generate embeddings for all documents (one-time)
- When user queries, generate query embedding
- Compute similarity between query and all documents
- Return top-K most similar documents
# Precompute document embeddings (do once)documents = [ "How to restart a service", "Service recovery procedures", "Troubleshooting failed processes", "Cooking pasta recipes",]
doc_embeddings = [get_embedding(doc) for doc in documents]
# User queryquery = "fix a crashed daemon"query_embedding = get_embedding(query)
# Compute similaritiessimilarities = [ cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
# Rank resultsresults = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
for doc, sim in results[:3]: print(f"{sim:.3f} - {doc}")Output:
0.872 - How to restart a service0.841 - Service recovery procedures0.798 - Troubleshooting failed processes0.134 - Cooking pasta recipes ← Correctly ranked low!2. Clustering (Group Similar Items)
Section titled “2. Clustering (Group Similar Items)”Problem: Automatically group similar documents/items.
Solution: Use k-means clustering on embeddings!
from sklearn.cluster import KMeansimport numpy as np
# Documentsdocuments = [ "Python programming tutorial", "JavaScript web development", "Machine learning basics", "Deep learning with PyTorch", "React frontend framework", "AI and neural networks",]
# Generate embeddingsembeddings = np.array([get_embedding(doc) for doc in documents])
# Cluster into 2 groupskmeans = KMeans(n_clusters=2, random_state=42)clusters = kmeans.fit_predict(embeddings)
# Print clustersfor cluster_id in range(2): print(f"\nCluster {cluster_id}:") for doc, label in zip(documents, clusters): if label == cluster_id: print(f" - {doc}")Output:
Cluster 0: - Python programming tutorial - JavaScript web development - React frontend framework
Cluster 1: - Machine learning basics - Deep learning with PyTorch - AI and neural networksPerfect clustering! Programming docs vs ML docs automatically separated.
** Build all 5 use cases: 02_semantic_applications.py implements search, clustering, recommendations, classification, and duplicates!**
3. Recommendation System
Section titled “3. Recommendation System”Problem: Recommend similar items to users.
Solution: Find items with embeddings similar to user’s liked items!
# User's favorite articlesuser_favorites = [ "Introduction to neural networks", "Deep learning for beginners"]
# All available articlesall_articles = [ "Advanced neural network architectures", # Should rank high "Cooking Italian cuisine", # Should rank low "Machine learning fundamentals", # Should rank high "Gardening tips and tricks", # Should rank low]
# Average user's favorite embeddingsuser_profile_embedding = np.mean( [get_embedding(fav) for fav in user_favorites], axis=0)
# Score all articlesscores = [ (article, cosine_similarity(user_profile_embedding, get_embedding(article))) for article in all_articles]
# Recommend top articlesrecommendations = sorted(scores, key=lambda x: x[1], reverse=True)
print("Recommended for you:")for article, score in recommendations[:2]: print(f"{score:.3f} - {article}")Output:
Recommended for you:0.891 - Advanced neural network architectures0.847 - Machine learning fundamentals4. Classification (Zero-Shot)
Section titled “4. Classification (Zero-Shot)”Problem: Classify text without training a classifier!
Solution: Compare text embedding to category embeddings, pick closest.
# Define categoriescategories = { "Technology": "Computers, software, programming, AI, tech", "Sports": "Football, basketball, athletics, competitions", "Cooking": "Recipes, food, ingredients, culinary arts"}
# Generate category embeddingscategory_embeddings = { name: get_embedding(description) for name, description in categories.items()}
# Classify a new texttext_to_classify = "Python is a popular programming language"
text_embedding = get_embedding(text_to_classify)
# Find closest categoryscores = { category: cosine_similarity(text_embedding, cat_emb) for category, cat_emb in category_embeddings.items()}
predicted_category = max(scores, key=scores.get)
print(f"Text: {text_to_classify}")print(f"Predicted: {predicted_category}")print(f"Scores: {scores}")Output:
Text: Python is a popular programming languagePredicted: TechnologyScores: {'Technology': 0.823, 'Sports': 0.142, 'Cooking': 0.098}5. Duplicate Detection
Section titled “5. Duplicate Detection”Problem: Find duplicate or near-duplicate content.
Solution: High similarity → likely duplicates!
def find_duplicates(documents: list[str], threshold: float = 0.95) -> list: """Find near-duplicate documents.""" embeddings = [get_embedding(doc) for doc in documents] duplicates = []
for i in range(len(documents)): for j in range(i + 1, len(documents)): sim = cosine_similarity(embeddings[i], embeddings[j]) if sim > threshold: duplicates.append((documents[i], documents[j], sim))
return duplicates
# Testdocs = [ "How to restart a server", "Restarting a server tutorial", # Duplicate! "Cooking pasta recipes", "Server restart guide", # Duplicate!]
dupes = find_duplicates(docs, threshold=0.85)
for doc1, doc2, sim in dupes: print(f"{sim:.3f} - '{doc1}' ≈ '{doc2}'")Output:
0.921 - 'How to restart a server' ≈ 'Restarting a server tutorial'0.897 - 'How to restart a server' ≈ 'Server restart guide'0.913 - 'Restarting a server tutorial' ≈ 'Server restart guide'️ Common Pitfalls
Section titled “️ Common Pitfalls”Pitfall 1: Not Normalizing Text
Section titled “Pitfall 1: Not Normalizing Text”Problem: Embeddings can be sensitive to formatting.
# These have different embeddings!emb_1 = get_embedding("HELLO WORLD")emb_2 = get_embedding("hello world")emb_3 = get_embedding("hello world") # Extra spaces
# Similarity might not be 1.0!Solution: Normalize text before embedding:
def normalize_text(text: str) -> str: """Normalize text for consistent embeddings.""" # Lowercase (optional - models handle case well) # text = text.lower()
# Remove extra whitespace text = " ".join(text.split())
# Remove newlines text = text.replace("\n", " ")
return textPitfall 2: Embedding Too Much Text
Section titled “Pitfall 2: Embedding Too Much Text”Problem: Models have token limits!
# OpenAI: 8191 tokens max# Long document → truncated → information loss!
long_doc = "..." * 10000 # 50,000 wordsembedding = get_embedding(long_doc) # Silently truncated!Solution: Chunk long documents:
def chunk_text(text: str, max_tokens: int = 500) -> list[str]: """Split text into chunks.""" # Simple chunking by sentences sentences = text.split(". ") chunks = [] current_chunk = [] current_tokens = 0
for sentence in sentences: tokens = len(sentence.split()) # Rough estimate if current_tokens + tokens > max_tokens: chunks.append(". ".join(current_chunk) + ".") current_chunk = [sentence] current_tokens = tokens else: current_chunk.append(sentence) current_tokens += tokens
if current_chunk: chunks.append(". ".join(current_chunk) + ".")
return chunks
# Embed each chunk separatelychunks = chunk_text(long_document)chunk_embeddings = [get_embedding(chunk) for chunk in chunks]Pitfall 3: Comparing Embeddings from Different Models
Section titled “Pitfall 3: Comparing Embeddings from Different Models”Problem: Embeddings from different models are incompatible!
# DON'T DO THIS!emb_1 = get_embedding("text", model="text-embedding-3-small") # 1536 dimsemb_2 = get_voyage_embedding("text") # 1024 dims
similarity = cosine_similarity(emb_1, emb_2) # ERROR or nonsense!Solution: Always use the same model for all embeddings in a system!
Pitfall 4: Not Caching Embeddings
Section titled “Pitfall 4: Not Caching Embeddings”Problem: Embedding APIs cost money and time!
# BAD: Re-embed the same text repeatedlyfor query in user_queries: doc_embeddings = [get_embedding(doc) for doc in documents] # WASTEFUL! # ... search logicSolution: Precompute and cache!
# GOOD: Embed documents oncedoc_embeddings = {doc: get_embedding(doc) for doc in documents}
# Then reusefor query in user_queries: query_emb = get_embedding(query) similarities = [ cosine_similarity(query_emb, doc_embeddings[doc]) for doc in documents ]Pitfall 5: Ignoring Embedding Quality
Section titled “Pitfall 5: Ignoring Embedding Quality”Problem: Not all embedding models are equal!
Solution: Benchmark on your data!
from sklearn.metrics import accuracy_score
# Test different modelsmodels = ["text-embedding-3-small", "text-embedding-3-large"]
for model in models: # Generate embeddings embeddings = [get_embedding(text, model=model) for text in test_texts]
# Evaluate on your task (e.g., classification accuracy) accuracy = evaluate_embeddings(embeddings, labels)
print(f"{model}: {accuracy:.3f}")Choosing the Right Embedding Model
Section titled “Choosing the Right Embedding Model”Decision Matrix
Section titled “Decision Matrix”| Criteria | Recommendation |
|---|---|
| Cost-sensitive | OpenAI text-embedding-3-small or open-source |
| Quality-critical | OpenAI text-embedding-3-large or Voyage |
| Privacy-required | Open-source (Sentence Transformers) |
| Multilingual | paraphrase-multilingual (open-source) |
| Low latency | Cache embeddings or use smaller models |
| Retrieval/search | Voyage voyage-2 or OpenAI |
| General purpose | OpenAI text-embedding-3-small |
Benchmarking
Section titled “Benchmarking”Test on your specific use case:
def benchmark_model(model_name: str, queries: list[str], documents: list[str], relevant: dict): """Benchmark a model on search task.""" # Generate embeddings query_embs = [get_embedding(q, model=model_name) for q in queries] doc_embs = [get_embedding(d, model=model_name) for d in documents]
# Evaluate recall@k hits = 0 for i, query in enumerate(queries): # Find top-5 documents similarities = [cosine_similarity(query_embs[i], doc_emb) for doc_emb in doc_embs] top_k_indices = np.argsort(similarities)[-5:][::-1]
# Check if relevant doc is in top-5 if relevant[i] in top_k_indices: hits += 1
recall_at_5 = hits / len(queries) return recall_at_5
# Testrecall = benchmark_model("text-embedding-3-small", test_queries, test_docs, relevance_map)print(f"Recall@5: {recall:.2%}")Real-World Applications
Section titled “Real-World Applications”Application 1: kaizen (Lean DevOps Platform)
Section titled “Application 1: kaizen (Lean DevOps Platform)”Current: Keyword-based RAG retrieval Enhancement: Semantic search with embeddings!
# Embed all kaizen documentationdocs = load_kaizen_docs()doc_embeddings = {doc["id"]: get_embedding(doc["text"]) for doc in docs}
# User asks questionuser_question = "How do I optimize deployment speed?"
# Find relevant docsquestion_emb = get_embedding(user_question)scores = [ (doc_id, cosine_similarity(question_emb, doc_embeddings[doc_id])) for doc_id in doc_embeddings]
# Get top 5 relevant docstop_docs = sorted(scores, key=lambda x: x[1], reverse=True)[:5]
# Use in RAG promptcontext = "\n\n".join([docs[doc_id]["text"] for doc_id, _ in top_docs])answer = llm_query(f"Context: {context}\n\nQuestion: {user_question}")Application 2: vibe (Teaching Platform)
Section titled “Application 2: vibe (Teaching Platform)”Use case: Recommend similar learning materials
# Student just completed a lessoncompleted_lesson = "Introduction to Python functions"lesson_emb = get_embedding(completed_lesson)
# Find similar lessonsall_lessons = get_all_lessons()recommendations = [ (lesson, cosine_similarity(lesson_emb, get_embedding(lesson))) for lesson in all_lessons]
# Recommend top 3recommended = sorted(recommendations, key=lambda x: x[1], reverse=True)[1:4] # Skip self
print("You might also like:")for lesson, score in recommended: print(f"- {lesson} (similarity: {score:.2f})")Application 3: contrarian (Stock Analysis)
Section titled “Application 3: contrarian (Stock Analysis)”Use case: Cluster news articles by topic
# Get today's financial newsnews_articles = fetch_financial_news()
# Embed articlesarticle_embeddings = [get_embedding(article["title"] + " " + article["summary"]) for article in news_articles]
# Clusterkmeans = KMeans(n_clusters=5)clusters = kmeans.fit_predict(article_embeddings)
# Group articles by clusterfor cluster_id in range(5): print(f"\nTopic {cluster_id}:") for article, label in zip(news_articles, clusters): if label == cluster_id: print(f" - {article['title']}")Application 4: Work (Geospatial + Cloud)
Section titled “Application 4: Work (Geospatial + Cloud)”Use case: Semantic search across infrastructure documentation
# Index all runbooks, docs, scriptsinfrastructure_docs = load_all_docs()doc_index = { doc["path"]: get_embedding(doc["content"]) for doc in infrastructure_docs}
# Engineer asks: "How do I scale the database?"query = "How do I scale the database?"query_emb = get_embedding(query)
# Find relevant docsresults = sorted( [(path, cosine_similarity(query_emb, emb)) for path, emb in doc_index.items()], key=lambda x: x[1], reverse=True)[:5]
print("Relevant documentation:")for path, score in results: print(f"{score:.3f} - {path}")Production War Stories
Section titled “Production War Stories”The E-commerce Search Disaster
Section titled “The E-commerce Search Disaster”Company: Major online retailer (anonymized) Challenge: Holiday season search failures costing $50,000/hour in lost sales
The problem was subtle. During Black Friday, customers searching for “warm winter jacket” weren’t finding the company’s best-selling “insulated parka” or “thermal coat” products. The keyword-based search system only matched exact terms.
The disaster unfolds:
- Search conversion rate dropped 40% on Black Friday
- Customer support flooded with “I can’t find [product]” complaints
- Engineering scrambled to add synonyms manually
- By Cyber Monday, they’d manually mapped 500 synonym pairs—and still missed countless combinations
The embedding solution: The team implemented semantic search over the following month:
# Before: Keyword matchingdef search_products(query): return db.query("SELECT * FROM products WHERE name LIKE '%{}%'".format(query))
# After: Semantic similaritydef semantic_search(query, top_k=20): query_embedding = get_embedding(query) # Find nearest product embeddings in vector database return vector_db.search(query_embedding, limit=top_k)Results:
- Search conversion rate increased 35%
- Zero synonym maintenance required
- “Warm winter jacket” now matches “insulated parka” automatically
- Estimated annual revenue increase: $4.2M
The Legal Discovery Breakthrough
Section titled “The Legal Discovery Breakthrough”Company: Law firm handling complex litigation Challenge: Find relevant documents in 2 million case files
Traditional approach: Teams of paralegals reading documents for weeks, billing $150/hour.
The embedding approach:
- Embed all 2 million documents (one-time cost: ~$200 in API calls)
- Lawyer describes what they’re looking for in plain English
- System returns the 100 most semantically similar documents
- Paralegals review only the filtered results
The numbers:
- Before: 6 weeks, 8 paralegals, ~$180,000 in billable hours
- After: 2 days, 2 paralegals, ~$5,000 in labor + $200 in API costs
- Total savings per case: ~$175,000
- ROI: 87,400%
Key insight: Embeddings are particularly powerful for search over unstructured text where you can’t predict what words people will use.
The Customer Support Resolution
Section titled “The Customer Support Resolution”Company: SaaS platform with 10,000 support tickets/month Challenge: Route tickets to the right team and suggest relevant knowledge base articles
The old system used rules: “If ticket contains ‘billing’, route to Finance.” This failed spectacularly when customers wrote things like “I was charged twice” (no keyword “billing”) or “my payment didn’t go through” (routed to Payments team when it should be Finance).
The embedding solution:
- Embed all historical tickets and their resolutions
- When new ticket arrives, find the 10 most similar past tickets
- Use majority vote to determine correct team
- Suggest knowledge base articles with highest semantic similarity
Results:
- Correct routing: 67% → 94%
- First-response time: 4 hours → 45 minutes (right team gets it immediately)
- Customer satisfaction: +28 NPS points
- Support team capacity effectively increased 40%
Economics of Embeddings
Section titled “Economics of Embeddings”Cost Comparison: Build vs Buy
Section titled “Cost Comparison: Build vs Buy”| Approach | Cost per 1M Tokens | Latency | Quality | Best For |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | 100-200ms | Very Good | Most applications |
| OpenAI text-embedding-3-large | $0.13 | 150-300ms | Excellent | Quality-critical apps |
| Voyage AI voyage-2 | $0.10 | 100-200ms | Excellent | Retrieval-focused |
| Self-hosted all-MiniLM | ~$0.001* | 10-50ms | Good | High-volume, privacy |
| Self-hosted e5-large | ~$0.003* | 30-100ms | Very Good | Balanced |
*Self-hosted costs assume amortized GPU costs on cloud infrastructure
Real Cost Calculation
Section titled “Real Cost Calculation”Scenario: Semantic search for a documentation site
Assumptions:
- 10,000 documents to index
- Average 500 tokens per document
- 1,000 search queries per day
- Each query: 50 tokens
Initial indexing (one-time):
- Tokens: 10,000 × 500 = 5M tokens
- Cost (OpenAI small): 5M × $0.02/1M = $0.10
Daily queries:
- Tokens: 1,000 × 50 = 50K tokens
- Cost (OpenAI small): 50K × $0.02/1M = $0.001/day
Monthly total: $0.10 initial + $0.03 queries = $0.13/month
This is essentially free. Even at 100x the scale, you’re looking at $13/month.
When Self-Hosting Makes Sense
Section titled “When Self-Hosting Makes Sense”| Factor | API | Self-Hosted |
|---|---|---|
| Volume | <100M tokens/month | >100M tokens/month |
| Latency | 100-300ms acceptable | <50ms required |
| Privacy | Public cloud OK | Must stay on-premises |
| Quality | Need absolute best | ”Good enough” works |
| Ops burden | None | Significant |
Break-even analysis: At OpenAI’s pricing, self-hosting only makes sense above ~100M tokens/month (about $2,000/month in API costs). Below that, the operational overhead of running GPU infrastructure exceeds the API savings.
Interview Preparation: Embeddings
Section titled “Interview Preparation: Embeddings”Common Interview Questions
Section titled “Common Interview Questions”Q1: “Explain what embeddings are to a non-technical stakeholder.”
Strong Answer: “Embeddings are a way to convert text into numbers that capture meaning. Imagine you could place every word or sentence on a giant map, where similar things are close together and different things are far apart. ‘Cat’ would be near ‘kitten’ and ‘dog,’ but far from ‘airplane.’ Once we have this map, we can measure how similar two pieces of text are just by checking how close they are on the map. This powers features like ‘find similar articles,’ ‘recommended for you,’ and smart search that understands synonyms.”
Q2: “Why use cosine similarity instead of Euclidean distance?”
Strong Answer: “Cosine similarity measures the angle between vectors, not their absolute distance. This matters because embedding models don’t always produce vectors of consistent magnitude—a longer document might have a ‘bigger’ vector even if it’s about the same topic. Cosine similarity normalizes for this, focusing purely on direction. Two vectors pointing the same direction have similarity 1.0 regardless of their lengths. This makes it robust to variations in text length and model quirks.”
Q3: “How would you evaluate embedding quality for a specific use case?”
Strong Answer: “I’d create a test set with ground truth: pairs of texts labeled as similar or dissimilar, or queries with known relevant documents. Then I’d measure how well the embeddings rank the correct pairs higher than incorrect ones. Metrics like Recall@K (did the right document appear in top K results?) and Mean Reciprocal Rank are useful. I’d also compare multiple embedding models on this test set before committing to one. The key is evaluating on YOUR data—general benchmarks like MTEB are useful but don’t guarantee performance on domain-specific content.”
Q4: “What are the main pitfalls when implementing semantic search?”
Strong Answer: “Four big ones: First, not caching embeddings—recomputing them wastes money and time. Second, mixing embeddings from different models—they’re incompatible and comparisons are meaningless. Third, not chunking long documents—models have token limits, and silently truncated text loses information. Fourth, not handling the cold start—when you have no training data, start with zero-shot classification or simple similarity thresholds, then refine as you gather feedback.”
System Design Question
Section titled “System Design Question”Q: “Design a semantic search system for a company with 10 million documents.”
Strong Answer Framework:
-
Indexing Pipeline:
- Chunk documents into 500-token segments (overlap by 50)
- Generate embeddings via API (batch for efficiency)
- Store in vector database (Pinecone, Qdrant, or Weaviate)
- Estimated indexing time: 10M × 500 tokens = 5B tokens ≈ $100 with OpenAI
-
Query Pipeline:
- Embed user query
- Approximate nearest neighbor search in vector DB (HNSW algorithm, <100ms)
- Retrieve top 50-100 candidates
- Optional: Re-rank with cross-encoder for better precision
- Return top 10 to user
-
Scaling Considerations:
- Vector DB sharding for 10M+ documents
- Embedding cache for repeated queries
- Async indexing pipeline for new documents
- Monitoring: latency, cache hit rate, result diversity
-
Cost Estimate:
- Initial indexing: ~$100
- Daily queries (100K): ~$0.10/day
- Vector DB hosting: ~$100-500/month depending on scale
- Total: ~$200-600/month
Key Takeaways
Section titled “Key Takeaways”-
Embeddings capture meaning as geometry - Similar texts become close vectors, enabling mathematical operations on semantic content.
-
Cosine similarity is your friend - It measures directional similarity, robust to magnitude variations, and is the standard for comparing embeddings.
-
Choose your model wisely - OpenAI for quality, open-source for cost/privacy. Benchmark on YOUR data, not just MTEB scores.
-
Cache everything - Document embeddings rarely change. Compute once, store forever, query infinitely.
-
Chunk long documents - Models have token limits. Overlap chunks to maintain context across boundaries.
-
Same model, always - Never compare embeddings from different models. They live in different spaces.
-
Start simple, iterate - Basic semantic search often outperforms complex systems. Add re-ranking and filtering only when needed.
-
The economics are compelling - At $0.02 per million tokens, API costs are negligible for most applications.
-
Embeddings enable zero-shot learning - Classify without labeled data by comparing to category descriptions.
-
This is foundation technology - Every RAG system, recommendation engine, and semantic search uses embeddings. Master this, and everything else becomes easier.
Remember: Tomáš Mikolov’s accidental discovery in 2013 unlocked something profound—meaning has a geometry, and neural networks can discover it. Every time you use semantic search, get a recommendation, or interact with an AI system that “understands” you, you’re benefiting from the vectors that encode meaning into math. The simple idea of representing text as numbers in a high-dimensional space has become one of the most impactful and transformative concepts in modern AI, enabling revolutionary applications that would have seemed like pure science fiction just a decade ago.
Further Reading
Section titled “Further Reading”Papers
Section titled “Papers”- Sentence-BERT (2019): Paper - Introduced sentence embeddings
- SimCSE (2021): Paper - Contrastive learning for embeddings
- E5 (2022): Paper - Text embeddings by weak supervision
Documentation
Section titled “Documentation”- MTEB Leaderboard: Compare embedding models on 56 tasks
- txtai: Semantic search framework built on embeddings
- Qdrant: Vector database for embeddings (Module 14!)
Did You Know? The Revolution of Embeddings
Section titled “Did You Know? The Revolution of Embeddings”The Google Intern Who Changed NLP Forever
Section titled “The Google Intern Who Changed NLP Forever”In 2013, a Czech researcher named Tomáš Mikolov at Google published Word2Vec - and accidentally revolutionized natural language processing.
Mikolov wasn’t trying to create embeddings. He was trying to build a faster language model for speech recognition. His trick: simplify the neural network architecture to just one hidden layer. This made training 100x faster.
The accident: When Mikolov examined the hidden layer weights, he noticed something bizarre. Words with similar meanings had similar weights. And even more shocking: you could do math on them.
vector("king") - vector("man") + vector("woman") ≈ vector("queen")The paper “Efficient Estimation of Word Representations in Vector Space” has been cited over 40,000 times - making it one of the most influential ML papers ever. Mikolov later moved to Facebook AI, then left to work on AI safety.
Fun fact: The famous king/queen example was discovered by accident when a researcher was debugging the model and tried random arithmetic operations!
The “Linguistic Regularity” Discovery
Section titled “The “Linguistic Regularity” Discovery”After Word2Vec, researchers started finding increasingly bizarre patterns in embeddings:
Analogies that work:
Paris - France + Italy = Rome(capitals)walking - walked + swam = swimming(tense)brother - man + woman = sister(gender)good - better + worse = bad(comparatives)
But also unexpected ones:
sushi - Japan + Italy = pizza(!)Microsoft - Windows + Apple = macOS(!)Einstein - physicist + painter = Picasso(!)
Researchers published paper after paper exploring these “linguistic regularities.” The embeddings had somehow learned conceptual relationships just from predicting context words!
The Billion-Dollar Pivot: From Words to Sentences
Section titled “The Billion-Dollar Pivot: From Words to Sentences”Word2Vec had a fatal flaw: it only embedded single words. How do you embed “not good” (which means bad)?
2018: Google’s BERT solved this by embedding entire sentences. But BERT was slow - generating one embedding required a full transformer forward pass.
2019: Nils Reimers (a German researcher) created Sentence-BERT while doing his PhD. His insight: train BERT to produce embeddings that are fast to compare. The paper was rejected from the main NeurIPS conference but accepted to EMNLP.
The impact:
- Sentence-BERT made semantic search practical
- Reimers founded Hugging Face’s sentence-transformers library
- Now used by: Google, Amazon, Microsoft, and virtually every AI startup
- Downloads: 100M+ per month on Hugging Face
The lesson: A PhD student’s “rejected” paper became the foundation of the entire embedding industry!
The OpenAI Pricing Shock
Section titled “The OpenAI Pricing Shock”In December 2022, OpenAI released text-embedding-ada-002 - and shocked the industry with its pricing:
Before (text-embedding-ada-001):
- Price: $0.20 per 1,000 tokens
- Quality: Good but not great
After (ada-002):
- Price: $0.0001 per 1,000 tokens (2000x cheaper!)
- Quality: Significantly better
What happened? OpenAI had achieved a massive efficiency breakthrough. They never disclosed the details, but the pricing made embeddings essentially free for most applications.
Industry reaction:
- Startups pivoted overnight from open-source to OpenAI
- Self-hosted embedding servers were abandoned
- Cohere, Google, and others scrambled to match pricing
- By 2024, all major providers offer embeddings at <$0.0005 per 1K tokens
The Netflix Recommendation Engine
Section titled “The Netflix Recommendation Engine”Here’s a secret: Netflix doesn’t use traditional collaborative filtering anymore. They use embeddings.
Every movie on Netflix has an embedding. Every user has an embedding (based on their watch history). Recommendations are simply:
user_embedding = average(embeddings of watched movies)recommendations = nearest_neighbors(user_embedding, all_movie_embeddings)The numbers:
- Netflix generates $1B+ per year in value from their recommendation system
- Embeddings replaced their old system in 2019
- 80% of content watched comes from recommendations
Amazon, Spotify, TikTok - they all use similar embedding-based recommendation systems. The algorithm that powers your social media feed is essentially: find content with embeddings similar to what you’ve engaged with.
Why Embeddings Beat Keyword Search
Section titled “Why Embeddings Beat Keyword Search”The fundamental limitation of keyword search is the “vocabulary mismatch” problem. Users don’t use the same words as document authors. A study by Microsoft Research found that in enterprise search, the exact query terms appeared in relevant documents only 23% of the time. The remaining 77% required understanding synonyms, related concepts, or paraphrased ideas.
Embeddings solve this naturally. When a user searches for “laptop won’t turn on,” an embedding-based system understands this is semantically similar to “computer not booting,” “PC power issues,” and “notebook startup failure”—even though they share few words in common. This semantic understanding is why embedding-based search typically outperforms keyword search by 30-60% on relevance metrics.
The Benchmark Wars
Section titled “The Benchmark Wars”In 2022, the MTEB (Massive Text Embedding Benchmark) was released, ranking embedding models on 56 diverse tasks.
The competition got intense:
- OpenAI released ada-002, claimed #1 spot
- Cohere released embed-v3, challenged for #1
- Voyage AI (startup) came out of nowhere with voyage-large
- Open source (e5, gte, bge) caught up rapidly
Current state (2024):
- Best commercial: OpenAI text-embedding-3-large, Voyage AI voyage-large-2
- Best open-source: e5-large-v2, bge-large-en-v1.5
- Open source is now within 2-3% of commercial models!
The Surprising Economics
Section titled “The Surprising Economics”| Model | Cost per 1M tokens | Quality (MTEB) |
|---|---|---|
| OpenAI ada-002 | $0.10 | 61.0% |
| OpenAI text-embedding-3-large | $0.13 | 64.6% |
| Voyage AI voyage-large-2 | $0.12 | 64.5% |
| Cohere embed-v3 | $0.10 | 64.5% |
| Open source (e5-large) | ~$0.00* | 63.0% |
*Self-hosted cost depends on your infrastructure
The insight: For most applications, open-source embeddings are “good enough” and essentially free. Only pay for commercial when you need that extra 2-3% quality!
Module Summary
Section titled “Module Summary”What you learned:
- Embeddings are dense vectors representing text meaning
- Generated using neural networks trained on massive datasets
- Measured using cosine similarity (direction, not distance)
- Applied to search, clustering, recommendations, classification, duplicates
- OpenAI, Voyage, and open-source options available
- Always use the same model for all embeddings in a system
Key formulas:
cosine_similarity(A, B) = (A · B) / (|A| × |B|)Key code patterns:
# Generate embeddingembedding = get_embedding(text)
# Compare similaritysimilarity = cosine_similarity(emb_1, emb_2)
# Searchscores = [cosine_similarity(query_emb, doc_emb) for doc_emb in doc_embeddings]top_results = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:k]️ Next Steps
Section titled “️ Next Steps”Next module: Module 10: Vector Spaces & Semantic Search
In Module 10, you’ll experience the Heureka Moment: understanding embeddings as coordinates in semantic space, where mathematical operations work on meaning itself!
You’ll learn:
- Visualizing embeddings in 2D/3D space
- Vector arithmetic:
king - man + woman ≈ queen - Building semantic search from scratch
- Vector databases and indexing at scale
- The geometry of meaning!
This is when everything clicks!
** Neural Dojo - Master embeddings, unlock semantic understanding! **
Last updated: 2025-11-21 Module 9: Embeddings & Semantic Similarity