Перейти до вмісту

Vision AI

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6


Reading Time: 7-8 hours Prerequisites: Module 22

Section titled “Reading Time: 7-8 hours Prerequisites: Module 22”

San Francisco. September 25, 2023. 2:14 PM. OpenAI researcher Gabriel Goh uploaded an image to their internal GPT-4V test server. It wasn’t a standard benchmark—it was a photograph of his grandmother’s handwritten Hungarian recipe card, stained with decades of cooking oil, the cursive faded and cramped.

“Translate this and tell me what she’s making,” he typed.

The model responded: “This appears to be a Hungarian recipe for chicken paprikás. The handwriting notes ‘nagymama titkos receptje’ (grandmother’s secret recipe) at the top. I notice she’s crossed out ‘2 tablespoons’ and written ‘3’ instead—perhaps she learned over time that more paprika is better…”

Goh sat back, stunned. The model hadn’t just read the text—it had understood the context, noticed the correction, even inferred what it meant. This wasn’t optical character recognition. This was comprehension.

“That’s the moment I knew we’d crossed a threshold. The model wasn’t just seeing and reading separately—it was understanding the image the way a human would, noticing the small details that tell a story.” — Gabriel Goh, OpenAI Research Scientist


The Evolution of Computer Vision: A Brief History

Section titled “The Evolution of Computer Vision: A Brief History”

Before we dive into modern vision-language models, it’s worth understanding how we got here. The history of teaching computers to “see” is a story of humility, breakthroughs, and the eventual realization that language was the missing ingredient.

The Early Days: Hand-Crafted Features (1960s-2012)

Section titled “The Early Days: Hand-Crafted Features (1960s-2012)”

In the 1960s, researchers at MIT believed computer vision would be solved in a summer. Seymour Papert famously assigned it as a student project in 1966. Fifty years later, we’re still working on it.

Early approaches relied on hand-crafted features:

  • Edge detection (Canny, 1986): Finding boundaries in images
  • SIFT (Lowe, 1999): Scale-Invariant Feature Transform for object recognition
  • HOG (Dalal & Triggs, 2005): Histogram of Oriented Gradients for pedestrian detection

These methods worked in controlled environments but failed spectacularly in the real world. A SIFT-based car detector trained in California would fail in Tokyo. A face detector trained on white faces would miss Black faces. The “semantic gap” between pixels and meaning seemed insurmountable.

Everything changed in 2012 when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered ImageNet with AlexNet. Their deep CNN beat the competition by 10.8 percentage points—an unheard-of margin. Computer vision would never be the same.

The following years brought rapid progress:

  • VGG (2014): Deeper is better (19 layers)
  • GoogLeNet/Inception (2014): Parallel convolutions at different scales
  • ResNet (2015): Skip connections enabled 152-layer networks
  • EfficientNet (2019): Neural architecture search for optimal efficiency

By 2020, CNNs had achieved superhuman accuracy on ImageNet—97.3% compared to human error rates of 5.1%. But there was a catch: they could classify “golden retriever” but couldn’t answer “what is the dog doing?”

The breakthrough came when researchers stopped treating vision and language as separate problems. CLIP showed that training on image-caption pairs from the internet created representations that generalized far better than supervised learning. Suddenly, a model trained on web data could classify medical images, satellite photos, and artwork—tasks it had never explicitly trained for.

** Did You Know?**

The ImageNet dataset that sparked the deep learning revolution was created by Fei-Fei Li at Stanford. She hired workers from Amazon Mechanical Turk to label 14 million images into 22,000 categories. The project took three years (2007-2010) and cost less than $50,000—a pittance compared to the tens of billions of dollars it helped unlock in AI research. Li later said she was repeatedly told the project was “a waste of time” by colleagues who believed hand-crafted features were the only viable approach.


By the end of this module, you will:

  • Understand multimodal AI architectures that combine vision and language
  • Master CLIP for image-text embeddings and zero-shot classification
  • Use GPT-4V, Claude Vision, and Gemini for visual reasoning
  • Build image search and multimodal chatbots
  • Implement vision-language reasoning systems
  • Understand the transformer architecture for images (ViT)

Imagine teaching a computer not just to read text OR recognize images, but to understand them together - to describe what’s in a photo, answer questions about diagrams, or search for images using natural language. This is the promise of vision-language models, and it’s revolutionizing how we build AI applications.

Traditional AI forced us to choose: you had language models that could write but couldn’t see, and computer vision models that could see but couldn’t explain. Vision-language models break down this barrier, enabling:

  • Visual Question Answering: “What color is the car in this image?”
  • Image Captioning: Automatically describe any image
  • Visual Search: Find images using natural language queries
  • Document Understanding: Extract information from PDFs, receipts, diagrams
  • Multimodal Reasoning: Solve problems that require both seeing and thinking

Era 1: Separate Models (2012-2020)

  • CNNs for vision (ImageNet, ResNet)
  • Transformers for language (BERT, GPT)
  • Connecting them required complex pipelines

Era 2: CLIP Revolution (2021)

  • OpenAI’s CLIP unified vision and language in one training framework
  • Zero-shot image classification without any labeled data
  • Enabled entirely new applications like DALL-E

Era 3: Native Multimodal Models (2023-present)

  • GPT-4V, Claude 3, Gemini: Models that see and think natively
  • No separate vision/language modules - truly integrated
  • Human-level performance on many visual reasoning tasks

Part 1: Understanding Vision Transformers (ViT)

Section titled “Part 1: Understanding Vision Transformers (ViT)”

Before diving into multimodal models, we need to understand how transformers learned to see.

Convolutional Neural Networks (CNNs) dominated computer vision for a decade (2012-2020). They work by sliding filters across images to detect patterns - edges, textures, shapes, objects. But they have limitations:

  • Fixed receptive field: CNNs see local patterns first, global patterns later
  • No built-in attention: Every pixel gets equal initial treatment
  • Hard to scale: Larger CNNs don’t improve as predictably as transformers

In October 2020, Google’s “An Image is Worth 16x16 Words” paper changed everything. The key insight was simple but revolutionary:

What if we treat an image like a sequence of words?

Traditional approach: Image → CNN → Features → Classifier
ViT approach: Image → Patches → Transformer → Classification
  1. Split Image into Patches: A 224x224 image becomes 196 patches of 16x16 pixels
  2. Flatten Patches: Each 16x16x3 patch becomes a 768-dimensional vector
  3. Add Position Embeddings: Tell the model where each patch is located
  4. Add [CLS] Token: A learnable token that aggregates image information
  5. Process with Transformer: Standard transformer encoder layers
  6. Classify from [CLS]: The [CLS] token output goes to classification head
# Conceptual ViT forward pass
def vit_forward(image):
# 1. Patchify: [B, 3, 224, 224] → [B, 196, 768]
patches = split_into_patches(image, patch_size=16)
patch_embeddings = linear_projection(patches)
# 2. Add position embeddings
patch_embeddings += position_embeddings # [B, 196, 768]
# 3. Prepend [CLS] token
cls_token = learnable_cls_token.expand(batch_size, 1, 768)
tokens = torch.cat([cls_token, patch_embeddings], dim=1) # [B, 197, 768]
# 4. Transformer encoder
for layer in transformer_layers:
tokens = layer(tokens) # Self-attention + FFN
# 5. Classification from [CLS]
cls_output = tokens[:, 0] # [B, 768]
logits = classification_head(cls_output) # [B, num_classes]
return logits

The patch approach works because:

  1. Transformers need sequences: Patches create a sequence from a 2D image
  2. 16x16 is semantically meaningful: Large enough to contain objects/parts
  3. Attention sees everything: Every patch can attend to every other patch immediately
  4. Scales predictably: More compute → better results (unlike CNNs)
ModelPatch SizeLayersHidden DimParamsImageNet Acc
ViT-B/16161276886M77.9%
ViT-L/1616241024307M79.7%
ViT-H/1414321280632M80.9%
ViT-G/14144014081.8B83.3%

The naming convention: ViT-{Size}/{Patch Size}

  • B = Base, L = Large, H = Huge, G = Giant
  • /16 means 16x16 patches, /14 means 14x14 patches (more patches = more compute)

** Did You Know?**

The Vision Transformer paper was initially rejected from multiple conferences. Reviewers argued that “an image is fundamentally different from text—you can’t just use the same architecture.” The Google Brain team persisted, and when the results came in, the evidence was undeniable: with enough data (14 million images), ViT matched CNNs. With massive data (300 million images), it crushed them. Today, ViT variants power nearly every major vision-language model, from CLIP to GPT-4V.


Part 2: CLIP - Connecting Vision and Language

Section titled “Part 2: CLIP - Connecting Vision and Language”

CLIP (Contrastive Language-Image Pre-training) is one of the most influential AI models ever created. Released by OpenAI in January 2021, it fundamentally changed how we think about vision-language learning.

CLIP consists of two encoders trained together:

Image Text
↓ ↓
┌─────────────┐ ┌─────────────┐
│ Vision │ │ Text │
│ Encoder │ │ Encoder │
│ (ViT/ResNet)│ │ (Transformer)│
└─────────────┘ └─────────────┘
↓ ↓
Image Embedding Text Embedding
↓ ↓
┌───────────────────────────────┐
│ Contrastive Learning │
│ (Match images with captions) │
└───────────────────────────────┘

CLIP was trained on 400 million image-text pairs scraped from the internet. The training objective is elegantly simple:

  1. Take a batch of N image-caption pairs
  2. Encode all N images to get N image embeddings
  3. Encode all N captions to get N text embeddings
  4. For each image, push its embedding closer to its matching caption
  5. For each image, push its embedding away from non-matching captions
# Simplified CLIP training step
def clip_training_step(images, captions):
# Encode both modalities
image_embeddings = vision_encoder(images) # [N, 512]
text_embeddings = text_encoder(captions) # [N, 512]
# Normalize embeddings
image_embeddings = F.normalize(image_embeddings, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute similarity matrix
logits = image_embeddings @ text_embeddings.T # [N, N]
logits = logits * temperature # Learnable temperature parameter
# Contrastive loss: diagonal should be highest
labels = torch.arange(N)
loss_i2t = F.cross_entropy(logits, labels) # Image → Text
loss_t2i = F.cross_entropy(logits.T, labels) # Text → Image
return (loss_i2t + loss_t2i) / 2

The diagonal of the similarity matrix represents correct image-caption pairs. Training maximizes these while minimizing off-diagonal elements.

CLIP’s most impressive capability is zero-shot classification - classifying images into categories it has never seen during training.

How it works:

  1. Create text prompts for each class: “a photo of a {class}”
  2. Encode all prompts to get text embeddings
  3. Encode the image to get its embedding
  4. Find which text embedding is most similar to the image embedding
def zero_shot_classify(image, class_names):
# Create text prompts
prompts = [f"a photo of a {name}" for name in class_names]
# Encode image and texts
image_embedding = vision_encoder(image)
text_embeddings = text_encoder(prompts)
# Normalize
image_embedding = F.normalize(image_embedding, dim=-1)
text_embeddings = F.normalize(text_embeddings, dim=-1)
# Compute similarities
similarities = image_embedding @ text_embeddings.T
# Return class with highest similarity
predicted_class = class_names[similarities.argmax()]
return predicted_class
  1. Image Search: Encode images into a vector database, search with natural language
  2. Content Moderation: Classify images without explicit training
  3. Image Generation: CLIP guides DALL-E, Stable Diffusion
  4. Medical Imaging: Zero-shot diagnosis without medical training data
  5. Robotics: Describe tasks in natural language, let robot find relevant objects
ModelOrganizationKey Innovation
CLIPOpenAIOriginal contrastive learning
OpenCLIPLAIONOpen-source reproduction
SigLIPGoogleSigmoid loss (better than softmax)
BLIPSalesforceAdded image captioning
BLIP-2SalesforceFrozen LLM + Q-Former bridge
EVA-CLIPBAAILargest open CLIP (18B params)

Vision-Language Models take the next step beyond CLIP: instead of just comparing images and text, they can generate text about images and reason about visual content.

Modern VLMs typically combine three components:

Image
┌────────────────┐
│ Vision Encoder │
│ (ViT / SigLIP) │
└────────────────┘
Image Tokens
┌────────────────┐
│ Projection │
│ (Connector) │
└────────────────┘
Visual Embeddings
┌──────────────────────────────────┐
│ │
│ [Visual] [Text Prompt Tokens] │
│ │
│ Large Language Model │
│ (gpt-5, Claude, Llama) │
│ │
└──────────────────────────────────┘
Generated Response

Released: September 2023 Architecture: Unknown (proprietary) Capabilities:

  • Image understanding and description
  • OCR and document analysis
  • Diagram and chart interpretation
  • Multi-image reasoning
  • Spatial understanding

Usage:

from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {"url": "https://example.com/image.jpg"}
}
]
}
]
)

Released: March 2024 Models: Haiku, Sonnet, Opus (all have vision) Capabilities:

  • Strong at detailed image analysis
  • Excellent chart/graph interpretation
  • Good at multi-step visual reasoning
  • Strong safety measures

Usage:

import anthropic
import base64
client = anthropic.Anthropic()
# Load image as base64
with open("image.jpg", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-4.6-opus-20240229",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": image_data
}
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
}
]
)

Released: December 2023 Models: Gemini Pro Vision, Gemini Ultra Capabilities:

  • Native multimodal (not bolted on)
  • Video understanding
  • Long context with images
  • Strong at spatial reasoning

Usage:

import google.generativeai as genai
from PIL import Image
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-3.5-pro')
image = Image.open("image.jpg")
response = model.generate_content([
"What's happening in this image?",
image
])

Released: April 2023 Architecture: CLIP ViT + Vicuna/Llama Key Features:

  • Fully open source and weights available
  • Can run locally
  • Multiple sizes (7B, 13B, 34B)
  • Good for research and customization

Architecture:

Image → CLIP ViT-L/14 → Linear Projection → Vicuna/Llama-2
Response Text
ModelBest ForLimitations
GPT-4VGeneral vision tasks, complex reasoningCost, API-only
Claude 3Detailed analysis, safety-criticalAPI-only
GeminiVideo, long contextRegional availability
LLaVALocal deployment, customizationLower quality than closed models

** Did You Know?**

The race to build GPT-4V was shrouded in secrecy. OpenAI trained multiple vision-language architectures in parallel, including one approach that was “shockingly simple”—just concatenating image tokens with text tokens in the transformer. Rumor has it that this simple approach outperformed more complex architectures. Meanwhile, Google rushed Gemini to market after GPT-4V’s announcement, leading to an embarrassing demo where edited video made the model appear faster than it was. The multimodal AI race has become one of the most competitive in tech history.


Just like text prompting, there’s an art to prompting vision-language models effectively.

Simple Description:

User: What's in this image?

Specific Focus:

User: How many people are in this image? What are they doing?

Expert Persona:

User: As an art historian, analyze the composition and technique in this painting.

Visual reasoning improves dramatically with CoT prompting:

Without CoT:

User: [Image of a math problem]
What is the answer?

With CoT:

User: [Image of a math problem]
Look at this problem carefully. First, identify what type of math problem it is.
Then, list out all the given information.
Finally, solve it step by step and show your work.

VLMs can compare multiple images:

User: [Image 1: Product listing photo]
[Image 2: Actual product received]
Compare these two images. Is the product as advertised?
List any differences you notice.

For documents, invoices, receipts:

User: [Image of invoice]
Extract all of the following information in JSON format:
- Invoice number
- Date
- Vendor name
- Line items (product, quantity, price)
- Total amount
- Tax amount

For technical diagrams:

User: [Architecture diagram]
Analyze this system architecture diagram:
1. Identify all components and services
2. Trace the data flow from user request to response
3. Identify potential bottlenecks or single points of failure
4. Suggest improvements for scalability

Part 5: Building with Vision-Language Models

Section titled “Part 5: Building with Vision-Language Models”

Build semantic image search using CLIP embeddings:

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import numpy as np
class CLIPImageSearch:
def __init__(self, model_name="openai/clip-vit-base-patch32"):
self.model = CLIPModel.from_pretrained(model_name)
self.processor = CLIPProcessor.from_pretrained(model_name)
self.image_embeddings = []
self.image_paths = []
def index_images(self, image_paths):
"""Create embeddings for all images."""
self.image_paths = image_paths
for path in image_paths:
image = Image.open(path)
inputs = self.processor(images=image, return_tensors="pt")
with torch.no_grad():
embedding = self.model.get_image_features(**inputs)
embedding = embedding / embedding.norm(dim=-1, keepdim=True)
self.image_embeddings.append(embedding.numpy())
self.image_embeddings = np.vstack(self.image_embeddings)
def search(self, query: str, top_k: int = 5):
"""Search images using natural language query."""
inputs = self.processor(text=[query], return_tensors="pt")
with torch.no_grad():
text_embedding = self.model.get_text_features(**inputs)
text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
# Compute similarities
similarities = (text_embedding.numpy() @ self.image_embeddings.T)[0]
# Get top results
top_indices = np.argsort(similarities)[::-1][:top_k]
results = [
{"path": self.image_paths[i], "score": similarities[i]}
for i in top_indices
]
return results
# Usage
search = CLIPImageSearch()
search.index_images(["cat.jpg", "dog.jpg", "car.jpg", "house.jpg"])
results = search.search("a furry pet playing")
from openai import OpenAI
import base64
class VisualQA:
def __init__(self):
self.client = OpenAI()
def encode_image(self, image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def ask(self, image_path: str, question: str) -> str:
"""Ask a question about an image."""
image_data = self.encode_image(image_path)
response = self.client.chat.completions.create(
model="gpt-5",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Look at this image carefully and answer: {question}"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]
}
],
max_tokens=500
)
return response.choices[0].message.content
# Usage
vqa = VisualQA()
answer = vqa.ask("diagram.png", "What components are shown in this architecture?")
from openai import OpenAI
import base64
from dataclasses import dataclass
from typing import List
@dataclass
class Caption:
short: str
detailed: str
keywords: List[str]
alt_text: str
class ImageCaptioner:
def __init__(self):
self.client = OpenAI()
def caption(self, image_path: str) -> Caption:
"""Generate multiple caption styles for an image."""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
prompt = """Analyze this image and provide:
1. SHORT: A one-sentence caption (max 15 words)
2. DETAILED: A detailed description (2-3 sentences)
3. KEYWORDS: 5-10 relevant keywords, comma-separated
4. ALT_TEXT: Accessibility-friendly alt text
Format your response exactly as:
SHORT: [caption]
DETAILED: [description]
KEYWORDS: [keywords]
ALT_TEXT: [alt text]"""
response = self.client.chat.completions.create(
model="gpt-5",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
}
]
}
]
)
# Parse response
text = response.choices[0].message.content
lines = text.strip().split("\n")
result = {}
for line in lines:
if line.startswith("SHORT:"):
result["short"] = line.replace("SHORT:", "").strip()
elif line.startswith("DETAILED:"):
result["detailed"] = line.replace("DETAILED:", "").strip()
elif line.startswith("KEYWORDS:"):
keywords = line.replace("KEYWORDS:", "").strip()
result["keywords"] = [k.strip() for k in keywords.split(",")]
elif line.startswith("ALT_TEXT:"):
result["alt_text"] = line.replace("ALT_TEXT:", "").strip()
return Caption(**result)

Vision API calls are expensive. Strategies to optimize:

  1. Image Resizing: Reduce resolution before sending

    from PIL import Image
    def resize_for_api(image_path, max_size=1024):
    img = Image.open(image_path)
    img.thumbnail((max_size, max_size))
    return img
  2. Caching: Cache responses for identical images

    import hashlib
    def hash_image(image_bytes):
    return hashlib.md5(image_bytes).hexdigest()
  3. Batching: Process multiple images efficiently

  4. Model Selection: Use cheaper models for simple tasks

  1. Parallel Processing: Process images concurrently
  2. Streaming: Use streaming responses for faster first-token
  3. Edge Deployment: Run lightweight models locally (LLaVA, Florence-2)
  4. Async Operations: Non-blocking API calls
  1. Content Filtering: Check images before processing
  2. PII Detection: Be careful with documents containing personal information
  3. Rate Limiting: Implement proper rate limits
  4. Error Handling: Graceful degradation when API fails

Understanding the costs of vision AI is crucial for production deployments. Vision API calls are significantly more expensive than text-only calls, and the economics can make or break a product.

ProviderModelImage CostNotes
OpenAIgpt-5$0.00255 per 512x512 tileMultiple tiles for larger images
OpenAIgpt-5-mini$0.000638 per 512x512 tile4x cheaper, slightly lower quality
AnthropicClaude 3 Opus~$0.024 per 1000 input tokensImages count as ~1500 tokens
AnthropicClaude 3 Sonnet~$0.003 per imageMuch cheaper for most uses
AnthropicClaude 3 Haiku~$0.0005 per imageBest value for simple tasks
GoogleGemini Pro VisionFree tier, then $0.0025/imageGenerous free quota

Scenario 1: E-commerce Product Moderation

  • 100,000 product images/day
  • Need: Check for policy violations
  • gpt-5 cost: $255/day = $7,650/month
  • Claude Haiku cost: $50/day = $1,500/month
  • Local LLaVA cost: ~$500/month in compute

Scenario 2: Document Processing

  • 10,000 invoices/day
  • Average 3 pages each = 30,000 images
  • gpt-5 cost: $76.50/day = $2,295/month
  • Hybrid approach (OCR + LLM): $300/month

Scenario 3: Medical Imaging Analysis

  • 1,000 X-rays/day (research setting)
  • Need: High accuracy (Claude Opus or gpt-5)
  • Cost: ~$30/day = $900/month
  • But: Regulatory requirements may mandate specific solutions

1. Resolution Optimization Most VLMs don’t need full resolution. gpt-5 processes images in 512x512 tiles. Sending a 4K image means 64 tiles = 64x the cost.

def optimize_for_api(image_path, task_type="general"):
"""Resize images based on task requirements."""
from PIL import Image
img = Image.open(image_path)
# Task-specific optimal sizes
sizes = {
"general": 1024, # Good balance
"ocr": 2048, # Need text clarity
"classification": 512, # Usually sufficient
"thumbnail": 256 # Quick checks
}
max_dim = sizes.get(task_type, 1024)
if max(img.size) > max_dim:
img.thumbnail((max_dim, max_dim), Image.LANCZOS)
return img

2. Tiered Model Selection Use cheap models for easy tasks, expensive models for hard ones:

def smart_analyze(image):
# First pass: cheap model for classification
result = haiku_classify(image)
if result.confidence > 0.9:
return result # Easy case, done
# Second pass: expensive model for hard cases
return opus_analyze(image)

3. Caching Strategies Cache responses for identical or similar images:

import hashlib
from functools import lru_cache
def hash_image(image_bytes):
return hashlib.sha256(image_bytes).hexdigest()
@lru_cache(maxsize=10000)
def cached_analyze(image_hash, prompt):
# Actual API call only on cache miss
return vlm.analyze(image_hash, prompt)

Production War Stories: When Vision AI Goes Wrong

Section titled “Production War Stories: When Vision AI Goes Wrong”

A major fashion retailer deployed GPT-4V to automatically generate alt text for their product images—100,000+ items in their catalog. The system worked beautifully in testing.

In production, they discovered the model occasionally hallucinated brand names. A plain white t-shirt might be described as “Gucci white cotton t-shirt” when it was their store brand. Worse, it sometimes described competitor products.

The fallout:

  • Luxury brands threatened trademark lawsuits
  • They had to manually review 100,000 descriptions
  • Total cost of the “time-saving” automation: $2.1 million in legal fees, contractor costs, and lost sales

Lesson learned: Always validate VLM outputs against known constraints. If it’s a store-brand product, the description should never mention other brands.

An insurance company built a system to estimate car damage from photos. The VLM would analyze crash photos and estimate repair costs. It saved adjusters hours per claim.

Six months in, they noticed something troubling: claims from certain zip codes were consistently estimated lower. Investigation revealed the model was inferring vehicle age and quality from backgrounds—a damaged car in front of a well-maintained house got higher estimates than the same damage in front of an older house.

The fix: They cropped all images to show only the vehicle, removing environmental context. Claims became more consistent, but the story became a cautionary tale about unintended bias in vision systems.

The Medical Imaging False Positive Cascade

Section titled “The Medical Imaging False Positive Cascade”

A hospital deployed a VLM to pre-screen chest X-rays, flagging potential issues for radiologist review. The system was supposed to reduce workload by filtering out clearly normal scans.

Instead, it created a 30% increase in radiologist workload. Why? The model was trained on a dataset where 40% of images showed abnormalities (because normal scans weren’t interesting to researchers). In the real hospital, only 5% of scans had issues. The model’s base rate assumptions were completely wrong.

Lesson learned: Training data distribution must match deployment distribution. A model that’s great at finding rare diseases in a curated dataset may be useless in general screening.

A social media platform deployed vision AI to detect policy-violating content. The system worked well for obvious violations but struggled with context.

The worst case: A news article thumbnail showing historical war footage was repeatedly flagged and removed. The appeals process took weeks. Meanwhile, actual violations using subtle symbols and coded imagery slipped through because they weren’t in the training data.

The company’s response: They now use a hybrid system—AI flags potential issues, but humans make final decisions on all non-obvious cases. The “fully automated” dream became a “human-in-the-loop” reality.

** Did You Know?**

Tesla’s Autopilot system processes 8 cameras simultaneously, creating ~1.5 GB of visual data per second per vehicle. At Tesla’s scale (millions of vehicles), this represents the largest vision AI deployment in history. The system has logged over 1 billion miles of real-world driving data. When rare edge cases occur—a truck carrying a white trailer against a bright sky, a person in a wheelchair crossing unexpectedly—Tesla can search their database to find similar scenarios and retrain. This “fleet learning” approach is impossible for smaller players to replicate.


Did You Know? Historical Context and Stories

Section titled “Did You Know? Historical Context and Stories”

When OpenAI released CLIP in January 2021, it was initially overshadowed by DALL-E (released the same day). But CLIP would prove to be the more foundational contribution. The paper “Learning Transferable Visual Models From Natural Language Supervision” showed that:

  • Web-scale matters: 400 million image-text pairs from the internet
  • Simple objectives work: Contrastive learning is all you need
  • Zero-shot transfers: No fine-tuning required for new tasks

The CLIP team included Alec Radford (GPT lead), Jong Wook Kim, and Ilya Sutskever. They trained on a private dataset called WIT (WebImageText) - which has never been released.

In April 2023, a team at UW-Madison released LLaVA (Large Language and Vision Assistant), showing that you could create a capable VLM for about $300 in compute costs. The key insight: don’t train everything from scratch - just connect a frozen CLIP encoder to a frozen LLM with a small trainable projection layer.

The paper was uploaded to arXiv on April 17, 2023, and within 24 hours had over 1,000 GitHub stars. It democratized VLM research overnight.

Multiple companies are competing to build the best vision encoder:

  • OpenAI’s CLIP: The original, but closed source
  • Google’s SigLIP: Uses sigmoid loss instead of softmax, scales better
  • Meta’s DINOv2: Self-supervised learning without text
  • LAION’s OpenCLIP: Open reproduction of CLIP
  • BAAI’s EVA-CLIP: Currently largest open CLIP (18B params)

Each makes different tradeoffs between efficiency, accuracy, and openness.

When GPT-4V was released in September 2023, OpenAI demonstrated it could:

  • Read handwritten notes and convert to LaTeX
  • Explain memes and humor
  • Solve visual puzzles
  • Read charts and graphs
  • Debug code by looking at screenshots

But perhaps most impressively, it could understand complex diagrams well enough to help lawyers with visual evidence and architects with floor plans. The multimodal version scores in the 88th percentile on the bar exam (text-only gpt-5 scored 90th).

During testing of Claude 3 Opus’s vision capabilities, Anthropic observed something curious. When shown a document containing information about a previous conversation (a “needle in a haystack” test), Claude 3 wrote:

“Here is the most relevant sentence in the documents: ‘The most delicious pizza topping combination is figs, prosciutto, and goat cheese…’ However, this sentence seems very out of place… I suspect this may be a test to see if I’m paying attention…”

The model recognized it was being tested. This sparked discussions about whether VLMs might develop situational awareness.

The open-source AI community created LAION-5B, a dataset of 5.85 billion image-text pairs, to train OpenCLIP. It enabled reproducible CLIP research but also sparked controversy:

  • Artist lawsuits: Artists claimed their work was scraped without consent
  • CSAM discovery: Researchers found illegal content in the dataset
  • Copyright debates: Who owns web-scraped data?

LAION-5B was temporarily taken down in December 2023 for review, highlighting the tension between open AI research and data ethics.

Florence: Microsoft’s Visual Foundation Model

Section titled “Florence: Microsoft’s Visual Foundation Model”

Microsoft Research’s Florence project (2021-2023) pioneered several ideas later adopted industry-wide:

  • Unified visual representations (one model for classification, detection, segmentation)
  • Visual-language pre-training at scale
  • Transfer to any visual task

Florence-2 (2024) achieved state-of-the-art results on multiple benchmarks while being small enough to run on consumer GPUs.

One of the most counterintuitive findings in VLM research is that resolution matters less than you’d think. Researchers at Google found that their PaLI model performed nearly as well on 224x224 images as on 588x588 images for most tasks—despite the 7x difference in pixel count.

The explanation: after self-attention, the model has already abstracted away from raw pixels. What matters is semantic content, not pixel-level detail. This finding has major implications for deployment costs—you can often resize images dramatically without losing accuracy.

** Did You Know?**

The most expensive image ever processed by a VLM was a 200-megapixel satellite photo analyzed by Google Earth Engine’s AI system in 2023. The image covered 400 square kilometers of the Amazon rainforest at 50cm resolution—every individual tree was visible. Processing it required splitting the image into 200,000 tiles and running each through a deforestation detection model. The total compute cost: approximately $15,000 for a single image. The finding: 2.3% of the monitored area had been illegally cleared in the previous month, leading to arrests of logging operations.

Benchmarking VLMs is surprisingly hard. Popular benchmarks include:

  • VQA v2: Visual question answering
  • GQA: Compositional reasoning
  • MMMU: College-level multimodal understanding
  • MathVista: Mathematical reasoning with visuals
  • MM-Bench: Comprehensive multimodal evaluation

Models routinely “game” benchmarks, leading to a cat-and-mouse game between benchmark creators and model trainers.


VLMs are good at OCR but not perfect. For critical document processing:

Bad:

# Trust VLM output directly
text = vlm.extract_text(document)
process_invoice(text) # May have errors!

Better:

# Validate and verify
text = vlm.extract_text(document)
text = spell_check(text)
if not validate_invoice_format(text):
flag_for_human_review()

Low-quality images produce low-quality results:

Bad:

# Send tiny thumbnail
analyze(thumbnail_50x50.jpg)

Better:

# Ensure adequate resolution
if image.size[0] < 512 or image.size[1] < 512:
raise ValueError("Image too small for reliable analysis")

Complex visual reasoning often needs multiple steps:

Bad:

# One giant prompt
response = vlm.analyze("Count all objects, describe relationships,
identify brands, extract text, and determine location")

Better:

# Break into focused tasks
objects = vlm.analyze("List all objects visible")
relationships = vlm.analyze("Describe spatial relationships between objects")
text = vlm.analyze("Extract any visible text")
# Combine results

Pitfall 4: Hallucinations in Visual Content

Section titled “Pitfall 4: Hallucinations in Visual Content”

VLMs can hallucinate details not present in images:

Mitigation strategies:

  • Ask for confidence scores
  • Request models to say “I cannot determine” when uncertain
  • Cross-validate with multiple prompts
  • Use structured output to catch inconsistencies

Exercise 1: Build Zero-Shot Image Classifier

Section titled “Exercise 1: Build Zero-Shot Image Classifier”

Using CLIP, build a classifier that can categorize images into custom categories without any training.

Goal: Understand how CLIP enables classification without labeled training data.

Step-by-Step Instructions:

  1. Setup Environment:
Terminal window
pip install transformers torch pillow
  1. Create the Classifier:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
class ZeroShotClassifier:
def __init__(self):
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
def classify(self, image_path, categories, prompt_template="a photo of a {}"):
image = Image.open(image_path)
prompts = [prompt_template.format(cat) for cat in categories]
inputs = self.processor(
text=prompts,
images=image,
return_tensors="pt",
padding=True
)
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits_per_image
probs = logits.softmax(dim=1)
results = {cat: prob.item() for cat, prob in zip(categories, probs[0])}
return dict(sorted(results.items(), key=lambda x: x[1], reverse=True))
# Usage
classifier = ZeroShotClassifier()
categories = ["dog", "cat", "bird", "car", "house"]
results = classifier.classify("test_image.jpg", categories)
print(results)
  1. Test with Different Prompts:
templates = [
"a photo of a {}",
"a picture of a {}",
"an image containing a {}",
"a {} in a photograph"
]
for template in templates:
results = classifier.classify("test.jpg", categories, template)
print(f"{template}: {list(results.keys())[0]}")

Expected Results: You should see classification accuracy vary by 5-15% depending on prompt template. The “a photo of a {}” template typically works best for natural images.

Success Criteria: Achieve >80% accuracy on a test set of 10 images across 5 categories.

Build a semantic image search system that finds images using natural language queries.

Goal: Learn to build CLIP-based vector search for images.

Step-by-Step Instructions:

  1. Index Images:
import os
import numpy as np
from pathlib import Path
class ImageSearchEngine:
def __init__(self):
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
self.image_paths = []
self.embeddings = None
def index_folder(self, folder_path):
"""Index all images in a folder."""
image_files = []
for ext in ['*.jpg', '*.jpeg', '*.png', '*.webp']:
image_files.extend(Path(folder_path).glob(f'**/{ext}'))
embeddings = []
for img_path in image_files:
try:
image = Image.open(img_path).convert('RGB')
inputs = self.processor(images=image, return_tensors="pt")
with torch.no_grad():
emb = self.model.get_image_features(**inputs)
emb = emb / emb.norm(dim=-1, keepdim=True)
embeddings.append(emb.numpy())
self.image_paths.append(str(img_path))
except Exception as e:
print(f"Error processing {img_path}: {e}")
self.embeddings = np.vstack(embeddings)
print(f"Indexed {len(self.image_paths)} images")
def search(self, query, top_k=5):
"""Search images with natural language."""
inputs = self.processor(text=[query], return_tensors="pt")
with torch.no_grad():
text_emb = self.model.get_text_features(**inputs)
text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)
similarities = (text_emb.numpy() @ self.embeddings.T)[0]
top_indices = np.argsort(similarities)[::-1][:top_k]
return [
{"path": self.image_paths[i], "score": float(similarities[i])}
for i in top_indices
]
# Usage
engine = ImageSearchEngine()
engine.index_folder("./my_photos")
results = engine.search("sunset over the ocean")
  1. Add Visualization:
import matplotlib.pyplot as plt
def show_results(results, query):
fig, axes = plt.subplots(1, len(results), figsize=(15, 5))
fig.suptitle(f"Query: {query}")
for ax, result in zip(axes, results):
img = Image.open(result["path"])
ax.imshow(img)
ax.set_title(f"Score: {result['score']:.3f}")
ax.axis('off')
plt.tight_layout()
plt.show()

Expected Results: Searches like “people laughing” should find photos of happy groups, while “mountain landscape” finds nature shots.

Success Criteria: Build a search engine that indexes 100+ images and returns relevant results in <1 second.

Exercise 3: Document Understanding Pipeline

Section titled “Exercise 3: Document Understanding Pipeline”

Build a system that extracts information from PDFs using vision-language models.

Goal: Learn to process multi-page documents with VLMs.

Step-by-Step Instructions:

  1. PDF to Images:
from pdf2image import convert_from_path
import anthropic
import base64
from io import BytesIO
def pdf_to_images(pdf_path, dpi=150):
"""Convert PDF pages to images."""
return convert_from_path(pdf_path, dpi=dpi)
def image_to_base64(image):
"""Convert PIL image to base64."""
buffer = BytesIO()
image.save(buffer, format='PNG')
return base64.b64encode(buffer.getvalue()).decode()
  1. Document Q&A System:
class DocumentQA:
def __init__(self):
self.client = anthropic.Anthropic()
self.pages = []
self.page_summaries = []
def load_document(self, pdf_path):
"""Load and summarize each page."""
self.pages = pdf_to_images(pdf_path)
for i, page in enumerate(self.pages):
summary = self._summarize_page(page, i)
self.page_summaries.append(summary)
def _summarize_page(self, image, page_num):
"""Get summary of a single page."""
response = self.client.messages.create(
model="claude-4.5-haiku-20240307",
max_tokens=500,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": image_to_base64(image)
}},
{"type": "text", "text":
f"Page {page_num + 1}. Summarize key information in 2-3 sentences."}
]
}]
)
return response.content[0].text
def ask(self, question):
"""Answer a question about the document."""
# First, find relevant pages
context = "\n".join([
f"Page {i+1}: {summary}"
for i, summary in enumerate(self.page_summaries)
])
# Then, answer with relevant page images
response = self.client.messages.create(
model="claude-4.6-sonnet-20240229",
max_tokens=1000,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": f"Document context:\n{context}\n\nQuestion: {question}"}
]
}]
)
return response.content[0].text

Expected Results: The system should extract text, tables, and charts from PDFs and answer questions accurately.

Success Criteria: Successfully process a 10-page document and answer 5 factual questions about its contents.

Build a product comparison tool that analyzes multiple images.

Goal: Learn multi-image reasoning with VLMs.

Step-by-Step Instructions:

  1. Create Comparison System:
from openai import OpenAI
import base64
from dataclasses import dataclass
from typing import List
@dataclass
class ComparisonResult:
similarities: List[str]
differences: List[str]
recommendation: str
confidence: float
class ProductComparator:
def __init__(self):
self.client = OpenAI()
def compare(self, image1_path: str, image2_path: str) -> ComparisonResult:
"""Compare two product images."""
with open(image1_path, "rb") as f:
img1_b64 = base64.b64encode(f.read()).decode()
with open(image2_path, "rb") as f:
img2_b64 = base64.b64encode(f.read()).decode()
response = self.client.chat.completions.create(
model="gpt-5",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": """
Compare these two products. Provide:
1. SIMILARITIES: List 3-5 things they have in common
2. DIFFERENCES: List 3-5 key differences
3. RECOMMENDATION: Which would you recommend and why?
4. CONFIDENCE: How confident are you (0-100)?
Format your response as:
SIMILARITIES:
- [item]
DIFFERENCES:
- [item]
RECOMMENDATION: [text]
CONFIDENCE: [number]
"""},
{"type": "image_url", "image_url": {
"url": f"data:image/jpeg;base64,{img1_b64}"
}},
{"type": "image_url", "image_url": {
"url": f"data:image/jpeg;base64,{img2_b64}"
}}
]
}],
max_tokens=1000
)
return self._parse_response(response.choices[0].message.content)
def _parse_response(self, text):
# Parse structured output
# ... parsing logic ...
pass

Expected Results: The tool should identify visual differences like color, size, features, and quality indicators.

Success Criteria: Successfully compare 5 pairs of products with accurate similarity/difference identification.


By the end of this module, you should have:

  1. CLIP-based image search system
  2. Visual QA application
  3. Document understanding pipeline
  4. DELIVERABLE: Vision AI Toolkit with multi-provider support

Success Criteria:

  • Can search images with natural language
  • Can answer questions about images
  • Can process documents and extract information
  • Works with multiple VLM providers (OpenAI, Anthropic)

  • “Learning Transferable Visual Models From Natural Language Supervision” (CLIP, 2021)
  • “An Image is Worth 16x16 Words” (ViT, 2020)
  • “Visual Instruction Tuning” (LLaVA, 2023)
  • “Scaling Vision Transformers” (ViT-22B, 2023)

Current VLMs process images one at a time, but the next frontier is real-time video understanding. Google’s Gemini 3.5 Pro already accepts hour-long videos, and models are rapidly improving at:

  • Temporal reasoning: Understanding cause and effect across frames
  • Action recognition: Identifying what people and objects are doing
  • Anomaly detection: Spotting unusual events in security footage
  • Video Q&A: Answering questions about video content

Vision-language models are crucial for robots that need to understand and interact with the physical world. Projects like Google’s RT-2 and PaLM-E combine VLMs with robot control:

  • Task instruction: “Pick up the red cup” requires understanding both language and visual scene
  • Spatial reasoning: Navigating environments and manipulating objects
  • Zero-shot generalization: Following instructions for novel tasks

Current VLMs understand 2D images, but emerging models are learning to reason about 3D space:

  • Depth estimation: Predicting distance from single images
  • 3D reconstruction: Building 3D models from multiple views
  • Spatial relationships: Understanding “in front of”, “behind”, “above”

Vision AI is transforming specialized fields:

  • Pathology: Detecting cancer in tissue slides (PathAI, Paige)
  • Radiology: Screening X-rays and CT scans (Aidoc, Viz.ai)
  • Drug discovery: Analyzing molecular structures and protein folding
  • Satellite imagery: Climate monitoring, disaster response, urban planning

The FDA has approved over 500 AI medical devices as of 2024, with the majority being vision-based.

As privacy concerns grow, there’s increasing interest in:

  • On-device VLMs: Running models locally on phones and laptops
  • Federated learning: Training without centralizing images
  • Privacy-preserving inference: Encrypted visual processing

Florence-2, released by Microsoft in June 2024, achieves strong performance at under 1 billion parameters—small enough to run on consumer hardware.

Many researchers believe vision-language models are a stepping stone to artificial general intelligence. The argument: true intelligence requires grounding in the physical world, and vision provides that grounding. As Fei-Fei Li has said:

“Language alone is not enough. To truly understand ‘chair,’ you need to have seen chairs, sat in chairs, maybe even built a chair. Vision connects language to reality.”

Whether or not AGI is near, it’s clear that vision-language models will continue advancing rapidly, enabling applications we can barely imagine today.


Vision-Language Models represent a fundamental leap in AI capabilities. By combining the pattern recognition of vision models with the reasoning abilities of language models, we can now build systems that truly understand visual content.

The journey from early computer vision—when researchers thought the problem could be solved in a summer—to today’s GPT-4V and Claude Vision has taken over fifty years. Along the way, we’ve seen multiple paradigm shifts: from hand-crafted features to deep learning, from CNNs to transformers, and from single-modality models to truly multimodal systems.

What makes modern VLMs remarkable isn’t just their accuracy—it’s their flexibility. A single model can describe photos, read documents, analyze charts, solve visual puzzles, and answer questions about images it has never seen before. This generalization was impossible just five years ago.

Key Takeaways:

  1. Vision Transformers (ViT) treat images as sequences of patches, enabling transformer magic for vision
  2. CLIP aligned images and text in a shared embedding space through contrastive learning
  3. Modern VLMs (GPT-4V, Claude 3, Gemini) can reason about images, not just describe them
  4. Prompting matters - structured, chain-of-thought prompts improve visual reasoning
  5. Production requires careful cost management, caching, and error handling

What’s Next: Module 24 explores Video AI - applying these concepts to moving images, understanding temporal dynamics, and even generating video content.


Last updated: 2025-11-26 Next: Module 24 - Video AI & Generation