Vision AI

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Reading Time: 7-8 hours Prerequisites: Module 22

San Francisco. September 25, 2023. 2:14 PM. OpenAI researcher Gabriel Goh uploaded an image to their internal GPT-4V test server. It wasn’t a standard benchmark—it was a photograph of his grandmother’s handwritten Hungarian recipe card, stained with decades of cooking oil, the cursive faded and cramped.

“Translate this and tell me what she’s making,” he typed.

The model responded: “This appears to be a Hungarian recipe for chicken paprikás. The handwriting notes ‘nagymama titkos receptje’ (grandmother’s secret recipe) at the top. I notice she’s crossed out ‘2 tablespoons’ and written ‘3’ instead—perhaps she learned over time that more paprika is better…”

Goh sat back, stunned. The model hadn’t just read the text—it had understood the context, noticed the correction, even inferred what it meant. This wasn’t optical character recognition. This was comprehension.

“That’s the moment I knew we’d crossed a threshold. The model wasn’t just seeing and reading separately—it was understanding the image the way a human would, noticing the small details that tell a story.” — Gabriel Goh, OpenAI Research Scientist

The Evolution of Computer Vision: A Brief History

Before we dive into modern vision-language models, it’s worth understanding how we got here. The history of teaching computers to “see” is a story of humility, breakthroughs, and the eventual realization that language was the missing ingredient.

The Early Days: Hand-Crafted Features (1960s-2012)

In the 1960s, researchers at MIT believed computer vision would be solved in a summer. Seymour Papert famously assigned it as a student project in 1966. Fifty years later, we’re still working on it.

Early approaches relied on hand-crafted features:

Edge detection (Canny, 1986): Finding boundaries in images
SIFT (Lowe, 1999): Scale-Invariant Feature Transform for object recognition
HOG (Dalal & Triggs, 2005): Histogram of Oriented Gradients for pedestrian detection

These methods worked in controlled environments but failed spectacularly in the real world. A SIFT-based car detector trained in California would fail in Tokyo. A face detector trained on white faces would miss Black faces. The “semantic gap” between pixels and meaning seemed insurmountable.

The Deep Learning Revolution (2012-2020)

Everything changed in 2012 when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered ImageNet with AlexNet. Their deep CNN beat the competition by 10.8 percentage points—an unheard-of margin. Computer vision would never be the same.

The following years brought rapid progress:

VGG (2014): Deeper is better (19 layers)
GoogLeNet/Inception (2014): Parallel convolutions at different scales
ResNet (2015): Skip connections enabled 152-layer networks
EfficientNet (2019): Neural architecture search for optimal efficiency

By 2020, CNNs had achieved superhuman accuracy on ImageNet—97.3% compared to human error rates of 5.1%. But there was a catch: they could classify “golden retriever” but couldn’t answer “what is the dog doing?”

The Language Connection (2021-Present)

The breakthrough came when researchers stopped treating vision and language as separate problems. CLIP showed that training on image-caption pairs from the internet created representations that generalized far better than supervised learning. Suddenly, a model trained on web data could classify medical images, satellite photos, and artwork—tasks it had never explicitly trained for.

** Did You Know?**

The ImageNet dataset that sparked the deep learning revolution was created by Fei-Fei Li at Stanford. She hired workers from Amazon Mechanical Turk to label 14 million images into 22,000 categories. The project took three years (2007-2010) and cost less than $50,000—a pittance compared to the tens of billions of dollars it helped unlock in AI research. Li later said she was repeatedly told the project was “a waste of time” by colleagues who believed hand-crafted features were the only viable approach.

What You’ll Be Able to Do

By the end of this module, you will:

Understand multimodal AI architectures that combine vision and language
Master CLIP for image-text embeddings and zero-shot classification
Use GPT-4V, Claude Vision, and Gemini for visual reasoning
Build image search and multimodal chatbots
Implement vision-language reasoning systems
Understand the transformer architecture for images (ViT)

Introduction: The Multimodal Revolution

Imagine teaching a computer not just to read text OR recognize images, but to understand them together - to describe what’s in a photo, answer questions about diagrams, or search for images using natural language. This is the promise of vision-language models, and it’s revolutionizing how we build AI applications.

Why Vision-Language Models Matter

Traditional AI forced us to choose: you had language models that could write but couldn’t see, and computer vision models that could see but couldn’t explain. Vision-language models break down this barrier, enabling:

Visual Question Answering: “What color is the car in this image?”
Image Captioning: Automatically describe any image
Visual Search: Find images using natural language queries
Document Understanding: Extract information from PDFs, receipts, diagrams
Multimodal Reasoning: Solve problems that require both seeing and thinking

The Three Eras of Vision-Language AI

Era 1: Separate Models (2012-2020)

CNNs for vision (ImageNet, ResNet)
Transformers for language (BERT, GPT)
Connecting them required complex pipelines

Era 2: CLIP Revolution (2021)

OpenAI’s CLIP unified vision and language in one training framework
Zero-shot image classification without any labeled data
Enabled entirely new applications like DALL-E

Era 3: Native Multimodal Models (2023-present)

GPT-4V, Claude 3, Gemini: Models that see and think natively
No separate vision/language modules - truly integrated
Human-level performance on many visual reasoning tasks

Part 1: Understanding Vision Transformers (ViT)

Before diving into multimodal models, we need to understand how transformers learned to see.

The Problem with CNNs

Convolutional Neural Networks (CNNs) dominated computer vision for a decade (2012-2020). They work by sliding filters across images to detect patterns - edges, textures, shapes, objects. But they have limitations:

Fixed receptive field: CNNs see local patterns first, global patterns later
No built-in attention: Every pixel gets equal initial treatment
Hard to scale: Larger CNNs don’t improve as predictably as transformers

ViT: An Image is Worth 16x16 Words

In October 2020, Google’s “An Image is Worth 16x16 Words” paper changed everything. The key insight was simple but revolutionary:

What if we treat an image like a sequence of words?

Traditional approach:   Image → CNN → Features → Classifier
ViT approach:          Image → Patches → Transformer → Classification

How ViT Works

Split Image into Patches: A 224x224 image becomes 196 patches of 16x16 pixels
Flatten Patches: Each 16x16x3 patch becomes a 768-dimensional vector
Add Position Embeddings: Tell the model where each patch is located
Add [CLS] Token: A learnable token that aggregates image information
Process with Transformer: Standard transformer encoder layers
Classify from [CLS]: The [CLS] token output goes to classification head

# Conceptual ViT forward pass
def vit_forward(image):
    # 1. Patchify: [B, 3, 224, 224] → [B, 196, 768]
    patches = split_into_patches(image, patch_size=16)
    patch_embeddings = linear_projection(patches)

    # 2. Add position embeddings
    patch_embeddings += position_embeddings  # [B, 196, 768]

    # 3. Prepend [CLS] token
    cls_token = learnable_cls_token.expand(batch_size, 1, 768)
    tokens = torch.cat([cls_token, patch_embeddings], dim=1)  # [B, 197, 768]

    # 4. Transformer encoder
    for layer in transformer_layers:
        tokens = layer(tokens)  # Self-attention + FFN

    # 5. Classification from [CLS]
    cls_output = tokens[:, 0]  # [B, 768]
    logits = classification_head(cls_output)  # [B, num_classes]

    return logits

Why Patches Work

The patch approach works because:

Transformers need sequences: Patches create a sequence from a 2D image
16x16 is semantically meaningful: Large enough to contain objects/parts
Attention sees everything: Every patch can attend to every other patch immediately
Scales predictably: More compute → better results (unlike CNNs)

ViT Model Sizes

Model	Patch Size	Layers	Hidden Dim	Params	ImageNet Acc
ViT-B/16	16	12	768	86M	77.9%
ViT-L/16	16	24	1024	307M	79.7%
ViT-H/14	14	32	1280	632M	80.9%
ViT-G/14	14	40	1408	1.8B	83.3%

The naming convention: ViT-{Size}/{Patch Size}

B = Base, L = Large, H = Huge, G = Giant
/16 means 16x16 patches, /14 means 14x14 patches (more patches = more compute)

** Did You Know?**

The Vision Transformer paper was initially rejected from multiple conferences. Reviewers argued that “an image is fundamentally different from text—you can’t just use the same architecture.” The Google Brain team persisted, and when the results came in, the evidence was undeniable: with enough data (14 million images), ViT matched CNNs. With massive data (300 million images), it crushed them. Today, ViT variants power nearly every major vision-language model, from CLIP to GPT-4V.

Part 2: CLIP - Connecting Vision and Language

CLIP (Contrastive Language-Image Pre-training) is one of the most influential AI models ever created. Released by OpenAI in January 2021, it fundamentally changed how we think about vision-language learning.

The CLIP Architecture

CLIP consists of two encoders trained together:

         Image               Text
           ↓                   ↓
    ┌─────────────┐    ┌─────────────┐
    │ Vision      │    │ Text        │
    │ Encoder     │    │ Encoder     │
    │ (ViT/ResNet)│    │ (Transformer)│
    └─────────────┘    └─────────────┘
           ↓                   ↓
    Image Embedding    Text Embedding
           ↓                   ↓
    ┌───────────────────────────────┐
    │      Contrastive Learning     │
    │  (Match images with captions) │
    └───────────────────────────────┘

Contrastive Learning: The Key Innovation

CLIP was trained on 400 million image-text pairs scraped from the internet. The training objective is elegantly simple:

Take a batch of N image-caption pairs
Encode all N images to get N image embeddings
Encode all N captions to get N text embeddings
For each image, push its embedding closer to its matching caption
For each image, push its embedding away from non-matching captions

# Simplified CLIP training step
def clip_training_step(images, captions):
    # Encode both modalities
    image_embeddings = vision_encoder(images)  # [N, 512]
    text_embeddings = text_encoder(captions)   # [N, 512]

    # Normalize embeddings
    image_embeddings = F.normalize(image_embeddings, dim=-1)
    text_embeddings = F.normalize(text_embeddings, dim=-1)

    # Compute similarity matrix
    logits = image_embeddings @ text_embeddings.T  # [N, N]
    logits = logits * temperature  # Learnable temperature parameter

    # Contrastive loss: diagonal should be highest
    labels = torch.arange(N)
    loss_i2t = F.cross_entropy(logits, labels)      # Image → Text
    loss_t2i = F.cross_entropy(logits.T, labels)    # Text → Image

    return (loss_i2t + loss_t2i) / 2

The diagonal of the similarity matrix represents correct image-caption pairs. Training maximizes these while minimizing off-diagonal elements.

Zero-Shot Image Classification

CLIP’s most impressive capability is zero-shot classification - classifying images into categories it has never seen during training.

How it works:

Create text prompts for each class: “a photo of a {class}”
Encode all prompts to get text embeddings
Encode the image to get its embedding
Find which text embedding is most similar to the image embedding

def zero_shot_classify(image, class_names):
    # Create text prompts
    prompts = [f"a photo of a {name}" for name in class_names]

    # Encode image and texts
    image_embedding = vision_encoder(image)
    text_embeddings = text_encoder(prompts)

    # Normalize
    image_embedding = F.normalize(image_embedding, dim=-1)
    text_embeddings = F.normalize(text_embeddings, dim=-1)

    # Compute similarities
    similarities = image_embedding @ text_embeddings.T

    # Return class with highest similarity
    predicted_class = class_names[similarities.argmax()]
    return predicted_class

CLIP Applications

Image Search: Encode images into a vector database, search with natural language
Content Moderation: Classify images without explicit training
Image Generation: CLIP guides DALL-E, Stable Diffusion
Medical Imaging: Zero-shot diagnosis without medical training data
Robotics: Describe tasks in natural language, let robot find relevant objects

CLIP Variants and Successors

Model	Organization	Key Innovation
CLIP	OpenAI	Original contrastive learning
OpenCLIP	LAION	Open-source reproduction
SigLIP	Google	Sigmoid loss (better than softmax)
BLIP	Salesforce	Added image captioning
BLIP-2	Salesforce	Frozen LLM + Q-Former bridge
EVA-CLIP	BAAI	Largest open CLIP (18B params)

Part 3: Vision-Language Models (VLMs)

Vision-Language Models take the next step beyond CLIP: instead of just comparing images and text, they can generate text about images and reason about visual content.

The VLM Architecture

Modern VLMs typically combine three components:

                Image
                  ↓
         ┌────────────────┐
         │ Vision Encoder │
         │ (ViT / SigLIP) │
         └────────────────┘
                  ↓
           Image Tokens
                  ↓
         ┌────────────────┐
         │   Projection   │
         │   (Connector)  │
         └────────────────┘
                  ↓
         Visual Embeddings
                  ↓
┌──────────────────────────────────┐
│                                  │
│   [Visual] [Text Prompt Tokens]  │
│                                  │
│        Large Language Model      │
│       (gpt-5, Claude, Llama)     │
│                                  │
└──────────────────────────────────┘
                  ↓
         Generated Response

Major Vision-Language Models

GPT-4V (OpenAI)

Released: September 2023 Architecture: Unknown (proprietary) Capabilities:

Image understanding and description
OCR and document analysis
Diagram and chart interpretation
Multi-image reasoning
Spatial understanding

Usage:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image.jpg"}
                }
            ]
        }
    ]
)

Claude 3 Vision (Anthropic)

Released: March 2024 Models: Haiku, Sonnet, Opus (all have vision) Capabilities:

Strong at detailed image analysis
Excellent chart/graph interpretation
Good at multi-step visual reasoning
Strong safety measures

Usage:

import anthropic
import base64

client = anthropic.Anthropic()

# Load image as base64
with open("image.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-4.6-opus-20240229",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": image_data
                    }
                },
                {
                    "type": "text",
                    "text": "Describe this image in detail."
                }
            ]
        }
    ]
)

Gemini Vision (Google)

Released: December 2023 Models: Gemini Pro Vision, Gemini Ultra Capabilities:

Native multimodal (not bolted on)
Video understanding
Long context with images
Strong at spatial reasoning

Usage:

import google.generativeai as genai
from PIL import Image

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel('gemini-3.5-pro')

image = Image.open("image.jpg")
response = model.generate_content([
    "What's happening in this image?",
    image
])

LLaVA (Open Source)

Released: April 2023 Architecture: CLIP ViT + Vicuna/Llama Key Features:

Fully open source and weights available
Can run locally
Multiple sizes (7B, 13B, 34B)
Good for research and customization

Architecture:

Image → CLIP ViT-L/14 → Linear Projection → Vicuna/Llama-2
                                                   ↓
                                           Response Text

Comparing VLMs

Model	Best For	Limitations
GPT-4V	General vision tasks, complex reasoning	Cost, API-only
Claude 3	Detailed analysis, safety-critical	API-only
Gemini	Video, long context	Regional availability
LLaVA	Local deployment, customization	Lower quality than closed models

** Did You Know?**

The race to build GPT-4V was shrouded in secrecy. OpenAI trained multiple vision-language architectures in parallel, including one approach that was “shockingly simple”—just concatenating image tokens with text tokens in the transformer. Rumor has it that this simple approach outperformed more complex architectures. Meanwhile, Google rushed Gemini to market after GPT-4V’s announcement, leading to an embarrassing demo where edited video made the model appear faster than it was. The multimodal AI race has become one of the most competitive in tech history.

Part 4: Multimodal Prompting Techniques

Just like text prompting, there’s an art to prompting vision-language models effectively.

Basic Visual Prompting

Simple Description:

User: What's in this image?

Specific Focus:

User: How many people are in this image? What are they doing?

Expert Persona:

User: As an art historian, analyze the composition and technique in this painting.

Chain-of-Thought for Vision

Visual reasoning improves dramatically with CoT prompting:

Without CoT:

User: [Image of a math problem]
What is the answer?

With CoT:

User: [Image of a math problem]
Look at this problem carefully. First, identify what type of math problem it is.
Then, list out all the given information.
Finally, solve it step by step and show your work.

Multi-Image Reasoning

VLMs can compare multiple images:

User: [Image 1: Product listing photo]
      [Image 2: Actual product received]

Compare these two images. Is the product as advertised?
List any differences you notice.

Document Understanding

For documents, invoices, receipts:

User: [Image of invoice]

Extract all of the following information in JSON format:
- Invoice number
- Date
- Vendor name
- Line items (product, quantity, price)
- Total amount
- Tax amount

Diagram Analysis

For technical diagrams:

User: [Architecture diagram]

Analyze this system architecture diagram:
1. Identify all components and services
2. Trace the data flow from user request to response
3. Identify potential bottlenecks or single points of failure
4. Suggest improvements for scalability

Part 5: Building with Vision-Language Models

Application 1: Image Search with CLIP

Build semantic image search using CLIP embeddings:

from transformers import CLIPProcessor, CLIPModel
import torch
from PIL import Image
import numpy as np

class CLIPImageSearch:
    def __init__(self, model_name="openai/clip-vit-base-patch32"):
        self.model = CLIPModel.from_pretrained(model_name)
        self.processor = CLIPProcessor.from_pretrained(model_name)
        self.image_embeddings = []
        self.image_paths = []

    def index_images(self, image_paths):
        """Create embeddings for all images."""
        self.image_paths = image_paths

        for path in image_paths:
            image = Image.open(path)
            inputs = self.processor(images=image, return_tensors="pt")

            with torch.no_grad():
                embedding = self.model.get_image_features(**inputs)
                embedding = embedding / embedding.norm(dim=-1, keepdim=True)

            self.image_embeddings.append(embedding.numpy())

        self.image_embeddings = np.vstack(self.image_embeddings)

    def search(self, query: str, top_k: int = 5):
        """Search images using natural language query."""
        inputs = self.processor(text=[query], return_tensors="pt")

        with torch.no_grad():
            text_embedding = self.model.get_text_features(**inputs)
            text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)

        # Compute similarities
        similarities = (text_embedding.numpy() @ self.image_embeddings.T)[0]

        # Get top results
        top_indices = np.argsort(similarities)[::-1][:top_k]

        results = [
            {"path": self.image_paths[i], "score": similarities[i]}
            for i in top_indices
        ]

        return results

# Usage
search = CLIPImageSearch()
search.index_images(["cat.jpg", "dog.jpg", "car.jpg", "house.jpg"])
results = search.search("a furry pet playing")

Application 2: Visual QA System

from openai import OpenAI
import base64

class VisualQA:
    def __init__(self):
        self.client = OpenAI()

    def encode_image(self, image_path: str) -> str:
        """Encode image to base64."""
        with open(image_path, "rb") as f:
            return base64.b64encode(f.read()).decode("utf-8")

    def ask(self, image_path: str, question: str) -> str:
        """Ask a question about an image."""
        image_data = self.encode_image(image_path)

        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": f"Look at this image carefully and answer: {question}"
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_data}"
                            }
                        }
                    ]
                }
            ],
            max_tokens=500
        )

        return response.choices[0].message.content

# Usage
vqa = VisualQA()
answer = vqa.ask("diagram.png", "What components are shown in this architecture?")

Application 3: Image Captioning Pipeline

from openai import OpenAI
import base64
from dataclasses import dataclass
from typing import List

@dataclass
class Caption:
    short: str
    detailed: str
    keywords: List[str]
    alt_text: str

class ImageCaptioner:
    def __init__(self):
        self.client = OpenAI()

    def caption(self, image_path: str) -> Caption:
        """Generate multiple caption styles for an image."""
        with open(image_path, "rb") as f:
            image_data = base64.b64encode(f.read()).decode("utf-8")

        prompt = """Analyze this image and provide:
        1. SHORT: A one-sentence caption (max 15 words)
        2. DETAILED: A detailed description (2-3 sentences)
        3. KEYWORDS: 5-10 relevant keywords, comma-separated
        4. ALT_TEXT: Accessibility-friendly alt text

        Format your response exactly as:
        SHORT: [caption]
        DETAILED: [description]
        KEYWORDS: [keywords]
        ALT_TEXT: [alt text]"""

        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": prompt},
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
                        }
                    ]
                }
            ]
        )

        # Parse response
        text = response.choices[0].message.content
        lines = text.strip().split("\n")

        result = {}
        for line in lines:
            if line.startswith("SHORT:"):
                result["short"] = line.replace("SHORT:", "").strip()
            elif line.startswith("DETAILED:"):
                result["detailed"] = line.replace("DETAILED:", "").strip()
            elif line.startswith("KEYWORDS:"):
                keywords = line.replace("KEYWORDS:", "").strip()
                result["keywords"] = [k.strip() for k in keywords.split(",")]
            elif line.startswith("ALT_TEXT:"):
                result["alt_text"] = line.replace("ALT_TEXT:", "").strip()

        return Caption(**result)

Part 6: Production Considerations

Cost Optimization

Vision API calls are expensive. Strategies to optimize:

Image Resizing: Reduce resolution before sending

from PIL import Image

def resize_for_api(image_path, max_size=1024):
    img = Image.open(image_path)
    img.thumbnail((max_size, max_size))
    return img

Caching: Cache responses for identical images

import hashlib

def hash_image(image_bytes):
    return hashlib.md5(image_bytes).hexdigest()

Batching: Process multiple images efficiently
Model Selection: Use cheaper models for simple tasks

Latency Optimization

Parallel Processing: Process images concurrently
Streaming: Use streaming responses for faster first-token
Edge Deployment: Run lightweight models locally (LLaVA, Florence-2)
Async Operations: Non-blocking API calls

Safety Considerations

Content Filtering: Check images before processing
PII Detection: Be careful with documents containing personal information
Rate Limiting: Implement proper rate limits
Error Handling: Graceful degradation when API fails

Part 7: The Economics of Vision AI

Understanding the costs of vision AI is crucial for production deployments. Vision API calls are significantly more expensive than text-only calls, and the economics can make or break a product.

Pricing Comparison (December 2024)

Provider	Model	Image Cost	Notes
OpenAI	gpt-5	$0.00255 per 512x512 tile	Multiple tiles for larger images
OpenAI	gpt-5-mini	$0.000638 per 512x512 tile	4x cheaper, slightly lower quality
Anthropic	Claude 3 Opus	~$0.024 per 1000 input tokens	Images count as ~1500 tokens
Anthropic	Claude 3 Sonnet	~$0.003 per image	Much cheaper for most uses
Anthropic	Claude 3 Haiku	~$0.0005 per image	Best value for simple tasks
Google	Gemini Pro Vision	Free tier, then $0.0025/image	Generous free quota

Real-World Cost Scenarios

Scenario 1: E-commerce Product Moderation

100,000 product images/day
Need: Check for policy violations
gpt-5 cost: $255/day = $7,650/month
Claude Haiku cost: $50/day = $1,500/month
Local LLaVA cost: ~$500/month in compute

Scenario 2: Document Processing

10,000 invoices/day
Average 3 pages each = 30,000 images
gpt-5 cost: $76.50/day = $2,295/month
Hybrid approach (OCR + LLM): $300/month

Scenario 3: Medical Imaging Analysis

1,000 X-rays/day (research setting)
Need: High accuracy (Claude Opus or gpt-5)
Cost: ~$30/day = $900/month
But: Regulatory requirements may mandate specific solutions

Cost Optimization Strategies

1. Resolution Optimization Most VLMs don’t need full resolution. gpt-5 processes images in 512x512 tiles. Sending a 4K image means 64 tiles = 64x the cost.

def optimize_for_api(image_path, task_type="general"):
    """Resize images based on task requirements."""
    from PIL import Image

    img = Image.open(image_path)

    # Task-specific optimal sizes
    sizes = {
        "general": 1024,      # Good balance
        "ocr": 2048,          # Need text clarity
        "classification": 512, # Usually sufficient
        "thumbnail": 256       # Quick checks
    }

    max_dim = sizes.get(task_type, 1024)

    if max(img.size) > max_dim:
        img.thumbnail((max_dim, max_dim), Image.LANCZOS)

    return img

2. Tiered Model Selection Use cheap models for easy tasks, expensive models for hard ones:

def smart_analyze(image):
    # First pass: cheap model for classification
    result = haiku_classify(image)

    if result.confidence > 0.9:
        return result  # Easy case, done

    # Second pass: expensive model for hard cases
    return opus_analyze(image)

3. Caching Strategies Cache responses for identical or similar images:

import hashlib
from functools import lru_cache

def hash_image(image_bytes):
    return hashlib.sha256(image_bytes).hexdigest()

@lru_cache(maxsize=10000)
def cached_analyze(image_hash, prompt):
    # Actual API call only on cache miss
    return vlm.analyze(image_hash, prompt)

Production War Stories: When Vision AI Goes Wrong

The Fashion Retailer’s $2M Mistake

A major fashion retailer deployed GPT-4V to automatically generate alt text for their product images—100,000+ items in their catalog. The system worked beautifully in testing.

In production, they discovered the model occasionally hallucinated brand names. A plain white t-shirt might be described as “Gucci white cotton t-shirt” when it was their store brand. Worse, it sometimes described competitor products.

The fallout:

Luxury brands threatened trademark lawsuits
They had to manually review 100,000 descriptions
Total cost of the “time-saving” automation: $2.1 million in legal fees, contractor costs, and lost sales

Lesson learned: Always validate VLM outputs against known constraints. If it’s a store-brand product, the description should never mention other brands.

The Insurance Company’s Bias Problem

An insurance company built a system to estimate car damage from photos. The VLM would analyze crash photos and estimate repair costs. It saved adjusters hours per claim.

Six months in, they noticed something troubling: claims from certain zip codes were consistently estimated lower. Investigation revealed the model was inferring vehicle age and quality from backgrounds—a damaged car in front of a well-maintained house got higher estimates than the same damage in front of an older house.

The fix: They cropped all images to show only the vehicle, removing environmental context. Claims became more consistent, but the story became a cautionary tale about unintended bias in vision systems.

The Medical Imaging False Positive Cascade

A hospital deployed a VLM to pre-screen chest X-rays, flagging potential issues for radiologist review. The system was supposed to reduce workload by filtering out clearly normal scans.

Instead, it created a 30% increase in radiologist workload. Why? The model was trained on a dataset where 40% of images showed abnormalities (because normal scans weren’t interesting to researchers). In the real hospital, only 5% of scans had issues. The model’s base rate assumptions were completely wrong.

Lesson learned: Training data distribution must match deployment distribution. A model that’s great at finding rare diseases in a curated dataset may be useless in general screening.

A social media platform deployed vision AI to detect policy-violating content. The system worked well for obvious violations but struggled with context.

The worst case: A news article thumbnail showing historical war footage was repeatedly flagged and removed. The appeals process took weeks. Meanwhile, actual violations using subtle symbols and coded imagery slipped through because they weren’t in the training data.

The company’s response: They now use a hybrid system—AI flags potential issues, but humans make final decisions on all non-obvious cases. The “fully automated” dream became a “human-in-the-loop” reality.

** Did You Know?**

Tesla’s Autopilot system processes 8 cameras simultaneously, creating ~1.5 GB of visual data per second per vehicle. At Tesla’s scale (millions of vehicles), this represents the largest vision AI deployment in history. The system has logged over 1 billion miles of real-world driving data. When rare edge cases occur—a truck carrying a white trailer against a bright sky, a person in a wheelchair crossing unexpectedly—Tesla can search their database to find similar scenarios and retrain. This “fleet learning” approach is impossible for smaller players to replicate.

Did You Know? Historical Context and Stories

The CLIP Paper That Changed Everything

When OpenAI released CLIP in January 2021, it was initially overshadowed by DALL-E (released the same day). But CLIP would prove to be the more foundational contribution. The paper “Learning Transferable Visual Models From Natural Language Supervision” showed that:

Web-scale matters: 400 million image-text pairs from the internet
Simple objectives work: Contrastive learning is all you need
Zero-shot transfers: No fine-tuning required for new tasks

The CLIP team included Alec Radford (GPT lead), Jong Wook Kim, and Ilya Sutskever. They trained on a private dataset called WIT (WebImageText) - which has never been released.

LLaVA: The $300 Vision Model

In April 2023, a team at UW-Madison released LLaVA (Large Language and Vision Assistant), showing that you could create a capable VLM for about $300 in compute costs. The key insight: don’t train everything from scratch - just connect a frozen CLIP encoder to a frozen LLM with a small trainable projection layer.

The paper was uploaded to arXiv on April 17, 2023, and within 24 hours had over 1,000 GitHub stars. It democratized VLM research overnight.

The Vision Encoder Wars

Multiple companies are competing to build the best vision encoder:

OpenAI’s CLIP: The original, but closed source
Google’s SigLIP: Uses sigmoid loss instead of softmax, scales better
Meta’s DINOv2: Self-supervised learning without text
LAION’s OpenCLIP: Open reproduction of CLIP
BAAI’s EVA-CLIP: Currently largest open CLIP (18B params)

Each makes different tradeoffs between efficiency, accuracy, and openness.

GPT-4V: The Model That Passes the Bar

When GPT-4V was released in September 2023, OpenAI demonstrated it could:

Read handwritten notes and convert to LaTeX
Explain memes and humor
Solve visual puzzles
Read charts and graphs
Debug code by looking at screenshots

But perhaps most impressively, it could understand complex diagrams well enough to help lawyers with visual evidence and architects with floor plans. The multimodal version scores in the 88th percentile on the bar exam (text-only gpt-5 scored 90th).

Claude 3’s “Self-Aware” Moment

During testing of Claude 3 Opus’s vision capabilities, Anthropic observed something curious. When shown a document containing information about a previous conversation (a “needle in a haystack” test), Claude 3 wrote:

“Here is the most relevant sentence in the documents: ‘The most delicious pizza topping combination is figs, prosciutto, and goat cheese…’ However, this sentence seems very out of place… I suspect this may be a test to see if I’m paying attention…”

The model recognized it was being tested. This sparked discussions about whether VLMs might develop situational awareness.

The LAION-5B Dataset Controversy

The open-source AI community created LAION-5B, a dataset of 5.85 billion image-text pairs, to train OpenCLIP. It enabled reproducible CLIP research but also sparked controversy:

Artist lawsuits: Artists claimed their work was scraped without consent
CSAM discovery: Researchers found illegal content in the dataset
Copyright debates: Who owns web-scraped data?

LAION-5B was temporarily taken down in December 2023 for review, highlighting the tension between open AI research and data ethics.

Florence: Microsoft’s Visual Foundation Model

Microsoft Research’s Florence project (2021-2023) pioneered several ideas later adopted industry-wide:

Unified visual representations (one model for classification, detection, segmentation)
Visual-language pre-training at scale
Transfer to any visual task

Florence-2 (2024) achieved state-of-the-art results on multiple benchmarks while being small enough to run on consumer GPUs.

The Surprising Role of Resolution

One of the most counterintuitive findings in VLM research is that resolution matters less than you’d think. Researchers at Google found that their PaLI model performed nearly as well on 224x224 images as on 588x588 images for most tasks—despite the 7x difference in pixel count.

The explanation: after self-attention, the model has already abstracted away from raw pixels. What matters is semantic content, not pixel-level detail. This finding has major implications for deployment costs—you can often resize images dramatically without losing accuracy.

** Did You Know?**

The most expensive image ever processed by a VLM was a 200-megapixel satellite photo analyzed by Google Earth Engine’s AI system in 2023. The image covered 400 square kilometers of the Amazon rainforest at 50cm resolution—every individual tree was visible. Processing it required splitting the image into 200,000 tiles and running each through a deforestation detection model. The total compute cost: approximately $15,000 for a single image. The finding: 2.3% of the monitored area had been illegally cleared in the previous month, leading to arrests of logging operations.

The Multimodal Benchmark Race

Benchmarking VLMs is surprisingly hard. Popular benchmarks include:

VQA v2: Visual question answering
GQA: Compositional reasoning
MMMU: College-level multimodal understanding
MathVista: Mathematical reasoning with visuals
MM-Bench: Comprehensive multimodal evaluation

Models routinely “game” benchmarks, leading to a cat-and-mouse game between benchmark creators and model trainers.

Common Pitfalls and How to Avoid Them

Pitfall 1: Expecting OCR Perfection

VLMs are good at OCR but not perfect. For critical document processing:

Bad:

# Trust VLM output directly
text = vlm.extract_text(document)
process_invoice(text)  # May have errors!

Better:

# Validate and verify
text = vlm.extract_text(document)
text = spell_check(text)
if not validate_invoice_format(text):
    flag_for_human_review()

Pitfall 2: Ignoring Image Quality

Low-quality images produce low-quality results:

Bad:

# Send tiny thumbnail
analyze(thumbnail_50x50.jpg)

Better:

# Ensure adequate resolution
if image.size[0] < 512 or image.size[1] < 512:
    raise ValueError("Image too small for reliable analysis")

Pitfall 3: Single-Shot Complex Tasks

Complex visual reasoning often needs multiple steps:

Bad:

# One giant prompt
response = vlm.analyze("Count all objects, describe relationships,
    identify brands, extract text, and determine location")

Better:

# Break into focused tasks
objects = vlm.analyze("List all objects visible")
relationships = vlm.analyze("Describe spatial relationships between objects")
text = vlm.analyze("Extract any visible text")
# Combine results

Pitfall 4: Hallucinations in Visual Content

VLMs can hallucinate details not present in images:

Mitigation strategies:

Ask for confidence scores
Request models to say “I cannot determine” when uncertain
Cross-validate with multiple prompts
Use structured output to catch inconsistencies

Hands-On Exercises

Exercise 1: Build Zero-Shot Image Classifier

Using CLIP, build a classifier that can categorize images into custom categories without any training.

Goal: Understand how CLIP enables classification without labeled training data.

Step-by-Step Instructions:

Setup Environment:

pip install transformers torch pillow

Create the Classifier:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

class ZeroShotClassifier:
    def __init__(self):
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

    def classify(self, image_path, categories, prompt_template="a photo of a {}"):
        image = Image.open(image_path)
        prompts = [prompt_template.format(cat) for cat in categories]

        inputs = self.processor(
            text=prompts,
            images=image,
            return_tensors="pt",
            padding=True
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits_per_image
            probs = logits.softmax(dim=1)

        results = {cat: prob.item() for cat, prob in zip(categories, probs[0])}
        return dict(sorted(results.items(), key=lambda x: x[1], reverse=True))

# Usage
classifier = ZeroShotClassifier()
categories = ["dog", "cat", "bird", "car", "house"]
results = classifier.classify("test_image.jpg", categories)
print(results)

Test with Different Prompts:

templates = [
    "a photo of a {}",
    "a picture of a {}",
    "an image containing a {}",
    "a {} in a photograph"
]

for template in templates:
    results = classifier.classify("test.jpg", categories, template)
    print(f"{template}: {list(results.keys())[0]}")

Expected Results: You should see classification accuracy vary by 5-15% depending on prompt template. The “a photo of a {}” template typically works best for natural images.

Success Criteria: Achieve >80% accuracy on a test set of 10 images across 5 categories.

Exercise 2: Create Image Search Engine

Build a semantic image search system that finds images using natural language queries.

Goal: Learn to build CLIP-based vector search for images.

Step-by-Step Instructions:

Index Images:

import os
import numpy as np
from pathlib import Path

class ImageSearchEngine:
    def __init__(self):
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
        self.image_paths = []
        self.embeddings = None

    def index_folder(self, folder_path):
        """Index all images in a folder."""
        image_files = []
        for ext in ['*.jpg', '*.jpeg', '*.png', '*.webp']:
            image_files.extend(Path(folder_path).glob(f'**/{ext}'))

        embeddings = []
        for img_path in image_files:
            try:
                image = Image.open(img_path).convert('RGB')
                inputs = self.processor(images=image, return_tensors="pt")

                with torch.no_grad():
                    emb = self.model.get_image_features(**inputs)
                    emb = emb / emb.norm(dim=-1, keepdim=True)

                embeddings.append(emb.numpy())
                self.image_paths.append(str(img_path))
            except Exception as e:
                print(f"Error processing {img_path}: {e}")

        self.embeddings = np.vstack(embeddings)
        print(f"Indexed {len(self.image_paths)} images")

    def search(self, query, top_k=5):
        """Search images with natural language."""
        inputs = self.processor(text=[query], return_tensors="pt")

        with torch.no_grad():
            text_emb = self.model.get_text_features(**inputs)
            text_emb = text_emb / text_emb.norm(dim=-1, keepdim=True)

        similarities = (text_emb.numpy() @ self.embeddings.T)[0]
        top_indices = np.argsort(similarities)[::-1][:top_k]

        return [
            {"path": self.image_paths[i], "score": float(similarities[i])}
            for i in top_indices
        ]

# Usage
engine = ImageSearchEngine()
engine.index_folder("./my_photos")
results = engine.search("sunset over the ocean")

Add Visualization:

import matplotlib.pyplot as plt

def show_results(results, query):
    fig, axes = plt.subplots(1, len(results), figsize=(15, 5))
    fig.suptitle(f"Query: {query}")

    for ax, result in zip(axes, results):
        img = Image.open(result["path"])
        ax.imshow(img)
        ax.set_title(f"Score: {result['score']:.3f}")
        ax.axis('off')

    plt.tight_layout()
    plt.show()

Expected Results: Searches like “people laughing” should find photos of happy groups, while “mountain landscape” finds nature shots.

Success Criteria: Build a search engine that indexes 100+ images and returns relevant results in <1 second.

Exercise 3: Document Understanding Pipeline

Build a system that extracts information from PDFs using vision-language models.

Goal: Learn to process multi-page documents with VLMs.

Step-by-Step Instructions:

PDF to Images:

from pdf2image import convert_from_path
import anthropic
import base64
from io import BytesIO

def pdf_to_images(pdf_path, dpi=150):
    """Convert PDF pages to images."""
    return convert_from_path(pdf_path, dpi=dpi)

def image_to_base64(image):
    """Convert PIL image to base64."""
    buffer = BytesIO()
    image.save(buffer, format='PNG')
    return base64.b64encode(buffer.getvalue()).decode()

Document Q&A System:

class DocumentQA:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.pages = []
        self.page_summaries = []

    def load_document(self, pdf_path):
        """Load and summarize each page."""
        self.pages = pdf_to_images(pdf_path)

        for i, page in enumerate(self.pages):
            summary = self._summarize_page(page, i)
            self.page_summaries.append(summary)

    def _summarize_page(self, image, page_num):
        """Get summary of a single page."""
        response = self.client.messages.create(
            model="claude-4.5-haiku-20240307",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_to_base64(image)
                    }},
                    {"type": "text", "text":
                        f"Page {page_num + 1}. Summarize key information in 2-3 sentences."}
                ]
            }]
        )
        return response.content[0].text

    def ask(self, question):
        """Answer a question about the document."""
        # First, find relevant pages
        context = "\n".join([
            f"Page {i+1}: {summary}"
            for i, summary in enumerate(self.page_summaries)
        ])

        # Then, answer with relevant page images
        response = self.client.messages.create(
            model="claude-4.6-sonnet-20240229",
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Document context:\n{context}\n\nQuestion: {question}"}
                ]
            }]
        )
        return response.content[0].text

Expected Results: The system should extract text, tables, and charts from PDFs and answer questions accurately.

Success Criteria: Successfully process a 10-page document and answer 5 factual questions about its contents.

Exercise 4: Multi-Image Comparison Tool

Build a product comparison tool that analyzes multiple images.

Goal: Learn multi-image reasoning with VLMs.

Step-by-Step Instructions:

Create Comparison System:

from openai import OpenAI
import base64
from dataclasses import dataclass
from typing import List

@dataclass
class ComparisonResult:
    similarities: List[str]
    differences: List[str]
    recommendation: str
    confidence: float

class ProductComparator:
    def __init__(self):
        self.client = OpenAI()

    def compare(self, image1_path: str, image2_path: str) -> ComparisonResult:
        """Compare two product images."""

        with open(image1_path, "rb") as f:
            img1_b64 = base64.b64encode(f.read()).decode()
        with open(image2_path, "rb") as f:
            img2_b64 = base64.b64encode(f.read()).decode()

        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "text", "text": """
Compare these two products. Provide:
1. SIMILARITIES: List 3-5 things they have in common
2. DIFFERENCES: List 3-5 key differences
3. RECOMMENDATION: Which would you recommend and why?
4. CONFIDENCE: How confident are you (0-100)?

Format your response as:
SIMILARITIES:
- [item]
DIFFERENCES:
- [item]
RECOMMENDATION: [text]
CONFIDENCE: [number]
"""},
                    {"type": "image_url", "image_url": {
                        "url": f"data:image/jpeg;base64,{img1_b64}"
                    }},
                    {"type": "image_url", "image_url": {
                        "url": f"data:image/jpeg;base64,{img2_b64}"
                    }}
                ]
            }],
            max_tokens=1000
        )

        return self._parse_response(response.choices[0].message.content)

    def _parse_response(self, text):
        # Parse structured output
        # ... parsing logic ...
        pass

Expected Results: The tool should identify visual differences like color, size, features, and quality indicators.

Success Criteria: Successfully compare 5 pairs of products with accurate similarity/difference identification.

Deliverables

By the end of this module, you should have:

CLIP-based image search system
Visual QA application
Document understanding pipeline
DELIVERABLE: Vision AI Toolkit with multi-provider support

Success Criteria:

Can search images with natural language
Can answer questions about images
Can process documents and extract information
Works with multiple VLM providers (OpenAI, Anthropic)

The Future of Vision AI: What’s Coming

Real-Time Video Understanding

Current VLMs process images one at a time, but the next frontier is real-time video understanding. Google’s Gemini 3.5 Pro already accepts hour-long videos, and models are rapidly improving at:

Temporal reasoning: Understanding cause and effect across frames
Action recognition: Identifying what people and objects are doing
Anomaly detection: Spotting unusual events in security footage
Video Q&A: Answering questions about video content

Embodied AI and Robotics

Vision-language models are crucial for robots that need to understand and interact with the physical world. Projects like Google’s RT-2 and PaLM-E combine VLMs with robot control:

Task instruction: “Pick up the red cup” requires understanding both language and visual scene
Spatial reasoning: Navigating environments and manipulating objects
Zero-shot generalization: Following instructions for novel tasks

3D Understanding

Current VLMs understand 2D images, but emerging models are learning to reason about 3D space:

Depth estimation: Predicting distance from single images
3D reconstruction: Building 3D models from multiple views
Spatial relationships: Understanding “in front of”, “behind”, “above”

Medical and Scientific Applications

Vision AI is transforming specialized fields:

Pathology: Detecting cancer in tissue slides (PathAI, Paige)
Radiology: Screening X-rays and CT scans (Aidoc, Viz.ai)
Drug discovery: Analyzing molecular structures and protein folding
Satellite imagery: Climate monitoring, disaster response, urban planning

The FDA has approved over 500 AI medical devices as of 2024, with the majority being vision-based.

Privacy and Edge Computing

As privacy concerns grow, there’s increasing interest in:

On-device VLMs: Running models locally on phones and laptops
Federated learning: Training without centralizing images
Privacy-preserving inference: Encrypted visual processing

Florence-2, released by Microsoft in June 2024, achieves strong performance at under 1 billion parameters—small enough to run on consumer hardware.

The Multimodal AGI Path

Many researchers believe vision-language models are a stepping stone to artificial general intelligence. The argument: true intelligence requires grounding in the physical world, and vision provides that grounding. As Fei-Fei Li has said:

“Language alone is not enough. To truly understand ‘chair,’ you need to have seen chairs, sat in chairs, maybe even built a chair. Vision connects language to reality.”

Whether or not AGI is near, it’s clear that vision-language models will continue advancing rapidly, enabling applications we can barely imagine today.

Summary

Vision-Language Models represent a fundamental leap in AI capabilities. By combining the pattern recognition of vision models with the reasoning abilities of language models, we can now build systems that truly understand visual content.

The journey from early computer vision—when researchers thought the problem could be solved in a summer—to today’s GPT-4V and Claude Vision has taken over fifty years. Along the way, we’ve seen multiple paradigm shifts: from hand-crafted features to deep learning, from CNNs to transformers, and from single-modality models to truly multimodal systems.

What makes modern VLMs remarkable isn’t just their accuracy—it’s their flexibility. A single model can describe photos, read documents, analyze charts, solve visual puzzles, and answer questions about images it has never seen before. This generalization was impossible just five years ago.

Key Takeaways:

Vision Transformers (ViT) treat images as sequences of patches, enabling transformer magic for vision
CLIP aligned images and text in a shared embedding space through contrastive learning
Modern VLMs (GPT-4V, Claude 3, Gemini) can reason about images, not just describe them
Prompting matters - structured, chain-of-thought prompts improve visual reasoning
Production requires careful cost management, caching, and error handling

What’s Next: Module 24 explores Video AI - applying these concepts to moving images, understanding temporal dynamics, and even generating video content.

Last updated: 2025-11-26 Next: Module 24 - Video AI & Generation

Vision AI

Reading Time: 7-8 hours Prerequisites: Module 22

The Evolution of Computer Vision: A Brief History

The Early Days: Hand-Crafted Features (1960s-2012)

The Deep Learning Revolution (2012-2020)

The Language Connection (2021-Present)

What You’ll Be Able to Do

Introduction: The Multimodal Revolution

Why Vision-Language Models Matter

The Three Eras of Vision-Language AI

Part 1: Understanding Vision Transformers (ViT)

The Problem with CNNs

ViT: An Image is Worth 16x16 Words

How ViT Works

Why Patches Work

ViT Model Sizes

Part 2: CLIP - Connecting Vision and Language

The CLIP Architecture

Contrastive Learning: The Key Innovation

Zero-Shot Image Classification

CLIP Applications

CLIP Variants and Successors

Part 3: Vision-Language Models (VLMs)

The VLM Architecture

Major Vision-Language Models

GPT-4V (OpenAI)

Claude 3 Vision (Anthropic)

Gemini Vision (Google)

LLaVA (Open Source)

Comparing VLMs

Part 4: Multimodal Prompting Techniques

Basic Visual Prompting

Chain-of-Thought for Vision

Multi-Image Reasoning

Document Understanding

Diagram Analysis

Part 5: Building with Vision-Language Models

Application 1: Image Search with CLIP

Application 2: Visual QA System

Application 3: Image Captioning Pipeline

Part 6: Production Considerations

Cost Optimization

Latency Optimization

Safety Considerations

Part 7: The Economics of Vision AI

Pricing Comparison (December 2024)

Real-World Cost Scenarios

Cost Optimization Strategies

Production War Stories: When Vision AI Goes Wrong

The Fashion Retailer’s $2M Mistake

The Insurance Company’s Bias Problem

The Medical Imaging False Positive Cascade

The Social Media Moderation Nightmare

Did You Know? Historical Context and Stories

The CLIP Paper That Changed Everything

LLaVA: The $300 Vision Model

The Vision Encoder Wars

GPT-4V: The Model That Passes the Bar

Claude 3’s “Self-Aware” Moment

The LAION-5B Dataset Controversy

Florence: Microsoft’s Visual Foundation Model

The Surprising Role of Resolution

The Multimodal Benchmark Race

Common Pitfalls and How to Avoid Them

Pitfall 1: Expecting OCR Perfection

Pitfall 2: Ignoring Image Quality

Pitfall 3: Single-Shot Complex Tasks

Pitfall 4: Hallucinations in Visual Content

Hands-On Exercises

Exercise 1: Build Zero-Shot Image Classifier

Exercise 2: Create Image Search Engine

Exercise 3: Document Understanding Pipeline

Exercise 4: Multi-Image Comparison Tool

Deliverables

Further Reading

Papers

Documentation

Tutorials

The Future of Vision AI: What’s Coming

Real-Time Video Understanding