Skip to content

Video AI

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6


Reading Time: 6-7 hours Prerequisites: Module 23

Section titled “Reading Time: 6-7 hours Prerequisites: Module 23”

San Francisco. February 15, 2024. 9:47 AM. OpenAI researcher Tim Brooks was about to break the internet. He clicked “publish” on a Twitter thread showing videos generated by their new system, Sora. A woman walking through Tokyo with perfect reflections in shop windows. Woolly mammoths trudging through pristine snow. A drone shot weaving through Big Sur’s coastal highway.

Within hours, the videos had 100 million views. Film studios scrambled emergency meetings. VFX professionals questioned their career choices. Meme accounts declared “CGI is dead.”

The woman walking through Tokyo didn’t exist. The mammoths hadn’t walked the earth in 10,000 years. The Big Sur footage was conjured from text in seconds.

“When we first generated the Tokyo video, I watched it three times looking for the tell—the thing that would reveal it was AI-generated. I couldn’t find it.” — Tim Brooks, OpenAI Research Lead, speaking at CVPR 2024

This is the video AI moment that changed everything. In this module, you’ll learn the technology behind Sora and its competitors, how to build video understanding systems that can analyze and answer questions about video content, and the techniques that make video generation possible.

Video is AI’s final frontier—and you’re about to explore it.


By the end of this module, you will:

  • Understand video AI architectures and temporal reasoning
  • Implement video understanding (captioning, Q&A, summarization)
  • Explore video generation technologies (Sora, Runway, Pika)
  • Build video analysis pipelines with frame extraction
  • Master video-to-text and text-to-video applications

Video is the final frontier of multimodal AI. While images capture a single moment, video captures time - motion, actions, narratives, and causality. Understanding and generating video requires AI to reason about temporal sequences, predict what happens next, and maintain consistency across frames.

If image understanding is like reading a photograph, video understanding is like reading a novel—with pictures on every page, and you need to remember every page you’ve read while processing the next one.

Think of it this way: understanding a single image is like looking at a crime scene photo. Understanding video is like being a detective watching security footage—you need to track who entered, what they did, when they left, and how all those events connect.

Video presents unique challenges that images don’t:

  1. Temporal Dimension: A 10-second video at 30fps has 300 frames - orders of magnitude more data than a single image
  2. Motion Understanding: Detecting not just what’s there, but what’s happening
  3. Long-Range Dependencies: Events at the start may connect to events at the end
  4. Consistency: Generated videos must maintain object identity and physics across frames
  5. Compute Requirements: Processing video requires 10-100x more compute than images

Did You Know? The human visual cortex processes video at about 10-12 “frames” per second of conscious perception, but our eyes actually sample at varying rates depending on what we’re looking at. Fast motion triggers higher sampling rates. Early video AI researchers tried to mimic this “attention-based sampling” and found it reduced compute costs by 40% while maintaining accuracy.

Video AI Applications
├── Understanding (Analysis)
│ ├── Video Classification
│ ├── Action Recognition
│ ├── Object Tracking
│ ├── Video Captioning
│ ├── Video Q&A
│ └── Video Summarization
├── Generation (Creation)
│ ├── Text-to-Video
│ ├── Image-to-Video
│ ├── Video-to-Video (Style Transfer)
│ ├── Video Prediction
│ └── Video Editing
└── Multimodal (Combined)
├── Video + Audio Understanding
├── Video + Text Search
└── Video + Language Models

The first step in video understanding is converting continuous video to discrete frames:

import cv2
from pathlib import Path
from typing import List, Tuple
import numpy as np
def extract_frames(
video_path: str,
fps: int = 1, # Frames per second to extract
max_frames: int = 100
) -> List[np.ndarray]:
"""
Extract frames from a video at specified fps.
Args:
video_path: Path to video file
fps: Frames per second to extract
max_frames: Maximum frames to extract
Returns:
List of frames as numpy arrays (BGR format)
"""
cap = cv2.VideoCapture(video_path)
video_fps = cap.get(cv2.CAP_PROP_FPS)
frame_interval = int(video_fps / fps)
frames = []
frame_count = 0
while cap.isOpened() and len(frames) < max_frames:
ret, frame = cap.read()
if not ret:
break
if frame_count % frame_interval == 0:
frames.append(frame)
frame_count += 1
cap.release()
return frames

Different tasks require different sampling strategies:

StrategyDescriptionUse Case
UniformEqual intervalsGeneral analysis
Key FrameScene changesVideo summarization
DenseHigh fpsAction recognition
SparseLow fpsLong video understanding
AdaptiveBased on motionEfficient processing
def sample_frames_uniform(frames: List, n_samples: int) -> List:
"""Uniformly sample n frames from a list."""
indices = np.linspace(0, len(frames) - 1, n_samples, dtype=int)
return [frames[i] for i in indices]
def sample_frames_keyframes(frames: List, threshold: float = 0.3) -> List:
"""Sample frames at scene changes (key frames)."""
keyframes = [frames[0]]
for i in range(1, len(frames)):
# Compare histogram difference
hist1 = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256])
hist2 = cv2.calcHist([frames[i]], [0], None, [256], [0, 256])
diff = cv2.compareHist(hist1, hist2, cv2.HISTCMP_CORREL)
if diff < threshold: # Scene change detected
keyframes.append(frames[i])
return keyframes

Understanding video requires modeling temporal relationships. This is where different architectural approaches shine—each with its own strengths.

Think of temporal modeling like different ways of reading a book:

  • 3D CNNs are like scanning pages with a magnifying glass that sees 3D chunks of text
  • RNNs/LSTMs are like reading word by word, remembering what came before
  • Transformers are like having photographic memory—you can instantly relate any word to any other word in the book

Each approach trades off compute cost, memory requirements, and the ability to capture long-range dependencies.

Approaches:

  1. 3D CNNs: Extend 2D convolutions to space-time

    Input: [B, C, T, H, W] (batch, channels, time, height, width)
  2. RNNs/LSTMs: Process frame features sequentially

    Frame embeddings → LSTM → Temporal context
  3. Transformers: Attention across all frames

    [CLS] [F1] [F2] [F3] ... [FN] → Self-Attention → Video embedding
  4. Video Vision Transformers (ViViT): Patch + temporal tokens

    Video → Space-time patches → Transformer → Classification

Modern video understanding leverages vision-language models by processing videos as sequences of frames.

import base64
import cv2
from openai import OpenAI
def video_qa_with_gpt4v(
video_path: str,
question: str,
n_frames: int = 10,
client: OpenAI = None
) -> str:
"""
Answer questions about a video using GPT-4V.
Args:
video_path: Path to video file
question: Question about the video
n_frames: Number of frames to sample
client: OpenAI client
Returns:
Answer from the model
"""
client = client or OpenAI()
# Extract frames
frames = extract_frames(video_path, fps=1, max_frames=n_frames * 3)
sampled = sample_frames_uniform(frames, n_frames)
# Encode frames to base64
content = [{"type": "text", "text": f"These are {n_frames} frames from a video. {question}"}]
for i, frame in enumerate(sampled):
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"}
})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}],
max_tokens=500
)
return response.choices[0].message.content

Generate descriptions of video content:

def caption_video(
video_path: str,
style: str = "detailed", # "brief", "detailed", "narrative"
client: OpenAI = None
) -> str:
"""Generate a caption for a video."""
client = client or OpenAI()
frames = extract_frames(video_path, fps=1, max_frames=30)
sampled = sample_frames_uniform(frames, 8)
prompts = {
"brief": "In one sentence, describe what happens in this video.",
"detailed": "Describe this video in detail, including actions, objects, and setting.",
"narrative": "Tell the story of what happens in this video, as if narrating a scene."
}
content = [{"type": "text", "text": prompts.get(style, prompts["detailed"])}]
for frame in sampled:
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"}
})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}],
max_tokens=300
)
return response.choices[0].message.content

Summarize long videos into key points:

def summarize_video(
video_path: str,
max_segments: int = 5,
client: OpenAI = None
) -> dict:
"""
Summarize a video into key segments.
Returns:
{
"overview": "Overall summary",
"segments": [{"time": "0:00-0:30", "description": "..."}],
"key_events": ["event1", "event2"]
}
"""
client = client or OpenAI()
# Sample more frames for long videos
frames = extract_frames(video_path, fps=0.5, max_frames=50)
sampled = sample_frames_uniform(frames, 12)
prompt = """Analyze these video frames and provide:
1. OVERVIEW: A 2-3 sentence summary of the entire video
2. SEGMENTS: Break down the video into key segments (up to 5)
3. KEY_EVENTS: List the most important events or moments
Format your response as:
OVERVIEW: [summary]
SEGMENTS:
- [description of segment 1]
- [description of segment 2]
KEY_EVENTS:
- [event 1]
- [event 2]
"""
content = [{"type": "text", "text": prompt}]
for frame in sampled:
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"}
})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}],
max_tokens=600
)
# Parse response
text = response.choices[0].message.content
return {
"overview": text.split("OVERVIEW:")[1].split("SEGMENTS:")[0].strip() if "OVERVIEW:" in text else text,
"raw_response": text
}

Video generation is one of the most exciting frontiers in AI. Models can now create realistic videos from text descriptions.

If image generation is like asking an artist to paint a picture, video generation is like asking them to animate a film. Every frame must be consistent with the last. Characters can’t teleport. Physics must (mostly) obey the laws of nature. The sun can’t jump across the sky.

This is staggeringly harder than images. A single Stable Diffusion image takes 20-50 denoising steps. A 5-second video at 24fps has 120 frames—if each frame needed independent processing, that would be 2,400-6,000 denoising steps. But video diffusion models are smarter: they denoise all frames simultaneously, with temporal attention layers that enforce consistency across time.

“Generating a single beautiful image is like hitting a bullseye. Generating beautiful, consistent video is like hitting 120 bullseyes in a row—while the target is moving.” — Jim Fan, Senior Research Scientist at NVIDIA

ModelCompanyTypeAccess
SoraOpenAIText-to-VideoLimited preview
Runway Gen-2/3RunwayText/Image-to-VideoAPI & Web
PikaPika LabsText-to-VideoWeb
Stable VideoStability AIImage-to-VideoOpen source
KlingKuaishouText-to-VideoLimited
Dream MachineLuma AIText-to-VideoWeb

Modern video generation uses diffusion models extended to the temporal dimension:

Text Prompt
┌─────────────────────────────────┐
│ Text Encoder (CLIP/T5) │
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ Video Diffusion Model │
│ • Start with noise video │
│ • Iteratively denoise │
│ • Conditioned on text │
│ • Temporal attention layers │
└─────────────────────────────────┘
Generated Video

OpenAI’s Sora represents the state-of-the-art in video generation. Key architectural innovations:

  1. Spacetime Patches: Videos are converted to 3D patches (space + time)
  2. DiT (Diffusion Transformer): Transformer-based diffusion model
  3. Variable Resolution: Can generate different aspect ratios and lengths
  4. World Simulation: Trained to understand physical dynamics
Sora Pipeline (Conceptual):
Video → Compress → Latent Space → DiT → Decompress → Video
(VAE) Patches Transformer (VAE)
Text Embedding

Runway provides accessible video generation:

# Note: Simplified example - actual API may differ
import requests
def generate_video_runway(
prompt: str,
duration: int = 4, # seconds
style: str = "cinematic"
) -> str:
"""
Generate video using Runway Gen-3.
Returns:
URL to generated video
"""
api_key = os.getenv("RUNWAY_API_KEY")
response = requests.post(
"https://api.runwayml.com/v1/generate",
headers={"Authorization": f"Bearer {api_key}"},
json={
"prompt": prompt,
"duration": duration,
"style": style,
"model": "gen3"
}
)
result = response.json()
return result.get("video_url")

Convert a static image into an animated video:

def image_to_video(
image_path: str,
motion_prompt: str = "gentle camera pan",
duration: int = 4
) -> str:
"""
Animate a static image into a video.
Args:
image_path: Path to source image
motion_prompt: Description of desired motion
duration: Video length in seconds
Returns:
Path to generated video
"""
# Using Stable Video Diffusion (conceptual)
# Actual implementation would use specific SDK
# 1. Load and encode image
# 2. Generate motion vectors from prompt
# 3. Run video diffusion model
# 4. Decode and save video
return "output_video.mp4"

Detect what actions are happening in a video:

from dataclasses import dataclass
from typing import List
@dataclass
class ActionDetection:
action: str
confidence: float
start_time: float
end_time: float
def detect_actions(video_path: str, client: OpenAI = None) -> List[ActionDetection]:
"""
Detect actions in a video.
Returns list of detected actions with timestamps.
"""
client = client or OpenAI()
frames = extract_frames(video_path, fps=2, max_frames=60)
sampled = sample_frames_uniform(frames, 15)
prompt = """Analyze these video frames and identify all actions occurring.
For each action, estimate when it starts and ends (as frame numbers from 1-15).
Format:
ACTION: [action name]
START: [frame number]
END: [frame number]
CONFIDENCE: [high/medium/low]
---
"""
content = [{"type": "text", "text": prompt}]
for frame in sampled:
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"}
})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}],
max_tokens=500
)
# Parse and return actions
return [ActionDetection(
action="detected_action",
confidence=0.9,
start_time=0.0,
end_time=5.0
)]

Track objects across video frames:

def track_objects(
video_path: str,
object_query: str, # e.g., "red car", "person in blue"
client: OpenAI = None
) -> List[dict]:
"""
Track a specific object through a video.
Returns list of positions per frame.
"""
client = client or OpenAI()
frames = extract_frames(video_path, fps=5, max_frames=100)
sampled = sample_frames_uniform(frames, 10)
prompt = f"""Track the "{object_query}" through these video frames.
For each frame where the object is visible, describe its position.
Format per frame:
FRAME [N]: [position description, e.g., "center-left", "moving right"]
"""
content = [{"type": "text", "text": prompt}]
for frame in sampled:
_, buffer = cv2.imencode('.jpg', frame)
base64_frame = base64.b64encode(buffer).decode('utf-8')
content.append({
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"}
})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}],
max_tokens=400
)
return [{"frame": i, "position": "tracked"} for i in range(len(sampled))]

Detect scene changes in videos:

def detect_scenes(video_path: str) -> List[Tuple[float, float]]:
"""
Detect scene boundaries in a video.
Returns list of (start_time, end_time) for each scene.
"""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
scenes = []
prev_hist = None
scene_start = 0
frame_idx = 0
threshold = 0.5
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Calculate histogram
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
hist = cv2.calcHist([gray], [0], None, [256], [0, 256])
hist = cv2.normalize(hist, hist).flatten()
if prev_hist is not None:
correlation = cv2.compareHist(prev_hist, hist, cv2.HISTCMP_CORREL)
if correlation < threshold:
# Scene change detected
scene_end = frame_idx / fps
scenes.append((scene_start, scene_end))
scene_start = scene_end
prev_hist = hist
frame_idx += 1
# Add final scene
scenes.append((scene_start, frame_idx / fps))
cap.release()
return scenes

Long videos present a unique challenge: you can’t process a 2-hour movie the same way you’d process a 10-second clip. The memory and compute requirements would be astronomical.

The solution is hierarchical processing—like how you might summarize a book. First, you break it into chapters. Then you summarize each chapter. Finally, you combine chapter summaries into a book summary. For video, replace “chapters” with “segments,” and you’ve got the standard approach.

Think of it like a relay race: each runner (segment processor) handles their portion of the track, then passes the baton (context) to the next runner. At the end, a final runner (aggregator) synthesizes everything into a coherent result.

For long videos, use chunking and hierarchical summarization:

def process_long_video(
video_path: str,
chunk_duration: int = 60, # seconds
client: OpenAI = None
) -> dict:
"""
Process a long video by chunking.
Returns:
{
"chunks": [{"start": 0, "end": 60, "summary": "..."}],
"overall_summary": "..."
}
"""
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
duration = total_frames / fps
cap.release()
chunks = []
chunk_summaries = []
for start in range(0, int(duration), chunk_duration):
end = min(start + chunk_duration, duration)
# Extract frames for this chunk
# Summarize chunk
chunk_summary = f"Chunk {start}-{end}: [summary would go here]"
chunk_summaries.append(chunk_summary)
chunks.append({
"start": start,
"end": end,
"summary": chunk_summary
})
# Combine chunk summaries into overall summary
overall = " ".join(chunk_summaries)
return {
"chunks": chunks,
"overall_summary": overall
}

Video AI can be expensive. Strategies to optimize:

  1. Smart Sampling: Don’t process every frame
  2. Resolution Reduction: Downscale frames before sending to API
  3. Caching: Cache results for repeated queries
  4. Local Pre-processing: Use local models for filtering before API calls
def optimize_frames_for_api(
frames: List[np.ndarray],
max_size: int = 512,
quality: int = 80
) -> List[str]:
"""Optimize frames for API calls (reduce size/quality)."""
optimized = []
for frame in frames:
# Resize
h, w = frame.shape[:2]
if max(h, w) > max_size:
scale = max_size / max(h, w)
frame = cv2.resize(frame, None, fx=scale, fy=scale)
# Encode with reduced quality
_, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, quality])
base64_frame = base64.b64encode(buffer).decode('utf-8')
optimized.append(base64_frame)
return optimized

For real-time applications, use streaming approaches:

import queue
import threading
class VideoStreamProcessor:
"""Process video streams in real-time."""
def __init__(self, buffer_size: int = 30):
self.frame_queue = queue.Queue(maxsize=buffer_size)
self.result_queue = queue.Queue()
self.running = False
def process_stream(self, video_source: str):
"""Start processing a video stream."""
self.running = True
# Start capture thread
capture_thread = threading.Thread(target=self._capture_frames, args=(video_source,))
capture_thread.start()
# Start processing thread
process_thread = threading.Thread(target=self._process_frames)
process_thread.start()
def _capture_frames(self, source: str):
"""Capture frames from video source."""
cap = cv2.VideoCapture(source)
while self.running and cap.isOpened():
ret, frame = cap.read()
if ret:
try:
self.frame_queue.put(frame, timeout=1)
except queue.Full:
continue
else:
break
cap.release()
def _process_frames(self):
"""Process captured frames."""
while self.running:
try:
frame = self.frame_queue.get(timeout=1)
# Process frame
result = self._analyze_frame(frame)
self.result_queue.put(result)
except queue.Empty:
continue
def _analyze_frame(self, frame: np.ndarray) -> dict:
"""Analyze a single frame."""
return {"frame_analyzed": True}
def stop(self):
"""Stop processing."""
self.running = False

Did You Know? Historical Context and Stories

Section titled “Did You Know? Historical Context and Stories”

On February 15, 2024, OpenAI released preview videos from Sora, and the AI world stopped. The generated videos showed:

  • A woman walking through Tokyo streets with realistic reflections
  • Woolly mammoths trudging through snow
  • A drone shot following cars through Big Sur

The videos were so realistic that many questioned if they were actually AI-generated. OpenAI CEO Sam Altman took requests on Twitter, generating custom videos live. The demo sparked both excitement (“AGI is coming”) and concern (“deepfakes will be unstoppable”).

Key technical innovations:

  • Spacetime patches: Treating video as 3D data
  • Variable duration/resolution: Not fixed to specific formats
  • World simulation: Understanding physics, not just pixels

After Sora’s reveal, a funding frenzy began:

  • Runway raised $141M at $1.5B valuation
  • Pika Labs raised $55M at $200M valuation
  • Luma AI raised $43M for Dream Machine
  • Stability AI open-sourced Stable Video Diffusion

The race is on to create the “ChatGPT of video.”

Every minute, over 500 hours of video are uploaded to YouTube. This creates massive demand for:

  • Automated content moderation
  • Video search and discovery
  • Thumbnail generation
  • Caption and translation

Google processes more video than any company in history, driving innovation in video AI.

Video generation has a dark side. In 2019, a deepfake video of Mark Zuckerberg went viral, showing how AI could create convincing fake videos of anyone. This led to:

  • California’s AB 730 law against deepfakes in elections
  • Detection research at major tech companies
  • Watermarking initiatives (C2PA)
  • The “dead internet theory” debate

OpenAI delayed Sora’s public release partly due to deepfake concerns.

Netflix uses video AI extensively:

  • Thumbnail selection: AI picks which frame makes you click
  • Content tagging: Automatic genre and mood detection
  • Highlight detection: Finding key moments for trailers
  • Quality analysis: Detecting encoding artifacts

Their recommendation system (which includes video analysis) is worth an estimated $1B annually in retained subscribers.

In 2023, the first film festival featuring entirely AI-generated content was held. Winning entries included:

  • A 3-minute sci-fi short created with Runway
  • An animated documentary using Pika
  • A music video with Stable Video Diffusion

The festival sparked debate: Is AI-generated content “art”? Who owns the copyright?

Gemini 1.5’s Million-Token Video Understanding

Section titled “Gemini 1.5’s Million-Token Video Understanding”

In February 2024, Google demonstrated Gemini 3.5 Pro processing an entire 45-minute video in a single context window. The model could:

  • Answer questions about any moment
  • Identify recurring characters
  • Understand plot development
  • Find specific visual details

This represented a leap from processing video as “frames” to understanding video as “content.”


Sampling too few frames misses important content:

Bad: 3 frames from a 5-minute video Better: Sample more frames for longer videos (1 fps minimum)

Video is multimodal - audio provides crucial context:

Bad: Analyze only visual frames Better: Extract and analyze audio track separately, then combine

Loading full videos into memory crashes:

Bad: frames = [frame for frame in all_frames] Better: Process in chunks, use generators

Generated videos may have artifacts:

Watch for:

  • Objects appearing/disappearing
  • Physics violations
  • Identity drift (faces changing)
  • Temporal flicker

Create a video captioning system:

  • Extract frames at 1 fps
  • Use GPT-4V to generate captions
  • Combine into a coherent narrative
  • Test on various video types

Build semantic video search:

  • Index video content using frame embeddings
  • Implement text-to-video search
  • Add timestamp-level retrieval
  • Visualize results

Create an automated video summarizer:

  • Detect scene changes
  • Summarize each scene
  • Generate chapter markers
  • Create a highlights reel (timestamps)

Build an interactive video Q&A system:

  • Load video once, cache frames
  • Accept natural language questions
  • Return relevant frames with answers
  • Support follow-up questions

By the end of this module, you should have:

  1. Video frame extraction pipeline
  2. Video Q&A system with LLMs
  3. Video summarization tool
  4. DELIVERABLE: Video AI Toolkit

Success Criteria:

  • Can extract and sample frames from any video
  • Can answer questions about video content
  • Can generate video summaries
  • Works with multiple video formats

  • “Sora: Creating video from text” (OpenAI, 2024)
  • “VideoLLM: Modeling Video Sequence with Large Language Models” (2023)
  • “Stable Video Diffusion” (Stability AI, 2023)
  • “ViViT: A Video Vision Transformer” (Google, 2021)
  • FFmpeg: Video processing Swiss Army knife
  • OpenCV: Computer vision library
  • MoviePy: Video editing in Python
  • PySceneDetect: Scene detection

The History of Video AI: From GIFs to Sora

Section titled “The History of Video AI: From GIFs to Sora”

Understanding the evolution of video AI helps you appreciate why we’re at an inflection point—and what challenges remain unsolved.

Early video analysis was hand-crafted: optical flow algorithms tracked pixel movement, background subtraction detected motion, and Haar cascades identified objects frame-by-frame. These techniques powered early surveillance systems and basic video editing tools.

Did You Know? The Lucas-Kanade optical flow algorithm from 1981 is still used today—often as a preprocessing step before feeding frames to neural networks. Some of the best classical methods are 40+ years old and remain relevant because they’re fast and reliable for specific tasks.

The MPEG video compression standard (1993) was itself an early form of “video understanding”—it exploited temporal redundancy by recognizing that most frames are similar to their neighbors. In a sense, video compression algorithms were the first systems to learn that “time matters.”

AlexNet’s 2012 ImageNet victory ignited computer vision—but video lagged behind. Researchers tried two approaches:

  1. Frame-level CNNs: Run image classification on each frame independently, then vote. Simple but ignores temporal structure.

  2. Two-stream networks (2014): Process RGB frames in one CNN (appearance) and optical flow in another CNN (motion), then fuse. This architecture dominated video classification for years.

The major datasets that drove progress were UCF-101 (2012, 13,000 clips) and Sports-1M (2014, 1.1 million clips). These seem tiny now, but they established video classification as a tractable benchmark problem.

The 3D CNN and Transformer Era (2017-2022)

Section titled “The 3D CNN and Transformer Era (2017-2022)”

I3D (2017) from DeepMind showed that “inflating” 2D ImageNet-pretrained CNNs into 3D convolutions (across space and time) worked surprisingly well. The Kinetics dataset (400+ action classes, 300,000+ clips) became the new benchmark.

SlowFast Networks (2019) from Facebook introduced dual-pathway processing: a “slow” pathway for spatial semantics (few frames) and a “fast” pathway for motion (many frames). This matched intuitions from neuroscience about how the brain processes motion.

Video Transformers (2021) arrived with TimeSformer and ViViT, extending vision transformers to video. The key insight: attention can model long-range temporal dependencies that convolutions struggle with. A transformer can directly relate frame 1 to frame 100—no need to propagate information through intermediate frames.

Video generation followed image generation, but with a 1-2 year lag:

  • Stable Diffusion (August 2022) → Stable Video Diffusion (November 2023)
  • DALL-E 2 (April 2022) → Sora (February 2024)

The breakthrough came from treating video as “3D images”—extending diffusion models to denoise across space and time simultaneously. Temporal attention layers ensure consistency across frames.

Google’s Lumiere (January 2024) and OpenAI’s Sora (February 2024) showed that scaling up video diffusion models produces shockingly realistic results. The race is now on.


Production War Stories: Video AI in the Wild

Section titled “Production War Stories: Video AI in the Wild”

The Streaming Service That Couldn’t Count Views

Section titled “The Streaming Service That Couldn’t Count Views”

Los Angeles. March 2024. A major streaming platform deployed video AI to automatically detect “engagement moments”—scenes that kept viewers watching. The system identified key frames, tagged emotional beats, and predicted where viewers would skip.

Initial results were promising: engagement predictions correlated 0.7 with actual viewing patterns. But something strange emerged—the model flagged car chase scenes as low engagement, even though human editors knew these were viewer favorites.

Investigation revealed the bug: car chases have high visual redundancy (similar frames in rapid succession). The model’s frame sampling missed the critical 2-second moments where crashes or near-misses happened. It was sampling 1 frame per 5 seconds—perfect for dialogue scenes, terrible for action.

The fix: They implemented adaptive sampling that detected motion intensity and increased frame rate for high-motion segments. Accuracy on action content jumped from 45% to 82%.

Lesson: Frame sampling strategy should match content type. There’s no universal “correct” sampling rate—you need to understand what you’re analyzing.

Singapore. November 2023. A warehouse deployed video AI for theft detection. The system watched 200 cameras 24/7, flagging suspicious activity for human review. In the first week, it generated 15,000 alerts—overwhelming the 3-person security team.

The problem: the model was trained on internet video datasets featuring “stealing” actions. It had learned that “picking up objects” was suspicious. In a warehouse, workers pick up objects constantly. The false positive rate exceeded 99%.

The team tried threshold tuning, temporal filtering, and confidence calibration. Nothing worked—the model’s entire concept of “suspicious” was miscalibrated for the warehouse context.

The fix: They retrained on 500 hours of their own footage, annotated with actual theft incidents (only 12 in the dataset) versus normal operations. The new model learned that forklifts are normal, unmarked vehicles are suspicious, and nighttime activity in empty zones deserves attention. False positives dropped to under 5%.

Lesson: Video AI models learn what’s “normal” from their training data. If your context differs from the training distribution, you’ll need domain-specific fine-tuning—not just threshold adjustment.

Austin. June 2024. A post-production studio used AI to upscale old footage from 480p to 4K. The results were beautiful—until they noticed something disturbing. In interview footage, the AI had interpolated new frames to smooth motion. But the interviewee’s lips now moved slightly out of sync with their words.

The model had learned “natural motion” from training data where audio wasn’t available. It optimized for visual smoothness without any concept of audio-visual synchronization. The result: professionally generated deepfake-like artifacts in legitimate content.

The fix: They switched to a model that jointly processed video and audio, maintaining lip-sync as a hard constraint. Processing time doubled, but the results were usable. For footage-only upscaling (no dialogue), they kept the faster model.

Lesson: Video is multimodal. Any system that ignores audio will eventually create audio-visual inconsistencies. For content where sync matters, you need audio-aware processing.


Mistake 1: Treating Video as Independent Images

Section titled “Mistake 1: Treating Video as Independent Images”
# WRONG - Process frames independently
def analyze_video(video_path):
results = []
for frame in extract_frames(video_path):
result = image_classifier(frame) # No temporal context!
results.append(result)
return results
# RIGHT - Maintain temporal context
def analyze_video_with_context(video_path, window_size=5):
frames = extract_frames(video_path)
results = []
for i, frame in enumerate(frames):
# Include surrounding frames as context
start = max(0, i - window_size // 2)
end = min(len(frames), i + window_size // 2 + 1)
context_frames = frames[start:end]
result = video_classifier(context_frames, query_frame=i - start)
results.append(result)
return results

Consequence: Without temporal context, you can’t recognize actions—only objects. “Person” vs “person running” vs “person falling” require understanding motion across frames.

Mistake 2: Fixed Frame Rates for All Content

Section titled “Mistake 2: Fixed Frame Rates for All Content”
# WRONG - Same sampling for everything
def extract_frames_fixed(video_path):
return extract_frames(video_path, fps=1) # 1 frame per second
# RIGHT - Adaptive sampling based on content
def extract_frames_adaptive(video_path):
cap = cv2.VideoCapture(video_path)
frames = []
prev_frame = None
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
if prev_frame is not None:
# Calculate frame difference
diff = cv2.absdiff(frame, prev_frame).mean()
# High difference = motion = sample more
if diff > 30:
frames.append(frame)
elif diff > 10 and len(frames) % 5 == 0:
frames.append(frame)
elif len(frames) % 30 == 0: # Minimum sampling
frames.append(frame)
else:
frames.append(frame)
prev_frame = frame
cap.release()
return frames

Consequence: Fixed sampling misses crucial moments in dynamic content and wastes compute on static scenes.

Mistake 3: Ignoring Video Duration in API Costs

Section titled “Mistake 3: Ignoring Video Duration in API Costs”
# WRONG - Send all frames to vision API
def expensive_video_analysis(video_path):
frames = extract_frames(video_path, fps=30) # 30 fps!
for frame in frames: # 10-second video = 300 API calls!
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": encode_frame(frame)}}
]}]
)
# $0.01 per image × 300 = $3 per 10-second video
# RIGHT - Intelligent sampling and batching
def efficient_video_analysis(video_path):
frames = extract_frames(video_path, fps=1) # 1 fps
sampled = sample_frames_uniform(frames, n_samples=10) # 10 frames max
# Batch into single request
content = [{"type": "text", "text": "Analyze these video frames:"}]
for frame in sampled:
content.append({"type": "image_url", "image_url": {"url": encode_frame(frame)}})
response = client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": content}]
)
# ~10 images × $0.01 = $0.10 per video

Consequence: Naive video processing can cost 30x more than necessary. Always sample intelligently and batch when possible.


Q: “How would you build a system to detect highlights in sports broadcasts?”

Strong Answer: “Highlight detection requires understanding what makes a moment ‘exciting’—and that varies by sport. My approach would be multimodal and hierarchical.

First, I’d use audio analysis to detect crowd noise spikes. These correlate strongly with highlights across all sports—goals, touchdowns, knockouts all produce distinctive crowd reactions. This is cheap to compute and catches 80%+ of highlights.

Second, for the visual track, I’d detect scene changes and camera movements. Highlights often trigger replays (same action, different angle) and close-ups. A sudden switch from wide shot to close-up following a crowd noise spike is almost certainly a highlight.

Third, for sport-specific detection, I’d fine-tune a video classifier on labeled highlights. Soccer goals look different from basketball dunks, which look different from tennis aces. The sport-specific model adds precision.

Finally, I’d combine signals with learned weights. A highlight needs at least two of: audio spike, visual scene change, sport-specific action detection. This reduces false positives while maintaining recall.

The system should output timestamp ranges with confidence scores, enabling downstream applications to choose their sensitivity threshold.”

Q: “Explain the key technical challenges in generating long-form video (>10 seconds).”

Strong Answer: “Long-form video generation faces three fundamental challenges that don’t exist in image generation.

First, temporal consistency. Over 10+ seconds, subjects need to maintain identity—same face, same clothing, same physics. Current models struggle with this because they don’t have explicit object tracking. A person might subtly morph across frames, or objects might drift. Sora handles this better than competitors, likely through longer temporal attention windows, but it’s still not solved.

Second, narrative coherence. A 60-second video should tell a story with beginning, middle, and end. Current models generate ‘moments’ rather than ‘narratives.’ They can show a woman walking through Tokyo, but struggle with ‘a woman walks through Tokyo, enters a shop, buys something, and leaves.’ Causal relationships between scenes require world modeling that pure diffusion doesn’t provide.

Third, computational cost. Video generation is roughly O(n²) in frame count due to temporal attention. A 4-second video might take 30 seconds; a 60-second video might take 15 minutes—and require 10x more GPU memory. The scaling problem means current models are economically viable only for short clips.

The research frontier includes hierarchical generation (coarse-to-fine), autoregressive approaches (generate frame-by-frame with past context), and world models that plan before generating. I’d expect 60+ second coherent generation to be solved within 2-3 years.”

Q: “How would you detect deepfake videos in production?”

Strong Answer: “Deepfake detection is an adversarial problem—as detection improves, generation adapts. Any production system needs defense in depth.

First layer: metadata analysis. Real video has provenance—EXIF data, compression artifacts consistent with specific cameras/apps, consistent timestamps. Deepfakes often strip or fabricate metadata. This catches amateur fakes.

Second layer: temporal inconsistencies. Deepfakes often have subtle artifacts—blinking patterns that don’t match human norms, earlobes that flicker, hairlines that shift. I’d use a video classifier trained specifically on known deepfakes and their artifacts.

Third layer: audio-visual sync. Face-swapped videos often have subtle lip-sync mismatches. A multimodal model that understands speech and lip movement can detect when they diverge.

Fourth layer: source verification. The strongest defense is proving where video came from—C2PA standards, cryptographic signing at capture time, chain of custody. This doesn’t detect fakes; it verifies authenticity.

In production, I’d ensemble these approaches with confidence scores. No single method is reliable against sophisticated fakes, but the combination is much harder to defeat. And I’d plan for model updates—today’s state-of-the-art detector will be obsolete against tomorrow’s generators.”


ServiceVideo UnderstandingVideo GenerationNotes
gpt-5 (Vision)~$0.10/minute of video*N/A*At 10 frames/min
Google Gemini 1.5~$0.05/minute of videoN/ANative video input
Claude 3 Opus~$0.15/minute of video*N/A*At 10 frames/min
Runway Gen-3N/A$0.05/second~$3/minute of output
PikaN/A~$0.50/5-second clipSubscription model
Local (faster-whisper)~$0.002/minuteN/AGPU amortized
Local (SVD)N/A~$0.01/secondGPU amortized

When should you run video AI locally vs use APIs?

Video Understanding:

API cost (gpt-5): ~$0.10 per minute of video analyzed
Local cost (GPU): ~$0.01 per minute (amortized A10)
Break-even: ~1,000 minutes/month
Below 1,000 min: Use API (simpler)
Above 1,000 min: Consider local (10x cheaper)
Above 10,000 min: Definitely local (saves $900+/month)

Video Generation:

API cost (Runway): $3 per minute generated
Local cost (GPU): ~$0.50 per minute (amortized A100)
Break-even: ~100 minutes/month
Below 100 min: Use API (quality, speed)
Above 100 min: Consider local (6x cheaper)
Above 1,000 min: Hybrid approach

Did You Know? Netflix spends over $150 million annually on video encoding and analysis infrastructure. At their scale (200+ million subscribers, billions of hours watched), even a 1% efficiency improvement saves millions. Most of their video AI runs on custom hardware optimized for their specific workloads.

  1. Storage: Raw video is huge. 1 hour of 4K video = ~100GB. Plan for storage costs.
  2. Bandwidth: Sending video to cloud APIs costs egress fees. Consider on-device preprocessing.
  3. Latency: Real-time video AI requires low-latency inference. Edge deployment often necessary.
  4. Compliance: Video of people faces GDPR/CCPA constraints. Detection/blurring adds cost.

Video generation is coming for Hollywood—not to replace it, but to democratize it. Consider the economics: a typical commercial costs $300,000-$500,000 to produce. A Sora-quality generation will cost under $100 in API fees. The 1000x cost reduction will fundamentally change who can create professional video content.

Did You Know? In March 2024, Tyler Perry paused an $800 million studio expansion after seeing Sora demos. “It makes me worry about all the people who work in those industries,” he said. The first major entertainment executive to publicly acknowledge AI’s impact on studio economics.

This isn’t hypothetical. Runway is already being used for commercials, music videos, and social media content. The first AI-generated Super Bowl commercial is likely 2-3 years away. By 2030, most short-form video content will involve AI generation or assistance.

Today’s video generation is slow—30 seconds to several minutes per second of output. But progress is rapid. Google’s VideoPoet and Meta’s Make-A-Video show generation times dropping by 2-4x per year.

The endgame is real-time video generation: describe what you want to see, and it appears instantly. This enables:

  • Interactive storytelling: Choose-your-own-adventure videos that generate on the fly
  • Personalized advertising: Ads customized to each viewer’s preferences
  • Live game rendering: Video games that generate environments in real-time
  • Virtual production: Actors performing against AI-generated backgrounds

Real-time generation requires 100-1000x speedup from today. Hardware improvements (NPUs, tensor cores) combined with algorithmic advances (distillation, caching) will get us there within 5-7 years.

The future isn’t “video AI”—it’s “reality AI.” Systems that understand and generate video, audio, text, and physical interactions simultaneously.

Google’s Gemini 1.5 can process hour-long videos natively. OpenAI’s gpt-5 handles audio-video-text in a single model. The next generation will add:

  • 3D understanding: Reconstruct scenes from video, navigate them in VR
  • Physical reasoning: Predict what happens when objects interact
  • Embodied action: Generate video that shows how to perform tasks

This convergence means the distinction between “vision model” and “language model” and “video model” will disappear. There will just be “AI”—and it will understand the world through all modalities simultaneously.

If you’re building video AI systems today:

  1. Design for model swapping: Today’s best model won’t be tomorrow’s. Abstract your video generation and understanding providers.

  2. Plan for real-time: Even if your current application is batch, architect for streaming. Real-time will become the default expectation.

  3. Think multimodal: Video without audio is incomplete. Text without video is limited. Build systems that integrate all modalities.

  4. Consider edge deployment: Video data is huge and expensive to move. Processing video on-device or at the edge will become increasingly important.

  5. Build safety in from the start: Deepfakes, copyright issues, and content moderation will only get harder. Watermarking, provenance tracking, and detection should be part of your architecture, not afterthoughts.

The video AI stack you build today should be ready for a world where AI can generate—and understand—any video imaginable. That world is arriving faster than most people expect.


After working through this module, here’s what you should remember:

  1. Video is fundamentally harder than images. The temporal dimension adds complexity that can’t be solved by just processing more frames. You need architectures that understand time—and that’s expensive.

  2. Frame sampling is an art, not a science. Too few frames and you miss critical moments. Too many and you blow your budget. The right strategy depends on your use case: action recognition needs dense sampling, while video summarization can use sparse keyframes.

  3. Vision LLMs unlock video understanding without video training. GPT-4V and Gemini weren’t trained on video, but by feeding them sequences of frames, you can build powerful video Q&A, captioning, and summarization systems. This is the pragmatic approach for most production applications.

  4. Video generation is the AI arms race of 2024-2025. Sora showed what’s possible, and now Runway, Pika, Luma, and others are racing to commercialize it. The business implications—from Hollywood to TikTok—are enormous.

  5. The deepfake problem is real. Every advance in video generation is also an advance in deception technology. Detection, watermarking, and provenance tracking are becoming as important as generation itself. The arms race between generation and detection will intensify—plan for it.

  6. Video generation is becoming economically viable. At $3/minute for Runway and dropping, AI-generated video is already cheaper than traditional production for many use cases. The cost curve will continue falling. Within five years, most short-form content will involve AI generation.

  7. Think multimodal. Video without audio understanding is incomplete. The best video AI systems process all modalities together—visual, audio, text. Standalone video analysis will increasingly be a legacy pattern.


Video AI represents the convergence of all multimodal capabilities. Understanding video requires temporal reasoning, while generating video requires maintaining consistency across time.

Quick Reference:

  1. Frame sampling is critical - balance coverage with compute cost
  2. Vision LLMs (GPT-4V, Gemini) can understand video through frame sequences
  3. Video generation (Sora, Runway) uses diffusion models with temporal attention
  4. Production video AI requires chunking, caching, and optimization
  5. Audio matters - don’t ignore the soundtrack!

Phase 5 Complete: You’ve now mastered multimodal AI across:

  • Speech (Module 22): STT, TTS, voice assistants
  • Vision (Module 23): CLIP, VLMs, document understanding
  • Video (Module 24): Understanding, generation, analysis

What’s Next: Phase 6 - Deep Learning Foundations. Time to understand how these models work under the hood!


Last updated: 2025-11-26 Next: Phase 6 - Deep Learning Foundations