Video AI
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 5-6
Or: When Images Just Aren’t Enough
Section titled “Or: When Images Just Aren’t Enough”Reading Time: 6-7 hours Prerequisites: Module 23
Section titled “Reading Time: 6-7 hours Prerequisites: Module 23”San Francisco. February 15, 2024. 9:47 AM. OpenAI researcher Tim Brooks was about to break the internet. He clicked “publish” on a Twitter thread showing videos generated by their new system, Sora. A woman walking through Tokyo with perfect reflections in shop windows. Woolly mammoths trudging through pristine snow. A drone shot weaving through Big Sur’s coastal highway.
Within hours, the videos had 100 million views. Film studios scrambled emergency meetings. VFX professionals questioned their career choices. Meme accounts declared “CGI is dead.”
The woman walking through Tokyo didn’t exist. The mammoths hadn’t walked the earth in 10,000 years. The Big Sur footage was conjured from text in seconds.
“When we first generated the Tokyo video, I watched it three times looking for the tell—the thing that would reveal it was AI-generated. I couldn’t find it.” — Tim Brooks, OpenAI Research Lead, speaking at CVPR 2024
This is the video AI moment that changed everything. In this module, you’ll learn the technology behind Sora and its competitors, how to build video understanding systems that can analyze and answer questions about video content, and the techniques that make video generation possible.
Video is AI’s final frontier—and you’re about to explore it.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Understand video AI architectures and temporal reasoning
- Implement video understanding (captioning, Q&A, summarization)
- Explore video generation technologies (Sora, Runway, Pika)
- Build video analysis pipelines with frame extraction
- Master video-to-text and text-to-video applications
Introduction: The Video Frontier
Section titled “Introduction: The Video Frontier”Video is the final frontier of multimodal AI. While images capture a single moment, video captures time - motion, actions, narratives, and causality. Understanding and generating video requires AI to reason about temporal sequences, predict what happens next, and maintain consistency across frames.
Why Video AI is Hard
Section titled “Why Video AI is Hard”If image understanding is like reading a photograph, video understanding is like reading a novel—with pictures on every page, and you need to remember every page you’ve read while processing the next one.
Think of it this way: understanding a single image is like looking at a crime scene photo. Understanding video is like being a detective watching security footage—you need to track who entered, what they did, when they left, and how all those events connect.
Video presents unique challenges that images don’t:
- Temporal Dimension: A 10-second video at 30fps has 300 frames - orders of magnitude more data than a single image
- Motion Understanding: Detecting not just what’s there, but what’s happening
- Long-Range Dependencies: Events at the start may connect to events at the end
- Consistency: Generated videos must maintain object identity and physics across frames
- Compute Requirements: Processing video requires 10-100x more compute than images
Did You Know? The human visual cortex processes video at about 10-12 “frames” per second of conscious perception, but our eyes actually sample at varying rates depending on what we’re looking at. Fast motion triggers higher sampling rates. Early video AI researchers tried to mimic this “attention-based sampling” and found it reduced compute costs by 40% while maintaining accuracy.
The Video AI Landscape
Section titled “The Video AI Landscape”Video AI Applications├── Understanding (Analysis)│ ├── Video Classification│ ├── Action Recognition│ ├── Object Tracking│ ├── Video Captioning│ ├── Video Q&A│ └── Video Summarization│├── Generation (Creation)│ ├── Text-to-Video│ ├── Image-to-Video│ ├── Video-to-Video (Style Transfer)│ ├── Video Prediction│ └── Video Editing│└── Multimodal (Combined) ├── Video + Audio Understanding ├── Video + Text Search └── Video + Language ModelsPart 1: Video Understanding Fundamentals
Section titled “Part 1: Video Understanding Fundamentals”Frame Extraction and Sampling
Section titled “Frame Extraction and Sampling”The first step in video understanding is converting continuous video to discrete frames:
import cv2from pathlib import Pathfrom typing import List, Tupleimport numpy as np
def extract_frames( video_path: str, fps: int = 1, # Frames per second to extract max_frames: int = 100) -> List[np.ndarray]: """ Extract frames from a video at specified fps.
Args: video_path: Path to video file fps: Frames per second to extract max_frames: Maximum frames to extract
Returns: List of frames as numpy arrays (BGR format) """ cap = cv2.VideoCapture(video_path)
video_fps = cap.get(cv2.CAP_PROP_FPS) frame_interval = int(video_fps / fps)
frames = [] frame_count = 0
while cap.isOpened() and len(frames) < max_frames: ret, frame = cap.read() if not ret: break
if frame_count % frame_interval == 0: frames.append(frame)
frame_count += 1
cap.release() return framesSampling Strategies
Section titled “Sampling Strategies”Different tasks require different sampling strategies:
| Strategy | Description | Use Case |
|---|---|---|
| Uniform | Equal intervals | General analysis |
| Key Frame | Scene changes | Video summarization |
| Dense | High fps | Action recognition |
| Sparse | Low fps | Long video understanding |
| Adaptive | Based on motion | Efficient processing |
def sample_frames_uniform(frames: List, n_samples: int) -> List: """Uniformly sample n frames from a list.""" indices = np.linspace(0, len(frames) - 1, n_samples, dtype=int) return [frames[i] for i in indices]
def sample_frames_keyframes(frames: List, threshold: float = 0.3) -> List: """Sample frames at scene changes (key frames).""" keyframes = [frames[0]]
for i in range(1, len(frames)): # Compare histogram difference hist1 = cv2.calcHist([frames[i-1]], [0], None, [256], [0, 256]) hist2 = cv2.calcHist([frames[i]], [0], None, [256], [0, 256]) diff = cv2.compareHist(hist1, hist2, cv2.HISTCMP_CORREL)
if diff < threshold: # Scene change detected keyframes.append(frames[i])
return keyframesTemporal Modeling
Section titled “Temporal Modeling”Understanding video requires modeling temporal relationships. This is where different architectural approaches shine—each with its own strengths.
Think of temporal modeling like different ways of reading a book:
- 3D CNNs are like scanning pages with a magnifying glass that sees 3D chunks of text
- RNNs/LSTMs are like reading word by word, remembering what came before
- Transformers are like having photographic memory—you can instantly relate any word to any other word in the book
Each approach trades off compute cost, memory requirements, and the ability to capture long-range dependencies.
Approaches:
-
3D CNNs: Extend 2D convolutions to space-time
Input: [B, C, T, H, W] (batch, channels, time, height, width) -
RNNs/LSTMs: Process frame features sequentially
Frame embeddings → LSTM → Temporal context -
Transformers: Attention across all frames
[CLS] [F1] [F2] [F3] ... [FN] → Self-Attention → Video embedding -
Video Vision Transformers (ViViT): Patch + temporal tokens
Video → Space-time patches → Transformer → Classification
Part 2: Video Understanding with LLMs
Section titled “Part 2: Video Understanding with LLMs”Modern video understanding leverages vision-language models by processing videos as sequences of frames.
Video Q&A with GPT-4V
Section titled “Video Q&A with GPT-4V”import base64import cv2from openai import OpenAI
def video_qa_with_gpt4v( video_path: str, question: str, n_frames: int = 10, client: OpenAI = None) -> str: """ Answer questions about a video using GPT-4V.
Args: video_path: Path to video file question: Question about the video n_frames: Number of frames to sample client: OpenAI client
Returns: Answer from the model """ client = client or OpenAI()
# Extract frames frames = extract_frames(video_path, fps=1, max_frames=n_frames * 3) sampled = sample_frames_uniform(frames, n_frames)
# Encode frames to base64 content = [{"type": "text", "text": f"These are {n_frames} frames from a video. {question}"}]
for i, frame in enumerate(sampled): _, buffer = cv2.imencode('.jpg', frame) base64_frame = base64.b64encode(buffer).decode('utf-8') content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"} })
response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=500 )
return response.choices[0].message.contentVideo Captioning
Section titled “Video Captioning”Generate descriptions of video content:
def caption_video( video_path: str, style: str = "detailed", # "brief", "detailed", "narrative" client: OpenAI = None) -> str: """Generate a caption for a video.""" client = client or OpenAI()
frames = extract_frames(video_path, fps=1, max_frames=30) sampled = sample_frames_uniform(frames, 8)
prompts = { "brief": "In one sentence, describe what happens in this video.", "detailed": "Describe this video in detail, including actions, objects, and setting.", "narrative": "Tell the story of what happens in this video, as if narrating a scene." }
content = [{"type": "text", "text": prompts.get(style, prompts["detailed"])}]
for frame in sampled: _, buffer = cv2.imencode('.jpg', frame) base64_frame = base64.b64encode(buffer).decode('utf-8') content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"} })
response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=300 )
return response.choices[0].message.contentVideo Summarization
Section titled “Video Summarization”Summarize long videos into key points:
def summarize_video( video_path: str, max_segments: int = 5, client: OpenAI = None) -> dict: """ Summarize a video into key segments.
Returns: { "overview": "Overall summary", "segments": [{"time": "0:00-0:30", "description": "..."}], "key_events": ["event1", "event2"] } """ client = client or OpenAI()
# Sample more frames for long videos frames = extract_frames(video_path, fps=0.5, max_frames=50) sampled = sample_frames_uniform(frames, 12)
prompt = """Analyze these video frames and provide: 1. OVERVIEW: A 2-3 sentence summary of the entire video 2. SEGMENTS: Break down the video into key segments (up to 5) 3. KEY_EVENTS: List the most important events or moments
Format your response as: OVERVIEW: [summary] SEGMENTS: - [description of segment 1] - [description of segment 2] KEY_EVENTS: - [event 1] - [event 2] """
content = [{"type": "text", "text": prompt}]
for frame in sampled: _, buffer = cv2.imencode('.jpg', frame) base64_frame = base64.b64encode(buffer).decode('utf-8') content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"} })
response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=600 )
# Parse response text = response.choices[0].message.content
return { "overview": text.split("OVERVIEW:")[1].split("SEGMENTS:")[0].strip() if "OVERVIEW:" in text else text, "raw_response": text }Part 3: Video Generation
Section titled “Part 3: Video Generation”Video generation is one of the most exciting frontiers in AI. Models can now create realistic videos from text descriptions.
If image generation is like asking an artist to paint a picture, video generation is like asking them to animate a film. Every frame must be consistent with the last. Characters can’t teleport. Physics must (mostly) obey the laws of nature. The sun can’t jump across the sky.
This is staggeringly harder than images. A single Stable Diffusion image takes 20-50 denoising steps. A 5-second video at 24fps has 120 frames—if each frame needed independent processing, that would be 2,400-6,000 denoising steps. But video diffusion models are smarter: they denoise all frames simultaneously, with temporal attention layers that enforce consistency across time.
“Generating a single beautiful image is like hitting a bullseye. Generating beautiful, consistent video is like hitting 120 bullseyes in a row—while the target is moving.” — Jim Fan, Senior Research Scientist at NVIDIA
The Video Generation Landscape
Section titled “The Video Generation Landscape”| Model | Company | Type | Access |
|---|---|---|---|
| Sora | OpenAI | Text-to-Video | Limited preview |
| Runway Gen-2/3 | Runway | Text/Image-to-Video | API & Web |
| Pika | Pika Labs | Text-to-Video | Web |
| Stable Video | Stability AI | Image-to-Video | Open source |
| Kling | Kuaishou | Text-to-Video | Limited |
| Dream Machine | Luma AI | Text-to-Video | Web |
How Video Generation Works
Section titled “How Video Generation Works”Modern video generation uses diffusion models extended to the temporal dimension:
Text Prompt │ ▼┌─────────────────────────────────┐│ Text Encoder (CLIP/T5) │└─────────────────────────────────┘ │ ▼┌─────────────────────────────────┐│ Video Diffusion Model ││ • Start with noise video ││ • Iteratively denoise ││ • Conditioned on text ││ • Temporal attention layers │└─────────────────────────────────┘ │ ▼Generated VideoSora Architecture (Conceptual)
Section titled “Sora Architecture (Conceptual)”OpenAI’s Sora represents the state-of-the-art in video generation. Key architectural innovations:
- Spacetime Patches: Videos are converted to 3D patches (space + time)
- DiT (Diffusion Transformer): Transformer-based diffusion model
- Variable Resolution: Can generate different aspect ratios and lengths
- World Simulation: Trained to understand physical dynamics
Sora Pipeline (Conceptual):
Video → Compress → Latent Space → DiT → Decompress → Video (VAE) Patches Transformer (VAE) ↑ Text EmbeddingUsing Runway API
Section titled “Using Runway API”Runway provides accessible video generation:
# Note: Simplified example - actual API may differimport requests
def generate_video_runway( prompt: str, duration: int = 4, # seconds style: str = "cinematic") -> str: """ Generate video using Runway Gen-3.
Returns: URL to generated video """ api_key = os.getenv("RUNWAY_API_KEY")
response = requests.post( "https://api.runwayml.com/v1/generate", headers={"Authorization": f"Bearer {api_key}"}, json={ "prompt": prompt, "duration": duration, "style": style, "model": "gen3" } )
result = response.json() return result.get("video_url")Image-to-Video
Section titled “Image-to-Video”Convert a static image into an animated video:
def image_to_video( image_path: str, motion_prompt: str = "gentle camera pan", duration: int = 4) -> str: """ Animate a static image into a video.
Args: image_path: Path to source image motion_prompt: Description of desired motion duration: Video length in seconds
Returns: Path to generated video """ # Using Stable Video Diffusion (conceptual) # Actual implementation would use specific SDK
# 1. Load and encode image # 2. Generate motion vectors from prompt # 3. Run video diffusion model # 4. Decode and save video
return "output_video.mp4"Part 4: Video Analysis Applications
Section titled “Part 4: Video Analysis Applications”Action Recognition
Section titled “Action Recognition”Detect what actions are happening in a video:
from dataclasses import dataclassfrom typing import List
@dataclassclass ActionDetection: action: str confidence: float start_time: float end_time: float
def detect_actions(video_path: str, client: OpenAI = None) -> List[ActionDetection]: """ Detect actions in a video.
Returns list of detected actions with timestamps. """ client = client or OpenAI()
frames = extract_frames(video_path, fps=2, max_frames=60) sampled = sample_frames_uniform(frames, 15)
prompt = """Analyze these video frames and identify all actions occurring. For each action, estimate when it starts and ends (as frame numbers from 1-15).
Format: ACTION: [action name] START: [frame number] END: [frame number] CONFIDENCE: [high/medium/low] --- """
content = [{"type": "text", "text": prompt}] for frame in sampled: _, buffer = cv2.imencode('.jpg', frame) base64_frame = base64.b64encode(buffer).decode('utf-8') content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"} })
response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=500 )
# Parse and return actions return [ActionDetection( action="detected_action", confidence=0.9, start_time=0.0, end_time=5.0 )]Object Tracking
Section titled “Object Tracking”Track objects across video frames:
def track_objects( video_path: str, object_query: str, # e.g., "red car", "person in blue" client: OpenAI = None) -> List[dict]: """ Track a specific object through a video.
Returns list of positions per frame. """ client = client or OpenAI()
frames = extract_frames(video_path, fps=5, max_frames=100) sampled = sample_frames_uniform(frames, 10)
prompt = f"""Track the "{object_query}" through these video frames. For each frame where the object is visible, describe its position.
Format per frame: FRAME [N]: [position description, e.g., "center-left", "moving right"] """
content = [{"type": "text", "text": prompt}] for frame in sampled: _, buffer = cv2.imencode('.jpg', frame) base64_frame = base64.b64encode(buffer).decode('utf-8') content.append({ "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_frame}"} })
response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}], max_tokens=400 )
return [{"frame": i, "position": "tracked"} for i in range(len(sampled))]Scene Detection
Section titled “Scene Detection”Detect scene changes in videos:
def detect_scenes(video_path: str) -> List[Tuple[float, float]]: """ Detect scene boundaries in a video.
Returns list of (start_time, end_time) for each scene. """ cap = cv2.VideoCapture(video_path) fps = cap.get(cv2.CAP_PROP_FPS)
scenes = [] prev_hist = None scene_start = 0 frame_idx = 0 threshold = 0.5
while cap.isOpened(): ret, frame = cap.read() if not ret: break
# Calculate histogram gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) hist = cv2.calcHist([gray], [0], None, [256], [0, 256]) hist = cv2.normalize(hist, hist).flatten()
if prev_hist is not None: correlation = cv2.compareHist(prev_hist, hist, cv2.HISTCMP_CORREL)
if correlation < threshold: # Scene change detected scene_end = frame_idx / fps scenes.append((scene_start, scene_end)) scene_start = scene_end
prev_hist = hist frame_idx += 1
# Add final scene scenes.append((scene_start, frame_idx / fps))
cap.release() return scenesPart 5: Production Considerations
Section titled “Part 5: Production Considerations”Processing Long Videos
Section titled “Processing Long Videos”Long videos present a unique challenge: you can’t process a 2-hour movie the same way you’d process a 10-second clip. The memory and compute requirements would be astronomical.
The solution is hierarchical processing—like how you might summarize a book. First, you break it into chapters. Then you summarize each chapter. Finally, you combine chapter summaries into a book summary. For video, replace “chapters” with “segments,” and you’ve got the standard approach.
Think of it like a relay race: each runner (segment processor) handles their portion of the track, then passes the baton (context) to the next runner. At the end, a final runner (aggregator) synthesizes everything into a coherent result.
For long videos, use chunking and hierarchical summarization:
def process_long_video( video_path: str, chunk_duration: int = 60, # seconds client: OpenAI = None) -> dict: """ Process a long video by chunking.
Returns: { "chunks": [{"start": 0, "end": 60, "summary": "..."}], "overall_summary": "..." } """ cap = cv2.VideoCapture(video_path) fps = cap.get(cv2.CAP_PROP_FPS) total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) duration = total_frames / fps cap.release()
chunks = [] chunk_summaries = []
for start in range(0, int(duration), chunk_duration): end = min(start + chunk_duration, duration)
# Extract frames for this chunk # Summarize chunk chunk_summary = f"Chunk {start}-{end}: [summary would go here]" chunk_summaries.append(chunk_summary)
chunks.append({ "start": start, "end": end, "summary": chunk_summary })
# Combine chunk summaries into overall summary overall = " ".join(chunk_summaries)
return { "chunks": chunks, "overall_summary": overall }Cost Optimization
Section titled “Cost Optimization”Video AI can be expensive. Strategies to optimize:
- Smart Sampling: Don’t process every frame
- Resolution Reduction: Downscale frames before sending to API
- Caching: Cache results for repeated queries
- Local Pre-processing: Use local models for filtering before API calls
def optimize_frames_for_api( frames: List[np.ndarray], max_size: int = 512, quality: int = 80) -> List[str]: """Optimize frames for API calls (reduce size/quality).""" optimized = []
for frame in frames: # Resize h, w = frame.shape[:2] if max(h, w) > max_size: scale = max_size / max(h, w) frame = cv2.resize(frame, None, fx=scale, fy=scale)
# Encode with reduced quality _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, quality]) base64_frame = base64.b64encode(buffer).decode('utf-8') optimized.append(base64_frame)
return optimizedReal-time Video Processing
Section titled “Real-time Video Processing”For real-time applications, use streaming approaches:
import queueimport threading
class VideoStreamProcessor: """Process video streams in real-time."""
def __init__(self, buffer_size: int = 30): self.frame_queue = queue.Queue(maxsize=buffer_size) self.result_queue = queue.Queue() self.running = False
def process_stream(self, video_source: str): """Start processing a video stream.""" self.running = True
# Start capture thread capture_thread = threading.Thread(target=self._capture_frames, args=(video_source,)) capture_thread.start()
# Start processing thread process_thread = threading.Thread(target=self._process_frames) process_thread.start()
def _capture_frames(self, source: str): """Capture frames from video source.""" cap = cv2.VideoCapture(source)
while self.running and cap.isOpened(): ret, frame = cap.read() if ret: try: self.frame_queue.put(frame, timeout=1) except queue.Full: continue else: break
cap.release()
def _process_frames(self): """Process captured frames.""" while self.running: try: frame = self.frame_queue.get(timeout=1) # Process frame result = self._analyze_frame(frame) self.result_queue.put(result) except queue.Empty: continue
def _analyze_frame(self, frame: np.ndarray) -> dict: """Analyze a single frame.""" return {"frame_analyzed": True}
def stop(self): """Stop processing.""" self.running = FalseDid You Know? Historical Context and Stories
Section titled “Did You Know? Historical Context and Stories”Sora: The Model That Broke the Internet
Section titled “Sora: The Model That Broke the Internet”On February 15, 2024, OpenAI released preview videos from Sora, and the AI world stopped. The generated videos showed:
- A woman walking through Tokyo streets with realistic reflections
- Woolly mammoths trudging through snow
- A drone shot following cars through Big Sur
The videos were so realistic that many questioned if they were actually AI-generated. OpenAI CEO Sam Altman took requests on Twitter, generating custom videos live. The demo sparked both excitement (“AGI is coming”) and concern (“deepfakes will be unstoppable”).
Key technical innovations:
- Spacetime patches: Treating video as 3D data
- Variable duration/resolution: Not fixed to specific formats
- World simulation: Understanding physics, not just pixels
The $1.5 Billion Video Generation Race
Section titled “The $1.5 Billion Video Generation Race”After Sora’s reveal, a funding frenzy began:
- Runway raised $141M at $1.5B valuation
- Pika Labs raised $55M at $200M valuation
- Luma AI raised $43M for Dream Machine
- Stability AI open-sourced Stable Video Diffusion
The race is on to create the “ChatGPT of video.”
YouTube’s 500 Hours Per Minute
Section titled “YouTube’s 500 Hours Per Minute”Every minute, over 500 hours of video are uploaded to YouTube. This creates massive demand for:
- Automated content moderation
- Video search and discovery
- Thumbnail generation
- Caption and translation
Google processes more video than any company in history, driving innovation in video AI.
The DeepFake Dilemma
Section titled “The DeepFake Dilemma”Video generation has a dark side. In 2019, a deepfake video of Mark Zuckerberg went viral, showing how AI could create convincing fake videos of anyone. This led to:
- California’s AB 730 law against deepfakes in elections
- Detection research at major tech companies
- Watermarking initiatives (C2PA)
- The “dead internet theory” debate
OpenAI delayed Sora’s public release partly due to deepfake concerns.
Netflix’s $1B Content Analysis
Section titled “Netflix’s $1B Content Analysis”Netflix uses video AI extensively:
- Thumbnail selection: AI picks which frame makes you click
- Content tagging: Automatic genre and mood detection
- Highlight detection: Finding key moments for trailers
- Quality analysis: Detecting encoding artifacts
Their recommendation system (which includes video analysis) is worth an estimated $1B annually in retained subscribers.
The First AI-Generated Film Festival
Section titled “The First AI-Generated Film Festival”In 2023, the first film festival featuring entirely AI-generated content was held. Winning entries included:
- A 3-minute sci-fi short created with Runway
- An animated documentary using Pika
- A music video with Stable Video Diffusion
The festival sparked debate: Is AI-generated content “art”? Who owns the copyright?
Gemini 1.5’s Million-Token Video Understanding
Section titled “Gemini 1.5’s Million-Token Video Understanding”In February 2024, Google demonstrated Gemini 3.5 Pro processing an entire 45-minute video in a single context window. The model could:
- Answer questions about any moment
- Identify recurring characters
- Understand plot development
- Find specific visual details
This represented a leap from processing video as “frames” to understanding video as “content.”
Common Pitfalls and How to Avoid Them
Section titled “Common Pitfalls and How to Avoid Them”Pitfall 1: Too Few Frames
Section titled “Pitfall 1: Too Few Frames”Sampling too few frames misses important content:
Bad: 3 frames from a 5-minute video Better: Sample more frames for longer videos (1 fps minimum)
Pitfall 2: Ignoring Audio
Section titled “Pitfall 2: Ignoring Audio”Video is multimodal - audio provides crucial context:
Bad: Analyze only visual frames Better: Extract and analyze audio track separately, then combine
Pitfall 3: Memory Issues
Section titled “Pitfall 3: Memory Issues”Loading full videos into memory crashes:
Bad: frames = [frame for frame in all_frames]
Better: Process in chunks, use generators
Pitfall 4: Generation Consistency
Section titled “Pitfall 4: Generation Consistency”Generated videos may have artifacts:
Watch for:
- Objects appearing/disappearing
- Physics violations
- Identity drift (faces changing)
- Temporal flicker
Hands-On Exercises
Section titled “Hands-On Exercises”Exercise 1: Build Video Captioner
Section titled “Exercise 1: Build Video Captioner”Create a video captioning system:
- Extract frames at 1 fps
- Use GPT-4V to generate captions
- Combine into a coherent narrative
- Test on various video types
Exercise 2: Video Search Engine
Section titled “Exercise 2: Video Search Engine”Build semantic video search:
- Index video content using frame embeddings
- Implement text-to-video search
- Add timestamp-level retrieval
- Visualize results
Exercise 3: Video Summarizer
Section titled “Exercise 3: Video Summarizer”Create an automated video summarizer:
- Detect scene changes
- Summarize each scene
- Generate chapter markers
- Create a highlights reel (timestamps)
Exercise 4: Video Q&A Bot
Section titled “Exercise 4: Video Q&A Bot”Build an interactive video Q&A system:
- Load video once, cache frames
- Accept natural language questions
- Return relevant frames with answers
- Support follow-up questions
Deliverables
Section titled “Deliverables”By the end of this module, you should have:
- Video frame extraction pipeline
- Video Q&A system with LLMs
- Video summarization tool
- DELIVERABLE: Video AI Toolkit
Success Criteria:
- Can extract and sample frames from any video
- Can answer questions about video content
- Can generate video summaries
- Works with multiple video formats
Further Reading
Section titled “Further Reading”Papers
Section titled “Papers”- “Sora: Creating video from text” (OpenAI, 2024)
- “VideoLLM: Modeling Video Sequence with Large Language Models” (2023)
- “Stable Video Diffusion” (Stability AI, 2023)
- “ViViT: A Video Vision Transformer” (Google, 2021)
Documentation
Section titled “Documentation”- FFmpeg: Video processing Swiss Army knife
- OpenCV: Computer vision library
- MoviePy: Video editing in Python
- PySceneDetect: Scene detection
The History of Video AI: From GIFs to Sora
Section titled “The History of Video AI: From GIFs to Sora”Understanding the evolution of video AI helps you appreciate why we’re at an inflection point—and what challenges remain unsolved.
The Pre-Deep Learning Era (1960s-2012)
Section titled “The Pre-Deep Learning Era (1960s-2012)”Early video analysis was hand-crafted: optical flow algorithms tracked pixel movement, background subtraction detected motion, and Haar cascades identified objects frame-by-frame. These techniques powered early surveillance systems and basic video editing tools.
Did You Know? The Lucas-Kanade optical flow algorithm from 1981 is still used today—often as a preprocessing step before feeding frames to neural networks. Some of the best classical methods are 40+ years old and remain relevant because they’re fast and reliable for specific tasks.
The MPEG video compression standard (1993) was itself an early form of “video understanding”—it exploited temporal redundancy by recognizing that most frames are similar to their neighbors. In a sense, video compression algorithms were the first systems to learn that “time matters.”
The CNN Era (2012-2017)
Section titled “The CNN Era (2012-2017)”AlexNet’s 2012 ImageNet victory ignited computer vision—but video lagged behind. Researchers tried two approaches:
-
Frame-level CNNs: Run image classification on each frame independently, then vote. Simple but ignores temporal structure.
-
Two-stream networks (2014): Process RGB frames in one CNN (appearance) and optical flow in another CNN (motion), then fuse. This architecture dominated video classification for years.
The major datasets that drove progress were UCF-101 (2012, 13,000 clips) and Sports-1M (2014, 1.1 million clips). These seem tiny now, but they established video classification as a tractable benchmark problem.
The 3D CNN and Transformer Era (2017-2022)
Section titled “The 3D CNN and Transformer Era (2017-2022)”I3D (2017) from DeepMind showed that “inflating” 2D ImageNet-pretrained CNNs into 3D convolutions (across space and time) worked surprisingly well. The Kinetics dataset (400+ action classes, 300,000+ clips) became the new benchmark.
SlowFast Networks (2019) from Facebook introduced dual-pathway processing: a “slow” pathway for spatial semantics (few frames) and a “fast” pathway for motion (many frames). This matched intuitions from neuroscience about how the brain processes motion.
Video Transformers (2021) arrived with TimeSformer and ViViT, extending vision transformers to video. The key insight: attention can model long-range temporal dependencies that convolutions struggle with. A transformer can directly relate frame 1 to frame 100—no need to propagate information through intermediate frames.
The Generation Revolution (2022-Present)
Section titled “The Generation Revolution (2022-Present)”Video generation followed image generation, but with a 1-2 year lag:
- Stable Diffusion (August 2022) → Stable Video Diffusion (November 2023)
- DALL-E 2 (April 2022) → Sora (February 2024)
The breakthrough came from treating video as “3D images”—extending diffusion models to denoise across space and time simultaneously. Temporal attention layers ensure consistency across frames.
Google’s Lumiere (January 2024) and OpenAI’s Sora (February 2024) showed that scaling up video diffusion models produces shockingly realistic results. The race is now on.
Production War Stories: Video AI in the Wild
Section titled “Production War Stories: Video AI in the Wild”The Streaming Service That Couldn’t Count Views
Section titled “The Streaming Service That Couldn’t Count Views”Los Angeles. March 2024. A major streaming platform deployed video AI to automatically detect “engagement moments”—scenes that kept viewers watching. The system identified key frames, tagged emotional beats, and predicted where viewers would skip.
Initial results were promising: engagement predictions correlated 0.7 with actual viewing patterns. But something strange emerged—the model flagged car chase scenes as low engagement, even though human editors knew these were viewer favorites.
Investigation revealed the bug: car chases have high visual redundancy (similar frames in rapid succession). The model’s frame sampling missed the critical 2-second moments where crashes or near-misses happened. It was sampling 1 frame per 5 seconds—perfect for dialogue scenes, terrible for action.
The fix: They implemented adaptive sampling that detected motion intensity and increased frame rate for high-motion segments. Accuracy on action content jumped from 45% to 82%.
Lesson: Frame sampling strategy should match content type. There’s no universal “correct” sampling rate—you need to understand what you’re analyzing.
The Security System That Cried Wolf
Section titled “The Security System That Cried Wolf”Singapore. November 2023. A warehouse deployed video AI for theft detection. The system watched 200 cameras 24/7, flagging suspicious activity for human review. In the first week, it generated 15,000 alerts—overwhelming the 3-person security team.
The problem: the model was trained on internet video datasets featuring “stealing” actions. It had learned that “picking up objects” was suspicious. In a warehouse, workers pick up objects constantly. The false positive rate exceeded 99%.
The team tried threshold tuning, temporal filtering, and confidence calibration. Nothing worked—the model’s entire concept of “suspicious” was miscalibrated for the warehouse context.
The fix: They retrained on 500 hours of their own footage, annotated with actual theft incidents (only 12 in the dataset) versus normal operations. The new model learned that forklifts are normal, unmarked vehicles are suspicious, and nighttime activity in empty zones deserves attention. False positives dropped to under 5%.
Lesson: Video AI models learn what’s “normal” from their training data. If your context differs from the training distribution, you’ll need domain-specific fine-tuning—not just threshold adjustment.
The Video Editor That Broke Time
Section titled “The Video Editor That Broke Time”Austin. June 2024. A post-production studio used AI to upscale old footage from 480p to 4K. The results were beautiful—until they noticed something disturbing. In interview footage, the AI had interpolated new frames to smooth motion. But the interviewee’s lips now moved slightly out of sync with their words.
The model had learned “natural motion” from training data where audio wasn’t available. It optimized for visual smoothness without any concept of audio-visual synchronization. The result: professionally generated deepfake-like artifacts in legitimate content.
The fix: They switched to a model that jointly processed video and audio, maintaining lip-sync as a hard constraint. Processing time doubled, but the results were usable. For footage-only upscaling (no dialogue), they kept the faster model.
Lesson: Video is multimodal. Any system that ignores audio will eventually create audio-visual inconsistencies. For content where sync matters, you need audio-aware processing.
Common Mistakes in Video AI Systems
Section titled “Common Mistakes in Video AI Systems”Mistake 1: Treating Video as Independent Images
Section titled “Mistake 1: Treating Video as Independent Images”# WRONG - Process frames independentlydef analyze_video(video_path): results = [] for frame in extract_frames(video_path): result = image_classifier(frame) # No temporal context! results.append(result) return results
# RIGHT - Maintain temporal contextdef analyze_video_with_context(video_path, window_size=5): frames = extract_frames(video_path) results = []
for i, frame in enumerate(frames): # Include surrounding frames as context start = max(0, i - window_size // 2) end = min(len(frames), i + window_size // 2 + 1) context_frames = frames[start:end]
result = video_classifier(context_frames, query_frame=i - start) results.append(result)
return resultsConsequence: Without temporal context, you can’t recognize actions—only objects. “Person” vs “person running” vs “person falling” require understanding motion across frames.
Mistake 2: Fixed Frame Rates for All Content
Section titled “Mistake 2: Fixed Frame Rates for All Content”# WRONG - Same sampling for everythingdef extract_frames_fixed(video_path): return extract_frames(video_path, fps=1) # 1 frame per second
# RIGHT - Adaptive sampling based on contentdef extract_frames_adaptive(video_path): cap = cv2.VideoCapture(video_path) frames = [] prev_frame = None
while cap.isOpened(): ret, frame = cap.read() if not ret: break
if prev_frame is not None: # Calculate frame difference diff = cv2.absdiff(frame, prev_frame).mean()
# High difference = motion = sample more if diff > 30: frames.append(frame) elif diff > 10 and len(frames) % 5 == 0: frames.append(frame) elif len(frames) % 30 == 0: # Minimum sampling frames.append(frame) else: frames.append(frame)
prev_frame = frame
cap.release() return framesConsequence: Fixed sampling misses crucial moments in dynamic content and wastes compute on static scenes.
Mistake 3: Ignoring Video Duration in API Costs
Section titled “Mistake 3: Ignoring Video Duration in API Costs”# WRONG - Send all frames to vision APIdef expensive_video_analysis(video_path): frames = extract_frames(video_path, fps=30) # 30 fps!
for frame in frames: # 10-second video = 300 API calls! response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": encode_frame(frame)}} ]}] ) # $0.01 per image × 300 = $3 per 10-second video
# RIGHT - Intelligent sampling and batchingdef efficient_video_analysis(video_path): frames = extract_frames(video_path, fps=1) # 1 fps sampled = sample_frames_uniform(frames, n_samples=10) # 10 frames max
# Batch into single request content = [{"type": "text", "text": "Analyze these video frames:"}] for frame in sampled: content.append({"type": "image_url", "image_url": {"url": encode_frame(frame)}})
response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": content}] ) # ~10 images × $0.01 = $0.10 per videoConsequence: Naive video processing can cost 30x more than necessary. Always sample intelligently and batch when possible.
Interview Prep: Video AI
Section titled “Interview Prep: Video AI”Common Questions and Strong Answers
Section titled “Common Questions and Strong Answers”Q: “How would you build a system to detect highlights in sports broadcasts?”
Strong Answer: “Highlight detection requires understanding what makes a moment ‘exciting’—and that varies by sport. My approach would be multimodal and hierarchical.
First, I’d use audio analysis to detect crowd noise spikes. These correlate strongly with highlights across all sports—goals, touchdowns, knockouts all produce distinctive crowd reactions. This is cheap to compute and catches 80%+ of highlights.
Second, for the visual track, I’d detect scene changes and camera movements. Highlights often trigger replays (same action, different angle) and close-ups. A sudden switch from wide shot to close-up following a crowd noise spike is almost certainly a highlight.
Third, for sport-specific detection, I’d fine-tune a video classifier on labeled highlights. Soccer goals look different from basketball dunks, which look different from tennis aces. The sport-specific model adds precision.
Finally, I’d combine signals with learned weights. A highlight needs at least two of: audio spike, visual scene change, sport-specific action detection. This reduces false positives while maintaining recall.
The system should output timestamp ranges with confidence scores, enabling downstream applications to choose their sensitivity threshold.”
Q: “Explain the key technical challenges in generating long-form video (>10 seconds).”
Strong Answer: “Long-form video generation faces three fundamental challenges that don’t exist in image generation.
First, temporal consistency. Over 10+ seconds, subjects need to maintain identity—same face, same clothing, same physics. Current models struggle with this because they don’t have explicit object tracking. A person might subtly morph across frames, or objects might drift. Sora handles this better than competitors, likely through longer temporal attention windows, but it’s still not solved.
Second, narrative coherence. A 60-second video should tell a story with beginning, middle, and end. Current models generate ‘moments’ rather than ‘narratives.’ They can show a woman walking through Tokyo, but struggle with ‘a woman walks through Tokyo, enters a shop, buys something, and leaves.’ Causal relationships between scenes require world modeling that pure diffusion doesn’t provide.
Third, computational cost. Video generation is roughly O(n²) in frame count due to temporal attention. A 4-second video might take 30 seconds; a 60-second video might take 15 minutes—and require 10x more GPU memory. The scaling problem means current models are economically viable only for short clips.
The research frontier includes hierarchical generation (coarse-to-fine), autoregressive approaches (generate frame-by-frame with past context), and world models that plan before generating. I’d expect 60+ second coherent generation to be solved within 2-3 years.”
Q: “How would you detect deepfake videos in production?”
Strong Answer: “Deepfake detection is an adversarial problem—as detection improves, generation adapts. Any production system needs defense in depth.
First layer: metadata analysis. Real video has provenance—EXIF data, compression artifacts consistent with specific cameras/apps, consistent timestamps. Deepfakes often strip or fabricate metadata. This catches amateur fakes.
Second layer: temporal inconsistencies. Deepfakes often have subtle artifacts—blinking patterns that don’t match human norms, earlobes that flicker, hairlines that shift. I’d use a video classifier trained specifically on known deepfakes and their artifacts.
Third layer: audio-visual sync. Face-swapped videos often have subtle lip-sync mismatches. A multimodal model that understands speech and lip movement can detect when they diverge.
Fourth layer: source verification. The strongest defense is proving where video came from—C2PA standards, cryptographic signing at capture time, chain of custody. This doesn’t detect fakes; it verifies authenticity.
In production, I’d ensemble these approaches with confidence scores. No single method is reliable against sophisticated fakes, but the combination is much harder to defeat. And I’d plan for model updates—today’s state-of-the-art detector will be obsolete against tomorrow’s generators.”
The Economics of Video AI
Section titled “The Economics of Video AI”Cost Comparison
Section titled “Cost Comparison”| Service | Video Understanding | Video Generation | Notes |
|---|---|---|---|
| gpt-5 (Vision) | ~$0.10/minute of video* | N/A | *At 10 frames/min |
| Google Gemini 1.5 | ~$0.05/minute of video | N/A | Native video input |
| Claude 3 Opus | ~$0.15/minute of video* | N/A | *At 10 frames/min |
| Runway Gen-3 | N/A | $0.05/second | ~$3/minute of output |
| Pika | N/A | ~$0.50/5-second clip | Subscription model |
| Local (faster-whisper) | ~$0.002/minute | N/A | GPU amortized |
| Local (SVD) | N/A | ~$0.01/second | GPU amortized |
Break-Even Analysis: Build vs Buy
Section titled “Break-Even Analysis: Build vs Buy”When should you run video AI locally vs use APIs?
Video Understanding:
API cost (gpt-5): ~$0.10 per minute of video analyzedLocal cost (GPU): ~$0.01 per minute (amortized A10)
Break-even: ~1,000 minutes/monthBelow 1,000 min: Use API (simpler)Above 1,000 min: Consider local (10x cheaper)Above 10,000 min: Definitely local (saves $900+/month)Video Generation:
API cost (Runway): $3 per minute generatedLocal cost (GPU): ~$0.50 per minute (amortized A100)
Break-even: ~100 minutes/monthBelow 100 min: Use API (quality, speed)Above 100 min: Consider local (6x cheaper)Above 1,000 min: Hybrid approachDid You Know? Netflix spends over $150 million annually on video encoding and analysis infrastructure. At their scale (200+ million subscribers, billions of hours watched), even a 1% efficiency improvement saves millions. Most of their video AI runs on custom hardware optimized for their specific workloads.
Hidden Costs
Section titled “Hidden Costs”- Storage: Raw video is huge. 1 hour of 4K video = ~100GB. Plan for storage costs.
- Bandwidth: Sending video to cloud APIs costs egress fees. Consider on-device preprocessing.
- Latency: Real-time video AI requires low-latency inference. Edge deployment often necessary.
- Compliance: Video of people faces GDPR/CCPA constraints. Detection/blurring adds cost.
The Future of Video AI
Section titled “The Future of Video AI”The Hollywood Disruption
Section titled “The Hollywood Disruption”Video generation is coming for Hollywood—not to replace it, but to democratize it. Consider the economics: a typical commercial costs $300,000-$500,000 to produce. A Sora-quality generation will cost under $100 in API fees. The 1000x cost reduction will fundamentally change who can create professional video content.
Did You Know? In March 2024, Tyler Perry paused an $800 million studio expansion after seeing Sora demos. “It makes me worry about all the people who work in those industries,” he said. The first major entertainment executive to publicly acknowledge AI’s impact on studio economics.
This isn’t hypothetical. Runway is already being used for commercials, music videos, and social media content. The first AI-generated Super Bowl commercial is likely 2-3 years away. By 2030, most short-form video content will involve AI generation or assistance.
Real-Time Video Generation
Section titled “Real-Time Video Generation”Today’s video generation is slow—30 seconds to several minutes per second of output. But progress is rapid. Google’s VideoPoet and Meta’s Make-A-Video show generation times dropping by 2-4x per year.
The endgame is real-time video generation: describe what you want to see, and it appears instantly. This enables:
- Interactive storytelling: Choose-your-own-adventure videos that generate on the fly
- Personalized advertising: Ads customized to each viewer’s preferences
- Live game rendering: Video games that generate environments in real-time
- Virtual production: Actors performing against AI-generated backgrounds
Real-time generation requires 100-1000x speedup from today. Hardware improvements (NPUs, tensor cores) combined with algorithmic advances (distillation, caching) will get us there within 5-7 years.
Multimodal Convergence
Section titled “Multimodal Convergence”The future isn’t “video AI”—it’s “reality AI.” Systems that understand and generate video, audio, text, and physical interactions simultaneously.
Google’s Gemini 1.5 can process hour-long videos natively. OpenAI’s gpt-5 handles audio-video-text in a single model. The next generation will add:
- 3D understanding: Reconstruct scenes from video, navigate them in VR
- Physical reasoning: Predict what happens when objects interact
- Embodied action: Generate video that shows how to perform tasks
This convergence means the distinction between “vision model” and “language model” and “video model” will disappear. There will just be “AI”—and it will understand the world through all modalities simultaneously.
What Developers Should Do Now
Section titled “What Developers Should Do Now”If you’re building video AI systems today:
-
Design for model swapping: Today’s best model won’t be tomorrow’s. Abstract your video generation and understanding providers.
-
Plan for real-time: Even if your current application is batch, architect for streaming. Real-time will become the default expectation.
-
Think multimodal: Video without audio is incomplete. Text without video is limited. Build systems that integrate all modalities.
-
Consider edge deployment: Video data is huge and expensive to move. Processing video on-device or at the edge will become increasingly important.
-
Build safety in from the start: Deepfakes, copyright issues, and content moderation will only get harder. Watermarking, provenance tracking, and detection should be part of your architecture, not afterthoughts.
The video AI stack you build today should be ready for a world where AI can generate—and understand—any video imaginable. That world is arriving faster than most people expect.
Key Takeaways
Section titled “Key Takeaways”After working through this module, here’s what you should remember:
-
Video is fundamentally harder than images. The temporal dimension adds complexity that can’t be solved by just processing more frames. You need architectures that understand time—and that’s expensive.
-
Frame sampling is an art, not a science. Too few frames and you miss critical moments. Too many and you blow your budget. The right strategy depends on your use case: action recognition needs dense sampling, while video summarization can use sparse keyframes.
-
Vision LLMs unlock video understanding without video training. GPT-4V and Gemini weren’t trained on video, but by feeding them sequences of frames, you can build powerful video Q&A, captioning, and summarization systems. This is the pragmatic approach for most production applications.
-
Video generation is the AI arms race of 2024-2025. Sora showed what’s possible, and now Runway, Pika, Luma, and others are racing to commercialize it. The business implications—from Hollywood to TikTok—are enormous.
-
The deepfake problem is real. Every advance in video generation is also an advance in deception technology. Detection, watermarking, and provenance tracking are becoming as important as generation itself. The arms race between generation and detection will intensify—plan for it.
-
Video generation is becoming economically viable. At $3/minute for Runway and dropping, AI-generated video is already cheaper than traditional production for many use cases. The cost curve will continue falling. Within five years, most short-form content will involve AI generation.
-
Think multimodal. Video without audio understanding is incomplete. The best video AI systems process all modalities together—visual, audio, text. Standalone video analysis will increasingly be a legacy pattern.
Summary
Section titled “Summary”Video AI represents the convergence of all multimodal capabilities. Understanding video requires temporal reasoning, while generating video requires maintaining consistency across time.
Quick Reference:
- Frame sampling is critical - balance coverage with compute cost
- Vision LLMs (GPT-4V, Gemini) can understand video through frame sequences
- Video generation (Sora, Runway) uses diffusion models with temporal attention
- Production video AI requires chunking, caching, and optimization
- Audio matters - don’t ignore the soundtrack!
Phase 5 Complete: You’ve now mastered multimodal AI across:
- Speech (Module 22): STT, TTS, voice assistants
- Vision (Module 23): CLIP, VLMs, document understanding
- Video (Module 24): Understanding, generation, analysis
What’s Next: Phase 6 - Deep Learning Foundations. Time to understand how these models work under the hood!
Last updated: 2025-11-26 Next: Phase 6 - Deep Learning Foundations