Перейти до вмісту

Voice & Audio AI

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: Teaching Computers to Listen (Finally)

Section titled “Or: Teaching Computers to Listen (Finally)”

Reading Time: 6-7 hours Prerequisites: Phase 4 complete

Section titled “Reading Time: 6-7 hours Prerequisites: Phase 4 complete”

Mountain View, California. September 23, 2022. 11:47 PM. Alec Radford couldn’t sleep. For three years, his team at OpenAI had been working on a speech recognition model that seemed cursed. Every architecture they tried hit the same wall: models that worked brilliantly in the lab fell apart in the real world. Background noise, accents, cross-talk—the gap between benchmark performance and actual usefulness seemed unbridgeable.

That night, Radford had a realization that would change everything. Instead of training on carefully curated speech datasets, what if they trained on 680,000 hours of messy, real-world audio scraped from the internet—complete with background music, multiple speakers, and every accent imaginable? The model would learn robustness not from architecture tricks, but from sheer diversity.

Two months later, Whisper launched. It achieved human-level transcription accuracy on benchmark after benchmark. More importantly, it worked—actually worked—on phone calls, podcasts, meetings, lectures. The “I’m sorry, I didn’t catch that” era of voice interfaces was over.

“We didn’t make the model smarter. We made the training data more representative of reality. That’s the whole secret.” — Alec Radford, Whisper technical report, 2022

This module teaches you to build on that foundation—not just transcribing speech, but creating complete voice-enabled AI systems that listen, understand, and respond naturally.


By the end of this module, you will:

  • Master Whisper for speech-to-text (STT) transcription
  • Build text-to-speech (TTS) systems with multiple providers
  • Understand voice cloning and neural voice synthesis
  • Implement real-time transcription pipelines
  • Build voice-enabled AI assistants
  • Deploy production speech applications
  • Understand the architecture behind modern speech models

For decades, voice interfaces felt like science fiction—or at best, frustrating. “I’m sorry, I didn’t catch that” became a meme. But 2022-2024 changed everything.

What happened?

  1. Whisper (OpenAI, 2022): First speech model that actually works reliably
  2. ElevenLabs (2022): Voice cloning that sounds human
  3. gpt-5 + Voice (2023): Conversational AI that can hear and speak
  4. Real-time APIs (2024): Sub-second latency for live conversations

The result: Voice is no longer a novelty—it’s becoming the primary interface for AI.

┌─────────────────────────────────────────────────────────────┐
│ SPEECH AI PIPELINE │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Microphone] ──► [STT: Whisper] ──► [Text] │
│ │ │
│ ▼ │
│ [LLM: Claude/GPT] │
│ │ │
│ ▼ │
│ [Speaker] ◄── [TTS: ElevenLabs] ◄── [Response Text] │
│ │
└─────────────────────────────────────────────────────────────┘

Components:

  1. Speech-to-Text (STT): Convert audio → text (Whisper, Deepgram, AssemblyAI)
  2. Language Model: Process text → generate response (Claude, gpt-5)
  3. Text-to-Speech (TTS): Convert response → audio (ElevenLabs, OpenAI TTS)

Think of Whisper like having a professional court stenographer who speaks 99 languages, never gets tired, and can understand people even in noisy environments. Previous speech recognition was like a toddler learning to talk—it could understand familiar words in quiet rooms, but anything else was hopeless. Whisper changed the game.

Whisper is OpenAI’s speech recognition model, released in September 2022. It’s trained on 680,000 hours of multilingual audio and achieves human-level accuracy.

Key features:

  • 99 languages supported
  • Automatic language detection
  • Punctuation and capitalization (unlike old ASR)
  • Robust to noise, accents, and background speech
  • Open-source (run locally, no API costs!)
ModelParametersEnglish-OnlyVRAMRelative Speed
tiny39M~1 GB~32x
base74M~1 GB~16x
small244M~2 GB~6x
medium769M~5 GB~2x
large1550M~10 GB1x
large-v21550M~10 GB1x
large-v31550M~10 GB1x

Recommendation:

  • Development: base or small (fast iteration)
  • Production (English): medium or large-v3
  • Production (multilingual): large-v3
import whisper
# Load model (downloads on first run)
model = whisper.load_model("base")
# Transcribe audio file
result = model.transcribe("audio.mp3")
print(result["text"])
# "Hello, this is a test of the Whisper speech recognition system."
# With more details
print(result["language"]) # "en"
print(result["segments"]) # List of timestamped segments
import whisper
model = whisper.load_model("base")
result = model.transcribe("podcast.mp3", word_timestamps=True)
# Access word-level timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
# Word-level timestamps (if available)
if "words" in segment:
for word in segment["words"]:
print(f" [{word['start']:.2f}s] {word['word']}")

Output:

[0.00s - 4.52s] Hello, this is a test of the Whisper system.
[0.00s] Hello,
[0.45s] this
[0.68s] is
[0.89s] a
[1.02s] test
...
import whisper
model = whisper.load_model("large-v3")
# Transcribe with auto language detection
result = model.transcribe("french_audio.mp3")
print(f"Detected language: {result['language']}")
print(f"Text: {result['text']}")
# Translate non-English to English
result = model.transcribe(
"french_audio.mp3",
task="translate" # Translate to English
)
print(f"Translation: {result['text']}")

For production without GPU infrastructure:

from openai import OpenAI
client = OpenAI()
# Transcribe audio file
with open("audio.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
print(transcript.text)
print(transcript.words) # Word-level timestamps

Pricing (OpenAI API):

  • $0.006 per minute of audio
  • 25 MB file size limit
  • Supports: mp3, mp4, mpeg, mpga, m4a, wav, webm

faster-whisper uses CTranslate2 for 4x faster inference:

from faster_whisper import WhisperModel
# Load with optimizations
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16" # or "int8" for even faster
)
# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Detected language: {info.language} ({info.language_probability:.2%})")
for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

Performance comparison (1-hour audio):

  • Whisper (PyTorch): ~25 minutes
  • faster-whisper (CTranslate2): ~6 minutes
  • 4x speedup!

Did You Know? The History of Speech Recognition

Section titled “Did You Know? The History of Speech Recognition”

Speech recognition has been “almost working” for decades. Here’s the journey:

1952: Bell Labs builds “Audrey” - recognizes digits 0-9 spoken by its creator (and only its creator).

1970s: DARPA funds speech research. CMU builds HARPY - 1,000-word vocabulary, 90% accuracy (controlled conditions).

1980s: Hidden Markov Models (HMMs) become dominant. IBM builds “Tangora” - 20,000 words but requires pausing between each word.

1990s: Dragon NaturallySpeaking launches (1997). First consumer dictation software. Terrible but revolutionary.

2000s: Statistical models improve. Google Voice Search launches (2008). Still frustrating for users.

2010s: Deep learning arrives. Google’s neural ASR (2012) cuts error rate by 25%. Apple launches Siri (2011) - “I didn’t quite catch that” becomes a meme.

2022: OpenAI releases Whisper. Game over.

What made Whisper different?

  1. Scale: 680,000 hours of training data (vs ~10,000 for previous SOTA)
  2. Multitask: Trained on transcription, translation, AND language detection simultaneously
  3. Robustness: Deliberately trained on noisy, real-world audio
  4. Open-source: Anyone can run it locally

The numbers:

  • Whisper matches or exceeds commercial APIs on most benchmarks
  • 99 languages supported
  • Word error rate (WER) < 5% on clean English
  • Downloads: 100M+ on Hugging Face

When OpenAI released Whisper in September 2022, it caused controversy:

The good:

  • Completely open-source (MIT license)
  • Model weights freely downloadable
  • No API lock-in

The controversy:

  • Trained on YouTube videos (copyright concerns?)
  • No training data release (how was it collected?)
  • Immediately obsoleted commercial speech APIs

Industry reaction:

  • Deepgram, AssemblyAI: Scrambled to improve their models
  • Google: Accelerated USM (Universal Speech Model) development
  • Startups: Pivoted from “building ASR” to “building on Whisper”

The lesson: Open-source AI can disrupt billion-dollar markets overnight.

Geoffrey Hinton, the “godfather of deep learning,” wasn’t just important for vision—his work enabled modern speech recognition too.

In 2012, Hinton’s team (with Navdeep Jaitly and Abdel-rahman Mohamed) showed that deep neural networks could dramatically outperform HMMs for speech recognition. Their paper “Deep Neural Networks for Acoustic Modeling in Speech Recognition” is cited 15,000+ times.

The irony: Hinton later left Google and warned about AI dangers. The speech systems his work enabled now power billions of voice assistants worldwide.


Think of modern TTS like the evolution of animation. Early TTS was like flip-books—jerky, robotic, obviously artificial. Then came “neural TTS” like Pixar films—smooth, but you could still tell it wasn’t real. Today’s TTS systems like ElevenLabs are approaching “uncanny valley” territory—sometimes indistinguishable from actual human recordings.

ProviderQualityLatencyPriceVoice Cloning
ElevenLabs⭐⭐⭐⭐⭐500ms$0.30/1K charsBest
OpenAI TTS⭐⭐⭐⭐300ms$0.015/1K chars
Amazon Polly⭐⭐⭐200ms$0.004/1K chars
Google TTS⭐⭐⭐250ms$0.004/1K chars
Coqui TTS⭐⭐⭐⭐VariesFree (open-source)
Bark⭐⭐⭐⭐2000ms+Free (open-source)

Recommendation:

  • Production (quality focus): ElevenLabs
  • Production (cost focus): OpenAI TTS
  • Development/offline: Coqui TTS or Bark

The easiest high-quality TTS option:

from openai import OpenAI
from pathlib import Path
client = OpenAI()
# Generate speech
response = client.audio.speech.create(
model="tts-1", # or "tts-1-hd" for higher quality
voice="alloy", # Options: alloy, echo, fable, onyx, nova, shimmer
input="Hello! This is a test of OpenAI's text-to-speech system.",
speed=1.0 # 0.25 to 4.0
)
# Save to file
speech_file = Path("output.mp3")
response.stream_to_file(speech_file)

Voices:

  • alloy: Neutral, balanced
  • echo: Warm, conversational
  • fable: British, narrative
  • onyx: Deep, authoritative
  • nova: Energetic, young
  • shimmer: Soft, gentle

Models:

  • tts-1: Optimized for speed (~300ms latency)
  • tts-1-hd: Higher quality, slower (~500ms)

ElevenLabs offers the most human-like TTS:

from elevenlabs import generate, save, set_api_key
set_api_key("your-api-key")
# Generate with default voice
audio = generate(
text="Welcome to the future of voice synthesis!",
voice="Rachel", # Or custom voice ID
model="eleven_multilingual_v2"
)
# Save to file
save(audio, "elevenlabs_output.mp3")

Voice cloning (ElevenLabs’ killer feature):

from elevenlabs import clone, generate
# Clone a voice from audio samples
voice = clone(
name="My Custom Voice",
files=["sample1.mp3", "sample2.mp3", "sample3.mp3"],
description="Professional male narrator"
)
# Generate with cloned voice
audio = generate(
text="This sounds just like the original speaker!",
voice=voice
)

For voice assistants, you need streaming:

from openai import OpenAI
client = OpenAI()
# Stream audio chunks
response = client.audio.speech.create(
model="tts-1",
voice="alloy",
input="This is a longer text that will be streamed in chunks...",
)
# Write streaming response
with open("streamed_output.mp3", "wb") as f:
for chunk in response.iter_bytes(chunk_size=1024):
f.write(chunk)
# In production: Send chunk to audio player immediately

Coqui TTS is the best open-source option:

from TTS.api import TTS
# Initialize TTS (downloads model on first run)
tts = TTS(model_name="tts_models/en/ljspeech/tacotron2-DDC")
# Generate speech
tts.tts_to_file(
text="Open source text to speech is amazing!",
file_path="coqui_output.wav"
)
# With voice cloning
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts")
tts.tts_to_file(
text="This uses my cloned voice!",
speaker_wav="my_voice_sample.wav",
language="en",
file_path="cloned_output.wav"
)

Did You Know? The Voice Cloning Revolution

Section titled “Did You Know? The Voice Cloning Revolution”

ElevenLabs: From Startup to Industry Disruptor

Section titled “ElevenLabs: From Startup to Industry Disruptor”

ElevenLabs was founded in 2022 by two former Google engineers: Piotr Dabkowski and Mati Staniszewski, both from Poland.

The origin story: Dabkowski and Staniszewski were frustrated watching poorly dubbed movies. They thought: “What if we could make dubbing sound natural?”

January 2023: They launched their voice cloning API and went viral. Within weeks:

  • 1 million users signed up
  • $2M+ revenue in first month
  • Controversy erupted over potential misuse

The controversy: Users started cloning celebrity voices without permission. ElevenLabs had to add voice verification and content moderation.

The funding:

  • 2023 (Series A): $19M
  • 2024 (Series B): $80M at $1B+ valuation
  • Total: $100M+ raised in 2 years

The technology: Their secret sauce is “Instant Voice Cloning” - create a convincing clone from just 30 seconds of audio. Previous tech required hours of samples.

Famous uses:

  • Dubbing studios cloning actors’ voices for foreign releases
  • Audiobook narrators scaling their work
  • Video game studios creating NPC dialogue at scale
  • Podcasters creating “AI co-hosts”

With great voice cloning comes great responsibility. The technology enables:

Good uses:

  • Accessibility (voice for people who lost theirs)
  • Entertainment (dubbing, games, audiobooks)
  • Education (personalized learning)
  • Productivity (quick voice content creation)

Concerning uses:

  • Political deepfakes (fake speeches)
  • Fraud (impersonating family members)
  • Scams (fake CEO phone calls)
  • Misinformation (fake news reports)

The response:

  • ElevenLabs: Voice verification, content policies, watermarking
  • OpenAI: No voice cloning in TTS API (for now)
  • Legislation: Several countries exploring AI voice laws

The numbers:

  • Voice phishing scams increased 300% from 2022-2024
  • $25M+ lost to AI voice fraud in 2023 (FBI estimate)
  • Detection tools are ~70-80% accurate (improving)

2024 was the year of real-time voice AI:

gpt-5 Voice (May 2024):

  • Sub-200ms response time
  • Can hear emotion, pace, background noise
  • Responds with appropriate tone
  • Feels like talking to a human

Google Gemini Live (2024):

  • Real-time multimodal conversations
  • Can see and hear simultaneously
  • Integrated with Android

The technical challenge: End-to-end latency must be <500ms for natural conversation:

  • STT: ~100ms
  • LLM inference: ~200ms
  • TTS: ~100ms
  • Network: ~100ms

The breakthrough: OpenAI’s gpt-5 uses a single model for audio-to-audio, bypassing the STT→LLM→TTS pipeline entirely. Latency dropped from ~2 seconds to ~200ms.


Building real-time transcription is like building a simultaneous translator. You can’t wait for someone to finish a 10-minute speech before starting to translate—you need to process speech as it comes in, make educated guesses about sentence structure, and output results with minimal delay. This requires careful buffer management and fast model inference.

For real-time applications (voice assistants, meeting transcription), you need streaming:

import pyaudio
import numpy as np
from faster_whisper import WhisperModel
import threading
import queue
class RealTimeTranscriber:
"""Real-time speech transcription using Whisper."""
def __init__(self, model_size: str = "base"):
self.model = WhisperModel(model_size, device="cuda", compute_type="float16")
self.audio_queue = queue.Queue()
self.is_running = False
# Audio settings
self.sample_rate = 16000
self.chunk_size = 1024
self.channels = 1
def start_recording(self):
"""Start capturing audio from microphone."""
self.is_running = True
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paFloat32,
channels=self.channels,
rate=self.sample_rate,
input=True,
frames_per_buffer=self.chunk_size
)
print("Listening... (Ctrl+C to stop)")
try:
audio_buffer = []
silence_threshold = 0.01
silence_duration = 0
while self.is_running:
# Read audio chunk
data = stream.read(self.chunk_size)
audio_np = np.frombuffer(data, dtype=np.float32)
# Detect speech vs silence
volume = np.abs(audio_np).mean()
if volume > silence_threshold:
audio_buffer.append(audio_np)
silence_duration = 0
else:
silence_duration += self.chunk_size / self.sample_rate
# If silence > 0.5s and we have audio, transcribe
if silence_duration > 0.5 and audio_buffer:
audio_data = np.concatenate(audio_buffer)
self.transcribe_chunk(audio_data)
audio_buffer = []
except KeyboardInterrupt:
self.is_running = False
finally:
stream.stop_stream()
stream.close()
p.terminate()
def transcribe_chunk(self, audio_data: np.ndarray):
"""Transcribe a chunk of audio."""
segments, _ = self.model.transcribe(
audio_data,
beam_size=5,
language="en"
)
for segment in segments:
print(f">>> {segment.text.strip()}")
# Usage
transcriber = RealTimeTranscriber(model_size="base")
transcriber.start_recording()

For better real-time performance, use VAD to detect speech:

import torch
import numpy as np
# Silero VAD (lightweight, accurate)
model, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False
)
(get_speech_timestamps, _, read_audio, _, _) = utils
def detect_speech_segments(audio_path: str) -> list:
"""Detect speech segments in audio file."""
wav = read_audio(audio_path, sampling_rate=16000)
speech_timestamps = get_speech_timestamps(
wav,
model,
threshold=0.5,
sampling_rate=16000
)
return speech_timestamps
# Example output: [{'start': 0, 'end': 48000}, {'start': 64000, 'end': 96000}]

Identify who’s speaking in multi-speaker audio:

from pyannote.audio import Pipeline
import torch
# Initialize pipeline (requires HuggingFace token)
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
# Send to GPU if available
if torch.cuda.is_available():
pipeline = pipeline.to(torch.device("cuda"))
# Diarize audio
diarization = pipeline("meeting.wav")
# Print speaker segments
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"[{turn.start:.1f}s - {turn.end:.1f}s] Speaker {speaker}")

Output:

[0.0s - 4.2s] Speaker SPEAKER_00
[4.5s - 8.1s] Speaker SPEAKER_01
[8.3s - 15.2s] Speaker SPEAKER_00

import asyncio
from openai import OpenAI
from faster_whisper import WhisperModel
import pyaudio
import numpy as np
import tempfile
import os
class VoiceAssistant:
"""Complete voice-in, voice-out AI assistant."""
def __init__(self):
self.client = OpenAI()
self.whisper = WhisperModel("base", device="cuda", compute_type="float16")
self.conversation_history = []
def record_audio(self, duration: float = 5.0) -> np.ndarray:
"""Record audio from microphone."""
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paFloat32,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1024
)
print("Recording...")
frames = []
for _ in range(int(16000 * duration / 1024)):
data = stream.read(1024)
frames.append(np.frombuffer(data, dtype=np.float32))
print("Done recording.")
stream.stop_stream()
stream.close()
p.terminate()
return np.concatenate(frames)
def transcribe(self, audio: np.ndarray) -> str:
"""Convert speech to text."""
segments, _ = self.whisper.transcribe(audio, beam_size=5)
return " ".join([s.text for s in segments]).strip()
def get_response(self, user_message: str) -> str:
"""Get AI response using Claude/GPT."""
self.conversation_history.append({
"role": "user",
"content": user_message
})
response = self.client.chat.completions.create(
model="gpt-5",
messages=[
{"role": "system", "content": "You are a helpful voice assistant. Keep responses concise (1-2 sentences) for natural conversation."},
*self.conversation_history
]
)
assistant_message = response.choices[0].message.content
self.conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
def speak(self, text: str):
"""Convert text to speech and play."""
response = self.client.audio.speech.create(
model="tts-1",
voice="nova",
input=text
)
# Save to temp file and play
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
response.stream_to_file(f.name)
os.system(f"afplay {f.name}") # macOS; use different player for other OS
os.unlink(f.name)
def conversation_loop(self):
"""Main conversation loop."""
print("Voice Assistant ready! Press Ctrl+C to exit.")
print("Speak after you see 'Recording...'")
while True:
try:
# Listen
audio = self.record_audio(duration=5.0)
# Transcribe
user_text = self.transcribe(audio)
if not user_text.strip():
print("(No speech detected)")
continue
print(f"You: {user_text}")
# Get AI response
response = self.get_response(user_text)
print(f"Assistant: {response}")
# Speak response
self.speak(response)
except KeyboardInterrupt:
print("\nGoodbye!")
break
# Run the assistant
if __name__ == "__main__":
assistant = VoiceAssistant()
assistant.conversation_loop()

For production voice assistants, latency is critical:

import asyncio
from concurrent.futures import ThreadPoolExecutor
class OptimizedVoiceAssistant:
"""Low-latency voice assistant with parallel processing."""
def __init__(self):
self.executor = ThreadPoolExecutor(max_workers=3)
# ... initialization
async def process_turn(self, audio: np.ndarray):
"""Process a conversation turn with optimized latency."""
# Start transcription immediately
transcription_future = self.executor.submit(self.transcribe, audio)
# Wait for transcription
user_text = transcription_future.result()
# Start LLM response with streaming
async def stream_response():
response_chunks = []
async for chunk in self.stream_llm_response(user_text):
response_chunks.append(chunk)
# Start TTS on first sentence
if len("".join(response_chunks)) > 50 and "." in chunk:
first_sentence = "".join(response_chunks).split(".")[0] + "."
self.executor.submit(self.speak, first_sentence)
return "".join(response_chunks)
full_response = await stream_response()
return full_response
# Target latencies:
# - Recording end to transcription complete: <200ms
# - Transcription to first TTS audio: <500ms
# - Total end-to-end: <1000ms

Apple’s Siri launched in 2011 and was revolutionary—but then fell behind:

2011: “Siri, set a timer for 10 minutes” blew people’s minds.

2012-2019: Siri barely improved while Alexa and Google Assistant leaped ahead.

The problem: Apple’s privacy-first approach limited data collection. Google/Amazon learned from billions of queries; Apple couldn’t.

2024: Apple Intelligence finally brought modern AI to Siri. The new Siri uses on-device LLMs and can finally have natural conversations.

The numbers:

  • Siri: 500M+ devices, but lowest satisfaction scores
  • Alexa: 100M+ devices, most smart home integrations
  • Google Assistant: Best accuracy, least privacy

Voice AI is big business:

Segment2023 Revenue2028 Projected
Speech Recognition$12B$28B
Text-to-Speech$3B$9B
Voice Assistants$5B$15B
Voice Biometrics$2B$6B
Total$22B$58B

Key players:

  • Nuance (acquired by Microsoft for $19.7B): Medical transcription
  • Deepgram: Developer-focused STT API
  • AssemblyAI: AI-powered transcription
  • Speechmatics: Enterprise speech recognition

Podcasters discovered AI voice in 2023, and it changed everything:

Before AI:

  • Edit 1 hour of podcast = 3-4 hours of work
  • Transcripts: Expensive or DIY
  • Show notes: Manual writing

After AI (Whisper + LLMs):

  • Automatic transcription (free with Whisper)
  • AI-generated show notes and summaries
  • Auto-detect and remove “um”s and “uh”s
  • Auto-generate clips for social media

Tools that emerged:

  • Descript: Edit audio by editing text
  • Podcastle: AI podcast production
  • Riverside: AI transcription + editing
  • Opus Clip: AI clip generator

The numbers:

  • 80% of top podcasters now use AI transcription
  • Average time savings: 60% on post-production
  • New accessibility: Deaf audiences can read transcripts

Whisper supports 99 languages with varying quality:

Tier 1 (Excellent - WER < 5%): English, Spanish, French, German, Italian, Portuguese, Dutch, Polish

Tier 2 (Good - WER 5-10%): Japanese, Korean, Chinese, Russian, Arabic, Hindi, Turkish

Tier 3 (Usable - WER 10-20%): Most other languages

import whisper
model = whisper.load_model("large-v3")
# Transcribe Japanese audio to Japanese text
result_transcribe = model.transcribe(
"japanese_speech.mp3",
language="ja",
task="transcribe"
)
print(f"Japanese: {result_transcribe['text']}")
# Translate Japanese audio to English text
result_translate = model.transcribe(
"japanese_speech.mp3",
task="translate" # Always translates to English
)
print(f"English: {result_translate['text']}")
from elevenlabs import generate
# ElevenLabs multilingual model
audio = generate(
text="Bonjour! Comment allez-vous aujourd'hui?",
voice="Rachel",
model="eleven_multilingual_v2"
)
# OpenAI TTS also handles multiple languages
from openai import OpenAI
client = OpenAI()
response = client.audio.speech.create(
model="tts-1",
voice="nova",
input="こんにちは、元気ですか?" # Japanese
)

Problem: Garbage in, garbage out.

# BAD: Transcribing noisy audio without preprocessing
result = model.transcribe("noisy_audio.mp3") # Poor results
# GOOD: Preprocess audio first
import librosa
import noisereduce as nr
# Load audio
audio, sr = librosa.load("noisy_audio.mp3", sr=16000)
# Reduce noise
audio_clean = nr.reduce_noise(y=audio, sr=sr)
# Save and transcribe
librosa.output.write_wav("clean_audio.wav", audio_clean, sr)
result = model.transcribe("clean_audio.wav") # Much better!

Pitfall 2: Not Handling Streaming Properly

Section titled “Pitfall 2: Not Handling Streaming Properly”

Problem: Waiting for full audio before transcribing.

# BAD: Transcribe only after recording stops
audio = record_5_seconds()
text = transcribe(audio) # 5+ second latency!
# GOOD: Continuous transcription with VAD
while True:
chunk = get_audio_chunk()
if is_speech(chunk):
buffer.append(chunk)
elif buffer: # End of speech
text = transcribe(buffer)
buffer = []
yield text # Stream results immediately

Problem: Using large model when small is enough.

# For real-time (< 500ms latency): Use base or small
model = WhisperModel("base") # 74M params, fast
# For accuracy (batch processing): Use large-v3
model = WhisperModel("large-v3") # 1.5B params, accurate
# For English-only: Use .en models
model = WhisperModel("base.en") # Faster for English

Problem: Regenerating TTS for repeated content.

import hashlib
def get_cached_audio(text: str, voice: str) -> bytes:
"""Cache TTS results to avoid regeneration."""
cache_key = hashlib.md5(f"{text}:{voice}".encode()).hexdigest()
cache_path = f".tts_cache/{cache_key}.mp3"
if os.path.exists(cache_path):
with open(cache_path, "rb") as f:
return f.read()
# Generate and cache
audio = generate_tts(text, voice)
os.makedirs(".tts_cache", exist_ok=True)
with open(cache_path, "wb") as f:
f.write(audio)
return audio

Use CaseRecommendation
Development/TestingLocal Whisper (free)
Low volume productionOpenAI Whisper API
High volume/real-timeDeepgram or AssemblyAI
On-premise requiredfaster-whisper + GPU
Multilingual focusWhisper large-v3
# For voice assistants (speed matters)
response = client.audio.speech.create(
model="tts-1", # Faster, slightly lower quality
voice="nova",
input=text,
speed=1.1 # Slightly faster speech
)
# For audiobooks/podcasts (quality matters)
response = client.audio.speech.create(
model="tts-1-hd", # Higher quality
voice="fable",
input=text,
speed=0.95 # Slightly slower, more natural
)
async def transcribe_with_fallback(audio_path: str) -> str:
"""Transcribe with fallback to backup service."""
try:
# Primary: Local faster-whisper
return await transcribe_local(audio_path)
except Exception as e:
logger.warning(f"Local transcription failed: {e}")
try:
# Fallback: OpenAI API
return await transcribe_openai(audio_path)
except Exception as e:
logger.warning(f"OpenAI transcription failed: {e}")
try:
# Last resort: Deepgram
return await transcribe_deepgram(audio_path)
except Exception as e:
logger.error(f"All transcription services failed: {e}")
return "[Transcription unavailable]"
from dataclasses import dataclass
import time
@dataclass
class SpeechMetrics:
transcription_latency_ms: float
tts_latency_ms: float
word_error_rate: float # If you have ground truth
audio_quality_score: float # From analysis
def track_metrics(func):
"""Decorator to track speech processing metrics."""
async def wrapper(*args, **kwargs):
start = time.time()
result = await func(*args, **kwargs)
latency = (time.time() - start) * 1000
metrics.record(
name=func.__name__,
latency_ms=latency,
timestamp=time.time()
)
return result
return wrapper
@track_metrics
async def transcribe(audio):
# ... transcription logic
pass

  • Whisper.cpp: Whisper in C++ for edge deployment
  • Silero VAD: Lightweight voice activity detection
  • pyannote.audio: Speaker diarization toolkit

The History of Speech Recognition: From SHOEBOX to Whisper

Section titled “The History of Speech Recognition: From SHOEBOX to Whisper”

Understanding how we arrived at modern speech AI helps you appreciate what makes current systems work—and where they still struggle.

The first speech recognition system was IBM’s SHOEBOX (1962), which could understand 16 words—digits 0-9 plus six command words. It worked by matching audio signals to handcrafted acoustic patterns. More words meant more rules, and the complexity became unmanageable.

Think of rule-based speech recognition like trying to write a dictionary that maps every possible way a word could sound—every accent, every speaking speed, every background noise condition—to its text representation. The task is fundamentally impossible to enumerate.

Did You Know? IBM’s 1970s “Tangora” system could recognize 20,000 words—but only if you spoke. One. Word. At. A. Time. With. Pauses. Between. Each. Word. Continuous speech recognition remained a dream for another two decades.

The breakthrough came from treating speech recognition as a probability problem. Hidden Markov Models (HMMs) didn’t try to define rules for what speech should sound like; they learned statistical patterns from data. “What’s the probability that this audio segment corresponds to the word ‘hello’ given what came before?”

Combine HMMs with Gaussian Mixture Models for acoustic modeling and n-gram language models for predicting word sequences, and you had systems that could handle continuous speech. Dragon NaturallySpeaking (1997) brought this technology to consumers—though “training” your voice profile by reading passages for 20 minutes was still required.

The limitation: these systems required careful feature engineering. Speech was converted into MFCCs (Mel-Frequency Cepstral Coefficients), and the system only saw those hand-designed features. Information was inevitably lost.

Deep learning changed speech recognition the same way it changed everything else: by learning features directly from raw data. Baidu’s Deep Speech (2014) showed that a deep neural network could match state-of-the-art HMM systems on clean speech—and crush them on noisy speech.

But even deep learning systems struggled with real-world robustness. They trained on clean datasets and tested on clean benchmarks. Move to a different microphone, a different accent, a noisy coffee shop, and accuracy fell off a cliff.

Whisper’s innovation wasn’t architectural—it was methodological. Train on everything: 680,000 hours of multilingual audio, including podcasts, YouTube videos, audiobooks, meetings. Don’t clean the data; embrace the messiness. The model learns robustness because the training data is diverse.

The results speak for themselves: Whisper achieves 4.2% word error rate on English, approaching human transcription accuracy (~4%). More importantly, that accuracy holds in the real world—not just on benchmark datasets.


Production War Stories: Speech AI in the Wild

Section titled “Production War Stories: Speech AI in the Wild”

The Call Center That Couldn’t Understand Its Customers

Section titled “The Call Center That Couldn’t Understand Its Customers”

Mumbai. February 2024. A major telecom company deployed Whisper for automatic call transcription. Accuracy was stellar in testing—until they went live. Complaints flooded in: the system was garbling names, misinterpreting requests, and occasionally producing complete gibberish.

Investigation revealed the problem: the test audio was recorded on high-quality headsets. Production calls came through phone lines with compression, background noise, and audio artifacts the model had rarely seen in training.

The fix: They fine-tuned Whisper on 10,000 hours of their actual call recordings. The process took two weeks and cost about $3,000 in compute. Accuracy jumped from 78% to 94% on their specific audio conditions.

Lesson: Off-the-shelf Whisper is a strong baseline, but domain-specific fine-tuning is often essential for production. The distribution of your audio matters more than benchmark accuracy.

The Podcast App That Processed 10 Million Hours

Section titled “The Podcast App That Processed 10 Million Hours”

San Francisco. August 2024. A podcast platform wanted to transcribe their entire catalog—10 million hours of audio. Naive calculation: at ~$0.006/minute for Whisper API, that’s $3.6 million just in API costs. Impossible.

Their solution: run Whisper locally on a fleet of 50 NVIDIA A10 GPUs. They used faster-whisper (the CTranslate2-optimized version) which runs 4x faster than the original. Total processing time: 3 months. Total cost: ~$200,000 in GPU rental—a 94% reduction from API pricing.

The insight: For batch processing at scale, local inference always wins. The break-even point is usually around 10,000 hours of audio—above that, invest in your own infrastructure.

Lesson: Know your volume. API for development and low volume; local inference for production scale.

Austin. November 2024. A startup built their entire product around ElevenLabs for voice synthesis. Their AI tutor spoke in a warm, encouraging voice that users loved. Then ElevenLabs changed their pricing—a 3x increase that blew the startup’s unit economics.

They scrambled to find alternatives. OpenAI TTS was cheaper but sounded different—users noticed immediately and complained. Coqui XTTS was free but required significant GPU resources. Bark was open-source but too slow for real-time.

The fix: They implemented a hybrid approach. For short responses (< 50 characters), they used OpenAI TTS (fast, cheap). For longer responses, they used a locally-hosted XTTS model. They A/B tested until they found the quality threshold where users didn’t notice the switch.

Lesson: Don’t build your product on a single TTS provider without a fallback plan. The voice AI market is volatile—pricing, availability, and capabilities change rapidly.


Mistake 1: Ignoring End-of-Speech Detection

Section titled “Mistake 1: Ignoring End-of-Speech Detection”
# WRONG - Wait for arbitrary timeout
def get_user_input():
audio = record_for_seconds(5) # What if they're still talking?
return transcribe(audio)
# RIGHT - Use Voice Activity Detection (VAD)
import webrtcvad
def get_user_input():
vad = webrtcvad.Vad(3) # Aggressiveness level 3 (most aggressive)
audio_buffer = []
silence_frames = 0
while True:
frame = get_audio_frame(30) # 30ms frame
if vad.is_speech(frame, sample_rate=16000):
audio_buffer.append(frame)
silence_frames = 0
else:
if audio_buffer: # We were speaking
silence_frames += 1
if silence_frames > 20: # 600ms of silence
break # End of utterance
return transcribe(b''.join(audio_buffer))

Consequence: Without proper end-of-speech detection, you either cut users off mid-sentence or waste time waiting after they’ve finished.

# WRONG - Play entire response before listening
def respond(user_text):
response_text = llm.generate(user_text)
audio = tts.synthesize(response_text)
play_audio(audio) # User can't interrupt!
return get_next_input()
# RIGHT - Stream TTS with interruption detection
async def respond(user_text):
response_text = llm.generate(user_text)
for chunk in tts.stream_synthesize(response_text):
# Check for user interruption while playing
if detect_speech_in_microphone():
stop_playback()
return get_user_input() # Let user take over
play_audio_chunk(chunk)
return get_next_input()

Consequence: Forcing users to wait for AI to finish speaking feels robotic and frustrating. Natural conversations have interruptions.

Mistake 3: One-Size-Fits-All Model Selection

Section titled “Mistake 3: One-Size-Fits-All Model Selection”

Think of speech models like vehicles. You wouldn’t use a semi-truck for grocery shopping or a bicycle for moving furniture. Whisper large-v3 is the semi-truck—powerful but slow. Whisper tiny is the bicycle—fast but limited. Match the model to the task.

# WRONG - Always use the biggest model
model = WhisperModel("large-v3") # 3 seconds per 1 second of audio
# RIGHT - Match model to use case
def get_model_for_use_case(use_case: str) -> WhisperModel:
models = {
"real_time": "tiny.en", # 50ms latency, English only
"streaming": "base", # 100ms latency, multilingual
"batch": "medium", # Good balance
"accuracy_critical": "large-v3" # Maximum accuracy
}
return WhisperModel(models[use_case])

Q: “How would you build a real-time voice assistant with sub-second response latency?”

Strong Answer: “Latency in voice systems compounds: STT + LLM + TTS must all complete before the user hears anything. My approach focuses on parallelization and streaming.

For STT, I’d use a small Whisper model (base or tiny) locally, achieving ~100ms latency. I’d implement Voice Activity Detection to know exactly when the user finishes speaking, avoiding arbitrary timeouts.

For the LLM, I’d stream tokens as they’re generated rather than waiting for the complete response. This lets TTS start working on the first sentence while the LLM is still generating the rest.

For TTS, I’d use OpenAI’s streaming TTS API or a local XTTS model with chunk-based synthesis. The first audio chunk can start playing within 200ms of receiving text.

The result: total latency from user-stops-speaking to AI-starts-responding of around 400-600ms. That feels responsive—similar to natural conversation pauses.”

Q: “Explain how Whisper handles multiple languages without explicit language detection.”

Strong Answer: “Whisper uses a clever bootstrapping approach. The first 30 seconds of audio are processed to predict the language token—this is a classification task the model learned during training. Once the language is identified, it guides subsequent transcription.

But here’s the elegant part: Whisper was trained on multilingual data where language tokens were part of the training signal. It learned that certain acoustic patterns co-occur with certain language tokens. It’s not doing language detection then transcription—it’s doing them jointly.

This is why Whisper can handle code-switching (multiple languages in one utterance) relatively well: it doesn’t commit to a single language upfront. It predicts language tokens at a fine-grained level.

In practice, you can also force a language: model.transcribe(audio, language=‘es’). This skips detection and can improve accuracy if you know the language beforehand.”

Q: “What are the ethical considerations around voice cloning technology?”

Strong Answer: “Voice cloning raises serious concerns that responsible engineers must address.

First, consent: cloning someone’s voice without permission is ethically problematic and increasingly illegal. ElevenLabs requires voice verification to prevent unauthorized cloning. Any system I build would include similar safeguards.

Second, deepfakes: cloned voices can be used for fraud, misinformation, and harassment. Detection becomes important—watermarking synthesized audio, training detection models, and supporting provenance tracking.

Third, displacement: as TTS improves, voice actors and narrators face job disruption. While technology advances regardless of our choices, we should consider the human impact and support transitions.

Fourth, accessibility: voice cloning has positive uses too. People who’ve lost their voice to illness can have it recreated. Audiobooks can be produced more affordably. The technology itself isn’t evil—our application of it matters.

In production, I’d implement audit logging, consent verification, usage policies, and detection mechanisms to mitigate misuse while enabling beneficial applications.”


ServiceSTT CostTTS CostNotes
OpenAI Whisper API$0.006/min-Most convenient
OpenAI TTS-$0.015/1K charsHigh quality
ElevenLabs-$0.018/1K charsBest voices
Deepgram$0.0043/min-Real-time optimized
AssemblyAI$0.0037/min-Best value
Local Whisper (GPU)~$0.0005/min*-*Amortized hardware
Local XTTS (GPU)-~$0.001/1K chars**Amortized hardware

When does local inference beat API pricing?

API cost per hour: $0.006 × 60 = $0.36/hour of audio
GPU cost (A10): ~$1/hour
At 1 hour real-time processing per GPU hour (base model):
Break-even when: $0.36/hr × X = $1/hr + setup costs
X ≈ 3 hours of audio per GPU hour needed
With faster-whisper (4x speedup):
Processing 4 hours of audio per GPU hour
Cost: $0.25/hour of audio
Savings: 31% vs API
At 100,000 hours/month:
API: $36,000/month
Local (25 GPU-hours): ~$9,000/month
Savings: $27,000/month

Recommendation: Below 1,000 hours/month, use APIs. Above 10,000 hours/month, invest in local infrastructure. In between, it depends on your latency requirements and engineering capacity.

Did You Know? Spotify uses a custom speech recognition system to transcribe millions of podcast episodes. They estimate local processing saves them over $10 million annually compared to API pricing. At their scale, even a few cents per hour adds up to millions.


Voice is becoming the default way humans interact with AI. The trajectory is clear: keyboards → touchscreens → voice. Each shift made computing more accessible and more natural. Voice removes the last barrier—you don’t need to learn anything. Speaking is hardwired into human biology.

Did You Know? By 2025, an estimated 8.4 billion voice assistants will be in use globally—more than the world’s population. The average American household already has 2.5 voice-enabled devices. We’re approaching a world where voice interaction is expected, not novel.

Today’s voice systems are pipelines: STT → LLM → TTS. Each component introduces latency and information loss. The future is unified audio models that process speech end-to-end.

OpenAI’s gpt-5 previewed this future. It doesn’t transcribe speech to text, process the text, then synthesize speech. It processes audio directly—hearing tone, pace, emotion, and background sounds, then generating audio responses that match the conversational context.

The implications are profound. A unified model can:

  • Respond to paralinguistic cues: sighs, laughter, hesitation
  • Maintain consistent voice personality: same tone throughout a conversation
  • Handle music and environmental audio: not just speech
  • Achieve sub-200ms latency: faster than human conversational pauses

Imagine an AI assistant that sounds like a trusted mentor, a friend, or your favorite audiobook narrator—because it is. Voice personalization is coming fast.

ElevenLabs already enables “professional voice cloning” from hours of audio. The next step is cloning from minutes, then seconds. Eventually, your AI assistant will speak in whatever voice you prefer, trained on a few samples you provide.

The ethical challenges are obvious, but so are the opportunities. Assistive technology for people who’ve lost their voice. Personalized learning with AI tutors who sound like inspiring teachers. Entertainment where NPCs speak with the voices of legendary actors.

Voice doesn’t exist in isolation. When you say “move that over there,” you’re probably pointing. When you say “this looks wrong,” you’re looking at something. The future of voice AI is multimodal—systems that see, hear, and respond with full context.

Google’s Gemini Live and OpenAI’s gpt-5 show glimpses of this future. You can point your phone camera at a restaurant menu and ask “what should I order if I’m vegetarian?” The AI sees the menu, hears your question, and responds with voice. That’s the convergence: vision + voice + language in a seamless interaction.

If you’re building voice applications today, design for tomorrow:

  1. Abstract your STT/TTS providers: The landscape is shifting fast. Don’t lock into one vendor.

  2. Build for multimodal: Even if your current app is voice-only, structure your code to accept additional modalities later.

  3. Measure latency obsessively: Users tolerate delays in text interfaces. Voice must feel instantaneous. Track end-to-end latency as a primary metric.

  4. Plan for personalization: Users will expect voice customization. Design your architecture to support multiple voice profiles.

  5. Consider offline: Edge deployment of speech models is improving rapidly. Whisper.cpp runs on phones. Plan for scenarios where cloud connectivity isn’t guaranteed.

The voice AI stack you build today should be ready for a world where voice is the primary human-AI interface—because that world is arriving faster than most people expect.


  1. Whisper changed the game by training on messy, diverse, real-world audio. The lesson: data diversity trumps architectural cleverness for robustness.

  2. The voice stack is simple: STT → LLM → TTS. But each component has latency, and latencies compound. Optimize each stage, and stream wherever possible.

  3. Model size is a trade-off, not a quality dial. Smaller models for real-time interaction; larger models for batch accuracy. Match the model to your latency budget.

  4. Voice Activity Detection (VAD) is crucial for production systems. Without it, you’re guessing when users start and stop speaking.

  5. Domain-specific fine-tuning often matters more than model size. Whisper trained on podcasts might struggle with your call center audio. Fine-tune on your actual distribution.

  6. TTS quality is perceptible but not always critical. Users notice bad TTS immediately, but the difference between good and great TTS is subtle. Don’t overspend on voice quality for low-stakes interactions.

  7. Handle interruptions gracefully. Real conversations have interruptions. If your AI can’t be interrupted, it feels robotic.

  8. Voice cloning is powerful but risky. Implement consent verification, audit logging, and abuse detection. The technology is too easy to misuse.

  9. Local inference wins at scale. API pricing is convenient but expensive. Above 10,000 hours/month, run your own Whisper infrastructure.

  10. The future is multimodal. Voice is just one modality. The best systems will combine voice with vision, text, and other inputs for natural, context-aware interaction.


What you learned:

  • Whisper for accurate speech-to-text
  • OpenAI TTS and ElevenLabs for natural speech synthesis
  • Real-time transcription with VAD
  • Building complete voice assistants
  • Speaker diarization for multi-speaker audio
  • Production best practices for speech AI

Key technologies:

  • STT: Whisper, faster-whisper, Deepgram
  • TTS: OpenAI TTS, ElevenLabs, Coqui
  • VAD: Silero VAD
  • Diarization: pyannote.audio

The voice stack:

Audio In → VAD → Whisper → LLM → TTS → Audio Out

Next module: Module 23: Vision AI

Now that you can make AI hear and speak, let’s make it see! You’ll learn:

  • CLIP for image-text understanding
  • GPT-4V and Claude Vision
  • Building multimodal applications
  • Image search and analysis

Phase 5 Progress: 1/3 modules complete


** Neural Dojo - Give your AI a voice! ️**


Last updated: 2025-11-26 Module 22: Speech AI