Module 1.1: DNS at Scale & Global Traffic Management
Complexity:
[COMPLEX]Time to Complete: 3 hours
Prerequisites: Basic DNS (A/AAAA/CNAME records), Kubernetes Ingress concepts
Track: Foundations — Advanced Networking
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design DNS architectures for global traffic management using weighted routing, geolocation policies, and health-checked failover
- Diagnose DNS resolution failures by tracing queries through recursive resolvers, authoritative servers, and caching layers
- Implement DNS-based service discovery patterns and explain their tradeoffs compared to service mesh alternatives
- Evaluate DNS security risks (cache poisoning, DDoS amplification, hijacking) and apply DNSSEC, DoH, and split-horizon mitigations
July 22, 2016. A routine configuration update at Dyn, one of the world’s largest managed DNS providers, propagates a change to their Anycast network. Nothing unusual. But three months later, Dyn would learn a very different lesson about DNS at scale.
On October 21, 2016, the Mirai botnet unleashed a massive DDoS attack against Dyn’s infrastructure. Tens of millions of IP addresses, mostly from compromised IoT devices, flooded Dyn’s DNS resolvers. The attack was devastating not because Dyn was careless, but because DNS is the single most critical piece of internet infrastructure that almost everyone takes for granted.
Twitter, GitHub, Netflix, Reddit, Spotify, The New York Times — all went dark. Not because their servers were down, but because nobody could look up their IP addresses. It was like erasing every phone number from every phone book simultaneously.
The Dyn attack exposed what infrastructure engineers already knew: DNS is the first thing that happens in every connection and the last thing anyone thinks about until it breaks. This module teaches you to think about DNS the way the engineers who keep the internet running do — as a globally distributed, latency-sensitive, security-critical system that demands deliberate architecture.
Why This Module Matters
Section titled “Why This Module Matters”Every single request your application serves begins with a DNS lookup. Before TLS handshakes, before HTTP requests, before any application logic — the client must resolve a hostname to an IP address. If that resolution is slow, everything is slow. If it fails, nothing works.
At scale, DNS stops being a simple lookup table and becomes a global traffic management system. It decides which datacenter serves your users. It detects failures and reroutes traffic. It balances load across continents. It enforces security policies before a single packet reaches your infrastructure.
Yet most engineers treat DNS as “set it and forget it.” They paste records into a web UI and wonder why their global application has mysterious latency spikes for users in certain regions, or why failover takes 20 minutes instead of 20 seconds.
The Air Traffic Control Analogy
Think of DNS like air traffic control. Every plane (request) needs to be told which runway (server) to land on. Good ATC considers weather (server health), fuel levels (client proximity), runway capacity (server load), and traffic patterns (routing policies). Bad ATC just assigns runways randomly and hopes for the best. DNS at scale is your application’s ATC system.
What You’ll Learn
Section titled “What You’ll Learn”- Advanced DNS record types beyond A/AAAA/CNAME (ALIAS, ANAME, CAA, SRV)
- Anycast routing and why it matters for DNS
- Traffic management policies: latency-based, weighted, geolocation, failover
- DNSSEC: how it works and why adoption is still incomplete
- TTL tuning and the caching trap that catches everyone
- Hands-on: Building latency-based routing with health checks and failover
Part 1: Beyond Basic DNS Records
Section titled “Part 1: Beyond Basic DNS Records”1.1 The Record Types You Already Know
Section titled “1.1 The Record Types You Already Know”BASIC DNS RECORDS — QUICK REVIEW═══════════════════════════════════════════════════════════════
A RECORD─────────────────────────────────────────────────────────────Maps hostname → IPv4 address
app.example.com. 300 IN A 203.0.113.10
AAAA RECORD─────────────────────────────────────────────────────────────Maps hostname → IPv6 address
app.example.com. 300 IN AAAA 2001:db8::1
CNAME RECORD─────────────────────────────────────────────────────────────Maps hostname → another hostname (alias)
www.example.com. 300 IN CNAME app.example.com.
⚠️ LIMITATION: CNAME cannot coexist with other records at the same name (RFC 1034). This means you CANNOT put a CNAME at the zone apex (example.com).
MX RECORD─────────────────────────────────────────────────────────────Maps hostname → mail server (with priority)
example.com. 300 IN MX 10 mail.example.com. example.com. 300 IN MX 20 backup.example.com.1.2 Advanced Record Types for Scale
Section titled “1.2 Advanced Record Types for Scale”ADVANCED DNS RECORDS═══════════════════════════════════════════════════════════════
ALIAS / ANAME RECORD (Provider-Specific)─────────────────────────────────────────────────────────────Solves the "CNAME at zone apex" problem.
Problem: example.com. CNAME lb.cloud.com. ← ILLEGAL per RFC example.com. A ??? ← Need dynamic IP
Solution: ALIAS/ANAME resolves at the DNS server level
example.com. ALIAS lb.us-east-1.elb.amazonaws.com.
How it works: 1. Client queries: example.com A? 2. DNS server resolves lb.us-east-1.elb.amazonaws.com → 52.1.2.3 3. DNS server returns: example.com A 52.1.2.3
┌──────────┐ ┌──────────────┐ │ Client │─── A example.com? ──→│ DNS Server │ │ │ │ │ │ │ │ Resolves │ │ │ │ ALIAS target│ │ │ │ internally │ │ │← A 52.1.2.3 ──────│ │ └──────────┘ └──────────────┘
⚠️ NOT standardized. Called "ALIAS" (Route53, DNSimple), "ANAME" (PowerDNS, RFC draft), "CNAME flattening" (Cloudflare). Behavior varies by provider.
SRV RECORD─────────────────────────────────────────────────────────────Service discovery with port and priority.
Format: _service._protocol.name TTL IN SRV priority weight port target
_http._tcp.example.com. 300 IN SRV 10 60 8080 web1.example.com. _http._tcp.example.com. 300 IN SRV 10 40 8080 web2.example.com. _http._tcp.example.com. 300 IN SRV 20 0 8080 backup.example.com.
Priority 10 (lower = preferred): 60% to web1, 40% to web2 Priority 20 (fallback): backup only if priority 10 fails
Used by: Kubernetes services, LDAP, SIP, XMPP, MongoDB
CAA RECORD (Certificate Authority Authorization)─────────────────────────────────────────────────────────────Controls which CAs can issue certificates for your domain.
example.com. 300 IN CAA 0 issue "letsencrypt.org" example.com. 300 IN CAA 0 issuewild "letsencrypt.org" example.com. 300 IN CAA 0 iodef "mailto:security@example.com"
issue → Who can issue regular certs issuewild → Who can issue wildcard certs iodef → Where to report violations
Since Sept 2017, CAs MUST check CAA before issuing. Missing CAA = any CA can issue (bad for security).
TXT RECORD (Verification & Policy)─────────────────────────────────────────────────────────────Free-form text, used heavily for verification and email auth.
example.com. 300 IN TXT "v=spf1 include:_spf.google.com ~all" _dmarc.example.com. 300 IN TXT "v=DMARC1; p=reject; rua=..." google._domainkey.example.com. 300 IN TXT "v=DKIM1; k=rsa; p=..."
SPF: Which servers can send email for your domain DKIM: Cryptographic email signing DMARC: What to do with failed SPF/DKIM checks1.3 DNS Hierarchy and Resolution
Section titled “1.3 DNS Hierarchy and Resolution”DNS RESOLUTION — THE FULL PICTURE═══════════════════════════════════════════════════════════════
When you type "app.example.com" in your browser:
┌──────────┐ ┌───────────────┐ ┌──────────────┐ │ Browser │────→│ Stub Resolver │────→│ Recursive │ │ │ │ (OS-level) │ │ Resolver │ └──────────┘ └───────────────┘ │ (ISP/Cloud) │ └──────┬───────┘ │ ┌─────────────────────────────┼──────┐ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌──────────┐ ┌──────────────┐ │ Root Server │ │ .com TLD │ │ example.com │ │ (13 groups) │ │ Server │ │ Authoritative│ └─────────────┘ └──────────┘ └──────────────┘
Step-by-step (uncached): 1. Browser checks its cache → miss 2. OS stub resolver checks /etc/hosts → miss 3. OS sends query to configured recursive resolver 4. Recursive resolver asks root: "Where is .com?" 5. Root says: "Try 192.5.6.30 (a.gtld-servers.net)" 6. Recursive asks .com TLD: "Where is example.com?" 7. TLD says: "Try 198.51.100.1 (ns1.example.com)" 8. Recursive asks authoritative: "What is app.example.com?" 9. Authoritative responds with the A record 10. Recursive caches result, returns to client
TOTAL UNCACHED LOOKUP TIME───────────────────────────────────────────────────────────── Root query: ~5-30ms (Anycast, nearby) TLD query: ~10-50ms Authoritative query: ~10-200ms (depends on location) ────────────────────────────────────────────── Total: ~25-280ms for first lookup Cached: ~0-5ms for subsequent lookupsPart 2: Anycast — The Secret Behind Fast DNS
Section titled “Part 2: Anycast — The Secret Behind Fast DNS”2.1 Unicast vs Anycast
Section titled “2.1 Unicast vs Anycast”UNICAST vs ANYCAST═══════════════════════════════════════════════════════════════
UNICAST (Traditional)─────────────────────────────────────────────────────────────One IP address → One physical server
Client in Tokyo ──────────────→ Server in Virginia 10,000+ km IP: 198.51.100.1 ~150ms RTT
Every client goes to the same server, regardless of location.
ANYCAST─────────────────────────────────────────────────────────────One IP address → Multiple physical serversBGP routing sends each client to the nearest one.
Client in Tokyo ───→ Server in Tokyo IP: 198.51.100.1 ~5ms
Client in London ──→ Server in London IP: 198.51.100.1 ~3ms
Client in NYC ─────→ Server in Virginia IP: 198.51.100.1 ~8ms
Same IP, different physical servers!
HOW ANYCAST WORKS─────────────────────────────────────────────────────────────Multiple servers announce the same IP prefix via BGP.
┌────────────┐ ┌────────────┐ │ Tokyo PoP │ │ London PoP │ │ 198.51.100 │ │ 198.51.100 │ │ .0/24 ↑ │ │ .0/24 ↑ │ └────┬───────┘ └────┬───────┘ │ BGP announces │ BGP announces │ 198.51.100.0/24 │ 198.51.100.0/24 ▼ ▼ ┌──────────────────────────────────────────────┐ │ Internet BGP Mesh │ │ │ │ Router sees same prefix from multiple │ │ origins → picks shortest AS-path │ └──────────────────────────────────────────────┘
WHY IT WORKS FOR DNS (AND CDN)───────────────────────────────────────────────────────────── ✓ DNS is UDP-based → no long-lived connections ✓ Each query is independent → stateless ✓ BGP convergence time (~30-90s) > DNS query time (~ms) ✓ Natural DDoS absorption: attack traffic gets distributed
⚠️ Less suitable for TCP (BGP route changes can break established connections). Some providers handle this with connection migration or short-lived flows.2.2 Anycast in Practice
Section titled “2.2 Anycast in Practice”ANYCAST DEPLOYMENTS — REAL NUMBERS═══════════════════════════════════════════════════════════════
ROOT DNS SERVERS───────────────────────────────────────────────────────────── 13 root server "identities" (a.root-servers.net through m) But NOT 13 physical servers!
Total root server instances worldwide: 1,700+ All using Anycast to serve from nearest location.
Example: f.root-servers.net (ISC) - 256+ instances across 6 continents - All announce same 192.5.5.241 via Anycast
CLOUDFLARE DNS (1.1.1.1)───────────────────────────────────────────────────────────── 330+ cities worldwide Average response time: <11ms globally All serving from the same 1.1.1.1 Anycast address
GOOGLE PUBLIC DNS (8.8.8.8)───────────────────────────────────────────────────────────── Anycast across Google's global network ~20+ PoP locations Average response time: ~12msPart 3: Global Traffic Management with DNS
Section titled “Part 3: Global Traffic Management with DNS”3.1 Traffic Routing Policies
Section titled “3.1 Traffic Routing Policies”DNS-BASED TRAFFIC MANAGEMENT═══════════════════════════════════════════════════════════════
Instead of returning a static IP, smart DNS returns DIFFERENTIPs based on the client's context. This is how global loadbalancing works at the DNS layer.
SIMPLE (Round Robin)─────────────────────────────────────────────────────────────Return all IPs, rotate the order.
Query 1: app.example.com → [10.0.1.1, 10.0.2.1, 10.0.3.1] Query 2: app.example.com → [10.0.2.1, 10.0.3.1, 10.0.1.1] Query 3: app.example.com → [10.0.3.1, 10.0.1.1, 10.0.2.1]
Pros: Dead simple Cons: No health awareness, no locality, uneven distribution
WEIGHTED─────────────────────────────────────────────────────────────Return IPs based on configured weights.
us-east 70% → 52.1.1.1 (primary, more capacity) eu-west 20% → 54.2.2.2 (secondary) ap-south 10% → 13.3.3.3 (new region, testing)
Use cases: - Gradual migration (90/10 → 80/20 → 50/50 → 0/100) - Canary deployments - Proportional to capacity
LATENCY-BASED─────────────────────────────────────────────────────────────Return the IP with lowest latency to the client.
Client in Tokyo → ap-northeast-1 (13.3.3.3) ~5ms Client in NYC → us-east-1 (52.1.1.1) ~8ms Client in Berlin → eu-west-1 (54.2.2.2) ~12ms
How it works: 1. DNS provider continuously measures latency from resolver locations to each endpoint 2. Maps client's resolver IP to nearest region 3. Returns the lowest-latency endpoint
⚠️ EDNS Client Subnet (ECS) improves accuracy. Without ECS, latency is measured to the RESOLVER, not the actual client. A user in Tokyo using 8.8.8.8 might get routed based on Google's resolver location.
GEOLOCATION─────────────────────────────────────────────────────────────Return IPs based on the client's geographic location.
Client in Germany → eu-central-1 (GDPR compliance!) Client in China → cn-north-1 (content compliance!) Client in Brazil → sa-east-1 (data sovereignty!)
Use cases: - Regulatory compliance (data residency) - Content licensing (geo-restricted media) - Language-specific endpoints
Hierarchy: Continent → Country → State → City Fallback if no match: default record
GEOPROXIMITY─────────────────────────────────────────────────────────────Like geolocation, but with a tunable "bias" that shiftsthe geographic boundary between regions.
Without bias: ┌─────────────────────────────────────────────┐ │ US-East │ US-West │ │ │ │ │ Chicago gets ←──│── boundary │ │ US-East │ │ └─────────────────────────────────────────────┘
With bias (US-West +25): ┌─────────────────────────────────────────────┐ │ US-East │ US-West │ │ │ │ │ Boundary ──→│ Chicago now gets │ │ shifted │ US-West! │ └─────────────────────────────────────────────┘
FAILOVER (Active-Passive)─────────────────────────────────────────────────────────────Return primary IP unless health check fails, then failover.
Normal: app.example.com → 52.1.1.1 (primary, healthy ✓)
After primary fails health check: app.example.com → 54.2.2.2 (secondary)
Health check configuration: Protocol: HTTP/HTTPS/TCP Path: /healthz Interval: 10s Threshold: 3 consecutive failures → failover Recovery: 3 consecutive successes → failback
MULTIVALUE ANSWER─────────────────────────────────────────────────────────────Like round robin, but with health checks.Returns up to 8 healthy IPs. Client picks one.
Healthy: [52.1.1.1 ✓, 54.2.2.2 ✓, 13.3.3.3 ✓] Response: [52.1.1.1, 54.2.2.2, 13.3.3.3]
After 54.2.2.2 fails: Response: [52.1.1.1, 13.3.3.3] (only healthy IPs)3.2 Health Checks and Failover Architecture
Section titled “3.2 Health Checks and Failover Architecture”HEALTH CHECK ARCHITECTURE═══════════════════════════════════════════════════════════════
DNS health checks run from MULTIPLE locations to avoidfalse positives from network issues.
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │Virginia│ │Oregon │ │Ireland │ │Tokyo │ │Checker │ │Checker │ │Checker │ │Checker │ └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌──────────────────────────────────────────────┐ │ Your Application │ │ 52.1.1.1:443/healthz │ └──────────────────────────────────────────────┘
Failure determination: ───────────────────────────────────────────────── If 1/4 checkers fail → Network issue, ignore If 2/4 checkers fail → Possible problem If 3/4 checkers fail → Mark unhealthy, failover If 4/4 checkers fail → Definitely down
CALCULATED HEALTH CHECKS (Composite)─────────────────────────────────────────────────────────────
┌─── AND ──────────────────────────────────────┐ │ │ │ ┌─── OR ────────────────────────────┐ │ │ │ Web Server 1: /healthz ✓ │ │ │ │ Web Server 2: /healthz ✓ │ ✓ │ │ └───────────────────────────────────┘ │ │ │ │ ┌─── OR ────────────────────────────┐ │ │ │ DB Primary: port 5432 ✓ │ │ │ │ DB Replica: port 5432 ✓ │ ✓ │ │ └───────────────────────────────────┘ │ │ │ → HEALTHY ✓ └──────────────────────────────────────────────┘
Region is healthy only if BOTH web AND database have at least one healthy instance.
FAILOVER TIMING BREAKDOWN───────────────────────────────────────────────────────────── Health check interval: 10s Failure threshold: 3 checks Time to detect failure: ~30s DNS propagation (TTL dependent): 0-60s Client cache expiry: 0-300s ───────────────────────────────────────────────── Worst case total: ~6 minutes Best case total: ~30 seconds
This is why low TTLs matter for failover!Part 4: TTL — The Caching Trap
Section titled “Part 4: TTL — The Caching Trap”4.1 How TTL Actually Works
Section titled “4.1 How TTL Actually Works”TTL (TIME TO LIVE) — THE MISUNDERSTOOD SETTING═══════════════════════════════════════════════════════════════
TTL tells resolvers how long to cache a DNS response.
app.example.com. 60 IN A 52.1.1.1 ^^ TTL = 60 seconds
THE CACHING CHAIN─────────────────────────────────────────────────────────────
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ Browser │ │ OS Cache │ │ Recursive│ │ Auth │ │ Cache │──→│ Stub │──→│ Resolver │──→│ Server │ │ ~1-2min │ │ ~TTL │ │ ~TTL │ │ │ └──────────┘ └──────────┘ └──────────┘ └─────────┘
Each layer can cache independently!
THE COUNTDOWN PROBLEM─────────────────────────────────────────────────────────────
t=0: Auth server sets TTL=300 (5 min) t=0: Recursive resolver caches with TTL=300 t=120: Client A queries resolver → gets answer with TTL=180 t=120: Client A caches with TTL=180 t=290: Client B queries resolver → gets answer with TTL=10 t=290: Client B caches with TTL=10
Client A won't re-query for 3 minutes. Client B will re-query in 10 seconds. SAME answer, DIFFERENT freshness guarantees.
WHEN TTL IS IGNORED───────────────────────────────────────────────────────────── Java's InetAddress: Caches FOREVER by default! (networkaddress.cache.ttl=30) Some ISP resolvers: Ignore low TTLs, cache longer Browser DNS cache: Chrome: ~1 minute regardless Corporate proxies: May cache aggressively Mobile carriers: Often ignore TTL completely4.2 TTL Strategy Guide
Section titled “4.2 TTL Strategy Guide”TTL STRATEGY BY USE CASE═══════════════════════════════════════════════════════════════
HIGH TTL (3600-86400 seconds / 1h-24h)─────────────────────────────────────────────────────────────Use when: Records rarely change, performance matters most
MX records: 86400 (mail routing rarely changes) SPF/DKIM/DMARC: 3600 (email auth records) Static infrastructure: 3600 (on-prem servers)
✓ Fewer queries to authoritative servers ✓ Faster resolution for end users ✗ Changes take hours to propagate ✗ Failover is slow
MEDIUM TTL (60-300 seconds / 1-5 min)─────────────────────────────────────────────────────────────Use when: Balance between performance and flexibility
Production apps: 300 (good default) API endpoints: 120 (moderate change frequency)
✓ Reasonable cache hit rate ✓ Changes propagate within minutes ✗ More queries than high TTL
LOW TTL (10-60 seconds)─────────────────────────────────────────────────────────────Use when: Fast failover required, frequent changes
Active failover: 60 (1 minute max staleness) Blue-green deploys: 30 (fast cutover) During migrations: 10 (minimize risk window)
✓ Fast failover ✓ Rapid propagation ✗ High query volume ✗ Higher latency (more cache misses) ✗ Higher cost (DNS queries per second pricing)
PRE-MIGRATION TTL STRATEGY─────────────────────────────────────────────────────────────
Day -7: Lower TTL from 3600 to 300 (wait for old caches to expire) Day -1: Lower TTL from 300 to 60 Day 0: Perform migration (change IP) Day +1: Verify everything works Day +7: Raise TTL back to 3600
⚠️ You must lower the TTL BEFORE the migration! If TTL is 3600, you need to wait 1 hour for the old TTL to expire before the low TTL takes effect.Part 5: DNSSEC — Authenticating the Phonebook
Section titled “Part 5: DNSSEC — Authenticating the Phonebook”5.1 The Problem DNSSEC Solves
Section titled “5.1 The Problem DNSSEC Solves”DNS CACHE POISONING ATTACK═══════════════════════════════════════════════════════════════
Without DNSSEC, DNS responses are NOT authenticated.An attacker can forge responses.
THE ATTACK (Kaminsky Attack, 2008)─────────────────────────────────────────────────────────────
┌──────┐ ┌──────────┐ ┌──────────┐ │Client│──── bank.com? ──→│ Resolver │──query──→│Auth DNS │ │ │ │ │ │ │ │ │ │ │←─real───│ "1.2.3.4"│ │ │ │ │ └──────────┘ │ │ │ │ │ │ │ │←─FAKE───┐ │ │← bank.com = │ Accepts │ │ Attacker │ │ 6.6.6.6 ──────│ FAKE! │ │ "6.6.6.6" └──────┘ (attacker IP) └──────────┘ └──────────┘
Attacker races the real response with a forged one. If the forged response arrives first with the right transaction ID, the resolver caches the FAKE answer.
Now EVERY client using that resolver gets sent to the attacker's server. For bank.com. With a valid- looking certificate (if attacker also has a cert).
HOW DNSSEC PREVENTS THIS─────────────────────────────────────────────────────────────
DNSSEC adds cryptographic signatures to DNS responses.
Zone owner signs records with private key. Resolver verifies signatures with public key. Forged responses fail signature verification.
┌──────────────────────────────────────────────────────┐ │ example.com. 300 A 52.1.1.1 │ │ │ │ RRSIG: A 13 2 300 20250101000000 20241201000000 │ │ 12345 example.com. │ │ <base64-encoded-signature> │ └──────────────────────────────────────────────────────┘
Resolver checks: Does the RRSIG match the A record using example.com's DNSKEY? If yes → trust. If no → reject.5.2 DNSSEC Chain of Trust
Section titled “5.2 DNSSEC Chain of Trust”DNSSEC CHAIN OF TRUST═══════════════════════════════════════════════════════════════
Root Zone (IANA) │ KSK → Signs → ZSK → Signs → DS records for .com │ ▼ .com TLD (Verisign) │ KSK → Signs → ZSK → Signs → DS records for example.com │ ▼ example.com (You) KSK → Signs → ZSK → Signs → A, AAAA, MX records
KEY TYPES───────────────────────────────────────────────────────────── KSK (Key Signing Key): Signs the DNSKEY set Registered as DS in parent zone Rotated infrequently (yearly+)
ZSK (Zone Signing Key): Signs actual records Rotated more frequently (monthly)
RECORD TYPES───────────────────────────────────────────────────────────── DNSKEY: Public keys for the zone RRSIG: Signature over a record set DS: Hash of child's KSK (in parent zone) NSEC/NSEC3: Proves a record does NOT exist (authenticated denial of existence)
DNSSEC ADOPTION (2025)───────────────────────────────────────────────────────────── .com zones signed: ~5% (surprisingly low!) .gov zones signed: ~93% (mandated by policy) .nl (Netherlands): ~58% (highest country TLD) .se (Sweden): ~52%
Why so low? - Operational complexity (key rotation) - Risk of self-inflicted outages (expired signatures) - Performance overhead (larger responses) - HTTPS/TLS already provides endpoint authentication - DNS-over-HTTPS (DoH) provides channel encryptionPart 6: DNS in Kubernetes
Section titled “Part 6: DNS in Kubernetes”6.1 CoreDNS and Cluster DNS
Section titled “6.1 CoreDNS and Cluster DNS”KUBERNETES DNS ARCHITECTURE═══════════════════════════════════════════════════════════════
Every Kubernetes cluster runs CoreDNS for internal resolution.
┌────────────────────────────────────────────────────┐ │ Pod: my-app │ │ /etc/resolv.conf: │ │ nameserver 10.96.0.10 (CoreDNS ClusterIP) │ │ search default.svc.cluster.local │ │ svc.cluster.local │ │ cluster.local │ │ ndots: 5 │ └────────────────┬───────────────────────────────────┘ │ ▼ ┌────────────────────────────────────────────────────┐ │ CoreDNS (kube-dns service: 10.96.0.10) │ │ │ │ Service records: │ │ my-svc.default.svc.cluster.local → ClusterIP │ │ │ │ Pod records: │ │ 10-244-1-5.default.pod.cluster.local → Pod IP │ │ │ │ Headless service: │ │ my-svc.default.svc.cluster.local → Pod IPs │ │ │ │ External: Forward to upstream (/etc/resolv.conf) │ └────────────────────────────────────────────────────┘
THE ndots TRAP───────────────────────────────────────────────────────────── ndots:5 means any name with fewer than 5 dots gets the search domains appended FIRST.
Query: api.example.com (2 dots, < 5)
Resolution order: 1. api.example.com.default.svc.cluster.local → NXDOMAIN 2. api.example.com.svc.cluster.local → NXDOMAIN 3. api.example.com.cluster.local → NXDOMAIN 4. api.example.com. → SUCCESS!
That's 3 WASTED queries for every external domain!
Fix: Add trailing dot (api.example.com.) or lower ndots. In pod spec: dnsConfig: options: - name: ndots value: "2"6.2 ExternalDNS — Bridging Cluster and Cloud DNS
Section titled “6.2 ExternalDNS — Bridging Cluster and Cloud DNS”EXTERNALDNS═══════════════════════════════════════════════════════════════
ExternalDNS automatically creates DNS records in cloudproviders based on Kubernetes resources.
┌──────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ │ │ Ingress: │ │ host: app.example.com ──────┐ │ │ → backend: my-svc:80 │ │ │ │ │ │ Service (type: LoadBalancer): │ │ │ external-ip: 52.1.1.1 │ │ │ │ │ │ ExternalDNS controller ◄─────┘ │ │ watches Ingress/Service │ │ creates DNS records ──────────────────┐ │ └──────────────────────────────┤ │ │ │ ▼ ┌──────────────────────────────────────────────┐ │ Route53 / Cloud DNS / Azure DNS │ │ │ │ app.example.com A 52.1.1.1 │ │ (auto-created and managed by ExternalDNS) │ └──────────────────────────────────────────────┘
Annotations for fine-tuning: ───────────────────────────────────────────────── external-dns.alpha.kubernetes.io/hostname: app.example.com external-dns.alpha.kubernetes.io/ttl: "60" external-dns.alpha.kubernetes.io/target: 52.1.1.1Did You Know?
Section titled “Did You Know?”-
The entire root zone file is only about 2MB. Despite being the foundation of the entire internet’s naming system, the root zone contains just around 1,500 TLD delegations. It is signed with DNSSEC and updated roughly twice daily by IANA.
-
A single DNS query can trigger up to 23 separate lookups internally. Between CNAME chains, DNAME redirections, and DNSSEC validation (fetching DNSKEY, DS, and RRSIG records at each level), what looks like one query can cascade into a complex resolution tree. This is why recursive resolvers are some of the most heavily optimized software on the internet.
-
Cloudflare’s 1.1.1.1 resolver was almost never usable. The IP address 1.1.1.1 was historically “squatted” by many networks that used it as a dummy or test address internally. When Cloudflare launched their resolver in April 2018, they spent months working with networks worldwide to stop hijacking traffic destined for 1.1.1.1. Some corporate networks still inadvertently block it today.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Using CNAME at zone apex | Violates RFC, breaks MX/TXT records | Use ALIAS/ANAME record or A record |
| TTL too high during migration | Users stuck on old IP for hours | Lower TTL days before migration |
| Ignoring Java DNS caching | JVM caches DNS forever by default | Set networkaddress.cache.ttl=30 |
| No health checks on DNS failover | Traffic routed to dead endpoints | Always pair DNS routing with health checks |
| Kubernetes ndots:5 with external calls | 3-4x DNS queries for external domains | Lower ndots to 2 or use FQDN with trailing dot |
| Not monitoring DNS resolution time | Silent latency added to every request | Track DNS resolution time in application metrics |
| DNSSEC without automation | Expired signatures cause total outage | Use managed DNSSEC or automated key rotation |
| Single DNS provider | Provider outage = complete domain blackout | Use multi-provider DNS (NS records from 2+ providers) |
-
Why can’t you use a CNAME record at the zone apex (example.com), and what alternatives exist?
Answer
Per RFC 1034, a CNAME record cannot coexist with any other record type at the same name. The zone apex always has SOA and NS records (and often MX and TXT records), so placing a CNAME there would violate the standard.
Alternatives:
- ALIAS record (Route53, DNSimple): The DNS server resolves the target internally and returns an A record to the client.
- ANAME record (PowerDNS, RFC draft): Similar concept, slightly different implementation.
- CNAME flattening (Cloudflare): Automatically resolves the CNAME chain and returns the final A/AAAA records at the apex.
- Static A records: Simply use A/AAAA records pointing to your load balancer IPs, though these must be updated if IPs change.
-
Explain how Anycast routing works for DNS. Why is it well-suited for DNS but problematic for long-lived TCP connections?
Answer
Anycast assigns the same IP address to multiple servers in different locations. Each server advertises the same IP prefix via BGP. Internet routers direct each client to the “nearest” server based on BGP path selection (shortest AS-path, lowest cost).
It works well for DNS because:
- DNS primarily uses UDP (stateless, single request-response)
- Each query is independent — no session to maintain
- If a BGP route changes mid-session, the next query just goes to a different server
- DDoS traffic gets naturally distributed across all Anycast nodes
It is problematic for long-lived TCP because:
- TCP connections are stateful (maintained by a specific server)
- If BGP reconverges and changes the best path, packets for an existing TCP connection may be routed to a different server
- The new server has no knowledge of the connection, causing a reset
- Mitigation exists (ECMP pinning, connection migration) but adds complexity
-
Your application has a TTL of 3600 seconds and you need to migrate to a new IP. Describe the step-by-step process and timeline.
Answer
Step-by-step migration process:
-
Day -7: Lower TTL from 3600 to 300. Wait at least 3600 seconds (1 hour) for all caches holding the old TTL value to expire.
-
Day -1: Lower TTL from 300 to 60. Wait at least 300 seconds (5 minutes) for caches to refresh.
-
Day 0 (migration): Change the IP address in the DNS record. With TTL at 60 seconds, most clients will pick up the new IP within 1-2 minutes. Some stubborn caches (Java, mobile carriers) may take longer.
-
Day +1 to +3: Monitor for any clients still hitting the old IP. Keep the old server running and redirecting if possible.
-
Day +7: After confirming all traffic has migrated, raise TTL back to 3600.
Key insight: You must lower the TTL before the migration. If you change the IP while TTL is 3600, some clients will cache the old IP for up to one hour. The pre-migration TTL reduction ensures caches expire quickly when the actual change happens.
-
-
What is the Kubernetes ndots:5 problem, and how would you optimize DNS for a pod that makes many external API calls?
Answer
The ndots:5 setting means any hostname with fewer than 5 dots is treated as a “relative” name, and the search domains are appended before trying the absolute name.
For a query like
api.stripe.com(2 dots < 5 ndots):api.stripe.com.default.svc.cluster.local-> NXDOMAINapi.stripe.com.svc.cluster.local-> NXDOMAINapi.stripe.com.cluster.local-> NXDOMAINapi.stripe.com.-> SUCCESS
That is 3 wasted DNS queries for every external call, adding 3-15ms of latency.
Optimization options:
- Use FQDNs with trailing dot in application code:
api.stripe.com. - Lower ndots in the pod spec:
dnsConfig:options:- name: ndotsvalue: "2"
- Use a NodeLocal DNSCache DaemonSet to cache responses locally on each node
- For high-throughput services, consider a local caching resolver sidecar
-
A company runs services in us-east-1 and eu-west-1. They want European users to hit eu-west-1 for GDPR compliance, but if eu-west-1 goes down, they’d rather serve from us-east-1 than go offline. Design the DNS routing policy.
Answer
This requires combining geolocation routing with failover:
-
Primary policy: Geolocation routing
- Europe (all countries) -> eu-west-1 endpoint
- Default (everything else) -> us-east-1 endpoint
-
Secondary policy: Failover per geo-target
- Europe geolocation target:
- Primary: eu-west-1 (with health check)
- Secondary: us-east-1 (failover target)
- Default geolocation target:
- Primary: us-east-1 (with health check)
- Secondary: eu-west-1 (failover target)
- Europe geolocation target:
-
Health checks
- HTTP health check on /healthz for each endpoint
- Check interval: 10 seconds
- Failure threshold: 3 consecutive failures
- Check from multiple locations
Important GDPR consideration: If eu-west-1 fails and European users are routed to us-east-1, the company must ensure their data processing in us-east-1 still complies with GDPR (valid transfer mechanism such as SCCs). The DNS failover solves availability but does not automatically solve compliance.
-
-
Why is DNSSEC adoption still below 10% for .com domains despite being available for over a decade?
Answer
Several factors contribute to low DNSSEC adoption:
-
Operational complexity: Key rotation (especially KSK rollover) requires careful coordination between the zone operator and the parent zone (registrar). A mistake can make the entire domain unreachable.
-
Self-inflicted outages: Expired RRSIG signatures cause validating resolvers to reject the domain entirely. This is worse than no DNSSEC — you’ve turned a security feature into an availability risk.
-
Limited perceived benefit: HTTPS/TLS already authenticates the endpoint. Even if DNS is spoofed, the attacker cannot present a valid TLS certificate for the domain. Many consider TLS sufficient.
-
Performance overhead: DNSSEC responses are 5-10x larger due to signatures. This increases bandwidth, complicates UDP (may require TCP fallback), and adds CPU for signature verification.
-
DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT): These encrypt the DNS channel itself, preventing man-in-the-middle attacks without DNSSEC’s complexity.
-
Registrar support: Many registrars make DNSSEC configuration difficult or don’t support automated DS record management.
Government domains (.gov) show that mandates work — adoption there is ~93% — but without market pressure, commercial adoption remains low.
-
Hands-On Exercise
Section titled “Hands-On Exercise”Objective: Build a latency-based DNS routing setup with health checks and failover using a simulated multi-region architecture.
Environment: kind cluster + CoreDNS + custom DNS server
Part 1: Set Up the Multi-Region Simulation (20 minutes)
Section titled “Part 1: Set Up the Multi-Region Simulation (20 minutes)”# Create a kind clustercat <<'EOF' > /tmp/dns-lab-cluster.yamlkind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes: - role: control-plane - role: worker labels: region: us-east - role: worker labels: region: eu-westEOF
kind create cluster --name dns-lab --config /tmp/dns-lab-cluster.yamlPart 2: Deploy Region-Specific Applications (15 minutes)
Section titled “Part 2: Deploy Region-Specific Applications (15 minutes)”# Deploy "us-east" applicationcat <<'EOF' | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: app-us-east labels: app: webapp region: us-eastspec: replicas: 2 selector: matchLabels: app: webapp region: us-east template: metadata: labels: app: webapp region: us-east spec: nodeSelector: region: us-east containers: - name: web image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 3 periodSeconds: 5 volumeMounts: - name: html mountPath: /usr/share/nginx/html volumes: - name: html configMap: name: us-east-html---apiVersion: v1kind: ConfigMapmetadata: name: us-east-htmldata: index.html: | <h1>Region: us-east-1</h1> <p>Latency: ~10ms from East Coast</p> healthz: "OK"---apiVersion: v1kind: Servicemetadata: name: app-us-east labels: region: us-eastspec: selector: app: webapp region: us-east ports: - port: 80 targetPort: 80 type: ClusterIPEOF
# Deploy "eu-west" applicationcat <<'EOF' | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: app-eu-west labels: app: webapp region: eu-westspec: replicas: 2 selector: matchLabels: app: webapp region: eu-west template: metadata: labels: app: webapp region: eu-west spec: nodeSelector: region: eu-west containers: - name: web image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 3 periodSeconds: 5 volumeMounts: - name: html mountPath: /usr/share/nginx/html volumes: - name: html configMap: name: eu-west-html---apiVersion: v1kind: ConfigMapmetadata: name: eu-west-htmldata: index.html: | <h1>Region: eu-west-1</h1> <p>Latency: ~15ms from Western Europe</p> healthz: "OK"---apiVersion: v1kind: Servicemetadata: name: app-eu-west labels: region: eu-westspec: selector: app: webapp region: eu-west ports: - port: 80 targetPort: 80 type: ClusterIPEOFPart 3: Deploy a Custom DNS Router (25 minutes)
Section titled “Part 3: Deploy a Custom DNS Router (25 minutes)”# Deploy a CoreDNS instance acting as a "global traffic manager"cat <<'EOF' | kubectl apply -f -apiVersion: v1kind: ConfigMapmetadata: name: dns-router-configdata: Corefile: | app.lab.local:5353 { log health :8080
# Respond with different IPs based on query metadata # In production this would use geolocation/latency data template IN A app.lab.local { answer "app.lab.local. 60 IN A {%s}" }
# Forward everything else to cluster DNS forward . /etc/resolv.conf }
.:5353 { forward . /etc/resolv.conf log }---apiVersion: apps/v1kind: Deploymentmetadata: name: dns-routerspec: replicas: 1 selector: matchLabels: app: dns-router template: metadata: labels: app: dns-router spec: containers: - name: coredns image: coredns/coredns:1.12.0 args: ["-conf", "/etc/coredns/Corefile"] ports: - containerPort: 5353 protocol: UDP - containerPort: 5353 protocol: TCP - containerPort: 8080 protocol: TCP volumeMounts: - name: config mountPath: /etc/coredns volumes: - name: config configMap: name: dns-router-config---apiVersion: v1kind: Servicemetadata: name: dns-routerspec: selector: app: dns-router ports: - port: 53 targetPort: 5353 protocol: UDP name: dns-udp - port: 53 targetPort: 5353 protocol: TCP name: dns-tcpEOFPart 4: Health Check and Failover Testing (20 minutes)
Section titled “Part 4: Health Check and Failover Testing (20 minutes)”# Deploy a health checker that monitors both regionscat <<'EOF' | kubectl apply -f -apiVersion: v1kind: ConfigMapmetadata: name: health-checker-scriptdata: check.sh: | #!/bin/sh echo "=== DNS Health Checker ===" echo ""
while true; do US_STATUS=$(wget -q -O /dev/null -T 2 http://app-us-east/healthz 2>&1 && echo "HEALTHY" || echo "UNHEALTHY") EU_STATUS=$(wget -q -O /dev/null -T 2 http://app-eu-west/healthz 2>&1 && echo "HEALTHY" || echo "UNHEALTHY")
TIMESTAMP=$(date '+%H:%M:%S') echo "[$TIMESTAMP] us-east: $US_STATUS | eu-west: $EU_STATUS"
if [ "$US_STATUS" = "UNHEALTHY" ] && [ "$EU_STATUS" = "UNHEALTHY" ]; then echo " ⚠ BOTH REGIONS DOWN — no healthy endpoints!" elif [ "$US_STATUS" = "UNHEALTHY" ]; then echo " → Failover: routing all traffic to eu-west" elif [ "$EU_STATUS" = "UNHEALTHY" ]; then echo " → Failover: routing all traffic to us-east" else echo " → Normal: latency-based routing active" fi
sleep 5 done---apiVersion: v1kind: Podmetadata: name: health-checkerspec: containers: - name: checker image: busybox:1.37 command: ["/bin/sh", "/scripts/check.sh"] volumeMounts: - name: scripts mountPath: /scripts volumes: - name: scripts configMap: name: health-checker-script defaultMode: 0755EOF
# Watch health check outputkubectl logs -f health-checkerPart 5: Simulate Region Failure (15 minutes)
Section titled “Part 5: Simulate Region Failure (15 minutes)”# Simulate us-east failure by scaling to 0echo "--- Simulating us-east failure ---"kubectl scale deployment app-us-east --replicas=0
# Watch the health checker detect the failurekubectl logs -f health-checker --tail=10
# Verify eu-west is still servingkubectl exec -it health-checker -- wget -qO- http://app-eu-west/
# Restore us-eastecho "--- Restoring us-east ---"kubectl scale deployment app-us-east --replicas=2
# Watch recoverykubectl logs -f health-checker --tail=10Part 6: Examine DNS Resolution Behavior (15 minutes)
Section titled “Part 6: Examine DNS Resolution Behavior (15 minutes)”# Launch a debug podkubectl run dns-debug --image=busybox:1.37 --rm -it -- sh
# Inside the pod, examine DNS configurationcat /etc/resolv.conf
# Observe the ndots problem# Count queries for an external domainnslookup -debug api.example.com 2>&1 | head -30
# Compare with FQDN (trailing dot)nslookup api.example.com. 2>&1 | head -10
# Test resolution timingtime nslookup kubernetes.default.svc.cluster.localtime nslookup google.comtime nslookup google.com. # With trailing dot — faster!Clean Up
Section titled “Clean Up”kind delete cluster --name dns-labSuccess Criteria:
- Deployed region-specific applications on labeled nodes
- Health checker correctly detects regional health status
- Observed failover behavior when scaling a region to zero
- Verified recovery when the region comes back
- Examined the ndots:5 behavior and tested FQDN optimization
- Understood the difference between DNS routing policies
Further Reading
Section titled “Further Reading”-
“DNS and BIND” (5th Edition) — Cricket Liu & Paul Albitz. The definitive reference for DNS internals and configuration.
-
RFC 8499: DNS Terminology — The official glossary of DNS terms, clarifying decades of ambiguity.
-
“The Anatomy of the Dyn DDoS Attack” — Dyn’s own post-mortem of the October 2016 attack, detailing how Anycast both helped and complicated recovery.
-
Cloudflare Learning Center: DNS — Excellent free resource with interactive diagrams explaining DNS concepts from basics to advanced.
Key Takeaways
Section titled “Key Takeaways”Before moving on, ensure you understand:
- ALIAS/ANAME solve the apex CNAME problem by resolving at the DNS server level, returning A records to clients
- Anycast assigns one IP to many servers using BGP, routing clients to the nearest node. Ideal for stateless protocols like DNS
- Traffic policies go beyond round robin: Weighted, latency-based, geolocation, and failover policies turn DNS into a global traffic manager
- Health checks are mandatory for DNS-based failover: Without them, DNS happily routes traffic to dead servers
- TTL is a promise that is frequently broken: Browsers, JVMs, ISPs, and mobile carriers all cache DNS differently. Pre-migration TTL lowering is essential
- DNSSEC authenticates but doesn’t encrypt: It prevents cache poisoning but does not hide your queries. DoH/DoT handle encryption
- Kubernetes ndots:5 multiplies external DNS queries: Use trailing dots or lower ndots for applications making many external calls
- ExternalDNS bridges cluster and cloud DNS: Automatically manages DNS records based on Kubernetes Ingress and Service resources
Next Module
Section titled “Next Module”Module 1.2: CDN & Edge Computing — How content delivery networks minimize latency by caching at the edge, and how edge compute is changing application architecture.