Skip to content

Module 1.1: DNS at Scale & Global Traffic Management

Complexity: [COMPLEX]

Time to Complete: 3 hours

Prerequisites: Basic DNS (A/AAAA/CNAME records), Kubernetes Ingress concepts

Track: Foundations — Advanced Networking

After completing this module, you will be able to:

  1. Design DNS architectures for global traffic management using weighted routing, geolocation policies, and health-checked failover
  2. Diagnose DNS resolution failures by tracing queries through recursive resolvers, authoritative servers, and caching layers
  3. Implement DNS-based service discovery patterns and explain their tradeoffs compared to service mesh alternatives
  4. Evaluate DNS security risks (cache poisoning, DDoS amplification, hijacking) and apply DNSSEC, DoH, and split-horizon mitigations

July 22, 2016. A routine configuration update at Dyn, one of the world’s largest managed DNS providers, propagates a change to their Anycast network. Nothing unusual. But three months later, Dyn would learn a very different lesson about DNS at scale.

On October 21, 2016, the Mirai botnet unleashed a massive DDoS attack against Dyn’s infrastructure. Tens of millions of IP addresses, mostly from compromised IoT devices, flooded Dyn’s DNS resolvers. The attack was devastating not because Dyn was careless, but because DNS is the single most critical piece of internet infrastructure that almost everyone takes for granted.

Twitter, GitHub, Netflix, Reddit, Spotify, The New York Times — all went dark. Not because their servers were down, but because nobody could look up their IP addresses. It was like erasing every phone number from every phone book simultaneously.

The Dyn attack exposed what infrastructure engineers already knew: DNS is the first thing that happens in every connection and the last thing anyone thinks about until it breaks. This module teaches you to think about DNS the way the engineers who keep the internet running do — as a globally distributed, latency-sensitive, security-critical system that demands deliberate architecture.


Every single request your application serves begins with a DNS lookup. Before TLS handshakes, before HTTP requests, before any application logic — the client must resolve a hostname to an IP address. If that resolution is slow, everything is slow. If it fails, nothing works.

At scale, DNS stops being a simple lookup table and becomes a global traffic management system. It decides which datacenter serves your users. It detects failures and reroutes traffic. It balances load across continents. It enforces security policies before a single packet reaches your infrastructure.

Yet most engineers treat DNS as “set it and forget it.” They paste records into a web UI and wonder why their global application has mysterious latency spikes for users in certain regions, or why failover takes 20 minutes instead of 20 seconds.

The Air Traffic Control Analogy

Think of DNS like air traffic control. Every plane (request) needs to be told which runway (server) to land on. Good ATC considers weather (server health), fuel levels (client proximity), runway capacity (server load), and traffic patterns (routing policies). Bad ATC just assigns runways randomly and hopes for the best. DNS at scale is your application’s ATC system.


  • Advanced DNS record types beyond A/AAAA/CNAME (ALIAS, ANAME, CAA, SRV)
  • Anycast routing and why it matters for DNS
  • Traffic management policies: latency-based, weighted, geolocation, failover
  • DNSSEC: how it works and why adoption is still incomplete
  • TTL tuning and the caching trap that catches everyone
  • Hands-on: Building latency-based routing with health checks and failover

BASIC DNS RECORDS — QUICK REVIEW
═══════════════════════════════════════════════════════════════
A RECORD
─────────────────────────────────────────────────────────────
Maps hostname → IPv4 address
app.example.com. 300 IN A 203.0.113.10
AAAA RECORD
─────────────────────────────────────────────────────────────
Maps hostname → IPv6 address
app.example.com. 300 IN AAAA 2001:db8::1
CNAME RECORD
─────────────────────────────────────────────────────────────
Maps hostname → another hostname (alias)
www.example.com. 300 IN CNAME app.example.com.
⚠️ LIMITATION: CNAME cannot coexist with other records
at the same name (RFC 1034). This means you CANNOT
put a CNAME at the zone apex (example.com).
MX RECORD
─────────────────────────────────────────────────────────────
Maps hostname → mail server (with priority)
example.com. 300 IN MX 10 mail.example.com.
example.com. 300 IN MX 20 backup.example.com.
ADVANCED DNS RECORDS
═══════════════════════════════════════════════════════════════
ALIAS / ANAME RECORD (Provider-Specific)
─────────────────────────────────────────────────────────────
Solves the "CNAME at zone apex" problem.
Problem:
example.com. CNAME lb.cloud.com. ← ILLEGAL per RFC
example.com. A ??? ← Need dynamic IP
Solution: ALIAS/ANAME resolves at the DNS server level
example.com. ALIAS lb.us-east-1.elb.amazonaws.com.
How it works:
1. Client queries: example.com A?
2. DNS server resolves lb.us-east-1.elb.amazonaws.com → 52.1.2.3
3. DNS server returns: example.com A 52.1.2.3
┌──────────┐ ┌──────────────┐
│ Client │─── A example.com? ──→│ DNS Server │
│ │ │ │
│ │ │ Resolves │
│ │ │ ALIAS target│
│ │ │ internally │
│ │← A 52.1.2.3 ──────│ │
└──────────┘ └──────────────┘
⚠️ NOT standardized. Called "ALIAS" (Route53, DNSimple),
"ANAME" (PowerDNS, RFC draft), "CNAME flattening"
(Cloudflare). Behavior varies by provider.
SRV RECORD
─────────────────────────────────────────────────────────────
Service discovery with port and priority.
Format: _service._protocol.name TTL IN SRV priority weight port target
_http._tcp.example.com. 300 IN SRV 10 60 8080 web1.example.com.
_http._tcp.example.com. 300 IN SRV 10 40 8080 web2.example.com.
_http._tcp.example.com. 300 IN SRV 20 0 8080 backup.example.com.
Priority 10 (lower = preferred): 60% to web1, 40% to web2
Priority 20 (fallback): backup only if priority 10 fails
Used by: Kubernetes services, LDAP, SIP, XMPP, MongoDB
CAA RECORD (Certificate Authority Authorization)
─────────────────────────────────────────────────────────────
Controls which CAs can issue certificates for your domain.
example.com. 300 IN CAA 0 issue "letsencrypt.org"
example.com. 300 IN CAA 0 issuewild "letsencrypt.org"
example.com. 300 IN CAA 0 iodef "mailto:security@example.com"
issue → Who can issue regular certs
issuewild → Who can issue wildcard certs
iodef → Where to report violations
Since Sept 2017, CAs MUST check CAA before issuing.
Missing CAA = any CA can issue (bad for security).
TXT RECORD (Verification & Policy)
─────────────────────────────────────────────────────────────
Free-form text, used heavily for verification and email auth.
example.com. 300 IN TXT "v=spf1 include:_spf.google.com ~all"
_dmarc.example.com. 300 IN TXT "v=DMARC1; p=reject; rua=..."
google._domainkey.example.com. 300 IN TXT "v=DKIM1; k=rsa; p=..."
SPF: Which servers can send email for your domain
DKIM: Cryptographic email signing
DMARC: What to do with failed SPF/DKIM checks
DNS RESOLUTION — THE FULL PICTURE
═══════════════════════════════════════════════════════════════
When you type "app.example.com" in your browser:
┌──────────┐ ┌───────────────┐ ┌──────────────┐
│ Browser │────→│ Stub Resolver │────→│ Recursive │
│ │ │ (OS-level) │ │ Resolver │
└──────────┘ └───────────────┘ │ (ISP/Cloud) │
└──────┬───────┘
┌─────────────────────────────┼──────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌──────────────┐
│ Root Server │ │ .com TLD │ │ example.com │
│ (13 groups) │ │ Server │ │ Authoritative│
└─────────────┘ └──────────┘ └──────────────┘
Step-by-step (uncached):
1. Browser checks its cache → miss
2. OS stub resolver checks /etc/hosts → miss
3. OS sends query to configured recursive resolver
4. Recursive resolver asks root: "Where is .com?"
5. Root says: "Try 192.5.6.30 (a.gtld-servers.net)"
6. Recursive asks .com TLD: "Where is example.com?"
7. TLD says: "Try 198.51.100.1 (ns1.example.com)"
8. Recursive asks authoritative: "What is app.example.com?"
9. Authoritative responds with the A record
10. Recursive caches result, returns to client
TOTAL UNCACHED LOOKUP TIME
─────────────────────────────────────────────────────────────
Root query: ~5-30ms (Anycast, nearby)
TLD query: ~10-50ms
Authoritative query: ~10-200ms (depends on location)
──────────────────────────────────────────────
Total: ~25-280ms for first lookup
Cached: ~0-5ms for subsequent lookups

Part 2: Anycast — The Secret Behind Fast DNS

Section titled “Part 2: Anycast — The Secret Behind Fast DNS”
UNICAST vs ANYCAST
═══════════════════════════════════════════════════════════════
UNICAST (Traditional)
─────────────────────────────────────────────────────────────
One IP address → One physical server
Client in Tokyo ──────────────→ Server in Virginia
10,000+ km IP: 198.51.100.1
~150ms RTT
Every client goes to the same server, regardless of location.
ANYCAST
─────────────────────────────────────────────────────────────
One IP address → Multiple physical servers
BGP routing sends each client to the nearest one.
Client in Tokyo ───→ Server in Tokyo IP: 198.51.100.1
~5ms
Client in London ──→ Server in London IP: 198.51.100.1
~3ms
Client in NYC ─────→ Server in Virginia IP: 198.51.100.1
~8ms
Same IP, different physical servers!
HOW ANYCAST WORKS
─────────────────────────────────────────────────────────────
Multiple servers announce the same IP prefix via BGP.
┌────────────┐ ┌────────────┐
│ Tokyo PoP │ │ London PoP │
│ 198.51.100 │ │ 198.51.100 │
│ .0/24 ↑ │ │ .0/24 ↑ │
└────┬───────┘ └────┬───────┘
│ BGP announces │ BGP announces
│ 198.51.100.0/24 │ 198.51.100.0/24
▼ ▼
┌──────────────────────────────────────────────┐
│ Internet BGP Mesh │
│ │
│ Router sees same prefix from multiple │
│ origins → picks shortest AS-path │
└──────────────────────────────────────────────┘
WHY IT WORKS FOR DNS (AND CDN)
─────────────────────────────────────────────────────────────
✓ DNS is UDP-based → no long-lived connections
✓ Each query is independent → stateless
✓ BGP convergence time (~30-90s) > DNS query time (~ms)
✓ Natural DDoS absorption: attack traffic gets distributed
⚠️ Less suitable for TCP (BGP route changes can break
established connections). Some providers handle this
with connection migration or short-lived flows.
ANYCAST DEPLOYMENTS — REAL NUMBERS
═══════════════════════════════════════════════════════════════
ROOT DNS SERVERS
─────────────────────────────────────────────────────────────
13 root server "identities" (a.root-servers.net through m)
But NOT 13 physical servers!
Total root server instances worldwide: 1,700+
All using Anycast to serve from nearest location.
Example: f.root-servers.net (ISC)
- 256+ instances across 6 continents
- All announce same 192.5.5.241 via Anycast
CLOUDFLARE DNS (1.1.1.1)
─────────────────────────────────────────────────────────────
330+ cities worldwide
Average response time: <11ms globally
All serving from the same 1.1.1.1 Anycast address
GOOGLE PUBLIC DNS (8.8.8.8)
─────────────────────────────────────────────────────────────
Anycast across Google's global network
~20+ PoP locations
Average response time: ~12ms

Part 3: Global Traffic Management with DNS

Section titled “Part 3: Global Traffic Management with DNS”
DNS-BASED TRAFFIC MANAGEMENT
═══════════════════════════════════════════════════════════════
Instead of returning a static IP, smart DNS returns DIFFERENT
IPs based on the client's context. This is how global load
balancing works at the DNS layer.
SIMPLE (Round Robin)
─────────────────────────────────────────────────────────────
Return all IPs, rotate the order.
Query 1: app.example.com → [10.0.1.1, 10.0.2.1, 10.0.3.1]
Query 2: app.example.com → [10.0.2.1, 10.0.3.1, 10.0.1.1]
Query 3: app.example.com → [10.0.3.1, 10.0.1.1, 10.0.2.1]
Pros: Dead simple
Cons: No health awareness, no locality, uneven distribution
WEIGHTED
─────────────────────────────────────────────────────────────
Return IPs based on configured weights.
us-east 70% → 52.1.1.1 (primary, more capacity)
eu-west 20% → 54.2.2.2 (secondary)
ap-south 10% → 13.3.3.3 (new region, testing)
Use cases:
- Gradual migration (90/10 → 80/20 → 50/50 → 0/100)
- Canary deployments
- Proportional to capacity
LATENCY-BASED
─────────────────────────────────────────────────────────────
Return the IP with lowest latency to the client.
Client in Tokyo → ap-northeast-1 (13.3.3.3) ~5ms
Client in NYC → us-east-1 (52.1.1.1) ~8ms
Client in Berlin → eu-west-1 (54.2.2.2) ~12ms
How it works:
1. DNS provider continuously measures latency from
resolver locations to each endpoint
2. Maps client's resolver IP to nearest region
3. Returns the lowest-latency endpoint
⚠️ EDNS Client Subnet (ECS) improves accuracy.
Without ECS, latency is measured to the RESOLVER,
not the actual client. A user in Tokyo using 8.8.8.8
might get routed based on Google's resolver location.
GEOLOCATION
─────────────────────────────────────────────────────────────
Return IPs based on the client's geographic location.
Client in Germany → eu-central-1 (GDPR compliance!)
Client in China → cn-north-1 (content compliance!)
Client in Brazil → sa-east-1 (data sovereignty!)
Use cases:
- Regulatory compliance (data residency)
- Content licensing (geo-restricted media)
- Language-specific endpoints
Hierarchy: Continent → Country → State → City
Fallback if no match: default record
GEOPROXIMITY
─────────────────────────────────────────────────────────────
Like geolocation, but with a tunable "bias" that shifts
the geographic boundary between regions.
Without bias:
┌─────────────────────────────────────────────┐
│ US-East │ US-West │
│ │ │
│ Chicago gets ←──│── boundary │
│ US-East │ │
└─────────────────────────────────────────────┘
With bias (US-West +25):
┌─────────────────────────────────────────────┐
│ US-East │ US-West │
│ │ │
│ Boundary ──→│ Chicago now gets │
│ shifted │ US-West! │
└─────────────────────────────────────────────┘
FAILOVER (Active-Passive)
─────────────────────────────────────────────────────────────
Return primary IP unless health check fails, then failover.
Normal:
app.example.com → 52.1.1.1 (primary, healthy ✓)
After primary fails health check:
app.example.com → 54.2.2.2 (secondary)
Health check configuration:
Protocol: HTTP/HTTPS/TCP
Path: /healthz
Interval: 10s
Threshold: 3 consecutive failures → failover
Recovery: 3 consecutive successes → failback
MULTIVALUE ANSWER
─────────────────────────────────────────────────────────────
Like round robin, but with health checks.
Returns up to 8 healthy IPs. Client picks one.
Healthy: [52.1.1.1 ✓, 54.2.2.2 ✓, 13.3.3.3 ✓]
Response: [52.1.1.1, 54.2.2.2, 13.3.3.3]
After 54.2.2.2 fails:
Response: [52.1.1.1, 13.3.3.3] (only healthy IPs)

3.2 Health Checks and Failover Architecture

Section titled “3.2 Health Checks and Failover Architecture”
HEALTH CHECK ARCHITECTURE
═══════════════════════════════════════════════════════════════
DNS health checks run from MULTIPLE locations to avoid
false positives from network issues.
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│Virginia│ │Oregon │ │Ireland │ │Tokyo │
│Checker │ │Checker │ │Checker │ │Checker │
└───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────┐
│ Your Application │
│ 52.1.1.1:443/healthz │
└──────────────────────────────────────────────┘
Failure determination:
─────────────────────────────────────────────────
If 1/4 checkers fail → Network issue, ignore
If 2/4 checkers fail → Possible problem
If 3/4 checkers fail → Mark unhealthy, failover
If 4/4 checkers fail → Definitely down
CALCULATED HEALTH CHECKS (Composite)
─────────────────────────────────────────────────────────────
┌─── AND ──────────────────────────────────────┐
│ │
│ ┌─── OR ────────────────────────────┐ │
│ │ Web Server 1: /healthz ✓ │ │
│ │ Web Server 2: /healthz ✓ │ ✓ │
│ └───────────────────────────────────┘ │
│ │
│ ┌─── OR ────────────────────────────┐ │
│ │ DB Primary: port 5432 ✓ │ │
│ │ DB Replica: port 5432 ✓ │ ✓ │
│ └───────────────────────────────────┘ │
│ │ → HEALTHY ✓
└──────────────────────────────────────────────┘
Region is healthy only if BOTH web AND database
have at least one healthy instance.
FAILOVER TIMING BREAKDOWN
─────────────────────────────────────────────────────────────
Health check interval: 10s
Failure threshold: 3 checks
Time to detect failure: ~30s
DNS propagation (TTL dependent): 0-60s
Client cache expiry: 0-300s
─────────────────────────────────────────────────
Worst case total: ~6 minutes
Best case total: ~30 seconds
This is why low TTLs matter for failover!

TTL (TIME TO LIVE) — THE MISUNDERSTOOD SETTING
═══════════════════════════════════════════════════════════════
TTL tells resolvers how long to cache a DNS response.
app.example.com. 60 IN A 52.1.1.1
^^
TTL = 60 seconds
THE CACHING CHAIN
─────────────────────────────────────────────────────────────
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐
│ Browser │ │ OS Cache │ │ Recursive│ │ Auth │
│ Cache │──→│ Stub │──→│ Resolver │──→│ Server │
│ ~1-2min │ │ ~TTL │ │ ~TTL │ │ │
└──────────┘ └──────────┘ └──────────┘ └─────────┘
Each layer can cache independently!
THE COUNTDOWN PROBLEM
─────────────────────────────────────────────────────────────
t=0: Auth server sets TTL=300 (5 min)
t=0: Recursive resolver caches with TTL=300
t=120: Client A queries resolver → gets answer with TTL=180
t=120: Client A caches with TTL=180
t=290: Client B queries resolver → gets answer with TTL=10
t=290: Client B caches with TTL=10
Client A won't re-query for 3 minutes.
Client B will re-query in 10 seconds.
SAME answer, DIFFERENT freshness guarantees.
WHEN TTL IS IGNORED
─────────────────────────────────────────────────────────────
Java's InetAddress: Caches FOREVER by default!
(networkaddress.cache.ttl=30)
Some ISP resolvers: Ignore low TTLs, cache longer
Browser DNS cache: Chrome: ~1 minute regardless
Corporate proxies: May cache aggressively
Mobile carriers: Often ignore TTL completely
TTL STRATEGY BY USE CASE
═══════════════════════════════════════════════════════════════
HIGH TTL (3600-86400 seconds / 1h-24h)
─────────────────────────────────────────────────────────────
Use when: Records rarely change, performance matters most
MX records: 86400 (mail routing rarely changes)
SPF/DKIM/DMARC: 3600 (email auth records)
Static infrastructure: 3600 (on-prem servers)
✓ Fewer queries to authoritative servers
✓ Faster resolution for end users
✗ Changes take hours to propagate
✗ Failover is slow
MEDIUM TTL (60-300 seconds / 1-5 min)
─────────────────────────────────────────────────────────────
Use when: Balance between performance and flexibility
Production apps: 300 (good default)
API endpoints: 120 (moderate change frequency)
✓ Reasonable cache hit rate
✓ Changes propagate within minutes
✗ More queries than high TTL
LOW TTL (10-60 seconds)
─────────────────────────────────────────────────────────────
Use when: Fast failover required, frequent changes
Active failover: 60 (1 minute max staleness)
Blue-green deploys: 30 (fast cutover)
During migrations: 10 (minimize risk window)
✓ Fast failover
✓ Rapid propagation
✗ High query volume
✗ Higher latency (more cache misses)
✗ Higher cost (DNS queries per second pricing)
PRE-MIGRATION TTL STRATEGY
─────────────────────────────────────────────────────────────
Day -7: Lower TTL from 3600 to 300
(wait for old caches to expire)
Day -1: Lower TTL from 300 to 60
Day 0: Perform migration (change IP)
Day +1: Verify everything works
Day +7: Raise TTL back to 3600
⚠️ You must lower the TTL BEFORE the migration!
If TTL is 3600, you need to wait 1 hour for the
old TTL to expire before the low TTL takes effect.

Part 5: DNSSEC — Authenticating the Phonebook

Section titled “Part 5: DNSSEC — Authenticating the Phonebook”
DNS CACHE POISONING ATTACK
═══════════════════════════════════════════════════════════════
Without DNSSEC, DNS responses are NOT authenticated.
An attacker can forge responses.
THE ATTACK (Kaminsky Attack, 2008)
─────────────────────────────────────────────────────────────
┌──────┐ ┌──────────┐ ┌──────────┐
│Client│──── bank.com? ──→│ Resolver │──query──→│Auth DNS │
│ │ │ │ │ │
│ │ │ │←─real───│ "1.2.3.4"│
│ │ │ │ └──────────┘
│ │ │ │
│ │ │ │←─FAKE───┐
│ │← bank.com = │ Accepts │ │ Attacker
│ │ 6.6.6.6 ──────│ FAKE! │ │ "6.6.6.6"
└──────┘ (attacker IP) └──────────┘ └──────────┘
Attacker races the real response with a forged one.
If the forged response arrives first with the right
transaction ID, the resolver caches the FAKE answer.
Now EVERY client using that resolver gets sent to
the attacker's server. For bank.com. With a valid-
looking certificate (if attacker also has a cert).
HOW DNSSEC PREVENTS THIS
─────────────────────────────────────────────────────────────
DNSSEC adds cryptographic signatures to DNS responses.
Zone owner signs records with private key.
Resolver verifies signatures with public key.
Forged responses fail signature verification.
┌──────────────────────────────────────────────────────┐
│ example.com. 300 A 52.1.1.1 │
│ │
│ RRSIG: A 13 2 300 20250101000000 20241201000000 │
│ 12345 example.com. │
│ <base64-encoded-signature> │
└──────────────────────────────────────────────────────┘
Resolver checks: Does the RRSIG match the A record
using example.com's DNSKEY? If yes → trust. If no → reject.
DNSSEC CHAIN OF TRUST
═══════════════════════════════════════════════════════════════
Root Zone (IANA)
│ KSK → Signs → ZSK → Signs → DS records for .com
.com TLD (Verisign)
│ KSK → Signs → ZSK → Signs → DS records for example.com
example.com (You)
KSK → Signs → ZSK → Signs → A, AAAA, MX records
KEY TYPES
─────────────────────────────────────────────────────────────
KSK (Key Signing Key): Signs the DNSKEY set
Registered as DS in parent zone
Rotated infrequently (yearly+)
ZSK (Zone Signing Key): Signs actual records
Rotated more frequently (monthly)
RECORD TYPES
─────────────────────────────────────────────────────────────
DNSKEY: Public keys for the zone
RRSIG: Signature over a record set
DS: Hash of child's KSK (in parent zone)
NSEC/NSEC3: Proves a record does NOT exist
(authenticated denial of existence)
DNSSEC ADOPTION (2025)
─────────────────────────────────────────────────────────────
.com zones signed: ~5% (surprisingly low!)
.gov zones signed: ~93% (mandated by policy)
.nl (Netherlands): ~58% (highest country TLD)
.se (Sweden): ~52%
Why so low?
- Operational complexity (key rotation)
- Risk of self-inflicted outages (expired signatures)
- Performance overhead (larger responses)
- HTTPS/TLS already provides endpoint authentication
- DNS-over-HTTPS (DoH) provides channel encryption

KUBERNETES DNS ARCHITECTURE
═══════════════════════════════════════════════════════════════
Every Kubernetes cluster runs CoreDNS for internal resolution.
┌────────────────────────────────────────────────────┐
│ Pod: my-app │
│ /etc/resolv.conf: │
│ nameserver 10.96.0.10 (CoreDNS ClusterIP) │
│ search default.svc.cluster.local │
│ svc.cluster.local │
│ cluster.local │
│ ndots: 5 │
└────────────────┬───────────────────────────────────┘
┌────────────────────────────────────────────────────┐
│ CoreDNS (kube-dns service: 10.96.0.10) │
│ │
│ Service records: │
│ my-svc.default.svc.cluster.local → ClusterIP │
│ │
│ Pod records: │
│ 10-244-1-5.default.pod.cluster.local → Pod IP │
│ │
│ Headless service: │
│ my-svc.default.svc.cluster.local → Pod IPs │
│ │
│ External: Forward to upstream (/etc/resolv.conf) │
└────────────────────────────────────────────────────┘
THE ndots TRAP
─────────────────────────────────────────────────────────────
ndots:5 means any name with fewer than 5 dots gets the
search domains appended FIRST.
Query: api.example.com (2 dots, < 5)
Resolution order:
1. api.example.com.default.svc.cluster.local → NXDOMAIN
2. api.example.com.svc.cluster.local → NXDOMAIN
3. api.example.com.cluster.local → NXDOMAIN
4. api.example.com. → SUCCESS!
That's 3 WASTED queries for every external domain!
Fix: Add trailing dot (api.example.com.) or lower ndots.
In pod spec:
dnsConfig:
options:
- name: ndots
value: "2"

6.2 ExternalDNS — Bridging Cluster and Cloud DNS

Section titled “6.2 ExternalDNS — Bridging Cluster and Cloud DNS”
EXTERNALDNS
═══════════════════════════════════════════════════════════════
ExternalDNS automatically creates DNS records in cloud
providers based on Kubernetes resources.
┌──────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ Ingress: │
│ host: app.example.com ──────┐ │
│ → backend: my-svc:80 │ │
│ │ │
│ Service (type: LoadBalancer): │ │
│ external-ip: 52.1.1.1 │ │
│ │ │
│ ExternalDNS controller ◄─────┘ │
│ watches Ingress/Service │
│ creates DNS records ──────────────────┐ │
└──────────────────────────────┤ │ │
┌──────────────────────────────────────────────┐
│ Route53 / Cloud DNS / Azure DNS │
│ │
│ app.example.com A 52.1.1.1 │
│ (auto-created and managed by ExternalDNS) │
└──────────────────────────────────────────────┘
Annotations for fine-tuning:
─────────────────────────────────────────────────
external-dns.alpha.kubernetes.io/hostname: app.example.com
external-dns.alpha.kubernetes.io/ttl: "60"
external-dns.alpha.kubernetes.io/target: 52.1.1.1

  • The entire root zone file is only about 2MB. Despite being the foundation of the entire internet’s naming system, the root zone contains just around 1,500 TLD delegations. It is signed with DNSSEC and updated roughly twice daily by IANA.

  • A single DNS query can trigger up to 23 separate lookups internally. Between CNAME chains, DNAME redirections, and DNSSEC validation (fetching DNSKEY, DS, and RRSIG records at each level), what looks like one query can cascade into a complex resolution tree. This is why recursive resolvers are some of the most heavily optimized software on the internet.

  • Cloudflare’s 1.1.1.1 resolver was almost never usable. The IP address 1.1.1.1 was historically “squatted” by many networks that used it as a dummy or test address internally. When Cloudflare launched their resolver in April 2018, they spent months working with networks worldwide to stop hijacking traffic destined for 1.1.1.1. Some corporate networks still inadvertently block it today.


MistakeProblemSolution
Using CNAME at zone apexViolates RFC, breaks MX/TXT recordsUse ALIAS/ANAME record or A record
TTL too high during migrationUsers stuck on old IP for hoursLower TTL days before migration
Ignoring Java DNS cachingJVM caches DNS forever by defaultSet networkaddress.cache.ttl=30
No health checks on DNS failoverTraffic routed to dead endpointsAlways pair DNS routing with health checks
Kubernetes ndots:5 with external calls3-4x DNS queries for external domainsLower ndots to 2 or use FQDN with trailing dot
Not monitoring DNS resolution timeSilent latency added to every requestTrack DNS resolution time in application metrics
DNSSEC without automationExpired signatures cause total outageUse managed DNSSEC or automated key rotation
Single DNS providerProvider outage = complete domain blackoutUse multi-provider DNS (NS records from 2+ providers)

  1. Why can’t you use a CNAME record at the zone apex (example.com), and what alternatives exist?

    Answer

    Per RFC 1034, a CNAME record cannot coexist with any other record type at the same name. The zone apex always has SOA and NS records (and often MX and TXT records), so placing a CNAME there would violate the standard.

    Alternatives:

    1. ALIAS record (Route53, DNSimple): The DNS server resolves the target internally and returns an A record to the client.
    2. ANAME record (PowerDNS, RFC draft): Similar concept, slightly different implementation.
    3. CNAME flattening (Cloudflare): Automatically resolves the CNAME chain and returns the final A/AAAA records at the apex.
    4. Static A records: Simply use A/AAAA records pointing to your load balancer IPs, though these must be updated if IPs change.
  2. Explain how Anycast routing works for DNS. Why is it well-suited for DNS but problematic for long-lived TCP connections?

    Answer

    Anycast assigns the same IP address to multiple servers in different locations. Each server advertises the same IP prefix via BGP. Internet routers direct each client to the “nearest” server based on BGP path selection (shortest AS-path, lowest cost).

    It works well for DNS because:

    • DNS primarily uses UDP (stateless, single request-response)
    • Each query is independent — no session to maintain
    • If a BGP route changes mid-session, the next query just goes to a different server
    • DDoS traffic gets naturally distributed across all Anycast nodes

    It is problematic for long-lived TCP because:

    • TCP connections are stateful (maintained by a specific server)
    • If BGP reconverges and changes the best path, packets for an existing TCP connection may be routed to a different server
    • The new server has no knowledge of the connection, causing a reset
    • Mitigation exists (ECMP pinning, connection migration) but adds complexity
  3. Your application has a TTL of 3600 seconds and you need to migrate to a new IP. Describe the step-by-step process and timeline.

    Answer

    Step-by-step migration process:

    1. Day -7: Lower TTL from 3600 to 300. Wait at least 3600 seconds (1 hour) for all caches holding the old TTL value to expire.

    2. Day -1: Lower TTL from 300 to 60. Wait at least 300 seconds (5 minutes) for caches to refresh.

    3. Day 0 (migration): Change the IP address in the DNS record. With TTL at 60 seconds, most clients will pick up the new IP within 1-2 minutes. Some stubborn caches (Java, mobile carriers) may take longer.

    4. Day +1 to +3: Monitor for any clients still hitting the old IP. Keep the old server running and redirecting if possible.

    5. Day +7: After confirming all traffic has migrated, raise TTL back to 3600.

    Key insight: You must lower the TTL before the migration. If you change the IP while TTL is 3600, some clients will cache the old IP for up to one hour. The pre-migration TTL reduction ensures caches expire quickly when the actual change happens.

  4. What is the Kubernetes ndots:5 problem, and how would you optimize DNS for a pod that makes many external API calls?

    Answer

    The ndots:5 setting means any hostname with fewer than 5 dots is treated as a “relative” name, and the search domains are appended before trying the absolute name.

    For a query like api.stripe.com (2 dots < 5 ndots):

    1. api.stripe.com.default.svc.cluster.local -> NXDOMAIN
    2. api.stripe.com.svc.cluster.local -> NXDOMAIN
    3. api.stripe.com.cluster.local -> NXDOMAIN
    4. api.stripe.com. -> SUCCESS

    That is 3 wasted DNS queries for every external call, adding 3-15ms of latency.

    Optimization options:

    1. Use FQDNs with trailing dot in application code: api.stripe.com.
    2. Lower ndots in the pod spec:
      dnsConfig:
      options:
      - name: ndots
      value: "2"
    3. Use a NodeLocal DNSCache DaemonSet to cache responses locally on each node
    4. For high-throughput services, consider a local caching resolver sidecar
  5. A company runs services in us-east-1 and eu-west-1. They want European users to hit eu-west-1 for GDPR compliance, but if eu-west-1 goes down, they’d rather serve from us-east-1 than go offline. Design the DNS routing policy.

    Answer

    This requires combining geolocation routing with failover:

    1. Primary policy: Geolocation routing

      • Europe (all countries) -> eu-west-1 endpoint
      • Default (everything else) -> us-east-1 endpoint
    2. Secondary policy: Failover per geo-target

      • Europe geolocation target:
        • Primary: eu-west-1 (with health check)
        • Secondary: us-east-1 (failover target)
      • Default geolocation target:
        • Primary: us-east-1 (with health check)
        • Secondary: eu-west-1 (failover target)
    3. Health checks

      • HTTP health check on /healthz for each endpoint
      • Check interval: 10 seconds
      • Failure threshold: 3 consecutive failures
      • Check from multiple locations

    Important GDPR consideration: If eu-west-1 fails and European users are routed to us-east-1, the company must ensure their data processing in us-east-1 still complies with GDPR (valid transfer mechanism such as SCCs). The DNS failover solves availability but does not automatically solve compliance.

  6. Why is DNSSEC adoption still below 10% for .com domains despite being available for over a decade?

    Answer

    Several factors contribute to low DNSSEC adoption:

    1. Operational complexity: Key rotation (especially KSK rollover) requires careful coordination between the zone operator and the parent zone (registrar). A mistake can make the entire domain unreachable.

    2. Self-inflicted outages: Expired RRSIG signatures cause validating resolvers to reject the domain entirely. This is worse than no DNSSEC — you’ve turned a security feature into an availability risk.

    3. Limited perceived benefit: HTTPS/TLS already authenticates the endpoint. Even if DNS is spoofed, the attacker cannot present a valid TLS certificate for the domain. Many consider TLS sufficient.

    4. Performance overhead: DNSSEC responses are 5-10x larger due to signatures. This increases bandwidth, complicates UDP (may require TCP fallback), and adds CPU for signature verification.

    5. DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT): These encrypt the DNS channel itself, preventing man-in-the-middle attacks without DNSSEC’s complexity.

    6. Registrar support: Many registrars make DNSSEC configuration difficult or don’t support automated DS record management.

    Government domains (.gov) show that mandates work — adoption there is ~93% — but without market pressure, commercial adoption remains low.


Objective: Build a latency-based DNS routing setup with health checks and failover using a simulated multi-region architecture.

Environment: kind cluster + CoreDNS + custom DNS server

Part 1: Set Up the Multi-Region Simulation (20 minutes)

Section titled “Part 1: Set Up the Multi-Region Simulation (20 minutes)”
Terminal window
# Create a kind cluster
cat <<'EOF' > /tmp/dns-lab-cluster.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
labels:
region: us-east
- role: worker
labels:
region: eu-west
EOF
kind create cluster --name dns-lab --config /tmp/dns-lab-cluster.yaml

Part 2: Deploy Region-Specific Applications (15 minutes)

Section titled “Part 2: Deploy Region-Specific Applications (15 minutes)”
Terminal window
# Deploy "us-east" application
cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-us-east
labels:
app: webapp
region: us-east
spec:
replicas: 2
selector:
matchLabels:
app: webapp
region: us-east
template:
metadata:
labels:
app: webapp
region: us-east
spec:
nodeSelector:
region: us-east
containers:
- name: web
image: nginx:1.27
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 3
periodSeconds: 5
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
volumes:
- name: html
configMap:
name: us-east-html
---
apiVersion: v1
kind: ConfigMap
metadata:
name: us-east-html
data:
index.html: |
<h1>Region: us-east-1</h1>
<p>Latency: ~10ms from East Coast</p>
healthz: "OK"
---
apiVersion: v1
kind: Service
metadata:
name: app-us-east
labels:
region: us-east
spec:
selector:
app: webapp
region: us-east
ports:
- port: 80
targetPort: 80
type: ClusterIP
EOF
# Deploy "eu-west" application
cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-eu-west
labels:
app: webapp
region: eu-west
spec:
replicas: 2
selector:
matchLabels:
app: webapp
region: eu-west
template:
metadata:
labels:
app: webapp
region: eu-west
spec:
nodeSelector:
region: eu-west
containers:
- name: web
image: nginx:1.27
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 3
periodSeconds: 5
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
volumes:
- name: html
configMap:
name: eu-west-html
---
apiVersion: v1
kind: ConfigMap
metadata:
name: eu-west-html
data:
index.html: |
<h1>Region: eu-west-1</h1>
<p>Latency: ~15ms from Western Europe</p>
healthz: "OK"
---
apiVersion: v1
kind: Service
metadata:
name: app-eu-west
labels:
region: eu-west
spec:
selector:
app: webapp
region: eu-west
ports:
- port: 80
targetPort: 80
type: ClusterIP
EOF

Part 3: Deploy a Custom DNS Router (25 minutes)

Section titled “Part 3: Deploy a Custom DNS Router (25 minutes)”
Terminal window
# Deploy a CoreDNS instance acting as a "global traffic manager"
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: dns-router-config
data:
Corefile: |
app.lab.local:5353 {
log
health :8080
# Respond with different IPs based on query metadata
# In production this would use geolocation/latency data
template IN A app.lab.local {
answer "app.lab.local. 60 IN A {%s}"
}
# Forward everything else to cluster DNS
forward . /etc/resolv.conf
}
.:5353 {
forward . /etc/resolv.conf
log
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: dns-router
spec:
replicas: 1
selector:
matchLabels:
app: dns-router
template:
metadata:
labels:
app: dns-router
spec:
containers:
- name: coredns
image: coredns/coredns:1.12.0
args: ["-conf", "/etc/coredns/Corefile"]
ports:
- containerPort: 5353
protocol: UDP
- containerPort: 5353
protocol: TCP
- containerPort: 8080
protocol: TCP
volumeMounts:
- name: config
mountPath: /etc/coredns
volumes:
- name: config
configMap:
name: dns-router-config
---
apiVersion: v1
kind: Service
metadata:
name: dns-router
spec:
selector:
app: dns-router
ports:
- port: 53
targetPort: 5353
protocol: UDP
name: dns-udp
- port: 53
targetPort: 5353
protocol: TCP
name: dns-tcp
EOF

Part 4: Health Check and Failover Testing (20 minutes)

Section titled “Part 4: Health Check and Failover Testing (20 minutes)”
Terminal window
# Deploy a health checker that monitors both regions
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: health-checker-script
data:
check.sh: |
#!/bin/sh
echo "=== DNS Health Checker ==="
echo ""
while true; do
US_STATUS=$(wget -q -O /dev/null -T 2 http://app-us-east/healthz 2>&1 && echo "HEALTHY" || echo "UNHEALTHY")
EU_STATUS=$(wget -q -O /dev/null -T 2 http://app-eu-west/healthz 2>&1 && echo "HEALTHY" || echo "UNHEALTHY")
TIMESTAMP=$(date '+%H:%M:%S')
echo "[$TIMESTAMP] us-east: $US_STATUS | eu-west: $EU_STATUS"
if [ "$US_STATUS" = "UNHEALTHY" ] && [ "$EU_STATUS" = "UNHEALTHY" ]; then
echo " ⚠ BOTH REGIONS DOWN — no healthy endpoints!"
elif [ "$US_STATUS" = "UNHEALTHY" ]; then
echo " → Failover: routing all traffic to eu-west"
elif [ "$EU_STATUS" = "UNHEALTHY" ]; then
echo " → Failover: routing all traffic to us-east"
else
echo " → Normal: latency-based routing active"
fi
sleep 5
done
---
apiVersion: v1
kind: Pod
metadata:
name: health-checker
spec:
containers:
- name: checker
image: busybox:1.37
command: ["/bin/sh", "/scripts/check.sh"]
volumeMounts:
- name: scripts
mountPath: /scripts
volumes:
- name: scripts
configMap:
name: health-checker-script
defaultMode: 0755
EOF
# Watch health check output
kubectl logs -f health-checker

Part 5: Simulate Region Failure (15 minutes)

Section titled “Part 5: Simulate Region Failure (15 minutes)”
Terminal window
# Simulate us-east failure by scaling to 0
echo "--- Simulating us-east failure ---"
kubectl scale deployment app-us-east --replicas=0
# Watch the health checker detect the failure
kubectl logs -f health-checker --tail=10
# Verify eu-west is still serving
kubectl exec -it health-checker -- wget -qO- http://app-eu-west/
# Restore us-east
echo "--- Restoring us-east ---"
kubectl scale deployment app-us-east --replicas=2
# Watch recovery
kubectl logs -f health-checker --tail=10

Part 6: Examine DNS Resolution Behavior (15 minutes)

Section titled “Part 6: Examine DNS Resolution Behavior (15 minutes)”
Terminal window
# Launch a debug pod
kubectl run dns-debug --image=busybox:1.37 --rm -it -- sh
# Inside the pod, examine DNS configuration
cat /etc/resolv.conf
# Observe the ndots problem
# Count queries for an external domain
nslookup -debug api.example.com 2>&1 | head -30
# Compare with FQDN (trailing dot)
nslookup api.example.com. 2>&1 | head -10
# Test resolution timing
time nslookup kubernetes.default.svc.cluster.local
time nslookup google.com
time nslookup google.com. # With trailing dot — faster!
Terminal window
kind delete cluster --name dns-lab

Success Criteria:

  • Deployed region-specific applications on labeled nodes
  • Health checker correctly detects regional health status
  • Observed failover behavior when scaling a region to zero
  • Verified recovery when the region comes back
  • Examined the ndots:5 behavior and tested FQDN optimization
  • Understood the difference between DNS routing policies

  • “DNS and BIND” (5th Edition) — Cricket Liu & Paul Albitz. The definitive reference for DNS internals and configuration.

  • RFC 8499: DNS Terminology — The official glossary of DNS terms, clarifying decades of ambiguity.

  • “The Anatomy of the Dyn DDoS Attack” — Dyn’s own post-mortem of the October 2016 attack, detailing how Anycast both helped and complicated recovery.

  • Cloudflare Learning Center: DNS — Excellent free resource with interactive diagrams explaining DNS concepts from basics to advanced.


Before moving on, ensure you understand:

  • ALIAS/ANAME solve the apex CNAME problem by resolving at the DNS server level, returning A records to clients
  • Anycast assigns one IP to many servers using BGP, routing clients to the nearest node. Ideal for stateless protocols like DNS
  • Traffic policies go beyond round robin: Weighted, latency-based, geolocation, and failover policies turn DNS into a global traffic manager
  • Health checks are mandatory for DNS-based failover: Without them, DNS happily routes traffic to dead servers
  • TTL is a promise that is frequently broken: Browsers, JVMs, ISPs, and mobile carriers all cache DNS differently. Pre-migration TTL lowering is essential
  • DNSSEC authenticates but doesn’t encrypt: It prevents cache poisoning but does not hide your queries. DoH/DoT handle encryption
  • Kubernetes ndots:5 multiplies external DNS queries: Use trailing dots or lower ndots for applications making many external calls
  • ExternalDNS bridges cluster and cloud DNS: Automatically manages DNS records based on Kubernetes Ingress and Service resources

Module 1.2: CDN & Edge Computing — How content delivery networks minimize latency by caching at the edge, and how edge compute is changing application architecture.