Skip to content

Module 3.8: Cluster Networking Data Path

Hands-On Lab Available
K8s Cluster advanced 40 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [MEDIUM] - Core troubleshooting topic

Time to Complete: ~35 minutes

Prerequisites: Module 3.1 (Services), Module 3.6 (Network Policies), Module 3.7 (CNI)


After this module, you will be able to:

  • Trace a packet from pod A to pod B across nodes through the full network stack
  • Explain how VXLAN, IP-in-IP, and native routing work at the Linux level
  • Debug cross-node connectivity issues by checking routes, bridges, and tunnel interfaces
  • Compare overlay vs native routing approaches and their performance trade-offs

You can create Services and write NetworkPolicies all day, but when something breaks in production at 3 AM, you need to understand where packets actually go. This module teaches you the mental model that turns networking mysteries into solvable puzzles.

War Story: The Silent MTU Drop

A platform team migrated a cluster from Flannel with host-gw to Flannel with vxlan encapsulation. Everything seemed fine — small health-check probes succeeded, pod-to-pod pings worked, and Services resolved correctly. But every few minutes, a critical batch job would hang and eventually time out.

After two days of fruitless debugging (restarting pods, checking DNS, blaming the application), a junior engineer ran tcpdump on a node and noticed something peculiar: TCP SYN packets crossed nodes fine, but the large data payloads were being silently dropped. The culprit? VXLAN adds a 50-byte header, reducing the effective MTU from 1500 to 1450. The CNI was configured for MTU 1500, so any packet close to the limit was too large for the tunnel, and the Don't Fragment bit caused the kernel to drop it silently instead of fragmenting.

The fix was a one-line config change ("MTU": 1450), but finding it required understanding the actual data path — where packets enter the kernel, how they get encapsulated, and where they exit. That is exactly what this module teaches.


By the end of this module, you’ll be able to:

  • Trace a packet from a client pod through kube-proxy rules to a backend pod
  • Distinguish between CNI responsibilities and kube-proxy responsibilities
  • Explain how CoreDNS resolution works end-to-end
  • Use tcpdump, iptables-save, conntrack, and nslookup to debug real networking issues
  • Apply a systematic troubleshooting mental model for cluster networking

  • kube-proxy does not proxy anything (despite its name). In iptables mode, it simply programs DNAT rules in the kernel. Actual packet forwarding is handled entirely by the Linux networking stack — kube-proxy never sees the data packets themselves.

  • A single Service with 1000 backends generates ~8000 iptables rules in iptables mode. This is why large clusters (5000+ Services) often switch to IPVS mode or eBPF-based solutions like Cilium, which can handle hundreds of thousands of backends without linear rule scanning.

  • Kubernetes requires a flat network: every pod must be able to reach every other pod without NAT. This single design decision (documented in the Kubernetes networking model) is what makes the entire Service abstraction possible — and it is the reason CNI plugins exist.

  • CoreDNS handles roughly 10,000-50,000 queries per second in a typical production cluster. A single misconfigured ndots value can multiply that by 5x, because each lookup triggers search-domain expansion (e.g., api.example.com becomes 5 separate DNS queries before the final one succeeds).


Part 1: Service to Pod Flow — The Packet Walk

Section titled “Part 1: Service to Pod Flow — The Packet Walk”

Understanding the exact path a packet takes is the foundation of all network troubleshooting. Let’s trace a request from Pod A to a ClusterIP Service that routes to Pod B.

┌─────────────────────────────────────────────────────────────────────────┐
│ ClusterIP Packet Walk (iptables mode) │
│ │
│ Pod A (10.244.1.5) Pod B (10.244.2.8) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ curl 10.96.0.50 │ │ nginx :80 │ │
│ └────────┬────────┘ └────────▲────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ ┌────────┴────────┐ │
│ │ 1. veth pair │ │ 7. veth pair │ │
│ │ (pod → node) │ │ (node → pod) │ │
│ └────────┬────────┘ └────────▲────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ 2. iptables │ PREROUTING chain │ │
│ │ DNAT rule │ dst: 10.96.0.50:80 │ │
│ │ rewrites dst │ → 10.244.2.8:80 │ │
│ └────────┬────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ 3. conntrack │ Records the NAT │ │
│ │ table entry │ mapping for return │ │
│ └────────┬────────┘ traffic │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ ┌────────┴────────┐ │
│ │ 4. Routing │ │ 6. Routing │ │
│ │ decision │ │ decision │ │
│ │ (same node │──── same node? ──────►│ (deliver │ │
│ │ or tunnel?) │ │ locally) │ │
│ └────────┬────────┘ └─────────────────┘ │
│ │ different node │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ 5a. CNI encap │ ═══ VXLAN/Geneve ═══►│ 5b. CNI decap │ │
│ │ (if overlay) │ tunnel │ (on dst node)│ │
│ └─────────────────┘ └────────┬────────┘ │
│ Node 1 │ Node 2 │
│ └──► step 6 │
└─────────────────────────────────────────────────────────────────────────┘

Here is what happens at each step:

  1. veth pair: The packet leaves Pod A’s network namespace through a virtual ethernet pair that connects it to the host (node) network namespace.
  2. iptables DNAT: kube-proxy has programmed iptables rules. The PREROUTING chain matches the destination 10.96.0.50 (the Service ClusterIP) and rewrites it to a backend pod IP — say 10.244.2.8. If multiple backends exist, a random or round-robin selection happens via iptables probability rules.
  3. conntrack: The kernel’s connection tracking module records this NAT mapping. When Pod B replies, conntrack automatically reverses the translation so Pod A sees the response coming from the Service IP, not the pod IP.
  4. Routing decision: The kernel routes the packet based on the rewritten destination. If Pod B is on the same node, it goes directly to Pod B’s veth pair. If Pod B is on another node, it goes to the CNI.
  5. CNI encapsulation: For overlay networks (VXLAN, Geneve), the CNI wraps the packet in an outer header to tunnel it to the destination node. For routed networks (BGP, host-gw), the kernel forwards it directly.
  6. Delivery: On the destination node, the packet is decapsulated (if needed) and routed to Pod B’s veth pair.
  7. Pod receives: Pod B sees a packet from Pod A’s IP destined for its own IP on port 80.

Pause and predict: In the packet walk above, the conntrack table records the NAT mapping at step 3. When Pod B sends its response back, the destination is Pod A’s IP — not the Service ClusterIP. How does Pod A know the response came from the Service it called, and not from a random pod?

NodePort adds an extra step at the front:

┌─────────────────────────────────────────────────────────────────────────┐
│ NodePort Packet Walk │
│ │
│ External Client │
│ ┌─────────────────┐ │
│ │ curl │ │
│ │ 192.168.1.10: │ │
│ │ 30080 │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ Node 1 (192.168.1.10) │
│ │ 1. Node eth0 │ │
│ │ receives on │ │
│ │ port 30080 │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 2. iptables │ KUBE-NODEPORTS chain: │
│ │ DNAT │ dst-port 30080 → 10.96.0.50:80 (ClusterIP) │
│ │ │ → then selects backend: 10.244.2.8:80 │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ 3. SNAT (maybe) │ If externalTrafficPolicy: Cluster (default) │
│ │ │ source is rewritten to node IP so return │
│ │ │ traffic comes back through this node │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ (same as ClusterIP flow from step 4 onward) │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Key detail: With externalTrafficPolicy: Cluster (the default), kube-proxy SNATs the source address, which means the backend pod sees the node’s IP, not the client’s real IP. Setting externalTrafficPolicy: Local preserves the client IP but only routes to pods on the receiving node — if none exist there, the connection is dropped.

When a pod calls its own Service (e.g., a pod behind web-svc curls web-svc), the packet might get routed back to itself. This is called hairpin or hairpin NAT:

┌─────────────────────────────────────────────────────────────┐
│ Hairpin Flow │
│ │
│ Pod A sends to Service IP │
│ │ │
│ ▼ │
│ iptables DNAT selects... Pod A itself! │
│ │ │
│ ▼ │
│ Packet must exit Pod A's netns and re-enter it │
│ (requires hairpin mode on the bridge/veth) │
│ │
│ If hairpin mode is OFF → packet is silently dropped │
│ If hairpin mode is ON → packet loops back correctly │
│ │
└─────────────────────────────────────────────────────────────┘

Most CNI plugins enable hairpin mode by default. If you see intermittent failures where a pod sometimes cannot reach its own Service, hairpin is the likely suspect. Check with:

Terminal window
# Check hairpin mode on a veth interface
k exec <pod> -- cat /sys/class/net/eth0/brport/hairpin_mode
# Or on the node:
cat /sys/devices/virtual/net/<veth-name>/brport/hairpin_mode

Part 2: CNI vs. kube-proxy Responsibilities

Section titled “Part 2: CNI vs. kube-proxy Responsibilities”

This is one of the most misunderstood distinctions in Kubernetes networking. Getting it wrong leads to debugging the wrong component.

┌─────────────────────────────────────────────────────────────────────────┐
│ CNI vs. kube-proxy: Who Does What? │
│ │
│ ┌──────────────────────────────┐ ┌──────────────────────────────┐ │
│ │ CNI Plugin │ │ kube-proxy │ │
│ │ │ │ │ │
│ │ ✓ Assign pod IPs │ │ ✓ Service → Pod DNAT │ │
│ │ ✓ Create veth pairs │ │ ✓ ClusterIP routing │ │
│ │ ✓ Configure pod routes │ │ ✓ NodePort rules │ │
│ │ ✓ Inter-node tunneling │ │ ✓ LoadBalancer rules │ │
│ │ (VXLAN, Geneve, BGP) │ │ ✓ Session affinity │ │
│ │ ✓ Network Policy │ │ ✓ Endpoint selection │ │
│ │ enforcement (some CNIs) │ │ │ │
│ │ ✓ Pod-to-pod connectivity │ │ ✗ Does NOT assign IPs │ │
│ │ │ │ ✗ Does NOT create tunnels │ │
│ │ ✗ Does NOT handle Services │ │ ✗ Does NOT enforce policies │ │
│ │ (unless eBPF replaces │ │ (except via DNAT rules) │ │
│ │ kube-proxy, e.g. Cilium) │ │ │ │
│ └──────────────────────────────┘ └──────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Quick Diagnosis │ │
│ │ │ │
│ │ "Pod A cannot reach Pod B by pod IP" │ │
│ │ → Problem is CNI (routing, encapsulation, MTU) │ │
│ │ │ │
│ │ "Pod A cannot reach Service, but CAN reach Pod B by pod IP" │ │
│ │ → Problem is kube-proxy (DNAT rules, endpoints) │ │
│ │ │ │
│ │ "Pod A cannot resolve service-name" │ │
│ │ → Problem is CoreDNS (see Part 3) │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Different CNI plugins take different approaches. Here is a comparison relevant to troubleshooting:

FeatureCalicoFlannelCilium
Data planeiptables or eBPFVXLAN or host-gweBPF
Default routingBGP (no encap)VXLAN overlayeBPF direct routing
Network PoliciesYes (native)No (needs add-on)Yes (L3-L7)
Can replace kube-proxyYes (eBPF mode)NoYes (kube-proxy replacement)
MTU concernNone (no encap)Yes (-50 bytes for VXLAN)Depends on config
Troubleshooting toolcalicoctl node statusCheck VXLAN interfacecilium status

kube-proxy can operate in different modes. The mode affects how Services are implemented in the kernel:

Terminal window
# Check which mode kube-proxy is using
k get configmap kube-proxy -n kube-system -o yaml | grep mode
# Or check the kube-proxy logs
k logs -n kube-system -l k8s-app=kube-proxy | head -20
ModeHow It WorksPerformanceWhen to Use
iptablesDNAT rules per Service/EndpointO(n) rule evaluationDefault, fine for < 5000 Services
IPVSVirtual server with real backendsO(1) lookup via hash tableLarge clusters (5000+ Services)
nftablesNext-gen replacement for iptablesBetter than iptablesK8s 1.31+ recommended path

DNS is the glue that makes Service names work. When a pod calls curl web-service, here is what actually happens.

┌─────────────────────────────────────────────────────────────────────────┐
│ DNS Resolution Path │
│ │
│ Pod (10.244.1.5) │
│ ┌──────────────────────────────┐ │
│ │ curl http://web-svc │ │
│ │ │ │
│ │ 1. glibc reads │ │
│ │ /etc/resolv.conf: │ │
│ │ nameserver 10.96.0.10 │ ◄── CoreDNS ClusterIP │
│ │ search default.svc. │ │
│ │ cluster.local │ │
│ │ svc.cluster.local │ │
│ │ cluster.local │ │
│ │ options ndots:5 │ │
│ └──────────┬───────────────────┘ │
│ │ │
│ │ 2. "web-svc" has 0 dots, which is < ndots (5) │
│ │ So search domains are tried FIRST: │
│ │ │
│ │ Query 1: web-svc.default.svc.cluster.local ← HIT! │
│ │ (If miss: web-svc.svc.cluster.local) │
│ │ (If miss: web-svc.cluster.local) │
│ │ (If miss: web-svc.) ← absolute query last │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ 3. UDP packet to │ │
│ │ 10.96.0.10:53 │ │
│ │ (CoreDNS ClusterIP) │ │
│ └──────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ 4. kube-proxy DNAT │ │
│ │ 10.96.0.10 → 10.244.0.3 │ ◄── Actual CoreDNS pod IP │
│ └──────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ 5. CoreDNS pod │ │
│ │ - Checks kubernetes │ │
│ │ plugin (in-cluster │ │
│ │ records) │ │
│ │ - Returns A record: │ │
│ │ 10.96.45.123 │ ◄── Service ClusterIP │
│ └──────────┬───────────────────┘ │
│ │ │
│ ▼ │
│ Pod receives IP, connects to 10.96.45.123 │
│ (then the Service → Pod flow from Part 1 takes over) │
│ │
└─────────────────────────────────────────────────────────────────────────┘

Stop and think: The DNS path involves kube-proxy DNAT (step 4 above) to reach the actual CoreDNS pod. This means DNS resolution itself depends on kube-proxy working correctly. If kube-proxy is down, can pods resolve service names? Can they resolve external names like google.com?

The ndots:5 default in Kubernetes means any name with fewer than 5 dots is treated as a relative name. This triggers search domain expansion:

Terminal window
# Querying "api.example.com" (2 dots, < 5) generates these lookups:
# 1. api.example.com.default.svc.cluster.local → NXDOMAIN
# 2. api.example.com.svc.cluster.local → NXDOMAIN
# 3. api.example.com.cluster.local → NXDOMAIN
# 4. api.example.com. → SUCCESS
# That's 4 DNS queries instead of 1!

For pods that frequently call external domains, reduce ndots or use trailing dots:

# Option 1: Set ndots in pod spec
apiVersion: v1
kind: Pod
metadata:
name: optimized-dns
spec:
dnsConfig:
options:
- name: ndots
value: "2"
containers:
- name: app
image: nginx
Terminal window
# Option 2: Use trailing dot (absolute name, skips search)
curl http://api.example.com.
# ^ trailing dot = absolute, no search expansion
SymptomLikely CauseHow to Check
All DNS failsCoreDNS pods downk get pods -n kube-system -l k8s-app=kube-dns
Intermittent DNS timeoutsCoreDNS overloaded or NetworkPolicy blocking UDP/53k top pods -n kube-system, check policies
External names failCoreDNS cannot reach upstream DNSCheck CoreDNS forward plugin config, node DNS
Cross-namespace failsWrong FQDN or search domainUse full FQDN: svc.ns.svc.cluster.local
DNS works, connection failsDNS is fine, problem is Service/CNInslookup succeeds but curl fails = not DNS
Terminal window
# Quick DNS health check from any pod
k run dns-check --rm -it --image=busybox:1.36 --restart=Never -- \
nslookup kubernetes.default
# Check CoreDNS logs for errors
k logs -n kube-system -l k8s-app=kube-dns --tail=50

When networking breaks, you need a systematic approach. Do not guess — follow the packet.

┌─────────────────────────────────────────────────────────────────────────┐
│ Networking Troubleshooting Decision Tree │
│ │
│ "Pod A cannot reach Service X" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Layer 1: DNS │ │
│ │ Can Pod A resolve the name? │ │
│ │ nslookup <service> │ │
│ └──────┬──────────┬───────────────┘ │
│ NO │ │ YES │
│ │ │ │
│ ▼ ▼ │
│ Check CoreDNS │ │
│ pods, resolv.conf │ │
│ NetworkPolicy │ │
│ on UDP/53 │ │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ Layer 2: Service (kube-proxy) │ │
│ │ Does the Service have endpoints? │ │
│ │ k get endpoints <service> │ │
│ └──────┬──────────┬───────────────┘ │
│ NO │ │ YES │
│ │ │ │
│ ▼ ▼ │
│ Check selector │ │
│ matches, pod │ │
│ readiness probes │ │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ Layer 3: Pod-to-Pod (CNI) │ │
│ │ Can Pod A reach the endpoint │ │
│ │ IP directly? │ │
│ │ curl <endpoint-ip>:<port> │ │
│ └──────┬──────────┬───────────────┘ │
│ NO │ │ YES │
│ │ │ │
│ ▼ ▼ │
│ CNI issue: Pod is not │
│ routes, encap, listening on │
│ MTU, Network the port or │
│ Policies app is broken │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Terminal window
# === Layer 1: DNS ===
# Test DNS from inside a pod
k run debug --rm -it --image=busybox:1.36 --restart=Never -- nslookup web-svc
# Check resolv.conf inside a pod
k exec <pod> -- cat /etc/resolv.conf
# === Layer 2: Service / kube-proxy ===
# Check endpoints exist
k get endpoints <service-name>
# Check iptables rules for a specific service (on the node)
iptables-save | grep <service-name>
# Check conntrack table for stale entries
conntrack -L -d <service-clusterip>
# === Layer 3: Pod-to-Pod / CNI ===
# Test direct pod-to-pod connectivity
k run debug --rm -it --image=busybox:1.36 --restart=Never -- \
wget -qO- --timeout=5 http://<pod-ip>:<port>
# Capture packets on a node (run on the node, not in a pod)
tcpdump -i any -nn host <pod-ip> and port <port>
# Check MTU
k exec <pod> -- ip link show eth0
# Look for "mtu" value -- should match CNI config
# Check routes inside a pod
k exec <pod> -- ip route

What would happen if: You delete a backend pod and Kubernetes immediately creates a replacement pod that happens to get the same IP address as the deleted one. A client has an active TCP connection to the old pod through the Service. Does the connection survive, fail gracefully, or hang?

Conntrack (connection tracking) is the kernel module that makes NAT work. It remembers which connections map to which translations. Stale conntrack entries are a common source of mysterious failures:

Terminal window
# List all conntrack entries for a Service IP
conntrack -L -d 10.96.0.50
# Count entries (high numbers may indicate connection leak)
conntrack -C
# Delete stale entries (careful in production)
conntrack -D -d 10.96.0.50 -p tcp --dport 80

When conntrack bites you: If a pod is deleted and recreated with the same IP (rare but possible), conntrack may still have entries pointing to the old connection state. Symptoms include connections that hang or reset for no apparent reason, but only to specific pods.

When packets are dropped between nodes, work through this list:

  1. MTU mismatch: Does the CNI use encapsulation? If so, is the MTU reduced accordingly?

    Terminal window
    # Check MTU on pod interface vs node tunnel interface
    ip link show vxlan.calico # or flannel.1, cilium_vxlan
  2. Firewall rules: Are the node firewalls (iptables, firewalld, cloud security groups) allowing the CNI protocol?

    • VXLAN: UDP port 4789
    • Geneve: UDP port 6081
    • BGP: TCP port 179
    • Wireguard: UDP port 51820
  3. CNI health: Is the CNI daemon running on all nodes?

    Terminal window
    k get pods -n kube-system -l k8s-app=calico-node # or flannel, cilium
  4. IP exhaustion: Has the pod CIDR run out of IPs on a specific node?

    Terminal window
    k describe node <node> | grep -A5 "PodCIDR"

MistakeProblemSolution
Debugging DNS when the issue is kube-proxyWasted time on wrong layerFollow the three-layer model: DNS first, then Service, then CNI
Ignoring MTU after switching CNI modesLarge packets silently droppedAlways set MTU = physical MTU minus encap overhead (50 for VXLAN)
Not checking conntrackStale NAT entries cause intermittent failuresUse conntrack -L to inspect state when connections hang
Forgetting externalTrafficPolicyClient source IP lost, or no backends on nodeUnderstand Cluster (SNAT, all backends) vs Local (preserves IP, local only)
Setting ndots too highDNS query amplification, slow lookupsUse ndots: 2 for pods calling external services, or use trailing dots
Testing from outside the cluster for ClusterIPConnection timeoutClusterIP only works inside the cluster; use NodePort/port-forward for external tests
Running tcpdump on wrong interfaceCaptures show nothingUse tcpdump -i any to capture on all interfaces, then narrow down
Blaming the application before checking the networkHours wasted debugging app codeAlways verify network connectivity first with simple tools (wget, curl)

1. A pod can reach another pod by its IP (10.244.2.8), but cannot reach it via the Service ClusterIP (10.96.0.50). Which component is most likely at fault?

Answer

kube-proxy is most likely at fault. Since pod-to-pod connectivity works, the CNI is functioning correctly. The Service ClusterIP is handled by kube-proxy’s iptables/IPVS/nftables rules. Check:

  • k get endpoints <service> — are there endpoints?
  • iptables-save | grep <service-name> — are the DNAT rules present?
  • Is kube-proxy running? k get pods -n kube-system -l k8s-app=kube-proxy

2. You switch your CNI from Flannel with host-gw to Flannel with vxlan. Small requests (health checks, pings) work fine, but large HTTP responses are dropped. What is the likely cause and fix?

Answer

MTU mismatch. VXLAN encapsulation adds a 50-byte header. If the pod MTU is still 1500 (the default for host-gw), packets near 1500 bytes will exceed the tunnel’s capacity after encapsulation.

Fix: Set the CNI MTU to 1450 (1500 - 50 for VXLAN overhead). In Flannel’s ConfigMap:

{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan",
"MTU": 1450
}
}

Then restart the Flannel pods so all nodes pick up the new MTU.

3. A developer reports that curl api.external.com from a pod takes 2 seconds, but only 50ms from their laptop. DNS is the bottleneck. Explain why and how to fix it.

Answer

The default ndots:5 in Kubernetes means api.external.com (2 dots, which is less than 5) is treated as a relative name. The resolver tries these lookups in order before succeeding:

  1. api.external.com.default.svc.cluster.local — NXDOMAIN (~500ms)
  2. api.external.com.svc.cluster.local — NXDOMAIN (~500ms)
  3. api.external.com.cluster.local — NXDOMAIN (~500ms)
  4. api.external.com. — SUCCESS (~50ms)

That is 3 wasted queries adding ~1.5 seconds of latency.

Fixes (pick one):

  • Use a trailing dot: curl api.external.com.
  • Set dnsConfig.options.ndots: 2 in the pod spec
  • Use the FQDN with trailing dot in application configuration

4. You run k get endpoints my-service and see <none>. The pods are running and have the correct labels. What else could cause empty endpoints?

Answer

Even if pods are running with correct labels, endpoints will be empty if:

  1. Pods are not Ready — the readiness probe is failing. Only pods that pass their readiness probe are added to the Endpoints object. Check: k get pods (look for 0/1 READY).
  2. The Service selector requires labels the pods do not have — double check with k describe svc my-service and compare against k get pods --show-labels.
  3. The pods are in a different namespace than the Service. Services only select pods in their own namespace.
  4. The endpoint controller is not running — extremely rare, but check the kube-controller-manager pod in kube-system.

Most commonly, the answer is failed readiness probes.

5. After deleting and recreating a backend pod, some existing connections to the Service hang for 30+ seconds before recovering. New connections work fine. What is happening?

Answer

Stale conntrack entries. The kernel’s connection tracking table still has entries mapping existing connections to the old pod’s IP. Since that IP no longer exists (or belongs to a different pod), packets are being sent into a void.

The entries will eventually expire (TCP timeout, typically 120 seconds for established connections), which is why it eventually recovers. New connections work because they create fresh conntrack entries pointing to valid backends.

To fix immediately: conntrack -D -d <service-clusterip> -p tcp --dport <port>. To prevent: use graceful pod termination (preStop hooks, connection draining) so the pod removes itself from endpoints before the process exits.


Objective: Trace a request end-to-end from a client pod through DNS, kube-proxy, and CNI to a backend pod. You will use tcpdump, nslookup, and iptables-save to observe each layer.

Environment: kind or minikube cluster (single node is fine for this exercise).

Terminal window
# Create a backend deployment and service
k create deployment trace-backend --image=nginx --replicas=2
k expose deployment trace-backend --port=80 --name=trace-svc
# Wait for pods to be ready
k wait --for=condition=ready pod -l app=trace-backend --timeout=60s
# Create a debug pod that stays running
k run trace-client --image=nicolaka/netshoot --restart=Never -- sleep 3600
# Wait for it
k wait --for=condition=ready pod/trace-client --timeout=60s
Terminal window
# Check the client pod's DNS config
k exec trace-client -- cat /etc/resolv.conf
# Note: nameserver should be the CoreDNS ClusterIP
# Resolve the service name
k exec trace-client -- nslookup trace-svc
# Should return the ClusterIP of trace-svc
# Try the FQDN
k exec trace-client -- nslookup trace-svc.default.svc.cluster.local
# Compare: resolve with trailing dot (skips search domains)
k exec trace-client -- nslookup trace-svc.default.svc.cluster.local.

Record: What IP did trace-svc resolve to? This is the ClusterIP.

Terminal window
# Get the Service details
k get svc trace-svc -o wide
# Get the endpoints (backend pod IPs)
k get endpoints trace-svc
# Look at iptables rules for this service (requires node access)
# On minikube: minikube ssh
# On kind: docker exec -it <node-container> bash
# Then run:
iptables-save | grep trace-svc
# You should see KUBE-SERVICES, KUBE-SVC-*, and KUBE-SEP-* chains
# The KUBE-SVC chain contains probability rules for load balancing
# The KUBE-SEP chains contain the DNAT rules to specific pod IPs

Record: How many KUBE-SEP entries exist? Should match your replica count (2).

Terminal window
# In one terminal, start tcpdump on the node (requires node access)
# On kind: docker exec -it <node-container> bash
tcpdump -i any -nn port 80 and host $(k get pod trace-client -o jsonpath='{.status.podIP}')
# In another terminal, make a request from the client pod
k exec trace-client -- curl -s http://trace-svc
# Observe the tcpdump output:
# 1. You should see the initial SYN from trace-client IP to ClusterIP
# 2. Then the DNAT'd packet from trace-client IP to a backend pod IP
# 3. The response from the backend pod IP to trace-client IP
# 4. Conntrack reverses the NAT, so trace-client sees the ClusterIP
Terminal window
# On the node, check conntrack entries
conntrack -L -d $(k get svc trace-svc -o jsonpath='{.spec.clusterIP}') 2>/dev/null
# You should see entries showing:
# src=<client-pod-ip> dst=<clusterIP> dport=80
# and the reply mapping:
# src=<backend-pod-ip> dst=<client-pod-ip>
Terminal window
# Apply a policy that blocks traffic to the backend
cat << 'EOF' | k apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-trace-backend
spec:
podSelector:
matchLabels:
app: trace-backend
policyTypes:
- Ingress
ingress: [] # Empty = deny all ingress
EOF
# Try the request again (should fail/timeout)
k exec trace-client -- curl -s --connect-timeout 5 http://trace-svc
# Expected: timeout or connection refused (depends on CNI)
# Check DNS still works (it should, DNS goes to CoreDNS, not backend)
k exec trace-client -- nslookup trace-svc
# Remove the policy
k delete networkpolicy deny-trace-backend
# Verify connectivity is restored
k exec trace-client -- curl -s --connect-timeout 5 http://trace-svc
Terminal window
k delete pod trace-client --force
k delete deployment trace-backend
k delete svc trace-svc

Success Criteria:

  • Can identify the ClusterIP from DNS resolution
  • Can find iptables DNAT rules for a Service
  • Can observe the packet flow in tcpdump (pre-DNAT and post-DNAT)
  • Can inspect conntrack entries for active connections
  • Understand that NetworkPolicy blocks pod-to-pod traffic but not DNS
  • Can articulate the three-layer troubleshooting model (DNS, Service, CNI)

  1. Follow the packet, not your assumptions. Use tcpdump, iptables-save, and conntrack to see what is actually happening instead of guessing.
  2. The three layers (DNS, kube-proxy/Service, CNI/pod-to-pod) are independent. Isolate which layer is broken before deep-diving.
  3. MTU matters. Any time encapsulation is involved, the effective MTU decreases. Silent drops on large packets are the classic symptom.
  4. conntrack is invisible but critical. It maintains NAT state for every connection through a Service. Stale entries cause some of the most confusing intermittent failures.
  5. ndots:5 is expensive. For workloads calling external services, either reduce ndots or use trailing dots on domain names.


Module 3.3: DNS in Kubernetes - Deep-dive into CoreDNS configuration, custom DNS policies, and advanced troubleshooting.