Module 6.2: GKE Networking: Dataplane V2 and Gateway API
Complexity: [COMPLEX] | Time to Complete: 3h | Prerequisites: Module 6.1 (GKE Architecture)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure GKE Dataplane V2 (Cilium-based) with network policies and network policy logging
- Implement Gateway API on GKE for traffic splitting and header-based routing
- Deploy Private Service Connect for secure control plane access on GKE
- Diagnose GKE networking issues related to IP exhaustion and pod-to-service communication failures
Why This Module Matters
Section titled “Why This Module Matters”In September 2023, a healthcare SaaS company running on GKE discovered that their network policies were not being enforced. A penetration tester demonstrated that a compromised pod in the staging namespace could freely communicate with pods in the production namespace, despite NetworkPolicy resources that should have blocked cross-namespace traffic. The root cause: the cluster was using the legacy iptables-based kube-proxy dataplane, which does not enforce Kubernetes NetworkPolicy at all. The team had assumed that creating NetworkPolicy resources was sufficient---they did not realize that enforcement requires a CNI that supports it. The compliance violation cost them a SOC 2 audit failure, delaying a $2.3 million enterprise deal by four months. The fix took 30 minutes: enable Dataplane V2 on their next cluster creation. The business impact lasted a quarter.
GKE networking is where Kubernetes meets Google’s global network infrastructure. The decisions you make about cluster networking---VPC-native mode, Dataplane V2, load balancing strategy, and Gateway API configuration---determine your application’s performance, security, and cost. A misconfigured network can leave your pods exposed, introduce unnecessary latency, or rack up egress charges that dwarf your compute costs.
In this module, you will learn how VPC-native clusters use alias IPs to give pods routable addresses, how Dataplane V2 replaces iptables with eBPF for faster and more observable networking, how Cloud Load Balancing integrates with GKE, and how the Gateway API provides a more expressive routing model than Ingress. By the end, you will configure Dataplane V2 network policies and set up a Gateway API canary deployment.
VPC-Native Clusters and Alias IPs
Section titled “VPC-Native Clusters and Alias IPs”Every modern GKE cluster should be VPC-native. This is the default since GKE 1.21 and is required for features like Dataplane V2, Private Google Access for pods, and VPC flow logs for pod traffic.
How Alias IPs Work
Section titled “How Alias IPs Work”In a VPC-native cluster, each node receives a primary IP from the subnet and a secondary IP range (alias range) for its pods. This means pods get IP addresses that are routable within the VPC---no NAT, no overlay network.
VPC: 10.0.0.0/16 ┌────────────────────────────────────────────────────────┐ │ │ │ Subnet: 10.0.0.0/24 (Node IPs) │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Node A │ │ Node B │ │ │ │ IP: 10.0.0.2 │ │ IP: 10.0.0.3 │ │ │ │ │ │ │ │ │ │ Alias: 10.4.0.0 │ │ Alias: 10.4.1.0 │ │ │ │ /24 (pods) │ │ /24 (pods) │ │ │ │ ┌────┐ ┌────┐ │ │ ┌────┐ ┌────┐ │ │ │ │ │Pod │ │Pod │ │ │ │Pod │ │Pod │ │ │ │ │ │.2 │ │.3 │ │ │ │.5 │ │.8 │ │ │ │ │ └────┘ └────┘ │ │ └────┘ └────┘ │ │ │ └─────────────────┘ └─────────────────┘ │ │ │ │ Secondary Range "pods": 10.4.0.0/14 │ │ Secondary Range "services": 10.8.0.0/20 │ └────────────────────────────────────────────────────────┘Why This Matters for Networking
Section titled “Why This Matters for Networking”| Feature | VPC-Native (Alias IPs) | Routes-Based (Legacy) |
|---|---|---|
| Pod IPs routable in VPC | Yes (directly) | No (requires custom routes) |
| Max pods per cluster | Limited by IP range size | Limited to 300 custom routes |
| Network Policy support | Full (Dataplane V2) | Limited |
| Private Google Access for pods | Yes | No |
| VPC Flow Logs for pods | Yes | No |
| Peering/VPN compatibility | Full | Route export required |
# Verify your cluster is VPC-nativegcloud container clusters describe my-cluster \ --region=us-central1 \ --format="yaml(ipAllocationPolicy)"
# Expected output includes:# useIpAliases: true# clusterSecondaryRangeName: pods# servicesSecondaryRangeName: servicesIP Address Planning
Section titled “IP Address Planning”Stop and think: If a VPC-native cluster uses alias IPs directly from the VPC, what happens if your VPC doesn’t have a large enough secondary range for your planned number of nodes and pods at maximum scale?
Poor IP planning is the number one networking regret for teams that scale. You cannot resize secondary ranges after cluster creation.
Planning Guide: ┌──────────────────────────────────────────────────────┐ │ Each node gets a /24 from the pod range by default │ │ = 256 IPs per node (110 pods max + overhead) │ │ │ │ For 100 nodes: you need 100 x /24 = /17 minimum │ │ For 500 nodes: you need 500 x /24 = /15 minimum │ │ │ │ Services range: │ │ /20 = 4,096 services (usually sufficient) │ │ /16 = 65,536 services (very large clusters) │ └──────────────────────────────────────────────────────┘# Create a cluster with explicit IP planning for scalegcloud container clusters create large-cluster \ --region=us-central1 \ --num-nodes=2 \ --network=prod-vpc \ --subnetwork=gke-subnet \ --cluster-secondary-range-name=gke-pods \ --services-secondary-range-name=gke-services \ --enable-ip-alias \ --max-pods-per-node=64 \ --default-max-pods-per-node=64
# Reducing max-pods-per-node from 110 to 64 means each node# needs a /26 instead of a /24, saving IP spaceDataplane V2: eBPF-Powered Networking
Section titled “Dataplane V2: eBPF-Powered Networking”Dataplane V2 is GKE’s modern networking stack, built on Cilium and eBPF. It replaces the traditional kube-proxy + iptables approach with a programmable, kernel-level dataplane.
Why eBPF Changes Everything
Section titled “Why eBPF Changes Everything”Traditional Kubernetes networking uses iptables rules for service routing and kube-proxy for load balancing. This works, but it has fundamental limitations:
Legacy (iptables/kube-proxy): ┌─────────────────────────────────────────────────────┐ │ Packet arrives at node │ │ │ │ │ ▼ │ │ iptables chain (linear scan) │ │ Rule 1: no match │ │ Rule 2: no match │ │ Rule 3: no match │ │ ... │ │ Rule 5,000: MATCH → DNAT to pod IP │ │ │ │ │ O(n) performance: more services = slower routing │ └─────────────────────────────────────────────────────┘
Dataplane V2 (eBPF): ┌─────────────────────────────────────────────────────┐ │ Packet arrives at node │ │ │ │ │ ▼ │ │ eBPF hash map lookup │ │ Key: {dest IP, dest port} │ │ Value: backend pod IP │ │ │ │ │ O(1) performance: constant time regardless of │ │ number of services │ └─────────────────────────────────────────────────────┘Dataplane V2 Benefits
Section titled “Dataplane V2 Benefits”Pause and predict: If Dataplane V2 uses eBPF hash maps instead of iptables, how might this change the way you troubleshoot dropped packets or connection timeouts compared to legacy clusters?
| Capability | iptables/kube-proxy | Dataplane V2 |
|---|---|---|
| Service routing | O(n) linear scan | O(1) hash lookup |
| Network Policy enforcement | Requires Calico add-on | Built-in (Cilium) |
| Network Policy logging | Not available | Built-in |
| Kernel bypass | No | Yes (XDP for some paths) |
| Observability | Basic conntrack | Rich eBPF flow logs |
| Scale limit | ~5,000 services practical | 25,000+ services tested |
| FQDN-based policies | Not supported | Supported |
Enabling Dataplane V2
Section titled “Enabling Dataplane V2”# Dataplane V2 is enabled at cluster creation timegcloud container clusters create dpv2-cluster \ --region=us-central1 \ --num-nodes=2 \ --enable-dataplane-v2 \ --enable-ip-alias \ --release-channel=regular
# For Autopilot clusters, Dataplane V2 is enabled by defaultgcloud container clusters create-auto dpv2-autopilot \ --region=us-central1
# Verify Dataplane V2 is activekubectl -n kube-system get pods -l k8s-app=cilium -o wideNetwork Policies with Dataplane V2
Section titled “Network Policies with Dataplane V2”With Dataplane V2, NetworkPolicy resources are enforced without any additional CNI installation. This is the feature that the healthcare company in our opening story was missing.
# Deny all ingress to production namespace by defaultapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: deny-all-ingress namespace: productionspec: podSelector: {} policyTypes: - Ingress
---# Allow only the API gateway to reach backend podsapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-api-gateway namespace: productionspec: podSelector: matchLabels: app: backend policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: role: gateway podSelector: matchLabels: app: api-gateway ports: - protocol: TCP port: 8080
---# Allow DNS resolution for all pods (critical, often forgotten)apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-dns namespace: productionspec: podSelector: {} policyTypes: - Egress egress: - to: - namespaceSelector: {} podSelector: matchLabels: k8s-app: kube-dns ports: - protocol: UDP port: 53 - protocol: TCP port: 53Network Policy Logging
Section titled “Network Policy Logging”Dataplane V2 can log allowed and denied connections, which is invaluable for debugging and compliance.
# Enable network policy logging on the clustergcloud container clusters update dpv2-cluster \ --region=us-central1 \ --enable-network-policy-logging
# View logs in Cloud Logginggcloud logging read \ 'resource.type="k8s_node" AND jsonPayload.disposition="deny"' \ --limit=10 \ --format="table(timestamp, jsonPayload.src.pod_name, jsonPayload.dest.pod_name, jsonPayload.disposition)"War Story: A platform team enabled network policy logging and discovered that their monitoring agent (Datadog) was making 3,000 denied connections per minute to pods in restricted namespaces. The agent had broad scrape targets configured, and every denied connection generated a log entry. Before enabling logging in production, test in a staging environment to understand the log volume---it can be surprisingly high.
Cloud Load Balancing Integration
Section titled “Cloud Load Balancing Integration”GKE integrates tightly with Google Cloud Load Balancing. When you create a Kubernetes Service or Ingress, GKE provisions the corresponding Google Cloud load balancer components automatically.
Service Types and Their Load Balancers
Section titled “Service Types and Their Load Balancers” Kubernetes Concept GCP Resource Created ───────────────── ──────────────────── Service type: ClusterIP → Nothing (internal only) Service type: NodePort → Nothing (opens port on nodes) Service type: LoadBalancer → Network Load Balancer (L4) Ingress (external) → Application Load Balancer (L7) Gateway (external) → Application Load Balancer (L7)| Service Type | Layer | Scope | Use Case |
|---|---|---|---|
| LoadBalancer | L4 (TCP/UDP) | Regional (default) | Non-HTTP, gRPC without path routing |
| Ingress (GKE Ingress) | L7 (HTTP/S) | Global | HTTP routing with host/path rules |
| Gateway (Gateway API) | L7 (HTTP/S) | Global or Regional | Modern alternative to Ingress |
| Internal LoadBalancer | L4 | Regional | Internal services, not internet-facing |
| Internal Ingress | L7 | Regional | Internal HTTP routing |
External Network Load Balancer (L4)
Section titled “External Network Load Balancer (L4)”Stop and think: If you expose an internal gRPC service that requires L7 routing and TLS termination, which GKE service type or ingress method should you choose instead of a standard LoadBalancer?
# Simple L4 load balancerapiVersion: v1kind: Servicemetadata: name: game-serverspec: type: LoadBalancer selector: app: game-server ports: - port: 7777 targetPort: 7777 protocol: UDP# Check the provisioned load balancerkubectl get svc game-server -o wide# The EXTERNAL-IP column shows the Google Cloud LB IP
# View the underlying GCP forwarding rulegcloud compute forwarding-rules list \ --filter="description~game-server"GKE Ingress (L7)
Section titled “GKE Ingress (L7)”GKE Ingress creates a Google Cloud Application Load Balancer (formerly HTTP(S) Load Balancer) with features like SSL termination, URL-based routing, and Cloud CDN integration.
# Multi-service Ingress with path-based routingapiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: web-ingress annotations: kubernetes.io/ingress.global-static-ip-name: web-static-ip networking.gke.io/managed-certificates: web-cert kubernetes.io/ingress.class: gcespec: defaultBackend: service: name: frontend port: number: 80 rules: - host: app.example.com http: paths: - path: /api/* pathType: ImplementationSpecific backend: service: name: api-service port: number: 8080 - path: /static/* pathType: ImplementationSpecific backend: service: name: static-assets port: number: 80Gateway API: The Future of Kubernetes Routing
Section titled “Gateway API: The Future of Kubernetes Routing”The Gateway API is a Kubernetes-native evolution of Ingress that provides richer routing capabilities, better role separation, and a more consistent experience across implementations. GKE fully supports the Gateway API and it is the recommended approach for new deployments.
Why Gateway API Over Ingress
Section titled “Why Gateway API Over Ingress”Pause and predict: In the Gateway API model, if the infrastructure team modifies the Gateway resource to restrict allowed namespaces, what happens to the existing HTTPRoutes in namespaces that are no longer allowed?
Ingress Model (flat): ┌──────────────────────────────────────┐ │ Ingress Resource │ │ (mixes infra config + routing) │ │ │ │ - TLS config (infra team concern) │ │ - Host rules (app team concern) │ │ - Path rules (app team concern) │ │ - Backend refs (app team concern) │ │ │ │ ONE resource, ONE owner = conflict │ └──────────────────────────────────────┘
Gateway API Model (layered): ┌──────────────────────────────────────┐ │ GatewayClass (cluster admin) │ │ "Which load balancer implementation"│ └──────────────┬───────────────────────┘ │ ┌──────────────▼───────────────────────┐ │ Gateway (infra/platform team) │ │ "Listener config, TLS, IP address" │ └──────────────┬───────────────────────┘ │ ┌──────────────▼───────────────────────┐ │ HTTPRoute (app team) │ │ "Host matching, path routing, │ │ headers, canary weights" │ └──────────────────────────────────────┘GKE Gateway Classes
Section titled “GKE Gateway Classes”GKE provides several pre-installed GatewayClasses:
| GatewayClass | Load Balancer Type | Scope | Use Case |
|---|---|---|---|
gke-l7-global-external-managed | Global external ALB | Global | Public-facing web apps |
gke-l7-regional-external-managed | Regional external ALB | Regional | Region-specific apps |
gke-l7-rilb | Regional internal ALB | Regional | Internal microservices |
gke-l7-gxlb | Classic global external ALB | Global | Legacy, avoid for new |
# List available GatewayClasses in your clusterkubectl get gatewayclass
# Enable the Gateway API on an existing clustergcloud container clusters update my-cluster \ --region=us-central1 \ --gateway-api=standardSetting Up a Gateway
Section titled “Setting Up a Gateway”# Step 1: Create the Gateway (platform/infra team)apiVersion: gateway.networking.k8s.io/v1kind: Gatewaymetadata: name: external-gateway namespace: infraspec: gatewayClassName: gke-l7-global-external-managed listeners: - name: https protocol: HTTPS port: 443 tls: mode: Terminate certificateRefs: - kind: Secret name: tls-cert allowedRoutes: namespaces: from: Selector selector: matchLabels: gateway-access: "true" - name: http protocol: HTTP port: 80 allowedRoutes: namespaces: from: Selector selector: matchLabels: gateway-access: "true"# Step 2: Create an HTTPRoute (app team)apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: store-route namespace: store labels: gateway: external-gatewayspec: parentRefs: - kind: Gateway name: external-gateway namespace: infra hostnames: - "store.example.com" rules: - matches: - path: type: PathPrefix value: /api backendRefs: - name: store-api port: 8080 - matches: - path: type: PathPrefix value: / backendRefs: - name: store-frontend port: 80Canary Deployments with Gateway API
Section titled “Canary Deployments with Gateway API”The Gateway API natively supports traffic splitting by weight---something that required Istio or custom annotations with Ingress.
# Canary: send 90% to stable, 10% to canaryapiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: store-api-canary namespace: storespec: parentRefs: - kind: Gateway name: external-gateway namespace: infra hostnames: - "store.example.com" rules: - matches: - path: type: PathPrefix value: /api backendRefs: - name: store-api-stable port: 8080 weight: 90 - name: store-api-canary port: 8080 weight: 10To gradually shift traffic, update the weights:
# Move to 50/50kubectl patch httproute store-api-canary -n store --type=merge -p '{ "spec": { "rules": [{ "matches": [{"path": {"type": "PathPrefix", "value": "/api"}}], "backendRefs": [ {"name": "store-api-stable", "port": 8080, "weight": 50}, {"name": "store-api-canary", "port": 8080, "weight": 50} ] }] }}'
# Promote canary to 100%kubectl patch httproute store-api-canary -n store --type=merge -p '{ "spec": { "rules": [{ "matches": [{"path": {"type": "PathPrefix", "value": "/api"}}], "backendRefs": [ {"name": "store-api-canary", "port": 8080, "weight": 100} ] }] }}'Header-Based Routing
Section titled “Header-Based Routing”Gateway API also supports routing based on HTTP headers, which is useful for testing in production.
# Route requests with X-Canary: true header to canary serviceapiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: store-api-header-routing namespace: storespec: parentRefs: - kind: Gateway name: external-gateway namespace: infra hostnames: - "store.example.com" rules: - matches: - path: type: PathPrefix value: /api headers: - name: X-Canary value: "true" backendRefs: - name: store-api-canary port: 8080 - matches: - path: type: PathPrefix value: /api backendRefs: - name: store-api-stable port: 8080Private Service Connect for GKE
Section titled “Private Service Connect for GKE”Private Service Connect (PSC) allows you to access the GKE control plane through a private endpoint within your VPC, eliminating exposure to the public internet.
# Create a private cluster with PSCgcloud container clusters create private-cluster \ --region=us-central1 \ --num-nodes=1 \ --enable-private-nodes \ --enable-private-endpoint \ --master-ipv4-cidr=172.16.0.0/28 \ --enable-ip-alias \ --enable-master-authorized-networks \ --master-authorized-networks=10.0.0.0/8
# With PSC (newer approach, recommended):gcloud container clusters create psc-cluster \ --region=us-central1 \ --num-nodes=1 \ --enable-private-nodes \ --private-endpoint-subnetwork=psc-subnet \ --enable-ip-alias Private Cluster with PSC: ┌─────────────────────────────────────────────────────┐ │ Google-Managed VPC │ │ ┌─────────────────────────────────────┐ │ │ │ GKE Control Plane │ │ │ │ (API Server, etcd, etc.) │ │ │ └──────────────┬──────────────────────┘ │ │ │ Private Service Connect │ └─────────────────┼───────────────────────────────────┘ │ ┌─────────────────▼───────────────────────────────────┐ │ Customer VPC │ │ ┌──────────────────┐ │ │ │ PSC Endpoint │ ← Private IP in your VPC │ │ │ 10.0.5.2 │ for control plane access │ │ └──────────────────┘ │ │ │ │ ┌──────────────────┐ │ │ │ GKE Nodes │ ← No public IPs │ │ │ 10.0.0.0/24 │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────┘Private Cluster Considerations
Section titled “Private Cluster Considerations”Stop and think: If you use Private Service Connect for your GKE control plane and have disabled public IP access, how will your cloud-hosted CI/CD pipeline (e.g., GitHub Actions) authenticate and deploy manifests to the cluster?
| Consideration | Impact | Solution |
|---|---|---|
| Nodes cannot pull from internet | Container images fail | Use Artifact Registry (in same region) or configure Cloud NAT |
| kubectl from local machine blocked | Cannot manage cluster | Use Cloud Shell, a bastion VM, or VPN/Interconnect |
| Webhooks from control plane to nodes | Admission webhooks may fail | Ensure firewall allows control plane CIDR to node ports |
| Cloud Build access | CI/CD pipelines cannot reach API | Use private pools or GKE deploy via Cloud Deploy |
# Set up Cloud NAT for private nodes to pull imagesgcloud compute routers create nat-router \ --network=prod-vpc \ --region=us-central1
gcloud compute routers nats create nat-config \ --router=nat-router \ --region=us-central1 \ --auto-allocate-nat-external-ips \ --nat-all-subnet-ip-rangesDid You Know?
Section titled “Did You Know?”-
Dataplane V2 uses the same eBPF technology that powers Meta’s (Facebook’s) entire network stack. Meta processes over 600 billion eBPF events per day across their fleet. In GKE, Dataplane V2’s eBPF programs are compiled and loaded into the Linux kernel at node boot, where they intercept and process packets before they ever reach userspace. This is why Dataplane V2 can achieve 26% lower latency than iptables-based routing in benchmarks with 10,000+ services.
-
A single GKE cluster can support up to 65,000 nodes and 400,000 pods. The practical networking limit is usually IP exhaustion rather than cluster capacity. A /14 pod CIDR gives you roughly 262,144 pod IPs. If each node uses a /24 for pods (the default for 110 max pods per node), you can support about 1,024 nodes before running out of pod IPs. Planning your IP ranges at cluster creation is one of the few decisions you truly cannot change later.
-
The Gateway API was designed by a cross-vendor working group including engineers from Google, Red Hat, HashiCorp, and VMware. The key insight was that Ingress combined infrastructure concerns (TLS, IP addresses) with application concerns (routing rules) in a single resource, making it impossible to safely delegate to different teams. Gateway API’s three-tier model (GatewayClass, Gateway, HTTPRoute) maps directly to the cluster admin, platform team, and application team roles that exist in most organizations.
-
GKE’s Global Application Load Balancer uses Google’s Maglev system, which was published as a research paper in 2016. Maglev is a distributed software load balancer that runs on commodity servers at Google’s edge PoPs. It uses consistent hashing to achieve connection persistence without shared state between load balancer instances. A single Maglev machine can handle 10 million packets per second, and the system has been running Google’s production traffic since 2008.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Creating a routes-based cluster instead of VPC-native | Following outdated tutorials | Always use --enable-ip-alias; it is the default for new clusters but verify |
| Assuming NetworkPolicy works without Dataplane V2 | Creating policies without enforcement | Enable Dataplane V2 at cluster creation; without it, policies are ignored |
| Undersizing the pod CIDR | Not calculating node count x pods per node | Plan for 3-5x your current node count; you cannot expand the range later |
| Forgetting DNS egress in NetworkPolicy | Writing a deny-all egress policy without DNS exception | Always include a rule allowing UDP/TCP port 53 to kube-dns pods |
| Using Ingress annotations for advanced routing | Trying to do canary/header routing with GKE Ingress | Switch to Gateway API which natively supports traffic splitting and header matching |
| Not enabling Cloud NAT for private clusters | Private nodes cannot reach the internet | Configure Cloud NAT on the VPC router before creating private clusters |
| Mixing GKE Ingress and Gateway API on the same cluster | Both create load balancer resources | Choose one approach per cluster; Gateway API is the recommended path forward |
| Ignoring network policy logging | Deploying policies without validation | Enable network policy logging and review denied connections before enforcing broadly |
1. Your e-commerce platform just scaled from 500 to 5,000 microservices. The platform team notices that network routing latency between pods has increased significantly on your older clusters, but remains flat on your new Dataplane V2 clusters. What fundamental architectural difference explains this behavior?
iptables-based routing uses a linear chain of rules that the kernel evaluates sequentially for every packet. When you have 5,000 Services, there are thousands of iptables rules, and each packet must traverse this chain until a match is found, resulting in O(n) complexity. Dataplane V2 uses eBPF hash maps compiled directly into the kernel. Service routing becomes a hash table lookup where the kernel hashes the destination IP and port, looks up the backend pod in O(1) constant time, and rewrites the packet. This means routing performance does not degrade as you add more services, resolving the latency issues seen in older clusters.
2. You deploy a strict `deny-all` egress NetworkPolicy to your `payments` namespace to meet PCI compliance. Suddenly, all pods in the namespace start crash-looping, reporting that they cannot connect to the internal database service `db.backend.svc.cluster.local`, even though you added an egress rule explicitly allowing traffic to the database's IP range. What critical rule is missing?
When you create a NetworkPolicy with policyTypes: ["Egress"] and no egress rules, you implicitly block all outbound traffic from the selected pods, including DNS resolution. Pods resolve service names (like db.backend.svc.cluster.local) by querying the kube-dns (CoreDNS) pods on UDP port 53. Without a DNS exception, pods cannot resolve any service names to IP addresses, meaning your application cannot even attempt the connection to the database. The critical missing rule is an explicit egress rule allowing traffic to kube-dns pods on both UDP and TCP port 53. TCP is required as a fallback for DNS responses larger than 512 bytes.
3. Your organization is moving from Ingress to the Gateway API. The security team wants to strictly control which TLS certificates are used and which namespaces can expose public endpoints, while application developers need the freedom to create path-based routing rules and canary deployments without submitting IT tickets. How does the Gateway API resource model satisfy both teams?
The Gateway API uses a three-tier resource model designed specifically for role-based access. The GatewayClass is managed by the cluster administrator and defines the load balancer implementation. The Gateway is managed by the platform or security team, allowing them to strictly configure TLS certificates, listening ports, and which namespaces can attach routes. The HTTPRoute is managed by the application team, giving them the freedom to define host matching, path routing, headers, and canary weights. This separation means the app team can update their routing autonomously, while the platform team enforces global security policies.
4. A junior engineer provisions a new regional GKE cluster (spanning 3 zones, 2 nodes per zone) and assigns a `/24` CIDR block for the pod secondary range. During the deployment of the first application, several pods remain in a `Pending` state, and the cluster autoscaler fails to add new nodes. What is the root cause of this failure?
A /24 CIDR block provides only 256 IP addresses for the entire pod network. In a VPC-native cluster, each node is allocated its own /24 slice by default to support up to 110 pods. Because a regional cluster with 3 zones and 2 nodes per zone requires 6 nodes in total, it would need at least a /21 for the pod range to accommodate them. The cluster creation will initially succeed, but you will hit scheduling failures and autoscaling blocks when the pod CIDR is immediately exhausted and new pods cannot be assigned IPs. This situation is unrecoverable, as secondary ranges cannot be resized, requiring a full cluster recreation.
5. You are rolling out a critical update to the authentication service and want to route exactly 5% of traffic to the new version. Your cluster uses the Gateway API, but you do not have a service mesh like Istio installed. How can you achieve this granular traffic splitting, and where does the actual routing decision take place?
Gateway API supports traffic splitting natively through the weight field on backendRefs within an HTTPRoute rule. You can specify multiple backend services with different weights (e.g., 95 for stable, 5 for canary), and the load balancer distributes incoming requests proportionally. Unlike Istio’s traffic splitting, which requires a sidecar proxy injecting hops into the data path, GKE Gateway API traffic splitting is programmed directly into the Google Cloud Load Balancer. You update the weights by patching the HTTPRoute resource, and the external load balancer reconfigures within seconds. This provides robust canary deployments as a first-class infrastructure feature without the operational overhead of a service mesh.
6. Your enterprise network team mandates that all new GKE clusters must be private, but they have exhausted the 25 VPC Peering connections limit on the central shared VPC. They also require that the GKE control plane be accessible via a specific private IP address on your on-premises network through Cloud Interconnect. Why is Private Service Connect (PSC) the only viable architecture for this requirement?
The legacy private cluster model relies on VPC peering between your VPC and the Google-managed VPC hosting the control plane. VPC peering is non-transitive, meaning peered networks cannot reach each other through your VPC, and it consumes a strict peering slot limit per VPC. Private Service Connect (PSC) instead creates a forwarding rule in your VPC that routes traffic to the control plane through a localized endpoint. This completely bypasses VPC peering, freeing up peering slots, and crucially supports transitive connectivity so on-premises networks can access the endpoint via Cloud Interconnect. PSC is the modern, scalable approach for private control plane access.
Hands-On Exercise: Dataplane V2 Network Policies and Gateway API Canary
Section titled “Hands-On Exercise: Dataplane V2 Network Policies and Gateway API Canary”Objective
Section titled “Objective”Create a GKE cluster with Dataplane V2, enforce network policies between namespaces, and set up a Gateway API canary deployment with traffic splitting.
Prerequisites
Section titled “Prerequisites”gcloudCLI installed and authenticated- A GCP project with billing enabled and the GKE API enabled
kubectlinstalled
Task 1: Create a GKE Cluster with Dataplane V2 and Gateway API
Solution
export PROJECT_ID=$(gcloud config get-value project)export REGION=us-central1
# Create a cluster with Dataplane V2 and Gateway API enabledgcloud container clusters create net-demo \ --region=$REGION \ --num-nodes=1 \ --machine-type=e2-standard-2 \ --enable-dataplane-v2 \ --enable-ip-alias \ --release-channel=regular \ --gateway-api=standard \ --workload-pool=$PROJECT_ID.svc.id.goog
# Get credentialsgcloud container clusters get-credentials net-demo --region=$REGION
# Verify Dataplane V2 (Cilium pods running)kubectl -n kube-system get pods -l k8s-app=cilium
# Verify Gateway API CRDs are installedkubectl get gatewayclassTask 2: Deploy Two Namespaces with Applications
Solution
# Create namespaceskubectl create namespace frontendkubectl create namespace backendkubectl label namespace frontend role=frontend gateway-access=truekubectl label namespace backend role=backend
# Deploy backend appkubectl apply -n backend -f - <<'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: apispec: replicas: 2 selector: matchLabels: app: api template: metadata: labels: app: api version: stable spec: containers: - name: api image: hashicorp/http-echo args: ["-text=API v1 (stable)", "-listen=:8080"] ports: - containerPort: 8080 resources: requests: cpu: 100m memory: 64Mi---apiVersion: v1kind: Servicemetadata: name: api-stablespec: selector: app: api version: stable ports: - port: 8080 targetPort: 8080EOF
# Deploy canary version of backendkubectl apply -n backend -f - <<'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: api-canaryspec: replicas: 1 selector: matchLabels: app: api version: canary template: metadata: labels: app: api version: canary spec: containers: - name: api image: hashicorp/http-echo args: ["-text=API v2 (canary)", "-listen=:8080"] ports: - containerPort: 8080 resources: requests: cpu: 100m memory: 64Mi---apiVersion: v1kind: Servicemetadata: name: api-canaryspec: selector: app: api version: canary ports: - port: 8080 targetPort: 8080EOF
# Deploy frontendkubectl apply -n frontend -f - <<'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: webspec: replicas: 2 selector: matchLabels: app: web template: metadata: labels: app: web spec: containers: - name: web image: nginx:1.27 ports: - containerPort: 80 resources: requests: cpu: 100m memory: 64Mi---apiVersion: v1kind: Servicemetadata: name: webspec: selector: app: web ports: - port: 80 targetPort: 80EOF
# Verify all pods runningkubectl get pods -n frontendkubectl get pods -n backendTask 3: Enforce Network Policies with Dataplane V2
Solution
# Default deny all ingress in the backend namespacekubectl apply -n backend -f - <<'EOF'apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: deny-all-ingressspec: podSelector: {} policyTypes: - IngressEOF
# Test: frontend cannot reach backend (should timeout)kubectl run test-curl --rm -it --restart=Never \ -n frontend --image=curlimages/curl -- \ curl -s --connect-timeout 5 http://api-stable.backend:8080 || echo "Connection blocked (expected)"
# Allow frontend namespace to reach backend APIkubectl apply -n backend -f - <<'EOF'apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-from-frontendspec: podSelector: matchLabels: app: api policyTypes: - Ingress ingress: - from: - namespaceSelector: matchLabels: role: frontend ports: - protocol: TCP port: 8080EOF
# Test again: frontend CAN reach backend nowkubectl run test-curl2 --rm -it --restart=Never \ -n frontend --image=curlimages/curl -- \ curl -s --connect-timeout 5 http://api-stable.backend:8080
# Test: a random namespace still cannot reach backendkubectl create namespace attackerkubectl run test-curl3 --rm -it --restart=Never \ -n attacker --image=curlimages/curl -- \ curl -s --connect-timeout 5 http://api-stable.backend:8080 || echo "Connection blocked (expected)"kubectl delete namespace attackerTask 4: Set Up Gateway API with Canary Traffic Splitting
Solution
# Create a Gatewaykubectl apply -f - <<'EOF'apiVersion: gateway.networking.k8s.io/v1kind: Gatewaymetadata: name: demo-gateway namespace: backendspec: gatewayClassName: gke-l7-global-external-managed listeners: - name: http protocol: HTTP port: 80 allowedRoutes: namespaces: from: SameEOF
# Create an HTTPRoute with canary traffic splitting (90/10)kubectl apply -n backend -f - <<'EOF'apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: api-canary-routespec: parentRefs: - kind: Gateway name: demo-gateway namespace: backend rules: - backendRefs: - name: api-stable port: 8080 weight: 90 - name: api-canary port: 8080 weight: 10EOF
# Wait for the Gateway to get an IP (takes 2-5 minutes)echo "Waiting for Gateway IP..."while true; do GW_IP=$(kubectl get gateway demo-gateway -n backend \ -o jsonpath='{.status.addresses[0].value}' 2>/dev/null) if [ -n "$GW_IP" ] && [ "$GW_IP" != "" ]; then echo "Gateway IP: $GW_IP" break fi echo "Still provisioning..." sleep 15done
# Test traffic splitting (run 20 requests, expect ~18 stable, ~2 canary)echo "Sending 20 requests to $GW_IP..."for i in $(seq 1 20); do curl -s http://$GW_IP echo ""done | sort | uniq -c | sort -rnTask 5: Shift Canary Traffic to 50/50 and Then Promote
Solution
# Shift to 50/50kubectl apply -n backend -f - <<'EOF'apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: api-canary-routespec: parentRefs: - kind: Gateway name: demo-gateway namespace: backend rules: - backendRefs: - name: api-stable port: 8080 weight: 50 - name: api-canary port: 8080 weight: 50EOF
echo "Waiting 30 seconds for LB to reconfigure..."sleep 30
# Test againGW_IP=$(kubectl get gateway demo-gateway -n backend \ -o jsonpath='{.status.addresses[0].value}')echo "50/50 split results:"for i in $(seq 1 20); do curl -s http://$GW_IPdone | sort | uniq -c | sort -rn
# Full promotion to canarykubectl apply -n backend -f - <<'EOF'apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: api-canary-routespec: parentRefs: - kind: Gateway name: demo-gateway namespace: backend rules: - backendRefs: - name: api-canary port: 8080 weight: 100EOF
sleep 30
echo "Full canary promotion results:"for i in $(seq 1 10); do curl -s http://$GW_IPdoneTask 6: Provision a Private Cluster with Private Service Connect (PSC)
Solution
# Create a dedicated subnet for PSC in the default networkgcloud compute networks subnets create psc-subnet \ --network=default \ --region=$REGION \ --range=10.10.0.0/28
# Create a private cluster using PSC instead of VPC peeringgcloud container clusters create psc-demo \ --region=$REGION \ --num-nodes=1 \ --enable-private-nodes \ --private-endpoint-subnetwork=psc-subnet \ --enable-ip-alias \ --master-authorized-networks=0.0.0.0/0
# Verify the PSC endpoint IP addressgcloud container clusters describe psc-demo \ --region=$REGION \ --format="value(privateClusterConfig.privateEndpoint)"Task 7: Clean Up
Solution
# Delete the Gateway API demo clustergcloud container clusters delete net-demo \ --region=$REGION --quiet
# Delete the PSC demo clustergcloud container clusters delete psc-demo \ --region=$REGION --quiet
# Delete the PSC subnetgcloud compute networks subnets delete psc-subnet \ --region=$REGION --quiet
echo "Clusters deleted. Verify no orphaned load balancer resources:"gcloud compute forwarding-rules list --filter="description~net-demo"gcloud compute target-http-proxies list --filter="description~net-demo"Success Criteria
Section titled “Success Criteria”- Cluster created with Dataplane V2 and Gateway API enabled
- Cilium pods running in kube-system namespace
- Network policy blocks cross-namespace traffic by default
- Network policy allows frontend-to-backend traffic on port 8080
- Gateway API HTTPRoute splits traffic 90/10 between stable and canary
- Traffic shifting to 50/50 and full promotion works correctly
- PSC cluster created with a dedicated private endpoint subnet
- All resources cleaned up
Next Module
Section titled “Next Module”Next up: Module 6.3: GKE Workload Identity and Security --- Learn how to securely connect pods to GCP services without storing credentials, enforce binary authorization for trusted images, and leverage GKE’s security posture dashboard.