Skip to content

Module 6.2: GKE Networking: Dataplane V2 and Gateway API

Complexity: [COMPLEX] | Time to Complete: 3h | Prerequisites: Module 6.1 (GKE Architecture)

After completing this module, you will be able to:

  • Configure GKE Dataplane V2 (Cilium-based) with network policies and network policy logging
  • Implement Gateway API on GKE for traffic splitting and header-based routing
  • Deploy Private Service Connect for secure control plane access on GKE
  • Diagnose GKE networking issues related to IP exhaustion and pod-to-service communication failures

In September 2023, a healthcare SaaS company running on GKE discovered that their network policies were not being enforced. A penetration tester demonstrated that a compromised pod in the staging namespace could freely communicate with pods in the production namespace, despite NetworkPolicy resources that should have blocked cross-namespace traffic. The root cause: the cluster was using the legacy iptables-based kube-proxy dataplane, which does not enforce Kubernetes NetworkPolicy at all. The team had assumed that creating NetworkPolicy resources was sufficient---they did not realize that enforcement requires a CNI that supports it. The compliance violation cost them a SOC 2 audit failure, delaying a $2.3 million enterprise deal by four months. The fix took 30 minutes: enable Dataplane V2 on their next cluster creation. The business impact lasted a quarter.

GKE networking is where Kubernetes meets Google’s global network infrastructure. The decisions you make about cluster networking---VPC-native mode, Dataplane V2, load balancing strategy, and Gateway API configuration---determine your application’s performance, security, and cost. A misconfigured network can leave your pods exposed, introduce unnecessary latency, or rack up egress charges that dwarf your compute costs.

In this module, you will learn how VPC-native clusters use alias IPs to give pods routable addresses, how Dataplane V2 replaces iptables with eBPF for faster and more observable networking, how Cloud Load Balancing integrates with GKE, and how the Gateway API provides a more expressive routing model than Ingress. By the end, you will configure Dataplane V2 network policies and set up a Gateway API canary deployment.


Every modern GKE cluster should be VPC-native. This is the default since GKE 1.21 and is required for features like Dataplane V2, Private Google Access for pods, and VPC flow logs for pod traffic.

In a VPC-native cluster, each node receives a primary IP from the subnet and a secondary IP range (alias range) for its pods. This means pods get IP addresses that are routable within the VPC---no NAT, no overlay network.

VPC: 10.0.0.0/16
┌────────────────────────────────────────────────────────┐
│ │
│ Subnet: 10.0.0.0/24 (Node IPs) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node A │ │ Node B │ │
│ │ IP: 10.0.0.2 │ │ IP: 10.0.0.3 │ │
│ │ │ │ │ │
│ │ Alias: 10.4.0.0 │ │ Alias: 10.4.1.0 │ │
│ │ /24 (pods) │ │ /24 (pods) │ │
│ │ ┌────┐ ┌────┐ │ │ ┌────┐ ┌────┐ │ │
│ │ │Pod │ │Pod │ │ │ │Pod │ │Pod │ │ │
│ │ │.2 │ │.3 │ │ │ │.5 │ │.8 │ │ │
│ │ └────┘ └────┘ │ │ └────┘ └────┘ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ Secondary Range "pods": 10.4.0.0/14 │
│ Secondary Range "services": 10.8.0.0/20 │
└────────────────────────────────────────────────────────┘
FeatureVPC-Native (Alias IPs)Routes-Based (Legacy)
Pod IPs routable in VPCYes (directly)No (requires custom routes)
Max pods per clusterLimited by IP range sizeLimited to 300 custom routes
Network Policy supportFull (Dataplane V2)Limited
Private Google Access for podsYesNo
VPC Flow Logs for podsYesNo
Peering/VPN compatibilityFullRoute export required
Terminal window
# Verify your cluster is VPC-native
gcloud container clusters describe my-cluster \
--region=us-central1 \
--format="yaml(ipAllocationPolicy)"
# Expected output includes:
# useIpAliases: true
# clusterSecondaryRangeName: pods
# servicesSecondaryRangeName: services

Stop and think: If a VPC-native cluster uses alias IPs directly from the VPC, what happens if your VPC doesn’t have a large enough secondary range for your planned number of nodes and pods at maximum scale?

Poor IP planning is the number one networking regret for teams that scale. You cannot resize secondary ranges after cluster creation.

Planning Guide:
┌──────────────────────────────────────────────────────┐
│ Each node gets a /24 from the pod range by default │
│ = 256 IPs per node (110 pods max + overhead) │
│ │
│ For 100 nodes: you need 100 x /24 = /17 minimum │
│ For 500 nodes: you need 500 x /24 = /15 minimum │
│ │
│ Services range: │
│ /20 = 4,096 services (usually sufficient) │
│ /16 = 65,536 services (very large clusters) │
└──────────────────────────────────────────────────────┘
Terminal window
# Create a cluster with explicit IP planning for scale
gcloud container clusters create large-cluster \
--region=us-central1 \
--num-nodes=2 \
--network=prod-vpc \
--subnetwork=gke-subnet \
--cluster-secondary-range-name=gke-pods \
--services-secondary-range-name=gke-services \
--enable-ip-alias \
--max-pods-per-node=64 \
--default-max-pods-per-node=64
# Reducing max-pods-per-node from 110 to 64 means each node
# needs a /26 instead of a /24, saving IP space

Dataplane V2 is GKE’s modern networking stack, built on Cilium and eBPF. It replaces the traditional kube-proxy + iptables approach with a programmable, kernel-level dataplane.

Traditional Kubernetes networking uses iptables rules for service routing and kube-proxy for load balancing. This works, but it has fundamental limitations:

Legacy (iptables/kube-proxy):
┌─────────────────────────────────────────────────────┐
│ Packet arrives at node │
│ │ │
│ ▼ │
│ iptables chain (linear scan) │
│ Rule 1: no match │
│ Rule 2: no match │
│ Rule 3: no match │
│ ... │
│ Rule 5,000: MATCH → DNAT to pod IP │
│ │ │
│ O(n) performance: more services = slower routing │
└─────────────────────────────────────────────────────┘
Dataplane V2 (eBPF):
┌─────────────────────────────────────────────────────┐
│ Packet arrives at node │
│ │ │
│ ▼ │
│ eBPF hash map lookup │
│ Key: {dest IP, dest port} │
│ Value: backend pod IP │
│ │ │
│ O(1) performance: constant time regardless of │
│ number of services │
└─────────────────────────────────────────────────────┘

Pause and predict: If Dataplane V2 uses eBPF hash maps instead of iptables, how might this change the way you troubleshoot dropped packets or connection timeouts compared to legacy clusters?

Capabilityiptables/kube-proxyDataplane V2
Service routingO(n) linear scanO(1) hash lookup
Network Policy enforcementRequires Calico add-onBuilt-in (Cilium)
Network Policy loggingNot availableBuilt-in
Kernel bypassNoYes (XDP for some paths)
ObservabilityBasic conntrackRich eBPF flow logs
Scale limit~5,000 services practical25,000+ services tested
FQDN-based policiesNot supportedSupported
Terminal window
# Dataplane V2 is enabled at cluster creation time
gcloud container clusters create dpv2-cluster \
--region=us-central1 \
--num-nodes=2 \
--enable-dataplane-v2 \
--enable-ip-alias \
--release-channel=regular
# For Autopilot clusters, Dataplane V2 is enabled by default
gcloud container clusters create-auto dpv2-autopilot \
--region=us-central1
# Verify Dataplane V2 is active
kubectl -n kube-system get pods -l k8s-app=cilium -o wide

With Dataplane V2, NetworkPolicy resources are enforced without any additional CNI installation. This is the feature that the healthcare company in our opening story was missing.

# Deny all ingress to production namespace by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
---
# Allow only the API gateway to reach backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-gateway
namespace: production
spec:
podSelector:
matchLabels:
app: backend
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
role: gateway
podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8080
---
# Allow DNS resolution for all pods (critical, often forgotten)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53

Dataplane V2 can log allowed and denied connections, which is invaluable for debugging and compliance.

Terminal window
# Enable network policy logging on the cluster
gcloud container clusters update dpv2-cluster \
--region=us-central1 \
--enable-network-policy-logging
# View logs in Cloud Logging
gcloud logging read \
'resource.type="k8s_node" AND jsonPayload.disposition="deny"' \
--limit=10 \
--format="table(timestamp, jsonPayload.src.pod_name, jsonPayload.dest.pod_name, jsonPayload.disposition)"

War Story: A platform team enabled network policy logging and discovered that their monitoring agent (Datadog) was making 3,000 denied connections per minute to pods in restricted namespaces. The agent had broad scrape targets configured, and every denied connection generated a log entry. Before enabling logging in production, test in a staging environment to understand the log volume---it can be surprisingly high.


GKE integrates tightly with Google Cloud Load Balancing. When you create a Kubernetes Service or Ingress, GKE provisions the corresponding Google Cloud load balancer components automatically.

Kubernetes Concept GCP Resource Created
───────────────── ────────────────────
Service type: ClusterIP → Nothing (internal only)
Service type: NodePort → Nothing (opens port on nodes)
Service type: LoadBalancer → Network Load Balancer (L4)
Ingress (external) → Application Load Balancer (L7)
Gateway (external) → Application Load Balancer (L7)
Service TypeLayerScopeUse Case
LoadBalancerL4 (TCP/UDP)Regional (default)Non-HTTP, gRPC without path routing
Ingress (GKE Ingress)L7 (HTTP/S)GlobalHTTP routing with host/path rules
Gateway (Gateway API)L7 (HTTP/S)Global or RegionalModern alternative to Ingress
Internal LoadBalancerL4RegionalInternal services, not internet-facing
Internal IngressL7RegionalInternal HTTP routing

Stop and think: If you expose an internal gRPC service that requires L7 routing and TLS termination, which GKE service type or ingress method should you choose instead of a standard LoadBalancer?

# Simple L4 load balancer
apiVersion: v1
kind: Service
metadata:
name: game-server
spec:
type: LoadBalancer
selector:
app: game-server
ports:
- port: 7777
targetPort: 7777
protocol: UDP
Terminal window
# Check the provisioned load balancer
kubectl get svc game-server -o wide
# The EXTERNAL-IP column shows the Google Cloud LB IP
# View the underlying GCP forwarding rule
gcloud compute forwarding-rules list \
--filter="description~game-server"

GKE Ingress creates a Google Cloud Application Load Balancer (formerly HTTP(S) Load Balancer) with features like SSL termination, URL-based routing, and Cloud CDN integration.

# Multi-service Ingress with path-based routing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
annotations:
kubernetes.io/ingress.global-static-ip-name: web-static-ip
networking.gke.io/managed-certificates: web-cert
kubernetes.io/ingress.class: gce
spec:
defaultBackend:
service:
name: frontend
port:
number: 80
rules:
- host: app.example.com
http:
paths:
- path: /api/*
pathType: ImplementationSpecific
backend:
service:
name: api-service
port:
number: 8080
- path: /static/*
pathType: ImplementationSpecific
backend:
service:
name: static-assets
port:
number: 80

Gateway API: The Future of Kubernetes Routing

Section titled “Gateway API: The Future of Kubernetes Routing”

The Gateway API is a Kubernetes-native evolution of Ingress that provides richer routing capabilities, better role separation, and a more consistent experience across implementations. GKE fully supports the Gateway API and it is the recommended approach for new deployments.

Pause and predict: In the Gateway API model, if the infrastructure team modifies the Gateway resource to restrict allowed namespaces, what happens to the existing HTTPRoutes in namespaces that are no longer allowed?

Ingress Model (flat):
┌──────────────────────────────────────┐
│ Ingress Resource │
│ (mixes infra config + routing) │
│ │
│ - TLS config (infra team concern) │
│ - Host rules (app team concern) │
│ - Path rules (app team concern) │
│ - Backend refs (app team concern) │
│ │
│ ONE resource, ONE owner = conflict │
└──────────────────────────────────────┘
Gateway API Model (layered):
┌──────────────────────────────────────┐
│ GatewayClass (cluster admin) │
│ "Which load balancer implementation"│
└──────────────┬───────────────────────┘
┌──────────────▼───────────────────────┐
│ Gateway (infra/platform team) │
│ "Listener config, TLS, IP address" │
└──────────────┬───────────────────────┘
┌──────────────▼───────────────────────┐
│ HTTPRoute (app team) │
│ "Host matching, path routing, │
│ headers, canary weights" │
└──────────────────────────────────────┘

GKE provides several pre-installed GatewayClasses:

GatewayClassLoad Balancer TypeScopeUse Case
gke-l7-global-external-managedGlobal external ALBGlobalPublic-facing web apps
gke-l7-regional-external-managedRegional external ALBRegionalRegion-specific apps
gke-l7-rilbRegional internal ALBRegionalInternal microservices
gke-l7-gxlbClassic global external ALBGlobalLegacy, avoid for new
Terminal window
# List available GatewayClasses in your cluster
kubectl get gatewayclass
# Enable the Gateway API on an existing cluster
gcloud container clusters update my-cluster \
--region=us-central1 \
--gateway-api=standard
# Step 1: Create the Gateway (platform/infra team)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: external-gateway
namespace: infra
spec:
gatewayClassName: gke-l7-global-external-managed
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
certificateRefs:
- kind: Secret
name: tls-cert
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
gateway-access: "true"
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: Selector
selector:
matchLabels:
gateway-access: "true"
# Step 2: Create an HTTPRoute (app team)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: store-route
namespace: store
labels:
gateway: external-gateway
spec:
parentRefs:
- kind: Gateway
name: external-gateway
namespace: infra
hostnames:
- "store.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: store-api
port: 8080
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: store-frontend
port: 80

The Gateway API natively supports traffic splitting by weight---something that required Istio or custom annotations with Ingress.

# Canary: send 90% to stable, 10% to canary
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: store-api-canary
namespace: store
spec:
parentRefs:
- kind: Gateway
name: external-gateway
namespace: infra
hostnames:
- "store.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: store-api-stable
port: 8080
weight: 90
- name: store-api-canary
port: 8080
weight: 10

To gradually shift traffic, update the weights:

Terminal window
# Move to 50/50
kubectl patch httproute store-api-canary -n store --type=merge -p '{
"spec": {
"rules": [{
"matches": [{"path": {"type": "PathPrefix", "value": "/api"}}],
"backendRefs": [
{"name": "store-api-stable", "port": 8080, "weight": 50},
{"name": "store-api-canary", "port": 8080, "weight": 50}
]
}]
}
}'
# Promote canary to 100%
kubectl patch httproute store-api-canary -n store --type=merge -p '{
"spec": {
"rules": [{
"matches": [{"path": {"type": "PathPrefix", "value": "/api"}}],
"backendRefs": [
{"name": "store-api-canary", "port": 8080, "weight": 100}
]
}]
}
}'

Gateway API also supports routing based on HTTP headers, which is useful for testing in production.

# Route requests with X-Canary: true header to canary service
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: store-api-header-routing
namespace: store
spec:
parentRefs:
- kind: Gateway
name: external-gateway
namespace: infra
hostnames:
- "store.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
headers:
- name: X-Canary
value: "true"
backendRefs:
- name: store-api-canary
port: 8080
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
- name: store-api-stable
port: 8080

Private Service Connect (PSC) allows you to access the GKE control plane through a private endpoint within your VPC, eliminating exposure to the public internet.

Terminal window
# Create a private cluster with PSC
gcloud container clusters create private-cluster \
--region=us-central1 \
--num-nodes=1 \
--enable-private-nodes \
--enable-private-endpoint \
--master-ipv4-cidr=172.16.0.0/28 \
--enable-ip-alias \
--enable-master-authorized-networks \
--master-authorized-networks=10.0.0.0/8
# With PSC (newer approach, recommended):
gcloud container clusters create psc-cluster \
--region=us-central1 \
--num-nodes=1 \
--enable-private-nodes \
--private-endpoint-subnetwork=psc-subnet \
--enable-ip-alias
Private Cluster with PSC:
┌─────────────────────────────────────────────────────┐
│ Google-Managed VPC │
│ ┌─────────────────────────────────────┐ │
│ │ GKE Control Plane │ │
│ │ (API Server, etcd, etc.) │ │
│ └──────────────┬──────────────────────┘ │
│ │ Private Service Connect │
└─────────────────┼───────────────────────────────────┘
┌─────────────────▼───────────────────────────────────┐
│ Customer VPC │
│ ┌──────────────────┐ │
│ │ PSC Endpoint │ ← Private IP in your VPC │
│ │ 10.0.5.2 │ for control plane access │
│ └──────────────────┘ │
│ │
│ ┌──────────────────┐ │
│ │ GKE Nodes │ ← No public IPs │
│ │ 10.0.0.0/24 │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────┘

Stop and think: If you use Private Service Connect for your GKE control plane and have disabled public IP access, how will your cloud-hosted CI/CD pipeline (e.g., GitHub Actions) authenticate and deploy manifests to the cluster?

ConsiderationImpactSolution
Nodes cannot pull from internetContainer images failUse Artifact Registry (in same region) or configure Cloud NAT
kubectl from local machine blockedCannot manage clusterUse Cloud Shell, a bastion VM, or VPN/Interconnect
Webhooks from control plane to nodesAdmission webhooks may failEnsure firewall allows control plane CIDR to node ports
Cloud Build accessCI/CD pipelines cannot reach APIUse private pools or GKE deploy via Cloud Deploy
Terminal window
# Set up Cloud NAT for private nodes to pull images
gcloud compute routers create nat-router \
--network=prod-vpc \
--region=us-central1
gcloud compute routers nats create nat-config \
--router=nat-router \
--region=us-central1 \
--auto-allocate-nat-external-ips \
--nat-all-subnet-ip-ranges

  1. Dataplane V2 uses the same eBPF technology that powers Meta’s (Facebook’s) entire network stack. Meta processes over 600 billion eBPF events per day across their fleet. In GKE, Dataplane V2’s eBPF programs are compiled and loaded into the Linux kernel at node boot, where they intercept and process packets before they ever reach userspace. This is why Dataplane V2 can achieve 26% lower latency than iptables-based routing in benchmarks with 10,000+ services.

  2. A single GKE cluster can support up to 65,000 nodes and 400,000 pods. The practical networking limit is usually IP exhaustion rather than cluster capacity. A /14 pod CIDR gives you roughly 262,144 pod IPs. If each node uses a /24 for pods (the default for 110 max pods per node), you can support about 1,024 nodes before running out of pod IPs. Planning your IP ranges at cluster creation is one of the few decisions you truly cannot change later.

  3. The Gateway API was designed by a cross-vendor working group including engineers from Google, Red Hat, HashiCorp, and VMware. The key insight was that Ingress combined infrastructure concerns (TLS, IP addresses) with application concerns (routing rules) in a single resource, making it impossible to safely delegate to different teams. Gateway API’s three-tier model (GatewayClass, Gateway, HTTPRoute) maps directly to the cluster admin, platform team, and application team roles that exist in most organizations.

  4. GKE’s Global Application Load Balancer uses Google’s Maglev system, which was published as a research paper in 2016. Maglev is a distributed software load balancer that runs on commodity servers at Google’s edge PoPs. It uses consistent hashing to achieve connection persistence without shared state between load balancer instances. A single Maglev machine can handle 10 million packets per second, and the system has been running Google’s production traffic since 2008.


MistakeWhy It HappensHow to Fix It
Creating a routes-based cluster instead of VPC-nativeFollowing outdated tutorialsAlways use --enable-ip-alias; it is the default for new clusters but verify
Assuming NetworkPolicy works without Dataplane V2Creating policies without enforcementEnable Dataplane V2 at cluster creation; without it, policies are ignored
Undersizing the pod CIDRNot calculating node count x pods per nodePlan for 3-5x your current node count; you cannot expand the range later
Forgetting DNS egress in NetworkPolicyWriting a deny-all egress policy without DNS exceptionAlways include a rule allowing UDP/TCP port 53 to kube-dns pods
Using Ingress annotations for advanced routingTrying to do canary/header routing with GKE IngressSwitch to Gateway API which natively supports traffic splitting and header matching
Not enabling Cloud NAT for private clustersPrivate nodes cannot reach the internetConfigure Cloud NAT on the VPC router before creating private clusters
Mixing GKE Ingress and Gateway API on the same clusterBoth create load balancer resourcesChoose one approach per cluster; Gateway API is the recommended path forward
Ignoring network policy loggingDeploying policies without validationEnable network policy logging and review denied connections before enforcing broadly

1. Your e-commerce platform just scaled from 500 to 5,000 microservices. The platform team notices that network routing latency between pods has increased significantly on your older clusters, but remains flat on your new Dataplane V2 clusters. What fundamental architectural difference explains this behavior?

iptables-based routing uses a linear chain of rules that the kernel evaluates sequentially for every packet. When you have 5,000 Services, there are thousands of iptables rules, and each packet must traverse this chain until a match is found, resulting in O(n) complexity. Dataplane V2 uses eBPF hash maps compiled directly into the kernel. Service routing becomes a hash table lookup where the kernel hashes the destination IP and port, looks up the backend pod in O(1) constant time, and rewrites the packet. This means routing performance does not degrade as you add more services, resolving the latency issues seen in older clusters.

2. You deploy a strict `deny-all` egress NetworkPolicy to your `payments` namespace to meet PCI compliance. Suddenly, all pods in the namespace start crash-looping, reporting that they cannot connect to the internal database service `db.backend.svc.cluster.local`, even though you added an egress rule explicitly allowing traffic to the database's IP range. What critical rule is missing?

When you create a NetworkPolicy with policyTypes: ["Egress"] and no egress rules, you implicitly block all outbound traffic from the selected pods, including DNS resolution. Pods resolve service names (like db.backend.svc.cluster.local) by querying the kube-dns (CoreDNS) pods on UDP port 53. Without a DNS exception, pods cannot resolve any service names to IP addresses, meaning your application cannot even attempt the connection to the database. The critical missing rule is an explicit egress rule allowing traffic to kube-dns pods on both UDP and TCP port 53. TCP is required as a fallback for DNS responses larger than 512 bytes.

3. Your organization is moving from Ingress to the Gateway API. The security team wants to strictly control which TLS certificates are used and which namespaces can expose public endpoints, while application developers need the freedom to create path-based routing rules and canary deployments without submitting IT tickets. How does the Gateway API resource model satisfy both teams?

The Gateway API uses a three-tier resource model designed specifically for role-based access. The GatewayClass is managed by the cluster administrator and defines the load balancer implementation. The Gateway is managed by the platform or security team, allowing them to strictly configure TLS certificates, listening ports, and which namespaces can attach routes. The HTTPRoute is managed by the application team, giving them the freedom to define host matching, path routing, headers, and canary weights. This separation means the app team can update their routing autonomously, while the platform team enforces global security policies.

4. A junior engineer provisions a new regional GKE cluster (spanning 3 zones, 2 nodes per zone) and assigns a `/24` CIDR block for the pod secondary range. During the deployment of the first application, several pods remain in a `Pending` state, and the cluster autoscaler fails to add new nodes. What is the root cause of this failure?

A /24 CIDR block provides only 256 IP addresses for the entire pod network. In a VPC-native cluster, each node is allocated its own /24 slice by default to support up to 110 pods. Because a regional cluster with 3 zones and 2 nodes per zone requires 6 nodes in total, it would need at least a /21 for the pod range to accommodate them. The cluster creation will initially succeed, but you will hit scheduling failures and autoscaling blocks when the pod CIDR is immediately exhausted and new pods cannot be assigned IPs. This situation is unrecoverable, as secondary ranges cannot be resized, requiring a full cluster recreation.

5. You are rolling out a critical update to the authentication service and want to route exactly 5% of traffic to the new version. Your cluster uses the Gateway API, but you do not have a service mesh like Istio installed. How can you achieve this granular traffic splitting, and where does the actual routing decision take place?

Gateway API supports traffic splitting natively through the weight field on backendRefs within an HTTPRoute rule. You can specify multiple backend services with different weights (e.g., 95 for stable, 5 for canary), and the load balancer distributes incoming requests proportionally. Unlike Istio’s traffic splitting, which requires a sidecar proxy injecting hops into the data path, GKE Gateway API traffic splitting is programmed directly into the Google Cloud Load Balancer. You update the weights by patching the HTTPRoute resource, and the external load balancer reconfigures within seconds. This provides robust canary deployments as a first-class infrastructure feature without the operational overhead of a service mesh.

6. Your enterprise network team mandates that all new GKE clusters must be private, but they have exhausted the 25 VPC Peering connections limit on the central shared VPC. They also require that the GKE control plane be accessible via a specific private IP address on your on-premises network through Cloud Interconnect. Why is Private Service Connect (PSC) the only viable architecture for this requirement?

The legacy private cluster model relies on VPC peering between your VPC and the Google-managed VPC hosting the control plane. VPC peering is non-transitive, meaning peered networks cannot reach each other through your VPC, and it consumes a strict peering slot limit per VPC. Private Service Connect (PSC) instead creates a forwarding rule in your VPC that routes traffic to the control plane through a localized endpoint. This completely bypasses VPC peering, freeing up peering slots, and crucially supports transitive connectivity so on-premises networks can access the endpoint via Cloud Interconnect. PSC is the modern, scalable approach for private control plane access.


Hands-On Exercise: Dataplane V2 Network Policies and Gateway API Canary

Section titled “Hands-On Exercise: Dataplane V2 Network Policies and Gateway API Canary”

Create a GKE cluster with Dataplane V2, enforce network policies between namespaces, and set up a Gateway API canary deployment with traffic splitting.

  • gcloud CLI installed and authenticated
  • A GCP project with billing enabled and the GKE API enabled
  • kubectl installed

Task 1: Create a GKE Cluster with Dataplane V2 and Gateway API

Solution
Terminal window
export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1
# Create a cluster with Dataplane V2 and Gateway API enabled
gcloud container clusters create net-demo \
--region=$REGION \
--num-nodes=1 \
--machine-type=e2-standard-2 \
--enable-dataplane-v2 \
--enable-ip-alias \
--release-channel=regular \
--gateway-api=standard \
--workload-pool=$PROJECT_ID.svc.id.goog
# Get credentials
gcloud container clusters get-credentials net-demo --region=$REGION
# Verify Dataplane V2 (Cilium pods running)
kubectl -n kube-system get pods -l k8s-app=cilium
# Verify Gateway API CRDs are installed
kubectl get gatewayclass

Task 2: Deploy Two Namespaces with Applications

Solution
Terminal window
# Create namespaces
kubectl create namespace frontend
kubectl create namespace backend
kubectl label namespace frontend role=frontend gateway-access=true
kubectl label namespace backend role=backend
# Deploy backend app
kubectl apply -n backend -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 2
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
version: stable
spec:
containers:
- name: api
image: hashicorp/http-echo
args: ["-text=API v1 (stable)", "-listen=:8080"]
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
name: api-stable
spec:
selector:
app: api
version: stable
ports:
- port: 8080
targetPort: 8080
EOF
# Deploy canary version of backend
kubectl apply -n backend -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-canary
spec:
replicas: 1
selector:
matchLabels:
app: api
version: canary
template:
metadata:
labels:
app: api
version: canary
spec:
containers:
- name: api
image: hashicorp/http-echo
args: ["-text=API v2 (canary)", "-listen=:8080"]
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
name: api-canary
spec:
selector:
app: api
version: canary
ports:
- port: 8080
targetPort: 8080
EOF
# Deploy frontend
kubectl apply -n frontend -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: nginx:1.27
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
name: web
spec:
selector:
app: web
ports:
- port: 80
targetPort: 80
EOF
# Verify all pods running
kubectl get pods -n frontend
kubectl get pods -n backend

Task 3: Enforce Network Policies with Dataplane V2

Solution
Terminal window
# Default deny all ingress in the backend namespace
kubectl apply -n backend -f - <<'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
EOF
# Test: frontend cannot reach backend (should timeout)
kubectl run test-curl --rm -it --restart=Never \
-n frontend --image=curlimages/curl -- \
curl -s --connect-timeout 5 http://api-stable.backend:8080 || echo "Connection blocked (expected)"
# Allow frontend namespace to reach backend API
kubectl apply -n backend -f - <<'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-frontend
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 8080
EOF
# Test again: frontend CAN reach backend now
kubectl run test-curl2 --rm -it --restart=Never \
-n frontend --image=curlimages/curl -- \
curl -s --connect-timeout 5 http://api-stable.backend:8080
# Test: a random namespace still cannot reach backend
kubectl create namespace attacker
kubectl run test-curl3 --rm -it --restart=Never \
-n attacker --image=curlimages/curl -- \
curl -s --connect-timeout 5 http://api-stable.backend:8080 || echo "Connection blocked (expected)"
kubectl delete namespace attacker

Task 4: Set Up Gateway API with Canary Traffic Splitting

Solution
Terminal window
# Create a Gateway
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: demo-gateway
namespace: backend
spec:
gatewayClassName: gke-l7-global-external-managed
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: Same
EOF
# Create an HTTPRoute with canary traffic splitting (90/10)
kubectl apply -n backend -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: api-canary-route
spec:
parentRefs:
- kind: Gateway
name: demo-gateway
namespace: backend
rules:
- backendRefs:
- name: api-stable
port: 8080
weight: 90
- name: api-canary
port: 8080
weight: 10
EOF
# Wait for the Gateway to get an IP (takes 2-5 minutes)
echo "Waiting for Gateway IP..."
while true; do
GW_IP=$(kubectl get gateway demo-gateway -n backend \
-o jsonpath='{.status.addresses[0].value}' 2>/dev/null)
if [ -n "$GW_IP" ] && [ "$GW_IP" != "" ]; then
echo "Gateway IP: $GW_IP"
break
fi
echo "Still provisioning..."
sleep 15
done
# Test traffic splitting (run 20 requests, expect ~18 stable, ~2 canary)
echo "Sending 20 requests to $GW_IP..."
for i in $(seq 1 20); do
curl -s http://$GW_IP
echo ""
done | sort | uniq -c | sort -rn

Task 5: Shift Canary Traffic to 50/50 and Then Promote

Solution
Terminal window
# Shift to 50/50
kubectl apply -n backend -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: api-canary-route
spec:
parentRefs:
- kind: Gateway
name: demo-gateway
namespace: backend
rules:
- backendRefs:
- name: api-stable
port: 8080
weight: 50
- name: api-canary
port: 8080
weight: 50
EOF
echo "Waiting 30 seconds for LB to reconfigure..."
sleep 30
# Test again
GW_IP=$(kubectl get gateway demo-gateway -n backend \
-o jsonpath='{.status.addresses[0].value}')
echo "50/50 split results:"
for i in $(seq 1 20); do
curl -s http://$GW_IP
done | sort | uniq -c | sort -rn
# Full promotion to canary
kubectl apply -n backend -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: api-canary-route
spec:
parentRefs:
- kind: Gateway
name: demo-gateway
namespace: backend
rules:
- backendRefs:
- name: api-canary
port: 8080
weight: 100
EOF
sleep 30
echo "Full canary promotion results:"
for i in $(seq 1 10); do
curl -s http://$GW_IP
done

Task 6: Provision a Private Cluster with Private Service Connect (PSC)

Solution
Terminal window
# Create a dedicated subnet for PSC in the default network
gcloud compute networks subnets create psc-subnet \
--network=default \
--region=$REGION \
--range=10.10.0.0/28
# Create a private cluster using PSC instead of VPC peering
gcloud container clusters create psc-demo \
--region=$REGION \
--num-nodes=1 \
--enable-private-nodes \
--private-endpoint-subnetwork=psc-subnet \
--enable-ip-alias \
--master-authorized-networks=0.0.0.0/0
# Verify the PSC endpoint IP address
gcloud container clusters describe psc-demo \
--region=$REGION \
--format="value(privateClusterConfig.privateEndpoint)"

Task 7: Clean Up

Solution
Terminal window
# Delete the Gateway API demo cluster
gcloud container clusters delete net-demo \
--region=$REGION --quiet
# Delete the PSC demo cluster
gcloud container clusters delete psc-demo \
--region=$REGION --quiet
# Delete the PSC subnet
gcloud compute networks subnets delete psc-subnet \
--region=$REGION --quiet
echo "Clusters deleted. Verify no orphaned load balancer resources:"
gcloud compute forwarding-rules list --filter="description~net-demo"
gcloud compute target-http-proxies list --filter="description~net-demo"
  • Cluster created with Dataplane V2 and Gateway API enabled
  • Cilium pods running in kube-system namespace
  • Network policy blocks cross-namespace traffic by default
  • Network policy allows frontend-to-backend traffic on port 8080
  • Gateway API HTTPRoute splits traffic 90/10 between stable and canary
  • Traffic shifting to 50/50 and full promotion works correctly
  • PSC cluster created with a dedicated private endpoint subnet
  • All resources cleaned up

Next up: Module 6.3: GKE Workload Identity and Security --- Learn how to securely connect pods to GCP services without storing credentials, enforce binary authorization for trusted images, and leverage GKE’s security posture dashboard.