Module 6.2: GKE Networking: Dataplane V2 and Gateway API

Complexity: [COMPLEX] | Time to Complete: 3h | Prerequisites: Module 6.1 (GKE Architecture)

What You’ll Be Able to Do

After completing this module, you will be able to:

Configure GKE Dataplane V2 (Cilium-based) with network policies and network policy logging
Implement Gateway API on GKE for traffic splitting and header-based routing
Deploy Private Service Connect for secure control plane access on GKE
Diagnose GKE networking issues related to IP exhaustion and pod-to-service communication failures

Why This Module Matters

In September 2023, a healthcare SaaS company running on GKE discovered that their network policies were not being enforced. A penetration tester demonstrated that a compromised pod in the staging namespace could freely communicate with pods in the production namespace, despite NetworkPolicy resources that should have blocked cross-namespace traffic. The root cause: the cluster was using the legacy iptables-based kube-proxy dataplane, which does not enforce Kubernetes NetworkPolicy at all. The team had assumed that creating NetworkPolicy resources was sufficient---they did not realize that enforcement requires a CNI that supports it. The compliance violation cost them a SOC 2 audit failure, delaying a $2.3 million enterprise deal by four months. The fix took 30 minutes: enable Dataplane V2 on their next cluster creation. The business impact lasted a quarter.

GKE networking is where Kubernetes meets Google’s global network infrastructure. The decisions you make about cluster networking---VPC-native mode, Dataplane V2, load balancing strategy, and Gateway API configuration---determine your application’s performance, security, and cost. A misconfigured network can leave your pods exposed, introduce unnecessary latency, or rack up egress charges that dwarf your compute costs.

In this module, you will learn how VPC-native clusters use alias IPs to give pods routable addresses, how Dataplane V2 replaces iptables with eBPF for faster and more observable networking, how Cloud Load Balancing integrates with GKE, and how the Gateway API provides a more expressive routing model than Ingress. By the end, you will configure Dataplane V2 network policies and set up a Gateway API canary deployment.

VPC-Native Clusters and Alias IPs

Every modern GKE cluster should be VPC-native. This is the default since GKE 1.21 and is required for features like Dataplane V2, Private Google Access for pods, and VPC flow logs for pod traffic.

How Alias IPs Work

In a VPC-native cluster, each node receives a primary IP from the subnet and a secondary IP range (alias range) for its pods. This means pods get IP addresses that are routable within the VPC---no NAT, no overlay network.

  VPC: 10.0.0.0/16
  ┌────────────────────────────────────────────────────────┐
  │                                                        │
  │  Subnet: 10.0.0.0/24 (Node IPs)                      │
  │  ┌─────────────────┐  ┌─────────────────┐            │
  │  │ Node A           │  │ Node B           │            │
  │  │ IP: 10.0.0.2     │  │ IP: 10.0.0.3     │            │
  │  │                   │  │                   │            │
  │  │ Alias: 10.4.0.0  │  │ Alias: 10.4.1.0  │            │
  │  │   /24 (pods)      │  │   /24 (pods)      │            │
  │  │ ┌────┐ ┌────┐    │  │ ┌────┐ ┌────┐    │            │
  │  │ │Pod │ │Pod │    │  │ │Pod │ │Pod │    │            │
  │  │ │.2  │ │.3  │    │  │ │.5  │ │.8  │    │            │
  │  │ └────┘ └────┘    │  │ └────┘ └────┘    │            │
  │  └─────────────────┘  └─────────────────┘            │
  │                                                        │
  │  Secondary Range "pods": 10.4.0.0/14                  │
  │  Secondary Range "services": 10.8.0.0/20              │
  └────────────────────────────────────────────────────────┘

Why This Matters for Networking

Feature	VPC-Native (Alias IPs)	Routes-Based (Legacy)
Pod IPs routable in VPC	Yes (directly)	No (requires custom routes)
Max pods per cluster	Limited by IP range size	Limited to 300 custom routes
Network Policy support	Full (Dataplane V2)	Limited
Private Google Access for pods	Yes	No
VPC Flow Logs for pods	Yes	No
Peering/VPN compatibility	Full	Route export required

# Verify your cluster is VPC-native
gcloud container clusters describe my-cluster \
  --region=us-central1 \
  --format="yaml(ipAllocationPolicy)"

# Expected output includes:
#   useIpAliases: true
#   clusterSecondaryRangeName: pods
#   servicesSecondaryRangeName: services

IP Address Planning

Stop and think: If a VPC-native cluster uses alias IPs directly from the VPC, what happens if your VPC doesn’t have a large enough secondary range for your planned number of nodes and pods at maximum scale?

Poor IP planning is the number one networking regret for teams that scale. You cannot resize secondary ranges after cluster creation.

  Planning Guide:
  ┌──────────────────────────────────────────────────────┐
  │  Each node gets a /24 from the pod range by default  │
  │  = 256 IPs per node (110 pods max + overhead)        │
  │                                                      │
  │  For 100 nodes: you need 100 x /24 = /17 minimum    │
  │  For 500 nodes: you need 500 x /24 = /15 minimum    │
  │                                                      │
  │  Services range:                                     │
  │  /20 = 4,096 services (usually sufficient)           │
  │  /16 = 65,536 services (very large clusters)         │
  └──────────────────────────────────────────────────────┘

# Create a cluster with explicit IP planning for scale
gcloud container clusters create large-cluster \
  --region=us-central1 \
  --num-nodes=2 \
  --network=prod-vpc \
  --subnetwork=gke-subnet \
  --cluster-secondary-range-name=gke-pods \
  --services-secondary-range-name=gke-services \
  --enable-ip-alias \
  --max-pods-per-node=64 \
  --default-max-pods-per-node=64

# Reducing max-pods-per-node from 110 to 64 means each node
# needs a /26 instead of a /24, saving IP space

Dataplane V2: eBPF-Powered Networking

Dataplane V2 is GKE’s modern networking stack, built on Cilium and eBPF. It replaces the traditional kube-proxy + iptables approach with a programmable, kernel-level dataplane.

Why eBPF Changes Everything

Traditional Kubernetes networking uses iptables rules for service routing and kube-proxy for load balancing. This works, but it has fundamental limitations:

  Legacy (iptables/kube-proxy):
  ┌─────────────────────────────────────────────────────┐
  │  Packet arrives at node                             │
  │       │                                             │
  │       ▼                                             │
  │  iptables chain (linear scan)                       │
  │  Rule 1: no match                                   │
  │  Rule 2: no match                                   │
  │  Rule 3: no match                                   │
  │  ...                                                │
  │  Rule 5,000: MATCH → DNAT to pod IP                │
  │       │                                             │
  │  O(n) performance: more services = slower routing   │
  └─────────────────────────────────────────────────────┘

  Dataplane V2 (eBPF):
  ┌─────────────────────────────────────────────────────┐
  │  Packet arrives at node                             │
  │       │                                             │
  │       ▼                                             │
  │  eBPF hash map lookup                               │
  │  Key: {dest IP, dest port}                          │
  │  Value: backend pod IP                              │
  │       │                                             │
  │  O(1) performance: constant time regardless of      │
  │  number of services                                 │
  └─────────────────────────────────────────────────────┘

Dataplane V2 Benefits

Pause and predict: If Dataplane V2 uses eBPF hash maps instead of iptables, how might this change the way you troubleshoot dropped packets or connection timeouts compared to legacy clusters?

Capability	iptables/kube-proxy	Dataplane V2
Service routing	O(n) linear scan	O(1) hash lookup
Network Policy enforcement	Requires Calico add-on	Built-in (Cilium)
Network Policy logging	Not available	Built-in
Kernel bypass	No	Yes (XDP for some paths)
Observability	Basic conntrack	Rich eBPF flow logs
Scale limit	~5,000 services practical	25,000+ services tested
FQDN-based policies	Not supported	Supported

Enabling Dataplane V2

# Dataplane V2 is enabled at cluster creation time
gcloud container clusters create dpv2-cluster \
  --region=us-central1 \
  --num-nodes=2 \
  --enable-dataplane-v2 \
  --enable-ip-alias \
  --release-channel=regular

# For Autopilot clusters, Dataplane V2 is enabled by default
gcloud container clusters create-auto dpv2-autopilot \
  --region=us-central1

# Verify Dataplane V2 is active
kubectl -n kube-system get pods -l k8s-app=cilium -o wide

Network Policies with Dataplane V2

With Dataplane V2, NetworkPolicy resources are enforced without any additional CNI installation. This is the feature that the healthcare company in our opening story was missing.

# Deny all ingress to production namespace by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress

---
# Allow only the API gateway to reach backend pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-gateway
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          role: gateway
      podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080

---
# Allow DNS resolution for all pods (critical, often forgotten)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Network Policy Logging

Dataplane V2 can log allowed and denied connections, which is invaluable for debugging and compliance.

# Enable network policy logging on the cluster
gcloud container clusters update dpv2-cluster \
  --region=us-central1 \
  --enable-network-policy-logging

# View logs in Cloud Logging
gcloud logging read \
  'resource.type="k8s_node" AND jsonPayload.disposition="deny"' \
  --limit=10 \
  --format="table(timestamp, jsonPayload.src.pod_name, jsonPayload.dest.pod_name, jsonPayload.disposition)"

War Story: A platform team enabled network policy logging and discovered that their monitoring agent (Datadog) was making 3,000 denied connections per minute to pods in restricted namespaces. The agent had broad scrape targets configured, and every denied connection generated a log entry. Before enabling logging in production, test in a staging environment to understand the log volume---it can be surprisingly high.

Cloud Load Balancing Integration

GKE integrates tightly with Google Cloud Load Balancing. When you create a Kubernetes Service or Ingress, GKE provisions the corresponding Google Cloud load balancer components automatically.

Service Types and Their Load Balancers

  Kubernetes Concept          GCP Resource Created
  ─────────────────          ────────────────────
  Service type: ClusterIP  → Nothing (internal only)
  Service type: NodePort   → Nothing (opens port on nodes)
  Service type: LoadBalancer → Network Load Balancer (L4)
  Ingress (external)        → Application Load Balancer (L7)
  Gateway (external)        → Application Load Balancer (L7)

Service Type	Layer	Scope	Use Case
LoadBalancer	L4 (TCP/UDP)	Regional (default)	Non-HTTP, gRPC without path routing
Ingress (GKE Ingress)	L7 (HTTP/S)	Global	HTTP routing with host/path rules
Gateway (Gateway API)	L7 (HTTP/S)	Global or Regional	Modern alternative to Ingress
Internal LoadBalancer	L4	Regional	Internal services, not internet-facing
Internal Ingress	L7	Regional	Internal HTTP routing

External Network Load Balancer (L4)

Stop and think: If you expose an internal gRPC service that requires L7 routing and TLS termination, which GKE service type or ingress method should you choose instead of a standard LoadBalancer?

# Simple L4 load balancer
apiVersion: v1
kind: Service
metadata:
  name: game-server
spec:
  type: LoadBalancer
  selector:
    app: game-server
  ports:
  - port: 7777
    targetPort: 7777
    protocol: UDP

# Check the provisioned load balancer
kubectl get svc game-server -o wide
# The EXTERNAL-IP column shows the Google Cloud LB IP

# View the underlying GCP forwarding rule
gcloud compute forwarding-rules list \
  --filter="description~game-server"

GKE Ingress (L7)

GKE Ingress creates a Google Cloud Application Load Balancer (formerly HTTP(S) Load Balancer) with features like SSL termination, URL-based routing, and Cloud CDN integration.

# Multi-service Ingress with path-based routing
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: web-static-ip
    networking.gke.io/managed-certificates: web-cert
    kubernetes.io/ingress.class: gce
spec:
  defaultBackend:
    service:
      name: frontend
      port:
        number: 80
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api/*
        pathType: ImplementationSpecific
        backend:
          service:
            name: api-service
            port:
              number: 8080
      - path: /static/*
        pathType: ImplementationSpecific
        backend:
          service:
            name: static-assets
            port:
              number: 80

Gateway API: The Future of Kubernetes Routing

The Gateway API is a Kubernetes-native evolution of Ingress that provides richer routing capabilities, better role separation, and a more consistent experience across implementations. GKE fully supports the Gateway API and it is the recommended approach for new deployments.

Why Gateway API Over Ingress

Pause and predict: In the Gateway API model, if the infrastructure team modifies the Gateway resource to restrict allowed namespaces, what happens to the existing HTTPRoutes in namespaces that are no longer allowed?

  Ingress Model (flat):
  ┌──────────────────────────────────────┐
  │  Ingress Resource                    │
  │  (mixes infra config + routing)      │
  │                                      │
  │  - TLS config (infra team concern)   │
  │  - Host rules (app team concern)     │
  │  - Path rules (app team concern)     │
  │  - Backend refs (app team concern)   │
  │                                      │
  │  ONE resource, ONE owner = conflict  │
  └──────────────────────────────────────┘

  Gateway API Model (layered):
  ┌──────────────────────────────────────┐
  │  GatewayClass (cluster admin)        │
  │  "Which load balancer implementation"│
  └──────────────┬───────────────────────┘
                 │
  ┌──────────────▼───────────────────────┐
  │  Gateway (infra/platform team)       │
  │  "Listener config, TLS, IP address"  │
  └──────────────┬───────────────────────┘
                 │
  ┌──────────────▼───────────────────────┐
  │  HTTPRoute (app team)                │
  │  "Host matching, path routing,       │
  │   headers, canary weights"           │
  └──────────────────────────────────────┘

GKE Gateway Classes

GKE provides several pre-installed GatewayClasses:

GatewayClass	Load Balancer Type	Scope	Use Case
`gke-l7-global-external-managed`	Global external ALB	Global	Public-facing web apps
`gke-l7-regional-external-managed`	Regional external ALB	Regional	Region-specific apps
`gke-l7-rilb`	Regional internal ALB	Regional	Internal microservices
`gke-l7-gxlb`	Classic global external ALB	Global	Legacy, avoid for new

# List available GatewayClasses in your cluster
kubectl get gatewayclass

# Enable the Gateway API on an existing cluster
gcloud container clusters update my-cluster \
  --region=us-central1 \
  --gateway-api=standard

Setting Up a Gateway

# Step 1: Create the Gateway (platform/infra team)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: external-gateway
  namespace: infra
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
  - name: https
    protocol: HTTPS
    port: 443
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: tls-cert
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            gateway-access: "true"
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            gateway-access: "true"

# Step 2: Create an HTTPRoute (app team)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: store-route
  namespace: store
  labels:
    gateway: external-gateway
spec:
  parentRefs:
  - kind: Gateway
    name: external-gateway
    namespace: infra
  hostnames:
  - "store.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: store-api
      port: 8080
  - matches:
    - path:
        type: PathPrefix
        value: /
    backendRefs:
    - name: store-frontend
      port: 80

Canary Deployments with Gateway API

The Gateway API natively supports traffic splitting by weight---something that required Istio or custom annotations with Ingress.

# Canary: send 90% to stable, 10% to canary
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: store-api-canary
  namespace: store
spec:
  parentRefs:
  - kind: Gateway
    name: external-gateway
    namespace: infra
  hostnames:
  - "store.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: store-api-stable
      port: 8080
      weight: 90
    - name: store-api-canary
      port: 8080
      weight: 10

To gradually shift traffic, update the weights:

# Move to 50/50
kubectl patch httproute store-api-canary -n store --type=merge -p '{
  "spec": {
    "rules": [{
      "matches": [{"path": {"type": "PathPrefix", "value": "/api"}}],
      "backendRefs": [
        {"name": "store-api-stable", "port": 8080, "weight": 50},
        {"name": "store-api-canary", "port": 8080, "weight": 50}
      ]
    }]
  }
}'

# Promote canary to 100%
kubectl patch httproute store-api-canary -n store --type=merge -p '{
  "spec": {
    "rules": [{
      "matches": [{"path": {"type": "PathPrefix", "value": "/api"}}],
      "backendRefs": [
        {"name": "store-api-canary", "port": 8080, "weight": 100}
      ]
    }]
  }
}'

Header-Based Routing

Gateway API also supports routing based on HTTP headers, which is useful for testing in production.

# Route requests with X-Canary: true header to canary service
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: store-api-header-routing
  namespace: store
spec:
  parentRefs:
  - kind: Gateway
    name: external-gateway
    namespace: infra
  hostnames:
  - "store.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /api
      headers:
      - name: X-Canary
        value: "true"
    backendRefs:
    - name: store-api-canary
      port: 8080
  - matches:
    - path:
        type: PathPrefix
        value: /api
    backendRefs:
    - name: store-api-stable
      port: 8080

Private Service Connect for GKE

Private Service Connect (PSC) allows you to access the GKE control plane through a private endpoint within your VPC, eliminating exposure to the public internet.

# Create a private cluster with PSC
gcloud container clusters create private-cluster \
  --region=us-central1 \
  --num-nodes=1 \
  --enable-private-nodes \
  --enable-private-endpoint \
  --master-ipv4-cidr=172.16.0.0/28 \
  --enable-ip-alias \
  --enable-master-authorized-networks \
  --master-authorized-networks=10.0.0.0/8

# With PSC (newer approach, recommended):
gcloud container clusters create psc-cluster \
  --region=us-central1 \
  --num-nodes=1 \
  --enable-private-nodes \
  --private-endpoint-subnetwork=psc-subnet \
  --enable-ip-alias

  Private Cluster with PSC:
  ┌─────────────────────────────────────────────────────┐
  │  Google-Managed VPC                                 │
  │  ┌─────────────────────────────────────┐           │
  │  │  GKE Control Plane                  │           │
  │  │  (API Server, etcd, etc.)           │           │
  │  └──────────────┬──────────────────────┘           │
  │                 │ Private Service Connect           │
  └─────────────────┼───────────────────────────────────┘
                    │
  ┌─────────────────▼───────────────────────────────────┐
  │  Customer VPC                                       │
  │  ┌──────────────────┐                              │
  │  │  PSC Endpoint     │  ← Private IP in your VPC   │
  │  │  10.0.5.2         │     for control plane access │
  │  └──────────────────┘                              │
  │                                                     │
  │  ┌──────────────────┐                              │
  │  │  GKE Nodes        │  ← No public IPs            │
  │  │  10.0.0.0/24      │                              │
  │  └──────────────────┘                              │
  └─────────────────────────────────────────────────────┘

Private Cluster Considerations

Stop and think: If you use Private Service Connect for your GKE control plane and have disabled public IP access, how will your cloud-hosted CI/CD pipeline (e.g., GitHub Actions) authenticate and deploy manifests to the cluster?

Consideration	Impact	Solution
Nodes cannot pull from internet	Container images fail	Use Artifact Registry (in same region) or configure Cloud NAT
kubectl from local machine blocked	Cannot manage cluster	Use Cloud Shell, a bastion VM, or VPN/Interconnect
Webhooks from control plane to nodes	Admission webhooks may fail	Ensure firewall allows control plane CIDR to node ports
Cloud Build access	CI/CD pipelines cannot reach API	Use private pools or GKE deploy via Cloud Deploy

# Set up Cloud NAT for private nodes to pull images
gcloud compute routers create nat-router \
  --network=prod-vpc \
  --region=us-central1

gcloud compute routers nats create nat-config \
  --router=nat-router \
  --region=us-central1 \
  --auto-allocate-nat-external-ips \
  --nat-all-subnet-ip-ranges

Did You Know?

Dataplane V2 uses the same eBPF technology that powers Meta’s (Facebook’s) entire network stack. Meta processes over 600 billion eBPF events per day across their fleet. In GKE, Dataplane V2’s eBPF programs are compiled and loaded into the Linux kernel at node boot, where they intercept and process packets before they ever reach userspace. This is why Dataplane V2 can achieve 26% lower latency than iptables-based routing in benchmarks with 10,000+ services.
A single GKE cluster can support up to 65,000 nodes and 400,000 pods. The practical networking limit is usually IP exhaustion rather than cluster capacity. A /14 pod CIDR gives you roughly 262,144 pod IPs. If each node uses a /24 for pods (the default for 110 max pods per node), you can support about 1,024 nodes before running out of pod IPs. Planning your IP ranges at cluster creation is one of the few decisions you truly cannot change later.
The Gateway API was designed by a cross-vendor working group including engineers from Google, Red Hat, HashiCorp, and VMware. The key insight was that Ingress combined infrastructure concerns (TLS, IP addresses) with application concerns (routing rules) in a single resource, making it impossible to safely delegate to different teams. Gateway API’s three-tier model (GatewayClass, Gateway, HTTPRoute) maps directly to the cluster admin, platform team, and application team roles that exist in most organizations.
GKE’s Global Application Load Balancer uses Google’s Maglev system, which was published as a research paper in 2016. Maglev is a distributed software load balancer that runs on commodity servers at Google’s edge PoPs. It uses consistent hashing to achieve connection persistence without shared state between load balancer instances. A single Maglev machine can handle 10 million packets per second, and the system has been running Google’s production traffic since 2008.

Common Mistakes

Mistake	Why It Happens	How to Fix It
Creating a routes-based cluster instead of VPC-native	Following outdated tutorials	Always use `--enable-ip-alias`; it is the default for new clusters but verify
Assuming NetworkPolicy works without Dataplane V2	Creating policies without enforcement	Enable Dataplane V2 at cluster creation; without it, policies are ignored
Undersizing the pod CIDR	Not calculating node count x pods per node	Plan for 3-5x your current node count; you cannot expand the range later
Forgetting DNS egress in NetworkPolicy	Writing a deny-all egress policy without DNS exception	Always include a rule allowing UDP/TCP port 53 to kube-dns pods
Using Ingress annotations for advanced routing	Trying to do canary/header routing with GKE Ingress	Switch to Gateway API which natively supports traffic splitting and header matching
Not enabling Cloud NAT for private clusters	Private nodes cannot reach the internet	Configure Cloud NAT on the VPC router before creating private clusters
Mixing GKE Ingress and Gateway API on the same cluster	Both create load balancer resources	Choose one approach per cluster; Gateway API is the recommended path forward
Ignoring network policy logging	Deploying policies without validation	Enable network policy logging and review denied connections before enforcing broadly

Quiz

1. Your e-commerce platform just scaled from 500 to 5,000 microservices. The platform team notices that network routing latency between pods has increased significantly on your older clusters, but remains flat on your new Dataplane V2 clusters. What fundamental architectural difference explains this behavior?

iptables-based routing uses a linear chain of rules that the kernel evaluates sequentially for every packet. When you have 5,000 Services, there are thousands of iptables rules, and each packet must traverse this chain until a match is found, resulting in O(n) complexity. Dataplane V2 uses eBPF hash maps compiled directly into the kernel. Service routing becomes a hash table lookup where the kernel hashes the destination IP and port, looks up the backend pod in O(1) constant time, and rewrites the packet. This means routing performance does not degrade as you add more services, resolving the latency issues seen in older clusters.

2. You deploy a strict `deny-all` egress NetworkPolicy to your `payments` namespace to meet PCI compliance. Suddenly, all pods in the namespace start crash-looping, reporting that they cannot connect to the internal database service `db.backend.svc.cluster.local`, even though you added an egress rule explicitly allowing traffic to the database's IP range. What critical rule is missing?

When you create a NetworkPolicy with policyTypes: ["Egress"] and no egress rules, you implicitly block all outbound traffic from the selected pods, including DNS resolution. Pods resolve service names (like db.backend.svc.cluster.local) by querying the kube-dns (CoreDNS) pods on UDP port 53. Without a DNS exception, pods cannot resolve any service names to IP addresses, meaning your application cannot even attempt the connection to the database. The critical missing rule is an explicit egress rule allowing traffic to kube-dns pods on both UDP and TCP port 53. TCP is required as a fallback for DNS responses larger than 512 bytes.

3. Your organization is moving from Ingress to the Gateway API. The security team wants to strictly control which TLS certificates are used and which namespaces can expose public endpoints, while application developers need the freedom to create path-based routing rules and canary deployments without submitting IT tickets. How does the Gateway API resource model satisfy both teams?

The Gateway API uses a three-tier resource model designed specifically for role-based access. The GatewayClass is managed by the cluster administrator and defines the load balancer implementation. The Gateway is managed by the platform or security team, allowing them to strictly configure TLS certificates, listening ports, and which namespaces can attach routes. The HTTPRoute is managed by the application team, giving them the freedom to define host matching, path routing, headers, and canary weights. This separation means the app team can update their routing autonomously, while the platform team enforces global security policies.

4. A junior engineer provisions a new regional GKE cluster (spanning 3 zones, 2 nodes per zone) and assigns a `/24` CIDR block for the pod secondary range. During the deployment of the first application, several pods remain in a `Pending` state, and the cluster autoscaler fails to add new nodes. What is the root cause of this failure?

A /24 CIDR block provides only 256 IP addresses for the entire pod network. In a VPC-native cluster, each node is allocated its own /24 slice by default to support up to 110 pods. Because a regional cluster with 3 zones and 2 nodes per zone requires 6 nodes in total, it would need at least a /21 for the pod range to accommodate them. The cluster creation will initially succeed, but you will hit scheduling failures and autoscaling blocks when the pod CIDR is immediately exhausted and new pods cannot be assigned IPs. This situation is unrecoverable, as secondary ranges cannot be resized, requiring a full cluster recreation.

5. You are rolling out a critical update to the authentication service and want to route exactly 5% of traffic to the new version. Your cluster uses the Gateway API, but you do not have a service mesh like Istio installed. How can you achieve this granular traffic splitting, and where does the actual routing decision take place?

Gateway API supports traffic splitting natively through the weight field on backendRefs within an HTTPRoute rule. You can specify multiple backend services with different weights (e.g., 95 for stable, 5 for canary), and the load balancer distributes incoming requests proportionally. Unlike Istio’s traffic splitting, which requires a sidecar proxy injecting hops into the data path, GKE Gateway API traffic splitting is programmed directly into the Google Cloud Load Balancer. You update the weights by patching the HTTPRoute resource, and the external load balancer reconfigures within seconds. This provides robust canary deployments as a first-class infrastructure feature without the operational overhead of a service mesh.

6. Your enterprise network team mandates that all new GKE clusters must be private, but they have exhausted the 25 VPC Peering connections limit on the central shared VPC. They also require that the GKE control plane be accessible via a specific private IP address on your on-premises network through Cloud Interconnect. Why is Private Service Connect (PSC) the only viable architecture for this requirement?

The legacy private cluster model relies on VPC peering between your VPC and the Google-managed VPC hosting the control plane. VPC peering is non-transitive, meaning peered networks cannot reach each other through your VPC, and it consumes a strict peering slot limit per VPC. Private Service Connect (PSC) instead creates a forwarding rule in your VPC that routes traffic to the control plane through a localized endpoint. This completely bypasses VPC peering, freeing up peering slots, and crucially supports transitive connectivity so on-premises networks can access the endpoint via Cloud Interconnect. PSC is the modern, scalable approach for private control plane access.

Hands-On Exercise: Dataplane V2 Network Policies and Gateway API Canary

Objective

Create a GKE cluster with Dataplane V2, enforce network policies between namespaces, and set up a Gateway API canary deployment with traffic splitting.

Prerequisites

gcloud CLI installed and authenticated
A GCP project with billing enabled and the GKE API enabled
kubectl installed

Tasks

Task 1: Create a GKE Cluster with Dataplane V2 and Gateway API

Solution

export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1

# Create a cluster with Dataplane V2 and Gateway API enabled
gcloud container clusters create net-demo \
  --region=$REGION \
  --num-nodes=1 \
  --machine-type=e2-standard-2 \
  --enable-dataplane-v2 \
  --enable-ip-alias \
  --release-channel=regular \
  --gateway-api=standard \
  --workload-pool=$PROJECT_ID.svc.id.goog

# Get credentials
gcloud container clusters get-credentials net-demo --region=$REGION

# Verify Dataplane V2 (Cilium pods running)
kubectl -n kube-system get pods -l k8s-app=cilium

# Verify Gateway API CRDs are installed
kubectl get gatewayclass

Task 2: Deploy Two Namespaces with Applications

Solution

# Create namespaces
kubectl create namespace frontend
kubectl create namespace backend
kubectl label namespace frontend role=frontend gateway-access=true
kubectl label namespace backend role=backend

# Deploy backend app
kubectl apply -n backend -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
        version: stable
    spec:
      containers:
      - name: api
        image: hashicorp/http-echo
        args: ["-text=API v1 (stable)", "-listen=:8080"]
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
  name: api-stable
spec:
  selector:
    app: api
    version: stable
  ports:
  - port: 8080
    targetPort: 8080
EOF

# Deploy canary version of backend
kubectl apply -n backend -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: api
      version: canary
  template:
    metadata:
      labels:
        app: api
        version: canary
    spec:
      containers:
      - name: api
        image: hashicorp/http-echo
        args: ["-text=API v2 (canary)", "-listen=:8080"]
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
  name: api-canary
spec:
  selector:
    app: api
    version: canary
  ports:
  - port: 8080
    targetPort: 8080
EOF

# Deploy frontend
kubectl apply -n frontend -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: nginx:1.27
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 64Mi
---
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 80
EOF

# Verify all pods running
kubectl get pods -n frontend
kubectl get pods -n backend

Task 3: Enforce Network Policies with Dataplane V2

Solution

# Default deny all ingress in the backend namespace
kubectl apply -n backend -f - <<'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress
EOF

# Test: frontend cannot reach backend (should timeout)
kubectl run test-curl --rm -it --restart=Never \
  -n frontend --image=curlimages/curl -- \
  curl -s --connect-timeout 5 http://api-stable.backend:8080 || echo "Connection blocked (expected)"

# Allow frontend namespace to reach backend API
kubectl apply -n backend -f - <<'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-frontend
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 8080
EOF

# Test again: frontend CAN reach backend now
kubectl run test-curl2 --rm -it --restart=Never \
  -n frontend --image=curlimages/curl -- \
  curl -s --connect-timeout 5 http://api-stable.backend:8080

# Test: a random namespace still cannot reach backend
kubectl create namespace attacker
kubectl run test-curl3 --rm -it --restart=Never \
  -n attacker --image=curlimages/curl -- \
  curl -s --connect-timeout 5 http://api-stable.backend:8080 || echo "Connection blocked (expected)"
kubectl delete namespace attacker

Task 4: Set Up Gateway API with Canary Traffic Splitting

Solution

# Create a Gateway
kubectl apply -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: demo-gateway
  namespace: backend
spec:
  gatewayClassName: gke-l7-global-external-managed
  listeners:
  - name: http
    protocol: HTTP
    port: 80
    allowedRoutes:
      namespaces:
        from: Same
EOF

# Create an HTTPRoute with canary traffic splitting (90/10)
kubectl apply -n backend -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-canary-route
spec:
  parentRefs:
  - kind: Gateway
    name: demo-gateway
    namespace: backend
  rules:
  - backendRefs:
    - name: api-stable
      port: 8080
      weight: 90
    - name: api-canary
      port: 8080
      weight: 10
EOF

# Wait for the Gateway to get an IP (takes 2-5 minutes)
echo "Waiting for Gateway IP..."
while true; do
  GW_IP=$(kubectl get gateway demo-gateway -n backend \
    -o jsonpath='{.status.addresses[0].value}' 2>/dev/null)
  if [ -n "$GW_IP" ] && [ "$GW_IP" != "" ]; then
    echo "Gateway IP: $GW_IP"
    break
  fi
  echo "Still provisioning..."
  sleep 15
done

# Test traffic splitting (run 20 requests, expect ~18 stable, ~2 canary)
echo "Sending 20 requests to $GW_IP..."
for i in $(seq 1 20); do
  curl -s http://$GW_IP
  echo ""
done | sort | uniq -c | sort -rn

Task 5: Shift Canary Traffic to 50/50 and Then Promote

Solution

# Shift to 50/50
kubectl apply -n backend -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-canary-route
spec:
  parentRefs:
  - kind: Gateway
    name: demo-gateway
    namespace: backend
  rules:
  - backendRefs:
    - name: api-stable
      port: 8080
      weight: 50
    - name: api-canary
      port: 8080
      weight: 50
EOF

echo "Waiting 30 seconds for LB to reconfigure..."
sleep 30

# Test again
GW_IP=$(kubectl get gateway demo-gateway -n backend \
  -o jsonpath='{.status.addresses[0].value}')
echo "50/50 split results:"
for i in $(seq 1 20); do
  curl -s http://$GW_IP
done | sort | uniq -c | sort -rn

# Full promotion to canary
kubectl apply -n backend -f - <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-canary-route
spec:
  parentRefs:
  - kind: Gateway
    name: demo-gateway
    namespace: backend
  rules:
  - backendRefs:
    - name: api-canary
      port: 8080
      weight: 100
EOF

sleep 30

echo "Full canary promotion results:"
for i in $(seq 1 10); do
  curl -s http://$GW_IP
done

Task 6: Provision a Private Cluster with Private Service Connect (PSC)

Solution

# Create a dedicated subnet for PSC in the default network
gcloud compute networks subnets create psc-subnet \
  --network=default \
  --region=$REGION \
  --range=10.10.0.0/28

# Create a private cluster using PSC instead of VPC peering
gcloud container clusters create psc-demo \
  --region=$REGION \
  --num-nodes=1 \
  --enable-private-nodes \
  --private-endpoint-subnetwork=psc-subnet \
  --enable-ip-alias \
  --master-authorized-networks=0.0.0.0/0

# Verify the PSC endpoint IP address
gcloud container clusters describe psc-demo \
  --region=$REGION \
  --format="value(privateClusterConfig.privateEndpoint)"

Task 7: Clean Up

Solution

# Delete the Gateway API demo cluster
gcloud container clusters delete net-demo \
  --region=$REGION --quiet

# Delete the PSC demo cluster
gcloud container clusters delete psc-demo \
  --region=$REGION --quiet

# Delete the PSC subnet
gcloud compute networks subnets delete psc-subnet \
  --region=$REGION --quiet

echo "Clusters deleted. Verify no orphaned load balancer resources:"
gcloud compute forwarding-rules list --filter="description~net-demo"
gcloud compute target-http-proxies list --filter="description~net-demo"

Success Criteria

Cluster created with Dataplane V2 and Gateway API enabled
Cilium pods running in kube-system namespace
Network policy blocks cross-namespace traffic by default
Network policy allows frontend-to-backend traffic on port 8080
Gateway API HTTPRoute splits traffic 90/10 between stable and canary
Traffic shifting to 50/50 and full promotion works correctly
PSC cluster created with a dedicated private endpoint subnet
All resources cleaned up

Next Module

Next up: Module 6.3: GKE Workload Identity and Security --- Learn how to securely connect pods to GCP services without storing credentials, enforce binary authorization for trusted images, and leverage GKE’s security posture dashboard.