Module 2.3: GCP Compute Engine
Complexity: [MEDIUM] | Time to Complete: 2.5h | Prerequisites: Module 2.2 (VPC Networking)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Compute Engine instances with custom machine types, preemptible VMs, and managed instance groups
- Configure instance templates and autoscaling policies for self-healing compute clusters on GCP
- Implement OS Login and metadata-based SSH key management to secure instance access
- Evaluate Compute Engine pricing models (on-demand, committed use, preemptible, Spot) to optimize costs
Why This Module Matters
Section titled “Why This Module Matters”In December 2022, a fast-growing e-commerce company ran their annual holiday sale on Google Cloud. Their architecture was simple: a fleet of Compute Engine VMs behind an HTTP(S) load balancer, with a Cloud SQL database in the backend. At 9:01 AM on Black Friday, traffic spiked to 15x normal levels. The operations team had provisioned what they thought was enough capacity---24 VMs running n1-standard-8 instances. Within minutes, all 24 VMs were at 100% CPU. The team scrambled to manually add more VMs, but each new VM took 3-4 minutes to boot, install dependencies, and register with the load balancer. By the time they had scaled to 60 VMs, they had lost an estimated $1.8 million in abandoned carts. The post-incident review revealed two failures: first, they were using individual VMs instead of Managed Instance Groups with autoscaling. Second, they were using an older machine family (n1) when the n2 family offered 20% better price-performance. The CEO’s summary was blunt: “We paid more for machines that were slower, and we could not scale them automatically.”
This incident captures why Compute Engine is more than “just VMs.” Choosing the right machine family, configuring instance templates, using Managed Instance Groups with autoscaling, and setting up global load balancing are the difference between an architecture that handles traffic spikes gracefully and one that collapses under load. Compute Engine is the foundational compute service in GCP---even GKE nodes, Cloud SQL instances, and Dataflow workers run on Compute Engine VMs under the hood.
In this module, you will learn how to select the right machine family for your workload, leverage preemptible and Spot VMs for massive cost savings, build golden images with custom images, configure Managed Instance Groups for automatic scaling and self-healing, and tie everything together with Cloud Load Balancing.
Machine Families: Choosing the Right Hardware
Section titled “Machine Families: Choosing the Right Hardware”GCP offers four machine families, each optimized for different workload characteristics. Selecting the wrong family is one of the most common ways to overspend.
The Four Families
Section titled “The Four Families”flowchart TD subgraph "Machine Families" A["<b>General Purpose</b><br/>E2, N2, N2D, T2D, N1<br/><i>Web servers, Dev/test, microservices</i>"] B["<b>Compute Optimized</b><br/>C2, C2D, C3, H3<br/><i>HPC, gaming, batch jobs, scientific simulations</i>"] C["<b>Memory Optimized</b><br/>M2, M3<br/><i>SAP HANA, in-memory databases</i>"] D["<b>Accelerator Optimized</b><br/>A2, A3, G2<br/><i>ML training, inference, video transcoding</i>"] endGeneral Purpose: The Workhorse
Section titled “General Purpose: The Workhorse”| Series | CPU | vCPU:Memory Ratio | Best For | Notes |
|---|---|---|---|---|
| E2 | Intel/AMD (automatic) | 1:4 (0.25 to 32 vCPUs) | Cost-sensitive, dev/test | Cheapest, shared-core options (e2-micro: 0.25 vCPU) |
| N2 | Intel Cascade Lake/Ice Lake | 1:4 (2 to 128 vCPUs) | General production | Good balance, use CUDs for savings (no SUDs) |
| N2D | AMD EPYC | 1:4 (2 to 224 vCPUs) | Same as N2, prefer AMD | Often 10-15% cheaper than N2 |
| T2D | AMD EPYC | 1:4 (1 to 60 vCPUs) | Scale-out workloads | Best per-thread performance |
| N1 | Intel Skylake/older | 1:3.75 | Legacy (avoid for new) | Still supported but outdated |
# Create a general-purpose VMgcloud compute instances create web-server \ --machine-type=e2-medium \ --zone=us-central1-a \ --image-family=debian-12 \ --image-project=debian-cloud \ --boot-disk-size=20GB \ --boot-disk-type=pd-balanced
# List available machine types in a zonegcloud compute machine-types list \ --zones=us-central1-a \ --filter="name~'^e2'" \ --format="table(name, guestCpus, memoryMb)"Custom Machine Types
Section titled “Custom Machine Types”If predefined machine types do not fit your workload, GCP allows you to specify exact vCPU and memory combinations.
# Custom machine type: 6 vCPUs, 24GB RAMgcloud compute instances create custom-vm \ --custom-cpu=6 \ --custom-memory=24GB \ --zone=us-central1-a \ --image-family=debian-12 \ --image-project=debian-cloud
# Custom with extended memory (more than 8GB per vCPU)gcloud compute instances create high-mem-vm \ --custom-cpu=4 \ --custom-memory=64GB \ --custom-vm-type=n2 \ --custom-extensions \ --zone=us-central1-a \ --image-family=debian-12 \ --image-project=debian-cloudRules for custom machine types:
- vCPUs must be 1, or an even number between 2 and 224 (varies by series).
- Memory must be between 0.9 GB and 8 GB per vCPU (or up to 24 GB per vCPU with extended memory).
- Extended memory costs more per GB than standard memory.
Shared-Core Machines
Section titled “Shared-Core Machines”For lightweight workloads that do not need a full vCPU, E2 offers shared-core options:
| Type | vCPUs | Memory | Use Case | Cost (approx vs e2-medium) |
|---|---|---|---|---|
e2-micro | 0.25 shared | 1 GB | Micro-services, tiny APIs | ~25% of e2-medium |
e2-small | 0.5 shared | 2 GB | Low-traffic web, dev | ~50% of e2-medium |
e2-medium | 1 shared | 4 GB | Moderate web, Jenkins agents | Baseline |
Preemptible and Spot VMs: Saving 60-91%
Section titled “Preemptible and Spot VMs: Saving 60-91%”The Pricing Tiers
Section titled “The Pricing Tiers”GCP offers three pricing tiers for the same hardware:
| Tier | Discount vs On-Demand | Max Lifetime | Guarantee | Use Case |
|---|---|---|---|---|
| On-Demand | 0% (baseline) | Unlimited | Will not be preempted | Production, stateful workloads |
| Committed Use (CUD) | 28-52% | 1 or 3 year term | Will not be preempted | Steady-state production |
| Spot | 60-91% | None (no 24h limit) | Can be preempted anytime | Batch, CI/CD, fault-tolerant |
| Preemptible (legacy) | 60-91% | 24 hours max | Preempted at 24h, or earlier | Use Spot instead (superset) |
Spot VMs replaced Preemptible VMs as the recommended ephemeral option. They offer the same discount but without the 24-hour maximum lifetime. Both can be preempted at any time with a 30-second warning.
# Create a Spot VMgcloud compute instances create batch-worker \ --machine-type=n2-standard-4 \ --zone=us-central1-a \ --provisioning-model=SPOT \ --instance-termination-action=STOP \ --image-family=debian-12 \ --image-project=debian-cloud
# termination-action options:# STOP - VM is stopped (can be restarted later if capacity available)# DELETE - VM is deleted (for truly ephemeral workloads)
# Create with preemptible (legacy, avoid for new workloads)gcloud compute instances create legacy-worker \ --machine-type=n2-standard-4 \ --zone=us-central1-a \ --preemptible \ --image-family=debian-12 \ --image-project=debian-cloudHandling Preemption Gracefully
Section titled “Handling Preemption Gracefully”# Inside the VM: check if a preemption notice has been issued# (the metadata server returns a termination timestamp 30s before preemption)curl -s "http://metadata.google.internal/computeMetadata/v1/instance/preempted" \ -H "Metadata-Flavor: Google"
# Create a shutdown script that handles graceful terminationgcloud compute instances create batch-worker \ --machine-type=n2-standard-4 \ --zone=us-central1-a \ --provisioning-model=SPOT \ --instance-termination-action=STOP \ --metadata=shutdown-script='#!/bin/bash echo "Preemption detected at $(date)" >> /var/log/preemption.log # Save checkpoint, flush buffers, deregister from load balancer /opt/app/save-checkpoint.sh /opt/app/deregister.sh'Committed Use Discounts (CUDs)
Section titled “Committed Use Discounts (CUDs)”For steady-state production workloads, CUDs offer significant savings without any preemption risk.
| Commitment | Duration | Discount |
|---|---|---|
| Resource-based | 1 year | ~28% |
| Resource-based | 3 years | ~52% |
| Spend-based | 1 year | 25% (more flexible) |
| Spend-based | 3 years | 52% |
# Purchase a committed use discount (resource-based)gcloud compute commitments create my-commitment \ --region=us-central1 \ --resources=vcpu=100,memory=400GB \ --plan=36-month \ --type=GENERAL_PURPOSE
# View existing commitmentsgcloud compute commitments list --region=us-central1Sustained Use Discounts (SUDs) apply automatically---no commitment required. If a VM runs for more than 25% of the month, GCP automatically applies increasing discounts. By the end of the month, you effectively get a ~20% discount for VMs that ran the entire time.
Pause and predict: You are designing a video rendering pipeline. If a rendering job is interrupted, it must start over from the beginning. Some jobs take up to 36 hours. Should you use Spot VMs to save costs here?
Custom Images and Image Families
Section titled “Custom Images and Image Families”Why Custom Images Matter
Section titled “Why Custom Images Matter”Every time you create a VM from a public image (like debian-12), you start with a bare OS. Installing your application, dependencies, and configuration on every new VM wastes time and creates inconsistency. Custom images solve this by baking your software into a reusable image.
# Step 1: Create a VM and configure itgcloud compute instances create image-builder \ --machine-type=e2-medium \ --zone=us-central1-a \ --image-family=debian-12 \ --image-project=debian-cloud
# SSH in and install your softwaregcloud compute ssh image-builder --zone=us-central1-a# Inside the VM:# sudo apt-get update && sudo apt-get install -y nginx nodejs npm# sudo npm install -g your-app# sudo systemctl enable nginx# exit
# Step 2: Stop the VM (required for image creation)gcloud compute instances stop image-builder --zone=us-central1-a
# Step 3: Create a custom image from the VM's diskgcloud compute images create my-app-v1-0 \ --source-disk=image-builder \ --source-disk-zone=us-central1-a \ --family=my-app \ --description="My App v1.0 with nginx and Node.js"
# Step 4: Clean up the builder VMgcloud compute instances delete image-builder --zone=us-central1-a --quietImage Families
Section titled “Image Families”Image families are like a “latest” pointer for your custom images. When you create a new image in a family, it automatically becomes the default.
# Create new version in the same familygcloud compute images create my-app-v1-1 \ --source-disk=image-builder-v2 \ --source-disk-zone=us-central1-a \ --family=my-app
# Create a VM using the latest image in the familygcloud compute instances create web-1 \ --image-family=my-app \ --zone=us-central1-a
# List images in a familygcloud compute images list --filter="family=my-app" \ --format="table(name, creationTimestamp, status)"
# Roll back: deprecate the latest image, making the previous one currentgcloud compute images deprecate my-app-v1-1 \ --state=DEPRECATED \ --replacement=my-app-v1-0Pause and predict: You need to apply a critical security patch to an OS used by 50 VMs. If you’re using image families, what steps must you take to ensure all VMs run the patched OS?
Instance Templates and Managed Instance Groups
Section titled “Instance Templates and Managed Instance Groups”Instance Templates
Section titled “Instance Templates”An instance template is a blueprint that defines the machine type, image, disks, network, and other settings for a VM. Templates are immutable---to change a setting, you create a new template.
# Create an instance templategcloud compute instance-templates create web-template-v1 \ --machine-type=e2-standard-2 \ --image-family=my-app \ --boot-disk-size=20GB \ --boot-disk-type=pd-balanced \ --network=prod-vpc \ --subnet=web-tier \ --region=us-central1 \ --no-address \ --service-account=web-sa@my-project.iam.gserviceaccount.com \ --scopes=cloud-platform \ --tags=web-server \ --metadata=startup-script='#!/bin/bash systemctl start nginx echo "$(hostname) ready" > /var/www/html/health'
# List templatesgcloud compute instance-templates list
# Create a new version (templates are immutable)gcloud compute instance-templates create web-template-v2 \ --machine-type=e2-standard-4 \ --image-family=my-app \ --boot-disk-size=20GB \ --boot-disk-type=pd-balanced \ --network=prod-vpc \ --subnet=web-tier \ --region=us-central1 \ --no-address \ --service-account=web-sa@my-project.iam.gserviceaccount.com \ --scopes=cloud-platformManaged Instance Groups (MIGs)
Section titled “Managed Instance Groups (MIGs)”A MIG is a group of identical VMs created from an instance template. MIGs provide autoscaling, self-healing, rolling updates, and load balancer integration.
# Create a regional MIG (recommended: spans all zones in a region)gcloud compute instance-groups managed create web-mig \ --template=web-template-v1 \ --size=3 \ --region=us-central1 \ --health-check=web-health-check \ --initial-delay=120
# Create the health check firstgcloud compute health-checks create http web-health-check \ --port=80 \ --request-path=/health \ --check-interval=10s \ --timeout=5s \ --healthy-threshold=2 \ --unhealthy-threshold=3Autoscaling
Section titled “Autoscaling”# Add autoscaling to the MIGgcloud compute instance-groups managed set-autoscaling web-mig \ --region=us-central1 \ --min-num-replicas=2 \ --max-num-replicas=20 \ --target-cpu-utilization=0.6 \ --cool-down-period=120
# Scale based on HTTP load balancing utilizationgcloud compute instance-groups managed set-autoscaling web-mig \ --region=us-central1 \ --min-num-replicas=2 \ --max-num-replicas=20 \ --custom-metric-utilization=metric=loadbalancing.googleapis.com/https/request_count,utilization-target=1000,utilization-target-type=GAUGE
# View current autoscaling statusgcloud compute instance-groups managed describe web-mig \ --region=us-central1 \ --format="yaml(status.autoscaler)"Rolling Updates
Section titled “Rolling Updates”MIGs support zero-downtime updates by gradually replacing instances with a new template.
# Start a rolling update to the new templategcloud compute instance-groups managed rolling-action start-update web-mig \ --version=template=web-template-v2 \ --region=us-central1 \ --max-surge=3 \ --max-unavailable=0
# Canary update: run new template on a subset of instancesgcloud compute instance-groups managed rolling-action start-update web-mig \ --version=template=web-template-v1 \ --canary-version=template=web-template-v2,target-size=20% \ --region=us-central1
# Monitor the updategcloud compute instance-groups managed describe web-mig \ --region=us-central1 \ --format="yaml(status.versionTarget, status.isStable)"
# Roll back (just update back to the old template)gcloud compute instance-groups managed rolling-action start-update web-mig \ --version=template=web-template-v1 \ --region=us-central1| Update Parameter | Description | Recommended |
|---|---|---|
--max-surge | Extra instances during update | 3 or 20% |
--max-unavailable | Instances that can be offline | 0 (zero downtime) |
--replacement-method=SUBSTITUTE | Create new, then delete old | Default (safest) |
--replacement-method=RECREATE | Delete old, then create new | Only when IP must stay |
--minimal-action=REPLACE | Replace entire VM | When image/template changes |
--minimal-action=RESTART | Just restart existing VM | When only metadata changes |
Self-Healing
Section titled “Self-Healing”When a health check fails, the MIG automatically recreates the unhealthy VM. This is the simplest form of self-healing in GCP.
flowchart LR subgraph "Normal Operation" direction LR VM1["VM-1<br/>OK"] VM2["VM-2<br/>OK"] VM3["VM-3<br/>OK"] end
subgraph "Self-Healing" direction LR VM1b["VM-1<br/>OK"] VM2b["VM-2<br/>FAIL"] VM3b["VM-3<br/>OK"]
VM2b -- "Health check fails<br/>3 consecutive times" --> Action["MIG deletes VM-2<br/>and creates VM-2-new<br/>from the template"] endStop and think: If you manually SSH into a VM managed by a MIG and update a configuration file, what will happen if the VM fails a health check later that day?
Cloud Load Balancing
Section titled “Cloud Load Balancing”GCP offers multiple load balancer types, but the most common is the External Application Load Balancer (formerly known as the External HTTP(S) Load Balancer).
Load Balancer Types
Section titled “Load Balancer Types”| Type | Scope | Layer | Protocol | Use Case |
|---|---|---|---|---|
| External Application LB | Global | L7 | HTTP/HTTPS | Public web apps, APIs |
| Internal Application LB | Regional | L7 | HTTP/HTTPS | Internal microservices |
| External Network LB | Regional | L4 | TCP/UDP | Non-HTTP (gaming, VoIP) |
| Internal Network LB | Regional | L4 | TCP/UDP | Internal TCP/UDP services |
| External Proxy Network LB | Global | L4 | TCP/SSL | Global TCP with Anycast |
Architecture of the External Application Load Balancer
Section titled “Architecture of the External Application Load Balancer”flowchart TD Users["Users (Internet)"] --> Anycast["Google Global Anycast IP Address<br/><i>Single IP, served from 100+ edge locations</i>"] Anycast --> URLMap["URL Map<br/><i>/api/* → backend-api<br/>/static → cdn-bucket<br/>/* → backend-web</i>"] URLMap --> BackendSvc["Backend Service (or Backend Bucket)<br/><i>Health checks, session affinity, connection draining</i>"] BackendSvc --> MIG_US["MIG us-central1<br/>(3 VMs)"] BackendSvc --> MIG_EU["MIG europe-west1<br/>(3 VMs)"]Setting Up a Global Load Balancer
Section titled “Setting Up a Global Load Balancer”# Step 1: Reserve a global static IPgcloud compute addresses create web-lb-ip \ --ip-version=IPV4 \ --global
# Step 2: Create a health check for the backend servicegcloud compute health-checks create http web-lb-health \ --port=80 \ --request-path=/health
# Step 3: Create a backend servicegcloud compute backend-services create web-backend \ --protocol=HTTP \ --port-name=http \ --health-checks=web-lb-health \ --global
# Step 4: Add MIG backends to the backend servicegcloud compute backend-services add-backend web-backend \ --instance-group=web-mig-us \ --instance-group-region=us-central1 \ --balancing-mode=UTILIZATION \ --max-utilization=0.8 \ --global
gcloud compute backend-services add-backend web-backend \ --instance-group=web-mig-eu \ --instance-group-region=europe-west1 \ --balancing-mode=UTILIZATION \ --max-utilization=0.8 \ --global
# Step 5: Create a URL mapgcloud compute url-maps create web-url-map \ --default-service=web-backend
# Step 6: Create an HTTPS target proxy with a managed SSL certificategcloud compute ssl-certificates create web-cert \ --domains=www.example.com \ --global
gcloud compute target-https-proxies create web-https-proxy \ --url-map=web-url-map \ --ssl-certificates=web-cert
# Step 7: Create a forwarding rulegcloud compute forwarding-rules create web-https-rule \ --address=web-lb-ip \ --global \ --target-https-proxy=web-https-proxy \ --ports=443Named Ports
Section titled “Named Ports”MIGs communicate port mappings through named ports. The backend service references a name (like “http”), and the MIG maps that name to an actual port number.
# Set named port on the MIGgcloud compute instance-groups managed set-named-ports web-mig-us \ --named-ports=http:80 \ --region=us-central1
gcloud compute instance-groups managed set-named-ports web-mig-eu \ --named-ports=http:80 \ --region=europe-west1Disk Types and Storage
Section titled “Disk Types and Storage”| Disk Type | IOPS (Read) | Throughput | Use Case | Cost |
|---|---|---|---|---|
| pd-standard | 0.75 per GB | 12 MB/s per GB | Bulk storage, logs | Lowest |
| pd-balanced | 6 per GB | 28 MB/s per GB | General purpose (default) | Medium |
| pd-ssd | 30 per GB | 48 MB/s per GB | Databases, high I/O | Higher |
| pd-extreme | Configurable | Configurable | SAP HANA, Oracle DB | Highest |
| local-ssd | 900K total | 9.4 GB/s total | Temp storage, caches | Included with VM |
# Create a VM with an additional SSD data diskgcloud compute instances create db-server \ --machine-type=n2-standard-8 \ --zone=us-central1-a \ --boot-disk-size=20GB \ --boot-disk-type=pd-balanced \ --create-disk=name=data-disk,size=200GB,type=pd-ssd,auto-delete=no
# Create a snapshot (backup)gcloud compute disks snapshot data-disk \ --zone=us-central1-a \ --snapshot-names=data-disk-backup-$(date +%Y%m%d)
# Schedule automatic snapshotsgcloud compute resource-policies create snapshot-schedule daily-snapshot \ --region=us-central1 \ --max-retention-days=14 \ --start-time=02:00 \ --daily-scheduleSecuring Access: OS Login and SSH Keys
Section titled “Securing Access: OS Login and SSH Keys”Historically, accessing a Linux VM involved generating an SSH key pair and pasting the public key into the project or instance metadata. This approach does not scale well: when an employee leaves, you must hunt down and remove their keys across all instances.
OS Login solves this by linking SSH access to IAM (Identity and Access Management). Instead of managing individual SSH keys, you assign IAM roles (roles/compute.osLogin or roles/compute.osAdminLogin) to users or groups.
# Enable OS Login at the project levelgcloud compute project-info add-metadata \ --metadata enable-oslogin=TRUE
# Grant OS Login IAM role to a usergcloud projects add-iam-policy-binding my-project \ --member="user:alice@example.com" \ --role="roles/compute.osLogin"When a user connects using gcloud compute ssh, GCP automatically generates a short-lived SSH key, pushes it to their OS Login profile, and allows them to log in. The moment their IAM permissions are revoked, their access to all VMs is instantly cut off. For VMs that do not have external IPs, you combine OS Login with Identity-Aware Proxy (IAP) TCP forwarding to securely tunnel SSH traffic without exposing ports to the internet.
Did You Know?
Section titled “Did You Know?”-
GCP’s global load balancer uses Anycast routing, meaning a single IP address is advertised from over 100 Google edge locations worldwide. When a user in Tokyo connects to your load balancer IP, they are routed to the nearest Google edge, which then forwards the request to the closest healthy backend. This happens at the network layer---no DNS-based routing tricks needed.
-
Spot VMs can save up to 91% compared to on-demand pricing. The actual discount varies by machine type and region. For a batch processing job running
n2-standard-16instances, the difference between on-demand ($0.7769/hr) and Spot ($0.07-0.23/hr) can mean the difference between a $5,000 monthly bill and a $500 one. -
Live migration is a GCP superpower that most users never notice. When Google needs to perform host maintenance, your VMs are transparently migrated to another physical host with no reboot and typically less than a second of degraded performance. This is enabled by default on all standard VMs. Preemptible/Spot VMs do not support live migration---they are terminated instead.
-
You can create a VM with up to 416 vCPUs and 12 TB of memory using the M3 machine family. These ultra-high-memory machines are designed for SAP HANA, large in-memory databases, and genomics workloads. At full price, an
m3-megamem-128costs over $21 per hour---which is still cheaper than buying equivalent on-premises hardware when you factor in a 3-year amortization.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Using N1 machines for new workloads | N1 appears first in old tutorials | Use N2, N2D, or E2---they offer better price-performance |
| Not using Managed Instance Groups | Individual VMs seem simpler initially | Always use MIGs for production; they provide autoscaling and self-healing |
| Setting autoscaler min to 1 | Want to minimize cost | Min should be 2+ for high availability across zones |
| Not configuring health checks | Assumed MIG “just knows” when VMs are unhealthy | Create HTTP health checks with appropriate thresholds |
| Using external IPs on every VM | Easier to SSH directly | Use IAP tunneling; VMs should not have external IPs unless they serve public traffic |
| Ignoring Sustained Use Discounts | Assuming CUDs are the only option | SUDs apply automatically; check billing reports to see your effective discount |
| Choosing pd-standard for databases | It is the cheapest disk type | Use pd-ssd for any workload with latency requirements; pd-standard IOPS scales with disk size |
| Not setting shutdown scripts on Spot VMs | Assuming preemption never happens | Always implement graceful shutdown to save state and deregister from services |
1. Your team is running a batch processing workload that takes 30 hours to complete. A junior engineer suggests using Preemptible VMs to save money. How would you explain to them why Spot VMs are a better choice for this specific scenario?
Preemptible VMs have a hard limitation: GCP will always terminate them after exactly 24 hours of uptime, regardless of whether there is available capacity in the zone. Because your workload takes 30 hours to complete, it would never finish on a Preemptible VM without being interrupted. Spot VMs are the modern successor to Preemptible VMs and remove this 24-hour maximum lifetime restriction. While Spot VMs can still be preempted at any time if GCP needs the capacity, they are allowed to run indefinitely during periods of low demand, making them the only viable low-cost option for uninterrupted jobs longer than 24 hours.
2. During a high-traffic event, one of the three VMs in your Managed Instance Group (MIG) runs out of memory and starts returning 502 Bad Gateway errors. The MIG is configured with an HTTP health check requiring 3 consecutive failures. Describe the exact sequence of events the MIG and load balancer will trigger to resolve this.
As soon as the VM fails the health check three consecutive times, the load balancer instantly stops routing new user traffic to that specific VM to prevent further errors. Concurrently, the MIG’s self-healing mechanism detects the unhealthy state and forcefully deletes the unresponsive VM. The MIG then automatically provisions a brand new VM using the exact specifications defined in the attached instance template. Once the newly created VM boots up and successfully passes its own health checks, the load balancer resumes sending it user traffic, restoring the group to full capacity without manual intervention.
3. Your company is deploying a mission-critical payment processing API. The architecture review board has rejected your proposal to use a zonal Managed Instance Group (MIG) in us-central1-a. Why is a regional MIG a strictly better choice for this architecture?
A zonal MIG places all of your VM instances into a single datacenter zone, which creates a single point of failure if that specific facility experiences a power outage or network partition. By contrast, a regional MIG automatically distributes your VMs across multiple independent zones (like us-central1-a, us-central1-b, and us-central1-c) within the same region. This ensures that even if an entire Google Cloud zone goes offline, your application continues to serve traffic from the remaining healthy zones. Furthermore, a regional MIG allows the autoscaler to intelligently provision new instances in whichever zone currently has the most available hardware capacity.
4. You have just built a new instance template containing a major software update. Instead of updating all 100 production VMs at once, you want to test the new version on just 10% of your traffic. How do you execute this safely using a MIG?
You can achieve this by triggering a rolling update on the MIG using the --canary-version flag and specifying the new instance template. By setting the target size to 10%, the MIG will gradually replace only 10 of your existing VMs with the new template, while leaving the other 90 VMs untouched. You can then monitor application logs and error rates for those specific canary instances to ensure the new software is stable under real-world traffic. If everything looks good, you issue a subsequent command to roll out the update to 100%, or simply rollback the 10% if errors spike.
5. You are provisioning a new Compute Engine VM that will host a high-throughput PostgreSQL database. Your colleague suggests using the `pd-standard` disk type because it is the cheapest option. Why is this a poor choice for a database, and what should you choose instead?
The pd-standard persistent disk is backed by standard Hard Disk Drives (HDDs) and offers extremely low IOPS (0.75 per GB), making it suitable only for sequential data like log archives or backups. A database requires random, high-speed read/write operations, and running it on a standard HDD will result in severe I/O bottlenecks and unacceptable latency. For a high-throughput database, you must choose pd-ssd or pd-extreme, which are backed by Solid State Drives (SSDs) and deliver massively higher IOPS and throughput. While pd-balanced offers a middle-ground of performance and cost, pd-ssd is strictly recommended for latency-sensitive workloads like enterprise databases.
6. Google Cloud notifies you that the physical host running your primary web server requires emergency hardware maintenance. You are using standard e2-medium VMs, and you panic because you cannot afford any downtime. Why shouldn't you worry, and under what circumstances would this actually cause an outage?
You shouldn’t worry because standard Compute Engine VMs benefit from a feature called live migration, which transparently moves your running VM from the failing physical host to a healthy one. This process happens automatically without rebooting the VM and typically results in less than a second of degraded performance, meaning your users will not notice the event. However, this would cause an outage if you were using Spot VMs, Preemptible VMs, or VMs with attached GPUs. These specific VM types do not support live migration, and would instead be terminated or stopped entirely when Google performs host maintenance.
7. A developer who recently left the company claims they still have SSH access to several production VMs because they manually added their public SSH key to the `~/.ssh/authorized_keys` file on those machines. How could your organization have prevented this by using OS Login?
When OS Login is enabled at the project level, Compute Engine completely bypasses local SSH key files like ~/.ssh/authorized_keys and exclusively relies on IAM policies to authorize access. With OS Login, a user’s ability to SSH into a VM is directly tied to their Google Cloud identity and IAM roles (like roles/compute.osLogin). The moment the departed developer’s Google Workspace account is suspended or their IAM role is revoked, their SSH access is instantly cut off across all VMs in the project. This eliminates the operational nightmare of hunting down and deleting rogue public keys scattered across individual instances.
Hands-On Exercise: Globally Load-Balanced App Across Two Regions
Section titled “Hands-On Exercise: Globally Load-Balanced App Across Two Regions”Objective
Section titled “Objective”Build a production-like architecture with MIGs in two regions behind a global HTTPS load balancer.
Prerequisites
Section titled “Prerequisites”gcloudCLI installed and authenticated- A GCP project with billing enabled
- A custom VPC with subnets in
us-central1andeurope-west1
Task 1: Create the Network Foundation
Solution
export PROJECT_ID=$(gcloud config get-value project)export REGION_US=us-central1export REGION_EU=europe-west1
# Create custom VPCgcloud compute networks create web-vpc \ --subnet-mode=custom \ --bgp-routing-mode=global
# Create subnetsgcloud compute networks subnets create web-us \ --network=web-vpc \ --region=$REGION_US \ --range=10.10.0.0/24 \ --enable-private-ip-google-access
gcloud compute networks subnets create web-eu \ --network=web-vpc \ --region=$REGION_EU \ --range=10.11.0.0/24 \ --enable-private-ip-google-access
# Create firewall rulesgcloud compute firewall-rules create web-vpc-allow-http \ --network=web-vpc \ --direction=INGRESS \ --action=ALLOW \ --rules=tcp:80 \ --source-ranges=130.211.0.0/22,35.191.0.0/16 \ --description="Allow health checks and LB traffic"
gcloud compute firewall-rules create web-vpc-allow-iap \ --network=web-vpc \ --direction=INGRESS \ --action=ALLOW \ --rules=tcp:22 \ --source-ranges=35.235.240.0/20Task 2: Create an Instance Template
Solution
# Create instance template (uses startup script to install nginx)gcloud compute instance-templates create web-template \ --machine-type=e2-small \ --image-family=debian-12 \ --image-project=debian-cloud \ --boot-disk-size=10GB \ --boot-disk-type=pd-balanced \ --network=web-vpc \ --no-address \ --metadata=startup-script='#!/bin/bash apt-get update apt-get install -y nginx ZONE=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/zone" -H "Metadata-Flavor: Google" | cut -d/ -f4) HOSTNAME=$(hostname) cat > /var/www/html/index.html <<HTMLEOF <h1>Hello from $HOSTNAME</h1> <p>Zone: $ZONE</p> <p>Served at: $(date)</p>HTMLEOF cat > /var/www/html/health <<HTMLEOF OKHTMLEOF systemctl restart nginx'
# Verifygcloud compute instance-templates describe web-template \ --format="yaml(properties.machineType, properties.networkInterfaces)"Task 3: Create Regional MIGs with Autoscaling
Solution
# Create health checkgcloud compute health-checks create http web-hc \ --port=80 \ --request-path=/health \ --check-interval=10s \ --timeout=5s \ --healthy-threshold=2 \ --unhealthy-threshold=3
# Create MIG in USgcloud compute instance-groups managed create web-mig-us \ --template=web-template \ --size=2 \ --region=$REGION_US \ --health-check=web-hc \ --initial-delay=120
# Create MIG in EUgcloud compute instance-groups managed create web-mig-eu \ --template=web-template \ --size=2 \ --region=$REGION_EU \ --health-check=web-hc \ --initial-delay=120
# Set named portsgcloud compute instance-groups managed set-named-ports web-mig-us \ --named-ports=http:80 --region=$REGION_US
gcloud compute instance-groups managed set-named-ports web-mig-eu \ --named-ports=http:80 --region=$REGION_EU
# Add autoscalingfor MIG_REGION in $REGION_US $REGION_EU; do MIG_NAME="web-mig-$(echo $MIG_REGION | cut -d- -f1)" [ "$MIG_REGION" = "$REGION_US" ] && MIG_NAME="web-mig-us" [ "$MIG_REGION" = "$REGION_EU" ] && MIG_NAME="web-mig-eu"
gcloud compute instance-groups managed set-autoscaling $MIG_NAME \ --region=$MIG_REGION \ --min-num-replicas=2 \ --max-num-replicas=10 \ --target-cpu-utilization=0.6 \ --cool-down-period=120doneTask 4: Create the Global Load Balancer
Solution
# Reserve global IPgcloud compute addresses create web-global-ip --ip-version=IPV4 --global
# Get the IP addressWEB_IP=$(gcloud compute addresses describe web-global-ip --global --format="get(address)")echo "Load Balancer IP: $WEB_IP"
# Create backend servicegcloud compute backend-services create web-backend-svc \ --protocol=HTTP \ --port-name=http \ --health-checks=web-hc \ --global
# Add both MIGs as backendsgcloud compute backend-services add-backend web-backend-svc \ --instance-group=web-mig-us \ --instance-group-region=$REGION_US \ --balancing-mode=UTILIZATION \ --max-utilization=0.8 \ --global
gcloud compute backend-services add-backend web-backend-svc \ --instance-group=web-mig-eu \ --instance-group-region=$REGION_EU \ --balancing-mode=UTILIZATION \ --max-utilization=0.8 \ --global
# Create URL mapgcloud compute url-maps create web-url-map \ --default-service=web-backend-svc
# Create HTTP target proxy (use HTTPS with cert in production)gcloud compute target-http-proxies create web-http-proxy \ --url-map=web-url-map
# Create forwarding rulegcloud compute forwarding-rules create web-http-rule \ --address=web-global-ip \ --global \ --target-http-proxy=web-http-proxy \ --ports=80
echo "Load balancer will be available at http://$WEB_IP in 3-5 minutes"Task 5: Test and Verify
Solution
# Wait for backends to become healthy (check every 30 seconds)echo "Waiting for backends to become healthy..."while true; do STATUS=$(gcloud compute backend-services get-health web-backend-svc --global 2>&1) HEALTHY=$(echo "$STATUS" | grep -c "HEALTHY" || true) echo "Healthy backends: $HEALTHY" if [ "$HEALTHY" -ge 4 ]; then echo "All backends healthy!" break fi sleep 30done
# Test the load balancer (run multiple times to see different backends)WEB_IP=$(gcloud compute addresses describe web-global-ip --global --format="get(address)")
for i in $(seq 1 6); do echo "--- Request $i ---" curl -s http://$WEB_IP echodone
# Check backend health statusgcloud compute backend-services get-health web-backend-svc --globalTask 6: Clean Up
Solution
# Delete in reverse order of dependenciesgcloud compute forwarding-rules delete web-http-rule --global --quietgcloud compute target-http-proxies delete web-http-proxy --quietgcloud compute url-maps delete web-url-map --quietgcloud compute backend-services delete web-backend-svc --global --quietgcloud compute addresses delete web-global-ip --global --quiet
# Delete MIGsgcloud compute instance-groups managed delete web-mig-us --region=$REGION_US --quietgcloud compute instance-groups managed delete web-mig-eu --region=$REGION_EU --quiet
# Delete health check and templategcloud compute health-checks delete web-hc --quietgcloud compute instance-templates delete web-template --quiet
# Delete firewall rules and networkgcloud compute firewall-rules delete web-vpc-allow-http --quietgcloud compute firewall-rules delete web-vpc-allow-iap --quietgcloud compute networks subnets delete web-us --region=$REGION_US --quietgcloud compute networks subnets delete web-eu --region=$REGION_EU --quietgcloud compute networks delete web-vpc --quiet
echo "Cleanup complete."Success Criteria
Section titled “Success Criteria”- Custom VPC with subnets in two regions
- Instance template configured with startup script
- MIGs in both regions with health checks and autoscaling
- Global load balancer distributing traffic to both regions
- Multiple curl requests show responses from different VMs/zones
- All resources cleaned up
Next Module
Section titled “Next Module”Next up: Module 2.4: Cloud Storage (GCS) --- Master storage classes, lifecycle management, versioning, signed URLs, and the gsutil/gcloud commands you will use every day.