Skip to content

Module 2.10: GCP Cloud Operations (Monitoring & Logging)

Complexity: [MEDIUM] | Time to Complete: 2.5h | Prerequisites: Module 2.3 (Compute Engine), Module 2.7 (Cloud Run)

After completing this module, you will be able to:

  • Configure Cloud Monitoring dashboards with custom metrics, uptime checks, and alerting policies
  • Implement structured logging with Cloud Logging and build log-based metrics for application-level observability
  • Deploy Cloud Trace and Cloud Profiler to diagnose latency bottlenecks in distributed GCP applications
  • Design multi-project monitoring with metrics scoping and centralized alerting across GCP organizations

In November 2022, a fintech company’s payment processing service began failing intermittently. Customers reported that approximately 5% of transactions were being declined with a generic “server error.” The on-call engineer checked the Cloud Run dashboard and saw that CPU and memory utilization were normal. Request count looked steady. Everything appeared healthy from the infrastructure layer. The issue persisted for 4 hours before a senior engineer noticed an anomaly in the application logs: a third-party payment gateway was returning HTTP 429 (rate limit exceeded) for requests from a specific IP range. This log signal was buried in 2 million log lines per hour because the team had no log-based metrics, no alerting on error rates, and no structured logging. They were flying blind in a sea of unstructured text. The 4-hour delay in diagnosis cost them $340,000 in failed transactions and a significant hit to customer trust.

This incident demonstrates a truth that every platform engineer learns the hard way: metrics tell you that something is wrong; logs tell you why. You need both, and you need them working together. Cloud Operations (formerly Stackdriver) is GCP’s integrated suite for monitoring, logging, and alerting. It is not a separate product you bolt on---it is deeply integrated into every GCP service. Cloud Logging automatically captures logs from managed services like Cloud Run, GKE, and Cloud Functions. Compute Engine instances require the Cloud Ops Agent for application and OS log collection. Cloud Monitoring automatically collects metrics from all GCP resources.

In this module, you will learn how Cloud Logging’s architecture works (the log router, sinks, and exclusions), how to create log-based metrics that turn log patterns into alertable signals, how Cloud Monitoring dashboards and alerting policies work, and how to set up uptime checks for external monitoring of your services.


Every log entry generated in GCP flows through the Log Router. The router evaluates each log entry against a set of rules (called “sinks”) to determine where the log goes.

Log Sources Log Router Destinations
─────────── ────────── ────────────
┌──────────────┐
│ Compute Engine│─────┐ ┌──────────────┐
└──────────────┘ │ │ │ ┌──────────────┐
│ │ Inclusion │───────────>│ Cloud Logging │
┌──────────────┐ ├────────>│ Filters │ │ (default) │
│ Cloud Run │─────┤ │ │ └──────────────┘
└──────────────┘ │ │ Exclusion │
│ │ Filters │ ┌──────────────┐
┌──────────────┐ │ │ │───────────>│ Cloud Storage │
│ GKE │─────┤ │ Sinks │ │ (long-term) │
└──────────────┘ │ │ │ └──────────────┘
│ │ │
┌──────────────┐ │ │ │ ┌──────────────┐
│ Cloud Functions│────┘ │ │───────────>│ BigQuery │
└──────────────┘ │ │ │ (analytics) │
│ │ └──────────────┘
┌──────────────┐ │ │
│ Cloud Audit │──────────────>│ │ ┌──────────────┐
│ Logs │ │ │───────────>│ Pub/Sub │
└──────────────┘ └──────────────┘ │ (streaming) │
└──────────────┘
Log TypeAuto-collectedCostRetentionExample
Admin ActivityYes (always on)Free400 daysIAM changes, resource creation
Data AccessMust enablePaid30 days (default)Who read what data
System EventYes (always on)Free400 daysLive migration, auto-scaling
Platform LogsYesPaid30 days (default)Cloud Run requests, GKE events
Application LogsYes (stdout/stderr)Paid30 days (default)Your application output
Terminal window
# Basic log query
gcloud logging read 'resource.type="cloud_run_revision"' \
--limit=20 \
--format=json
# Filter by severity
gcloud logging read 'severity>=ERROR AND resource.type="cloud_run_revision"' \
--limit=10
# Filter by time range
gcloud logging read 'resource.type="gce_instance" AND timestamp>="2024-01-15T00:00:00Z" AND timestamp<"2024-01-16T00:00:00Z"' \
--limit=50
# Search for specific text in log messages
gcloud logging read 'textPayload:"connection refused"' \
--limit=10
# Structured log query (jsonPayload)
gcloud logging read 'jsonPayload.status>=500 AND resource.type="cloud_run_revision"' \
--limit=20
# Query specific resource
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="my-api"' \
--limit=10 \
--format="table(timestamp, severity, textPayload)"

The Log Explorer in the console uses a powerful query language:

# Compound queries
resource.type="cloud_run_revision"
AND resource.labels.service_name="my-api"
AND severity>=WARNING
AND jsonPayload.latency_ms>500
AND timestamp>="2024-01-15T10:00:00Z"
# NOT operator
resource.type="gce_instance"
AND NOT severity="DEBUG"
# Regex matching
textPayload=~"error.*timeout"
# Specific labels
labels."compute.googleapis.com/resource_name"="my-vm"

Stop and think: If a log entry matches both an inclusion filter for BigQuery and an exclusion filter for the default Cloud Logging bucket, where does the log end up?


Sinks route copies of log entries to destinations outside the default Cloud Logging storage. This is essential for long-term retention, analytics, and compliance.

Terminal window
# Create a sink to Cloud Storage (long-term archival)
gcloud logging sinks create archive-all-logs \
storage.googleapis.com/my-log-archive-bucket \
--log-filter='severity>=INFO'
# Create a sink to BigQuery (analytics)
gcloud logging sinks create errors-to-bigquery \
bigquery.googleapis.com/projects/my-project/datasets/error_logs \
--log-filter='severity>=ERROR'
# Create a sink to Pub/Sub (real-time streaming)
gcloud logging sinks create critical-to-pubsub \
pubsub.googleapis.com/projects/my-project/topics/critical-logs \
--log-filter='severity=CRITICAL'
# After creating a sink, grant the sink's writer identity access
# to the destination
WRITER_IDENTITY=$(gcloud logging sinks describe archive-all-logs \
--format="value(writerIdentity)")
gcloud storage buckets add-iam-policy-binding gs://my-log-archive-bucket \
--member="$WRITER_IDENTITY" \
--role="roles/storage.objectCreator"
# List all sinks
gcloud logging sinks list
# Update a sink's filter
gcloud logging sinks update archive-all-logs \
--log-filter='severity>=WARNING'
# Delete a sink
gcloud logging sinks delete archive-all-logs

Exclusion filters prevent specific log entries from being ingested into Cloud Logging’s default storage, drastically reducing costs.

Terminal window
# Exclude debug logs from Cloud Run (they are noisy and expensive)
gcloud logging exclusions create exclude-debug-logs \
--description="Exclude debug-level Cloud Run logs" \
--filter='resource.type="cloud_run_revision" AND severity="DEBUG"'
# Exclude health check logs (extremely noisy)
gcloud logging exclusions create exclude-health-checks \
--description="Exclude health check logs" \
--filter='httpRequest.requestUrl="/health" OR httpRequest.requestUrl="/healthz"'
# View exclusions
gcloud logging exclusions list

Writing structured (JSON) logs instead of plain text enables powerful querying and log-based metrics. It allows you to parse custom fields (like latency or user ID) natively in the Log Explorer.

import json
import logging
import sys
class JSONFormatter(logging.Formatter):
def format(self, record):
log_entry = {
"severity": record.levelname,
"message": record.getMessage(),
"component": record.name,
}
# Add extra fields if present
if hasattr(record, "request_id"):
log_entry["request_id"] = record.request_id
if hasattr(record, "user_id"):
log_entry["user_id"] = record.user_id
if hasattr(record, "latency_ms"):
log_entry["latency_ms"] = record.latency_ms
return json.dumps(log_entry)
# Configure logging
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("my-api")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage
logger.info("Request processed",
extra={"request_id": "abc-123", "latency_ms": 45, "user_id": "user-456"})
# Output: {"severity": "INFO", "message": "Request processed",
# "request_id": "abc-123", "latency_ms": 45, "user_id": "user-456"}

In Cloud Logging, this is parsed as jsonPayload, allowing queries like:

jsonPayload.latency_ms > 200
jsonPayload.user_id = "user-456"
jsonPayload.severity = "ERROR"

Pause and predict: If you use standard print() statements in Python on Cloud Run, they appear in Cloud Logging as plain text within textPayload. How does this limit your ability to create specific alerting policies compared to JSON logging?


Log-Based Metrics: Turning Logs into Signals

Section titled “Log-Based Metrics: Turning Logs into Signals”

Log-based metrics are the bridge between logging and monitoring. They count log entries matching a filter and expose that count as a metric you can alert on.

Terminal window
# Create a metric that counts 5xx errors in Cloud Run
gcloud logging metrics create cloud_run_5xx_errors \
--description="Count of 5xx errors in Cloud Run" \
--log-filter='resource.type="cloud_run_revision" AND httpRequest.status>=500'
# Create a metric that counts authentication failures
gcloud logging metrics create auth_failures \
--description="Authentication failures across all services" \
--log-filter='jsonPayload.event="auth_failure" OR textPayload:"authentication failed"'
# List log-based metrics
gcloud logging metrics list
# View metric details
gcloud logging metrics describe cloud_run_5xx_errors

Distribution metrics capture the distribution of values (like latency) extracted from log fields.

Terminal window
# Create a distribution metric for response latency
gcloud logging metrics create api_latency \
--description="API response latency distribution" \
--log-filter='resource.type="cloud_run_revision" AND httpRequest.latency!=""' \
--bucket-options='linear-buckets={"numFiniteBuckets": 20, "width": 100, "offset": 0}' \
--value-extractor='EXTRACT(httpRequest.latency)'

Pause and predict: You create a log-based counter metric for HTTP 500 errors. Will this metric retroactively count the errors that occurred yesterday, or only the errors that happen from the moment of creation onward?


GCP automatically collects hundreds of metrics from every service. You do not need to install agents or configure anything for these.

ServiceExample Metrics
Compute Enginecompute.googleapis.com/instance/cpu/utilization, disk/read_bytes_count
Cloud Runrun.googleapis.com/request_count, request_latencies, container/cpu/utilization
Cloud SQLcloudsql.googleapis.com/database/cpu/utilization, connections
Cloud Storagestorage.googleapis.com/api/request_count, total_bytes
Cloud Functionscloudfunctions.googleapis.com/function/execution_count, execution_times
Terminal window
# List available metric types for a service
gcloud monitoring metrics-descriptors list \
--filter='metric.type = starts_with("run.googleapis.com")' \
--format="table(type, description)" \
--limit=20
# Query a specific metric (requires Monitoring Query Language - MQL)
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_count" AND resource.labels.service_name="my-api"' \
--interval-start-time=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
--format=json

Dashboards can be created via the console (recommended for exploration) or via JSON/YAML (recommended for infrastructure-as-code).

Terminal window
# Create a dashboard from a JSON definition
cat > /tmp/dashboard.json << 'EOF'
{
"displayName": "Cloud Run API Dashboard",
"mosaicLayout": {
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Request Count by Status",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"run.googleapis.com/request_count\" AND resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"my-api\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["metric.labels.response_code_class"]
}
}
}
}
]
}
}
},
{
"xPos": 6,
"width": 6,
"height": 4,
"widget": {
"title": "P99 Latency",
"xyChart": {
"dataSets": [
{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "metric.type=\"run.googleapis.com/request_latencies\" AND resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"my-api\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_PERCENTILE_99"
}
}
}
}
]
}
}
}
]
}
}
EOF
gcloud monitoring dashboards create --config-from-file=/tmp/dashboard.json
# List dashboards
gcloud monitoring dashboards list --format="table(displayName, name)"

Cloud Monitoring natively supports PromQL for users familiar with Prometheus.

# Request rate for a Cloud Run service
rate(run_googleapis_com:request_count{service_name="my-api"}[5m])
# Error rate (5xx responses)
rate(run_googleapis_com:request_count{service_name="my-api", response_code_class="5xx"}[5m])
/
rate(run_googleapis_com:request_count{service_name="my-api"}[5m])
* 100
# P95 latency
histogram_quantile(0.95, rate(run_googleapis_com:request_latencies_bucket{service_name="my-api"}[5m]))
# CPU utilization above 80%
compute_googleapis_com:instance_cpu_utilization{instance_name=~"web-.*"} > 0.8

Terminal window
# Create an alert policy for high error rate
cat > /tmp/alert-policy.json << 'EOF'
{
"displayName": "Cloud Run 5xx Error Rate > 5%",
"combiner": "OR",
"conditions": [
{
"displayName": "5xx error rate exceeds 5%",
"conditionThreshold": {
"filter": "metric.type=\"run.googleapis.com/request_count\" AND resource.type=\"cloud_run_revision\" AND metric.labels.response_code_class=\"5xx\"",
"aggregations": [
{
"alignmentPeriod": "300s",
"perSeriesAligner": "ALIGN_RATE",
"crossSeriesReducer": "REDUCE_SUM",
"groupByFields": ["resource.labels.service_name"]
}
],
"comparison": "COMPARISON_GT",
"thresholdValue": 0.05,
"duration": "300s",
"trigger": {
"count": 1
}
}
}
],
"notificationChannels": [],
"alertStrategy": {
"autoClose": "604800s"
}
}
EOF
gcloud monitoring policies create --policy-from-file=/tmp/alert-policy.json
# List alert policies
gcloud monitoring policies list \
--format="table(displayName, enabled, conditions[0].displayName)"
Terminal window
# Create an email notification channel
gcloud monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
--channel-labels="email_address=ops@example.com"
# Create a Slack notification channel
gcloud monitoring channels create \
--display-name="Incidents Slack" \
--type=slack \
--channel-labels="channel_name=#incidents,auth_token=xoxb-..."
# List notification channels
gcloud monitoring channels list \
--format="table(displayName, type, name)"
# Update an alert policy to use a notification channel
CHANNEL_ID=$(gcloud monitoring channels list --filter="displayName='Ops Team Email'" --format="value(name)")
gcloud monitoring policies update POLICY_ID \
--add-notification-channels=$CHANNEL_ID
PracticeWhyExample
Alert on symptoms, not causesSymptoms affect users; causes are for investigationAlert on error rate, not CPU usage
Use multi-condition alertsReduce noise from transient spikesError rate > 5% AND request count > 100
Set appropriate windowsToo short = noise; too long = late5-minute window for critical; 15-minute for warning
Include runbook linksReduce MTTR by guiding respondersLink to troubleshooting playbook in alert description
Avoid alert fatigueToo many alerts = ignored alertsOnly alert on actionable conditions

Stop and think: You set an alert policy for CPU utilization > 80% with a duration window of 5 minutes. The CPU spikes to 99% for 4 minutes, drops to 30% for 30 seconds, and goes back to 99% for 2 minutes. Does the alert trigger? Why or why not?


Uptime checks monitor the availability of your public endpoints from multiple global locations.

Terminal window
# Create an HTTP uptime check
gcloud monitoring uptime create my-api-uptime \
--display-name="My API Health Check" \
--resource-type=uptime-url \
--monitored-resource="host=my-api-abc123-uc.a.run.app,project_id=my-project" \
--http-check-path="/health" \
--http-check-port=443 \
--period=60 \
--timeout=10 \
--checker-type=STATIC_IP_CHECKERS
# List uptime checks
gcloud monitoring uptime list-configs \
--format="table(displayName, httpCheck.path, period)"
# Create an alert policy for uptime check failure
# (alert if the check fails from 2+ regions)
cat > /tmp/uptime-alert.json << 'EOF'
{
"displayName": "API Uptime Check Failed",
"combiner": "OR",
"conditions": [
{
"displayName": "Uptime check failing",
"conditionThreshold": {
"filter": "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\" AND resource.type=\"uptime_url\"",
"aggregations": [
{
"alignmentPeriod": "300s",
"perSeriesAligner": "ALIGN_NEXT_OLDER",
"crossSeriesReducer": "REDUCE_COUNT_FALSE",
"groupByFields": ["resource.labels.host"]
}
],
"comparison": "COMPARISON_GT",
"thresholdValue": 2,
"duration": "0s"
}
}
]
}
EOF
gcloud monitoring policies create --policy-from-file=/tmp/uptime-alert.json

Pause and predict: Why is it considered best practice to configure uptime checks to alert only when the check fails from multiple geographic regions rather than just a single region?


Diagnosing Latency: Cloud Trace and Cloud Profiler

Section titled “Diagnosing Latency: Cloud Trace and Cloud Profiler”

While logs and metrics tell you that a service is slow or experiencing high load, they often do not tell you exactly where the time is being spent inside the code or across a distributed microservice architecture.

Cloud Trace is a distributed tracing system that collects latency data from your applications and displays it in the GCP Console. When a request enters your system, Trace assigns it a unique Trace ID. As the request passes through various microservices (e.g., Load Balancer → Cloud Run → Cloud SQL → external API), each service reports a “span” representing the time spent in that component.

  • Why use it: To find the exact bottleneck in a chain of microservice calls. If an API request takes 5 seconds, Trace can visually show you that 4.8 seconds were spent waiting on a single slow database query.
  • How to use it: In managed environments like Cloud Run or App Engine, basic tracing is often automatic. For granular, code-level spans, you utilize OpenTelemetry libraries to instrument your application.

Cloud Profiler provides continuous CPU and heap profiling for applications running on GCP. It statistically gathers performance data from your production applications with minimal overhead (< 1%) and generates flame graphs.

  • Why use it: To identify which specific functions or methods in your code are consuming the most CPU cycles or allocating the most memory. It helps you optimize code efficiency and reduce compute costs.
  • How to use it: You import the Profiler agent into your application code (available for Go, Java, Node.js, and Python) and initialize it when the application boots.

Stop and think: If users report that clicking “Checkout” takes 10 seconds, but your system CPU utilization is hovering at a very healthy 20%, which tool should you reach for first to diagnose the issue: Cloud Trace or Cloud Profiler? Why?


In a real-world GCP organization, resources are rarely confined to a single project. You might have separate projects for networking, databases, frontend services, and backend APIs. Monitoring them individually leads to fragmented visibility.

A Metrics Scope allows you to view and manage monitoring data from multiple GCP projects through a single pane of glass. When you create a Metrics Scope, you designate one project as the “scoping project” (often a dedicated monitoring or DevOps project) and attach other “monitored projects” to it.

  • Dashboards: A dashboard created in the scoping project can query and display metrics side-by-side from all attached projects.
  • Alerting: You can create a single alert policy (e.g., “Alert if any Cloud SQL instance CPU > 80%”) in the scoping project that applies universally to all databases across all attached projects.
  • Access Control: You can grant your SRE or Ops team access to the scoping project, giving them full observability across the organization without needing to provision IAM roles in every individual application project.

Pause and predict: If you have 10 separate production microservice projects, should you manage alert policies in each project separately, or centrally within a single metrics scoping project?


  1. Cloud Logging ingests over 150 petabytes of log data per month across all GCP customers. The log router processes over 50 billion log entries per day. Despite this scale, the median query response time in the Log Explorer is under 3 seconds for queries spanning a 1-hour time window.

  2. Log-based metrics are evaluated in real-time as logs flow through the log router, not after they are stored. This means you can create an alert based on a log-based metric and receive a notification within 60-90 seconds of the triggering log entry being written---even before you could find it manually in the Log Explorer.

  3. Cloud Monitoring’s uptime checks run from 6 global regions simultaneously (USA-Oregon, USA-Virginia, South America, Europe, Asia Pacific-1, Asia Pacific-2). A check is considered “failed” only when it fails from multiple regions, reducing false positives from network partitions. You can see the per-region results in the uptime check dashboard.

  4. The Ops Agent (successor to the legacy Monitoring and Logging agents) supports both Prometheus metric scraping and fluent-bit log collection in a single agent. If you are running custom metrics in Prometheus format on your VMs, the Ops Agent can scrape them and send them to Cloud Monitoring without running a separate Prometheus server.


MistakeWhy It HappensHow to Fix It
Not creating log sinks for long-term retentionDefault 30-day retention seems enoughCreate sinks to Cloud Storage for compliance; 30 days passes quickly during incident investigation
Logging too much at DEBUG levelVerbose logging during developmentUse INFO as default; enable DEBUG only in non-production; use exclusion filters
Not creating log-based metricsRelying on manual log searchingCreate metrics for key patterns (errors, auth failures, latency thresholds)
Setting alert thresholds too sensitiveWanting to catch every issueUse multi-condition alerts and appropriate duration windows (5-15 minutes)
Not using structured loggingPlain text seems simplerJSON logs enable powerful filtering in Log Explorer; use structured logging from day one
Ignoring uptime checksInternal monitoring seems sufficientUptime checks verify from external perspective; catches DNS, certificate, and network issues
Alert fatigue from too many alertsAdding alerts without reviewing existing onesQuarterly alert hygiene review; delete alerts that are never actionable
Not routing audit logs to BigQueryDo not know about log sinksCreate a sink for audit logs to BigQuery for security analytics and compliance

1. Your company wants to retain all access logs for 5 years for compliance, but the security team is complaining that their default Cloud Logging bill is astronomically high due to debug logs from the staging environment. How do you configure the Log Router to satisfy both requirements?

You should create a log sink that routes all access logs to a Cloud Storage bucket configured with a 5-year retention policy and a lifecycle rule for cost optimization. Simultaneously, you must create an exclusion filter in the Log Router for the staging environment’s debug logs. The exclusion filter prevents the noisy debug logs from being ingested into the expensive default Cloud Logging storage, saving money. Because sinks and exclusions operate independently, the compliance sink will still capture the required access logs before any exclusions affect the default bucket.

2. An on-call engineer notices that the "Request Latency" log-based counter metric is firing alerts, but they cannot determine if the slow requests are taking 1 second or 30 seconds. What design flaw exists in their log-based metric, and how should it be redesigned?

The engineer created a log-based counter metric, which simply counts the number of log entries matching a filter (e.g., latency > 500ms) without capturing the actual latency value itself. To fix this, they need to recreate it as a log-based distribution metric. A distribution metric uses a value extractor to pull the specific numeric latency value from each structured JSON log entry. This allows Cloud Monitoring to calculate percentiles like P95 and P99, giving the engineer precise visibility into exactly how slow the requests actually are during an incident.

3. You created an exclusion filter to drop noisy HTTP 200 health check logs to save money on Cloud Logging ingestion. However, the security team complains that these logs are now missing from their custom BigQuery sink, which they use for historical audits. How does the Log Router handle this, and what went wrong?

In GCP, the Log Router processes log exclusions and log sinks completely independently of one another. Creating an exclusion filter prevents the logs from being ingested into the _Default Cloud Logging storage bucket, saving ingestion costs. It does not, however, prevent those same logs from being routed to a custom sink, such as BigQuery or Cloud Storage. If the security team is missing logs in their custom sink, the issue is with that specific sink’s inclusion/exclusion filter, not the general exclusion filter you created for the default bucket.

4. Your team receives pager alerts every night at 3 AM because a database VM's CPU hits 95%. However, this coincides with a scheduled nightly backup, and customer-facing API latency remains completely normal during this time. How should you restructure this alerting strategy to prevent alert fatigue?

This alert is currently firing on a “cause” (high CPU) rather than a “symptom” (user impact). Because the high CPU does not degrade the customer experience during the backup, this alert is unactionable and causes severe alert fatigue. You should restructure the alerting strategy to trigger on symptoms, such as the API’s P99 latency exceeding a certain threshold or the HTTP 5xx error rate spiking. If you still want to monitor the CPU for capacity planning, you should change the notification channel from a paging system to a low-priority email or Slack message that can be reviewed asynchronously during business hours.

5. A developer complains that their microservice is intermittently taking 4 seconds to respond instead of the usual 50ms. The service calls three other downstream GCP services and a Cloud SQL database. They ask you to check the CPU metrics on the Cloud Run instances to find the problem. Which GCP observability tool should you recommend they use instead, and why?

You should recommend using Cloud Trace rather than simply looking at CPU metrics. High latency in a distributed system is often caused by network waits, database locks, or slow downstream API calls, none of which will show up as high CPU utilization. Cloud Trace tracks a single request as it propagates through all the microservices and the database, creating a visual waterfall diagram of spans. This will immediately show exactly which downstream service or specific database query is responsible for the 4-second delay, drastically reducing the time to resolution.

6. You are tasked with centralizing monitoring for 15 different GCP projects belonging to 3 different product teams. Currently, engineers have to switch between projects in the GCP console to view dashboards and alerts, leading to fragmented observability. How do you architect a solution in Cloud Monitoring to provide a "single pane of glass"?

You should implement a Metrics Scope hosted in a dedicated, centralized monitoring project. By attaching the 15 individual product projects to this single scoping project, Cloud Monitoring will aggregate all their metrics into one unified view. This allows you to build centralized dashboards and configure global alerting policies that evaluate resources across all 15 projects simultaneously. Furthermore, you can grant the engineering teams IAM access to the scoping project, giving them full visibility into the organization’s health without needing to grant them permissions in every individual production project.


Hands-On Exercise: Monitoring and Alerting for Cloud Run

Section titled “Hands-On Exercise: Monitoring and Alerting for Cloud Run”

Deploy a Cloud Run service with structured logging, create log-based metrics, set up a monitoring dashboard, and configure alerting.

  • gcloud CLI installed and authenticated
  • A GCP project with billing enabled

Task 1: Deploy a Cloud Run Service with Structured Logging

Solution
Terminal window
export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1
# Enable APIs
gcloud services enable \
run.googleapis.com \
monitoring.googleapis.com \
logging.googleapis.com
mkdir -p /tmp/ops-lab && cd /tmp/ops-lab
cat > main.py << 'PYEOF'
import json
import logging
import os
import random
import sys
import time
from flask import Flask, request, jsonify
app = Flask(__name__)
class JSONFormatter(logging.Formatter):
def format(self, record):
entry = {
"severity": record.levelname,
"message": record.getMessage(),
}
for key in ["latency_ms", "status_code", "path", "error_type"]:
if hasattr(record, key):
entry[key] = getattr(record, key)
return json.dumps(entry)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("ops-lab")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
@app.route("/")
def home():
start = time.time()
latency_ms = int((time.time() - start) * 1000) + random.randint(5, 50)
logger.info("Request processed",
extra={"latency_ms": latency_ms, "status_code": 200, "path": "/"})
return jsonify({"status": "ok", "latency_ms": latency_ms})
@app.route("/slow")
def slow():
delay = random.uniform(0.5, 2.0)
time.sleep(delay)
latency_ms = int(delay * 1000)
logger.warning("Slow request detected",
extra={"latency_ms": latency_ms, "status_code": 200, "path": "/slow"})
return jsonify({"status": "ok", "latency_ms": latency_ms})
@app.route("/error")
def error():
error_types = ["DatabaseTimeout", "AuthenticationFailed", "RateLimitExceeded"]
error_type = random.choice(error_types)
logger.error("Request failed",
extra={"latency_ms": 0, "status_code": 500, "path": "/error",
"error_type": error_type})
return jsonify({"status": "error", "error": error_type}), 500
@app.route("/health")
def health():
return jsonify({"status": "healthy"})
if __name__ == "__main__":
port = int(os.environ.get("PORT", 8080))
app.run(host="0.0.0.0", port=port)
PYEOF
cat > requirements.txt << 'EOF'
flask>=3.0.0
gunicorn>=21.2.0
EOF
cat > Dockerfile << 'DEOF'
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "2", "main:app"]
DEOF
gcloud run deploy ops-lab-api \
--source=. \
--region=$REGION \
--allow-unauthenticated \
--memory=256Mi
SERVICE_URL=$(gcloud run services describe ops-lab-api \
--region=$REGION --format="value(status.url)")
echo "Service URL: $SERVICE_URL"

Task 2: Generate Traffic and Logs

Solution
Terminal window
SERVICE_URL=$(gcloud run services describe ops-lab-api \
--region=$REGION --format="value(status.url)")
# Generate normal traffic
for i in $(seq 1 15); do
curl -s "$SERVICE_URL/" > /dev/null
done
# Generate slow requests
for i in $(seq 1 5); do
curl -s "$SERVICE_URL/slow" > /dev/null
done
# Generate errors
for i in $(seq 1 8); do
curl -s "$SERVICE_URL/error" > /dev/null
done
echo "Traffic generated. Waiting for logs to appear..."
sleep 15
# View logs
gcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="ops-lab-api" AND jsonPayload.message!=""' \
--limit=15 \
--format="table(timestamp, jsonPayload.severity, jsonPayload.message, jsonPayload.status_code, jsonPayload.latency_ms)"

Task 3: Create Log-Based Metrics

Solution
Terminal window
# Metric: Count of 500 errors
gcloud logging metrics create ops_lab_errors \
--description="Count of 500 errors in ops-lab-api" \
--log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="ops-lab-api" AND jsonPayload.status_code=500'
# Metric: Count of slow requests (latency > 500ms)
gcloud logging metrics create ops_lab_slow_requests \
--description="Count of slow requests (>500ms) in ops-lab-api" \
--log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="ops-lab-api" AND jsonPayload.latency_ms>500'
# List metrics
gcloud logging metrics list \
--format="table(name, description, filter)"

Task 4: Create an Uptime Check

Solution
Terminal window
# Get the Cloud Run hostname
SERVICE_HOST=$(echo $SERVICE_URL | sed 's|https://||')
# Create an uptime check
# Note: uptime checks via gcloud have limited support;
# using the REST API is more reliable for complex configs
gcloud monitoring uptime create ops-lab-uptime \
--display-name="Ops Lab API Health" \
--resource-type=uptime-url \
--resource-labels="host=$SERVICE_HOST,project_id=$PROJECT_ID" \
--http-check-path="/health" \
--http-check-port=443 \
--http-check-request-method=GET \
--period=60 \
--timeout=10
# List uptime checks
gcloud monitoring uptime list-configs \
--format="table(displayName, httpCheck.path, period)"
echo "Uptime check created. Results will appear in ~2 minutes."

Task 5: Query Monitoring Metrics

Solution
Terminal window
# Generate more traffic for metrics to populate
for i in $(seq 1 20); do
curl -s "$SERVICE_URL/" > /dev/null
curl -s "$SERVICE_URL/error" > /dev/null 2>&1
done
sleep 30
# Query Cloud Run request count
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_count" AND resource.labels.service_name="ops-lab-api"' \
--interval-start-time=$(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "15 minutes ago" +%Y-%m-%dT%H:%M:%SZ) \
--format="table(metric.labels.response_code, points[0].value.int64Value)" \
--limit=10
# Query the log-based error metric
gcloud monitoring time-series list \
--filter='metric.type="logging.googleapis.com/user/ops_lab_errors"' \
--interval-start-time=$(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "15 minutes ago" +%Y-%m-%dT%H:%M:%SZ) \
--format=json \
--limit=5

Task 6: Clean Up

Solution
Terminal window
# Delete Cloud Run service
gcloud run services delete ops-lab-api --region=$REGION --quiet
# Delete log-based metrics
gcloud logging metrics delete ops_lab_errors --quiet
gcloud logging metrics delete ops_lab_slow_requests --quiet
# Delete uptime check
UPTIME_ID=$(gcloud monitoring uptime list-configs \
--filter="displayName='Ops Lab API Health'" --format="value(name)" | head -1)
gcloud monitoring uptime delete $UPTIME_ID --quiet 2>/dev/null || true
# Clean up local files
rm -rf /tmp/ops-lab
echo "Cleanup complete."
  • Cloud Run service deployed with structured JSON logging
  • Traffic generated (normal, slow, and error requests)
  • Structured logs visible in Cloud Logging with queryable fields
  • Log-based metrics created for errors and slow requests
  • Uptime check configured and running
  • All resources cleaned up

Next up: Module 2.11: Cloud Build & CI/CD --- Learn how to define build pipelines with cloudbuild.yaml, use built-in and custom builders, set up triggers from GitHub and GitLab, and orchestrate deployments with Cloud Deploy.