Module 2.10: GCP Cloud Operations (Monitoring & Logging)
Complexity: [MEDIUM] | Time to Complete: 2.5h | Prerequisites: Module 2.3 (Compute Engine), Module 2.7 (Cloud Run)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure Cloud Monitoring dashboards with custom metrics, uptime checks, and alerting policies
- Implement structured logging with Cloud Logging and build log-based metrics for application-level observability
- Deploy Cloud Trace and Cloud Profiler to diagnose latency bottlenecks in distributed GCP applications
- Design multi-project monitoring with metrics scoping and centralized alerting across GCP organizations
Why This Module Matters
Section titled “Why This Module Matters”In November 2022, a fintech company’s payment processing service began failing intermittently. Customers reported that approximately 5% of transactions were being declined with a generic “server error.” The on-call engineer checked the Cloud Run dashboard and saw that CPU and memory utilization were normal. Request count looked steady. Everything appeared healthy from the infrastructure layer. The issue persisted for 4 hours before a senior engineer noticed an anomaly in the application logs: a third-party payment gateway was returning HTTP 429 (rate limit exceeded) for requests from a specific IP range. This log signal was buried in 2 million log lines per hour because the team had no log-based metrics, no alerting on error rates, and no structured logging. They were flying blind in a sea of unstructured text. The 4-hour delay in diagnosis cost them $340,000 in failed transactions and a significant hit to customer trust.
This incident demonstrates a truth that every platform engineer learns the hard way: metrics tell you that something is wrong; logs tell you why. You need both, and you need them working together. Cloud Operations (formerly Stackdriver) is GCP’s integrated suite for monitoring, logging, and alerting. It is not a separate product you bolt on---it is deeply integrated into every GCP service. Cloud Logging automatically captures logs from managed services like Cloud Run, GKE, and Cloud Functions. Compute Engine instances require the Cloud Ops Agent for application and OS log collection. Cloud Monitoring automatically collects metrics from all GCP resources.
In this module, you will learn how Cloud Logging’s architecture works (the log router, sinks, and exclusions), how to create log-based metrics that turn log patterns into alertable signals, how Cloud Monitoring dashboards and alerting policies work, and how to set up uptime checks for external monitoring of your services.
Cloud Logging Architecture
Section titled “Cloud Logging Architecture”The Log Router
Section titled “The Log Router”Every log entry generated in GCP flows through the Log Router. The router evaluates each log entry against a set of rules (called “sinks”) to determine where the log goes.
Log Sources Log Router Destinations ─────────── ────────── ──────────── ┌──────────────┐ │ Compute Engine│─────┐ ┌──────────────┐ └──────────────┘ │ │ │ ┌──────────────┐ │ │ Inclusion │───────────>│ Cloud Logging │ ┌──────────────┐ ├────────>│ Filters │ │ (default) │ │ Cloud Run │─────┤ │ │ └──────────────┘ └──────────────┘ │ │ Exclusion │ │ │ Filters │ ┌──────────────┐ ┌──────────────┐ │ │ │───────────>│ Cloud Storage │ │ GKE │─────┤ │ Sinks │ │ (long-term) │ └──────────────┘ │ │ │ └──────────────┘ │ │ │ ┌──────────────┐ │ │ │ ┌──────────────┐ │ Cloud Functions│────┘ │ │───────────>│ BigQuery │ └──────────────┘ │ │ │ (analytics) │ │ │ └──────────────┘ ┌──────────────┐ │ │ │ Cloud Audit │──────────────>│ │ ┌──────────────┐ │ Logs │ │ │───────────>│ Pub/Sub │ └──────────────┘ └──────────────┘ │ (streaming) │ └──────────────┘Log Types
Section titled “Log Types”| Log Type | Auto-collected | Cost | Retention | Example |
|---|---|---|---|---|
| Admin Activity | Yes (always on) | Free | 400 days | IAM changes, resource creation |
| Data Access | Must enable | Paid | 30 days (default) | Who read what data |
| System Event | Yes (always on) | Free | 400 days | Live migration, auto-scaling |
| Platform Logs | Yes | Paid | 30 days (default) | Cloud Run requests, GKE events |
| Application Logs | Yes (stdout/stderr) | Paid | 30 days (default) | Your application output |
Querying Logs
Section titled “Querying Logs”# Basic log querygcloud logging read 'resource.type="cloud_run_revision"' \ --limit=20 \ --format=json
# Filter by severitygcloud logging read 'severity>=ERROR AND resource.type="cloud_run_revision"' \ --limit=10
# Filter by time rangegcloud logging read 'resource.type="gce_instance" AND timestamp>="2024-01-15T00:00:00Z" AND timestamp<"2024-01-16T00:00:00Z"' \ --limit=50
# Search for specific text in log messagesgcloud logging read 'textPayload:"connection refused"' \ --limit=10
# Structured log query (jsonPayload)gcloud logging read 'jsonPayload.status>=500 AND resource.type="cloud_run_revision"' \ --limit=20
# Query specific resourcegcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="my-api"' \ --limit=10 \ --format="table(timestamp, severity, textPayload)"Log Explorer Query Language
Section titled “Log Explorer Query Language”The Log Explorer in the console uses a powerful query language:
# Compound queriesresource.type="cloud_run_revision"AND resource.labels.service_name="my-api"AND severity>=WARNINGAND jsonPayload.latency_ms>500AND timestamp>="2024-01-15T10:00:00Z"
# NOT operatorresource.type="gce_instance"AND NOT severity="DEBUG"
# Regex matchingtextPayload=~"error.*timeout"
# Specific labelslabels."compute.googleapis.com/resource_name"="my-vm"Stop and think: If a log entry matches both an inclusion filter for BigQuery and an exclusion filter for the default Cloud Logging bucket, where does the log end up?
Log Sinks: Routing Logs to Destinations
Section titled “Log Sinks: Routing Logs to Destinations”Sinks route copies of log entries to destinations outside the default Cloud Logging storage. This is essential for long-term retention, analytics, and compliance.
# Create a sink to Cloud Storage (long-term archival)gcloud logging sinks create archive-all-logs \ storage.googleapis.com/my-log-archive-bucket \ --log-filter='severity>=INFO'
# Create a sink to BigQuery (analytics)gcloud logging sinks create errors-to-bigquery \ bigquery.googleapis.com/projects/my-project/datasets/error_logs \ --log-filter='severity>=ERROR'
# Create a sink to Pub/Sub (real-time streaming)gcloud logging sinks create critical-to-pubsub \ pubsub.googleapis.com/projects/my-project/topics/critical-logs \ --log-filter='severity=CRITICAL'
# After creating a sink, grant the sink's writer identity access# to the destinationWRITER_IDENTITY=$(gcloud logging sinks describe archive-all-logs \ --format="value(writerIdentity)")
gcloud storage buckets add-iam-policy-binding gs://my-log-archive-bucket \ --member="$WRITER_IDENTITY" \ --role="roles/storage.objectCreator"
# List all sinksgcloud logging sinks list
# Update a sink's filtergcloud logging sinks update archive-all-logs \ --log-filter='severity>=WARNING'
# Delete a sinkgcloud logging sinks delete archive-all-logsLog Exclusions (Reducing Cost)
Section titled “Log Exclusions (Reducing Cost)”Exclusion filters prevent specific log entries from being ingested into Cloud Logging’s default storage, drastically reducing costs.
# Exclude debug logs from Cloud Run (they are noisy and expensive)gcloud logging exclusions create exclude-debug-logs \ --description="Exclude debug-level Cloud Run logs" \ --filter='resource.type="cloud_run_revision" AND severity="DEBUG"'
# Exclude health check logs (extremely noisy)gcloud logging exclusions create exclude-health-checks \ --description="Exclude health check logs" \ --filter='httpRequest.requestUrl="/health" OR httpRequest.requestUrl="/healthz"'
# View exclusionsgcloud logging exclusions listStructured Logging
Section titled “Structured Logging”Writing structured (JSON) logs instead of plain text enables powerful querying and log-based metrics. It allows you to parse custom fields (like latency or user ID) natively in the Log Explorer.
Python Structured Logging for Cloud Run
Section titled “Python Structured Logging for Cloud Run”import jsonimport loggingimport sys
class JSONFormatter(logging.Formatter): def format(self, record): log_entry = { "severity": record.levelname, "message": record.getMessage(), "component": record.name, }
# Add extra fields if present if hasattr(record, "request_id"): log_entry["request_id"] = record.request_id if hasattr(record, "user_id"): log_entry["user_id"] = record.user_id if hasattr(record, "latency_ms"): log_entry["latency_ms"] = record.latency_ms
return json.dumps(log_entry)
# Configure logginghandler = logging.StreamHandler(sys.stdout)handler.setFormatter(JSONFormatter())logger = logging.getLogger("my-api")logger.addHandler(handler)logger.setLevel(logging.INFO)
# Usagelogger.info("Request processed", extra={"request_id": "abc-123", "latency_ms": 45, "user_id": "user-456"})# Output: {"severity": "INFO", "message": "Request processed",# "request_id": "abc-123", "latency_ms": 45, "user_id": "user-456"}In Cloud Logging, this is parsed as jsonPayload, allowing queries like:
jsonPayload.latency_ms > 200jsonPayload.user_id = "user-456"jsonPayload.severity = "ERROR"Pause and predict: If you use standard
print()statements in Python on Cloud Run, they appear in Cloud Logging as plain text withintextPayload. How does this limit your ability to create specific alerting policies compared to JSON logging?
Log-Based Metrics: Turning Logs into Signals
Section titled “Log-Based Metrics: Turning Logs into Signals”Log-based metrics are the bridge between logging and monitoring. They count log entries matching a filter and expose that count as a metric you can alert on.
Counter Metrics
Section titled “Counter Metrics”# Create a metric that counts 5xx errors in Cloud Rungcloud logging metrics create cloud_run_5xx_errors \ --description="Count of 5xx errors in Cloud Run" \ --log-filter='resource.type="cloud_run_revision" AND httpRequest.status>=500'
# Create a metric that counts authentication failuresgcloud logging metrics create auth_failures \ --description="Authentication failures across all services" \ --log-filter='jsonPayload.event="auth_failure" OR textPayload:"authentication failed"'
# List log-based metricsgcloud logging metrics list
# View metric detailsgcloud logging metrics describe cloud_run_5xx_errorsDistribution Metrics
Section titled “Distribution Metrics”Distribution metrics capture the distribution of values (like latency) extracted from log fields.
# Create a distribution metric for response latencygcloud logging metrics create api_latency \ --description="API response latency distribution" \ --log-filter='resource.type="cloud_run_revision" AND httpRequest.latency!=""' \ --bucket-options='linear-buckets={"numFiniteBuckets": 20, "width": 100, "offset": 0}' \ --value-extractor='EXTRACT(httpRequest.latency)'Pause and predict: You create a log-based counter metric for HTTP 500 errors. Will this metric retroactively count the errors that occurred yesterday, or only the errors that happen from the moment of creation onward?
Cloud Monitoring: Dashboards and Metrics
Section titled “Cloud Monitoring: Dashboards and Metrics”Built-in Metrics
Section titled “Built-in Metrics”GCP automatically collects hundreds of metrics from every service. You do not need to install agents or configure anything for these.
| Service | Example Metrics |
|---|---|
| Compute Engine | compute.googleapis.com/instance/cpu/utilization, disk/read_bytes_count |
| Cloud Run | run.googleapis.com/request_count, request_latencies, container/cpu/utilization |
| Cloud SQL | cloudsql.googleapis.com/database/cpu/utilization, connections |
| Cloud Storage | storage.googleapis.com/api/request_count, total_bytes |
| Cloud Functions | cloudfunctions.googleapis.com/function/execution_count, execution_times |
# List available metric types for a servicegcloud monitoring metrics-descriptors list \ --filter='metric.type = starts_with("run.googleapis.com")' \ --format="table(type, description)" \ --limit=20
# Query a specific metric (requires Monitoring Query Language - MQL)gcloud monitoring time-series list \ --filter='metric.type="run.googleapis.com/request_count" AND resource.labels.service_name="my-api"' \ --interval-start-time=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \ --format=jsonCreating Dashboards
Section titled “Creating Dashboards”Dashboards can be created via the console (recommended for exploration) or via JSON/YAML (recommended for infrastructure-as-code).
# Create a dashboard from a JSON definitioncat > /tmp/dashboard.json << 'EOF'{ "displayName": "Cloud Run API Dashboard", "mosaicLayout": { "tiles": [ { "width": 6, "height": 4, "widget": { "title": "Request Count by Status", "xyChart": { "dataSets": [ { "timeSeriesQuery": { "timeSeriesFilter": { "filter": "metric.type=\"run.googleapis.com/request_count\" AND resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"my-api\"", "aggregation": { "alignmentPeriod": "60s", "perSeriesAligner": "ALIGN_RATE", "crossSeriesReducer": "REDUCE_SUM", "groupByFields": ["metric.labels.response_code_class"] } } } } ] } } }, { "xPos": 6, "width": 6, "height": 4, "widget": { "title": "P99 Latency", "xyChart": { "dataSets": [ { "timeSeriesQuery": { "timeSeriesFilter": { "filter": "metric.type=\"run.googleapis.com/request_latencies\" AND resource.type=\"cloud_run_revision\" AND resource.labels.service_name=\"my-api\"", "aggregation": { "alignmentPeriod": "60s", "perSeriesAligner": "ALIGN_PERCENTILE_99" } } } } ] } } } ] }}EOF
gcloud monitoring dashboards create --config-from-file=/tmp/dashboard.json
# List dashboardsgcloud monitoring dashboards list --format="table(displayName, name)"PromQL in Cloud Monitoring
Section titled “PromQL in Cloud Monitoring”Cloud Monitoring natively supports PromQL for users familiar with Prometheus.
# Request rate for a Cloud Run servicerate(run_googleapis_com:request_count{service_name="my-api"}[5m])
# Error rate (5xx responses)rate(run_googleapis_com:request_count{service_name="my-api", response_code_class="5xx"}[5m]) /rate(run_googleapis_com:request_count{service_name="my-api"}[5m]) * 100
# P95 latencyhistogram_quantile(0.95, rate(run_googleapis_com:request_latencies_bucket{service_name="my-api"}[5m]))
# CPU utilization above 80%compute_googleapis_com:instance_cpu_utilization{instance_name=~"web-.*"} > 0.8Alerting Policies
Section titled “Alerting Policies”Creating Alert Policies
Section titled “Creating Alert Policies”# Create an alert policy for high error ratecat > /tmp/alert-policy.json << 'EOF'{ "displayName": "Cloud Run 5xx Error Rate > 5%", "combiner": "OR", "conditions": [ { "displayName": "5xx error rate exceeds 5%", "conditionThreshold": { "filter": "metric.type=\"run.googleapis.com/request_count\" AND resource.type=\"cloud_run_revision\" AND metric.labels.response_code_class=\"5xx\"", "aggregations": [ { "alignmentPeriod": "300s", "perSeriesAligner": "ALIGN_RATE", "crossSeriesReducer": "REDUCE_SUM", "groupByFields": ["resource.labels.service_name"] } ], "comparison": "COMPARISON_GT", "thresholdValue": 0.05, "duration": "300s", "trigger": { "count": 1 } } } ], "notificationChannels": [], "alertStrategy": { "autoClose": "604800s" }}EOF
gcloud monitoring policies create --policy-from-file=/tmp/alert-policy.json
# List alert policiesgcloud monitoring policies list \ --format="table(displayName, enabled, conditions[0].displayName)"Notification Channels
Section titled “Notification Channels”# Create an email notification channelgcloud monitoring channels create \ --display-name="Ops Team Email" \ --type=email \ --channel-labels="email_address=ops@example.com"
# Create a Slack notification channelgcloud monitoring channels create \ --display-name="Incidents Slack" \ --type=slack \ --channel-labels="channel_name=#incidents,auth_token=xoxb-..."
# List notification channelsgcloud monitoring channels list \ --format="table(displayName, type, name)"
# Update an alert policy to use a notification channelCHANNEL_ID=$(gcloud monitoring channels list --filter="displayName='Ops Team Email'" --format="value(name)")
gcloud monitoring policies update POLICY_ID \ --add-notification-channels=$CHANNEL_IDAlert Policy Best Practices
Section titled “Alert Policy Best Practices”| Practice | Why | Example |
|---|---|---|
| Alert on symptoms, not causes | Symptoms affect users; causes are for investigation | Alert on error rate, not CPU usage |
| Use multi-condition alerts | Reduce noise from transient spikes | Error rate > 5% AND request count > 100 |
| Set appropriate windows | Too short = noise; too long = late | 5-minute window for critical; 15-minute for warning |
| Include runbook links | Reduce MTTR by guiding responders | Link to troubleshooting playbook in alert description |
| Avoid alert fatigue | Too many alerts = ignored alerts | Only alert on actionable conditions |
Stop and think: You set an alert policy for CPU utilization > 80% with a duration window of 5 minutes. The CPU spikes to 99% for 4 minutes, drops to 30% for 30 seconds, and goes back to 99% for 2 minutes. Does the alert trigger? Why or why not?
Uptime Checks
Section titled “Uptime Checks”Uptime checks monitor the availability of your public endpoints from multiple global locations.
# Create an HTTP uptime checkgcloud monitoring uptime create my-api-uptime \ --display-name="My API Health Check" \ --resource-type=uptime-url \ --monitored-resource="host=my-api-abc123-uc.a.run.app,project_id=my-project" \ --http-check-path="/health" \ --http-check-port=443 \ --period=60 \ --timeout=10 \ --checker-type=STATIC_IP_CHECKERS
# List uptime checksgcloud monitoring uptime list-configs \ --format="table(displayName, httpCheck.path, period)"
# Create an alert policy for uptime check failure# (alert if the check fails from 2+ regions)cat > /tmp/uptime-alert.json << 'EOF'{ "displayName": "API Uptime Check Failed", "combiner": "OR", "conditions": [ { "displayName": "Uptime check failing", "conditionThreshold": { "filter": "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\" AND resource.type=\"uptime_url\"", "aggregations": [ { "alignmentPeriod": "300s", "perSeriesAligner": "ALIGN_NEXT_OLDER", "crossSeriesReducer": "REDUCE_COUNT_FALSE", "groupByFields": ["resource.labels.host"] } ], "comparison": "COMPARISON_GT", "thresholdValue": 2, "duration": "0s" } } ]}EOF
gcloud monitoring policies create --policy-from-file=/tmp/uptime-alert.jsonPause and predict: Why is it considered best practice to configure uptime checks to alert only when the check fails from multiple geographic regions rather than just a single region?
Diagnosing Latency: Cloud Trace and Cloud Profiler
Section titled “Diagnosing Latency: Cloud Trace and Cloud Profiler”While logs and metrics tell you that a service is slow or experiencing high load, they often do not tell you exactly where the time is being spent inside the code or across a distributed microservice architecture.
Cloud Trace
Section titled “Cloud Trace”Cloud Trace is a distributed tracing system that collects latency data from your applications and displays it in the GCP Console. When a request enters your system, Trace assigns it a unique Trace ID. As the request passes through various microservices (e.g., Load Balancer → Cloud Run → Cloud SQL → external API), each service reports a “span” representing the time spent in that component.
- Why use it: To find the exact bottleneck in a chain of microservice calls. If an API request takes 5 seconds, Trace can visually show you that 4.8 seconds were spent waiting on a single slow database query.
- How to use it: In managed environments like Cloud Run or App Engine, basic tracing is often automatic. For granular, code-level spans, you utilize OpenTelemetry libraries to instrument your application.
Cloud Profiler
Section titled “Cloud Profiler”Cloud Profiler provides continuous CPU and heap profiling for applications running on GCP. It statistically gathers performance data from your production applications with minimal overhead (< 1%) and generates flame graphs.
- Why use it: To identify which specific functions or methods in your code are consuming the most CPU cycles or allocating the most memory. It helps you optimize code efficiency and reduce compute costs.
- How to use it: You import the Profiler agent into your application code (available for Go, Java, Node.js, and Python) and initialize it when the application boots.
Stop and think: If users report that clicking “Checkout” takes 10 seconds, but your system CPU utilization is hovering at a very healthy 20%, which tool should you reach for first to diagnose the issue: Cloud Trace or Cloud Profiler? Why?
Multi-Project Monitoring
Section titled “Multi-Project Monitoring”In a real-world GCP organization, resources are rarely confined to a single project. You might have separate projects for networking, databases, frontend services, and backend APIs. Monitoring them individually leads to fragmented visibility.
Metrics Scopes
Section titled “Metrics Scopes”A Metrics Scope allows you to view and manage monitoring data from multiple GCP projects through a single pane of glass. When you create a Metrics Scope, you designate one project as the “scoping project” (often a dedicated monitoring or DevOps project) and attach other “monitored projects” to it.
- Dashboards: A dashboard created in the scoping project can query and display metrics side-by-side from all attached projects.
- Alerting: You can create a single alert policy (e.g., “Alert if any Cloud SQL instance CPU > 80%”) in the scoping project that applies universally to all databases across all attached projects.
- Access Control: You can grant your SRE or Ops team access to the scoping project, giving them full observability across the organization without needing to provision IAM roles in every individual application project.
Pause and predict: If you have 10 separate production microservice projects, should you manage alert policies in each project separately, or centrally within a single metrics scoping project?
Did You Know?
Section titled “Did You Know?”-
Cloud Logging ingests over 150 petabytes of log data per month across all GCP customers. The log router processes over 50 billion log entries per day. Despite this scale, the median query response time in the Log Explorer is under 3 seconds for queries spanning a 1-hour time window.
-
Log-based metrics are evaluated in real-time as logs flow through the log router, not after they are stored. This means you can create an alert based on a log-based metric and receive a notification within 60-90 seconds of the triggering log entry being written---even before you could find it manually in the Log Explorer.
-
Cloud Monitoring’s uptime checks run from 6 global regions simultaneously (USA-Oregon, USA-Virginia, South America, Europe, Asia Pacific-1, Asia Pacific-2). A check is considered “failed” only when it fails from multiple regions, reducing false positives from network partitions. You can see the per-region results in the uptime check dashboard.
-
The Ops Agent (successor to the legacy Monitoring and Logging agents) supports both Prometheus metric scraping and fluent-bit log collection in a single agent. If you are running custom metrics in Prometheus format on your VMs, the Ops Agent can scrape them and send them to Cloud Monitoring without running a separate Prometheus server.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Not creating log sinks for long-term retention | Default 30-day retention seems enough | Create sinks to Cloud Storage for compliance; 30 days passes quickly during incident investigation |
| Logging too much at DEBUG level | Verbose logging during development | Use INFO as default; enable DEBUG only in non-production; use exclusion filters |
| Not creating log-based metrics | Relying on manual log searching | Create metrics for key patterns (errors, auth failures, latency thresholds) |
| Setting alert thresholds too sensitive | Wanting to catch every issue | Use multi-condition alerts and appropriate duration windows (5-15 minutes) |
| Not using structured logging | Plain text seems simpler | JSON logs enable powerful filtering in Log Explorer; use structured logging from day one |
| Ignoring uptime checks | Internal monitoring seems sufficient | Uptime checks verify from external perspective; catches DNS, certificate, and network issues |
| Alert fatigue from too many alerts | Adding alerts without reviewing existing ones | Quarterly alert hygiene review; delete alerts that are never actionable |
| Not routing audit logs to BigQuery | Do not know about log sinks | Create a sink for audit logs to BigQuery for security analytics and compliance |
1. Your company wants to retain all access logs for 5 years for compliance, but the security team is complaining that their default Cloud Logging bill is astronomically high due to debug logs from the staging environment. How do you configure the Log Router to satisfy both requirements?
You should create a log sink that routes all access logs to a Cloud Storage bucket configured with a 5-year retention policy and a lifecycle rule for cost optimization. Simultaneously, you must create an exclusion filter in the Log Router for the staging environment’s debug logs. The exclusion filter prevents the noisy debug logs from being ingested into the expensive default Cloud Logging storage, saving money. Because sinks and exclusions operate independently, the compliance sink will still capture the required access logs before any exclusions affect the default bucket.
2. An on-call engineer notices that the "Request Latency" log-based counter metric is firing alerts, but they cannot determine if the slow requests are taking 1 second or 30 seconds. What design flaw exists in their log-based metric, and how should it be redesigned?
The engineer created a log-based counter metric, which simply counts the number of log entries matching a filter (e.g., latency > 500ms) without capturing the actual latency value itself. To fix this, they need to recreate it as a log-based distribution metric. A distribution metric uses a value extractor to pull the specific numeric latency value from each structured JSON log entry. This allows Cloud Monitoring to calculate percentiles like P95 and P99, giving the engineer precise visibility into exactly how slow the requests actually are during an incident.
3. You created an exclusion filter to drop noisy HTTP 200 health check logs to save money on Cloud Logging ingestion. However, the security team complains that these logs are now missing from their custom BigQuery sink, which they use for historical audits. How does the Log Router handle this, and what went wrong?
In GCP, the Log Router processes log exclusions and log sinks completely independently of one another. Creating an exclusion filter prevents the logs from being ingested into the _Default Cloud Logging storage bucket, saving ingestion costs. It does not, however, prevent those same logs from being routed to a custom sink, such as BigQuery or Cloud Storage. If the security team is missing logs in their custom sink, the issue is with that specific sink’s inclusion/exclusion filter, not the general exclusion filter you created for the default bucket.
4. Your team receives pager alerts every night at 3 AM because a database VM's CPU hits 95%. However, this coincides with a scheduled nightly backup, and customer-facing API latency remains completely normal during this time. How should you restructure this alerting strategy to prevent alert fatigue?
This alert is currently firing on a “cause” (high CPU) rather than a “symptom” (user impact). Because the high CPU does not degrade the customer experience during the backup, this alert is unactionable and causes severe alert fatigue. You should restructure the alerting strategy to trigger on symptoms, such as the API’s P99 latency exceeding a certain threshold or the HTTP 5xx error rate spiking. If you still want to monitor the CPU for capacity planning, you should change the notification channel from a paging system to a low-priority email or Slack message that can be reviewed asynchronously during business hours.
5. A developer complains that their microservice is intermittently taking 4 seconds to respond instead of the usual 50ms. The service calls three other downstream GCP services and a Cloud SQL database. They ask you to check the CPU metrics on the Cloud Run instances to find the problem. Which GCP observability tool should you recommend they use instead, and why?
You should recommend using Cloud Trace rather than simply looking at CPU metrics. High latency in a distributed system is often caused by network waits, database locks, or slow downstream API calls, none of which will show up as high CPU utilization. Cloud Trace tracks a single request as it propagates through all the microservices and the database, creating a visual waterfall diagram of spans. This will immediately show exactly which downstream service or specific database query is responsible for the 4-second delay, drastically reducing the time to resolution.
6. You are tasked with centralizing monitoring for 15 different GCP projects belonging to 3 different product teams. Currently, engineers have to switch between projects in the GCP console to view dashboards and alerts, leading to fragmented observability. How do you architect a solution in Cloud Monitoring to provide a "single pane of glass"?
You should implement a Metrics Scope hosted in a dedicated, centralized monitoring project. By attaching the 15 individual product projects to this single scoping project, Cloud Monitoring will aggregate all their metrics into one unified view. This allows you to build centralized dashboards and configure global alerting policies that evaluate resources across all 15 projects simultaneously. Furthermore, you can grant the engineering teams IAM access to the scoping project, giving them full visibility into the organization’s health without needing to grant them permissions in every individual production project.
Hands-On Exercise: Monitoring and Alerting for Cloud Run
Section titled “Hands-On Exercise: Monitoring and Alerting for Cloud Run”Objective
Section titled “Objective”Deploy a Cloud Run service with structured logging, create log-based metrics, set up a monitoring dashboard, and configure alerting.
Prerequisites
Section titled “Prerequisites”gcloudCLI installed and authenticated- A GCP project with billing enabled
Task 1: Deploy a Cloud Run Service with Structured Logging
Solution
export PROJECT_ID=$(gcloud config get-value project)export REGION=us-central1
# Enable APIsgcloud services enable \ run.googleapis.com \ monitoring.googleapis.com \ logging.googleapis.com
mkdir -p /tmp/ops-lab && cd /tmp/ops-lab
cat > main.py << 'PYEOF'import jsonimport loggingimport osimport randomimport sysimport timefrom flask import Flask, request, jsonify
app = Flask(__name__)
class JSONFormatter(logging.Formatter): def format(self, record): entry = { "severity": record.levelname, "message": record.getMessage(), } for key in ["latency_ms", "status_code", "path", "error_type"]: if hasattr(record, key): entry[key] = getattr(record, key) return json.dumps(entry)
handler = logging.StreamHandler(sys.stdout)handler.setFormatter(JSONFormatter())logger = logging.getLogger("ops-lab")logger.addHandler(handler)logger.setLevel(logging.INFO)
@app.route("/")def home(): start = time.time() latency_ms = int((time.time() - start) * 1000) + random.randint(5, 50)
logger.info("Request processed", extra={"latency_ms": latency_ms, "status_code": 200, "path": "/"}) return jsonify({"status": "ok", "latency_ms": latency_ms})
@app.route("/slow")def slow(): delay = random.uniform(0.5, 2.0) time.sleep(delay) latency_ms = int(delay * 1000)
logger.warning("Slow request detected", extra={"latency_ms": latency_ms, "status_code": 200, "path": "/slow"}) return jsonify({"status": "ok", "latency_ms": latency_ms})
@app.route("/error")def error(): error_types = ["DatabaseTimeout", "AuthenticationFailed", "RateLimitExceeded"] error_type = random.choice(error_types)
logger.error("Request failed", extra={"latency_ms": 0, "status_code": 500, "path": "/error", "error_type": error_type}) return jsonify({"status": "error", "error": error_type}), 500
@app.route("/health")def health(): return jsonify({"status": "healthy"})
if __name__ == "__main__": port = int(os.environ.get("PORT", 8080)) app.run(host="0.0.0.0", port=port)PYEOF
cat > requirements.txt << 'EOF'flask>=3.0.0gunicorn>=21.2.0EOF
cat > Dockerfile << 'DEOF'FROM python:3.12-slimWORKDIR /appCOPY requirements.txt .RUN pip install --no-cache-dir -r requirements.txtCOPY main.py .CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "2", "main:app"]DEOF
gcloud run deploy ops-lab-api \ --source=. \ --region=$REGION \ --allow-unauthenticated \ --memory=256Mi
SERVICE_URL=$(gcloud run services describe ops-lab-api \ --region=$REGION --format="value(status.url)")echo "Service URL: $SERVICE_URL"Task 2: Generate Traffic and Logs
Solution
SERVICE_URL=$(gcloud run services describe ops-lab-api \ --region=$REGION --format="value(status.url)")
# Generate normal trafficfor i in $(seq 1 15); do curl -s "$SERVICE_URL/" > /dev/nulldone
# Generate slow requestsfor i in $(seq 1 5); do curl -s "$SERVICE_URL/slow" > /dev/nulldone
# Generate errorsfor i in $(seq 1 8); do curl -s "$SERVICE_URL/error" > /dev/nulldone
echo "Traffic generated. Waiting for logs to appear..."sleep 15
# View logsgcloud logging read 'resource.type="cloud_run_revision" AND resource.labels.service_name="ops-lab-api" AND jsonPayload.message!=""' \ --limit=15 \ --format="table(timestamp, jsonPayload.severity, jsonPayload.message, jsonPayload.status_code, jsonPayload.latency_ms)"Task 3: Create Log-Based Metrics
Solution
# Metric: Count of 500 errorsgcloud logging metrics create ops_lab_errors \ --description="Count of 500 errors in ops-lab-api" \ --log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="ops-lab-api" AND jsonPayload.status_code=500'
# Metric: Count of slow requests (latency > 500ms)gcloud logging metrics create ops_lab_slow_requests \ --description="Count of slow requests (>500ms) in ops-lab-api" \ --log-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="ops-lab-api" AND jsonPayload.latency_ms>500'
# List metricsgcloud logging metrics list \ --format="table(name, description, filter)"Task 4: Create an Uptime Check
Solution
# Get the Cloud Run hostnameSERVICE_HOST=$(echo $SERVICE_URL | sed 's|https://||')
# Create an uptime check# Note: uptime checks via gcloud have limited support;# using the REST API is more reliable for complex configsgcloud monitoring uptime create ops-lab-uptime \ --display-name="Ops Lab API Health" \ --resource-type=uptime-url \ --resource-labels="host=$SERVICE_HOST,project_id=$PROJECT_ID" \ --http-check-path="/health" \ --http-check-port=443 \ --http-check-request-method=GET \ --period=60 \ --timeout=10
# List uptime checksgcloud monitoring uptime list-configs \ --format="table(displayName, httpCheck.path, period)"
echo "Uptime check created. Results will appear in ~2 minutes."Task 5: Query Monitoring Metrics
Solution
# Generate more traffic for metrics to populatefor i in $(seq 1 20); do curl -s "$SERVICE_URL/" > /dev/null curl -s "$SERVICE_URL/error" > /dev/null 2>&1done
sleep 30
# Query Cloud Run request countgcloud monitoring time-series list \ --filter='metric.type="run.googleapis.com/request_count" AND resource.labels.service_name="ops-lab-api"' \ --interval-start-time=$(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "15 minutes ago" +%Y-%m-%dT%H:%M:%SZ) \ --format="table(metric.labels.response_code, points[0].value.int64Value)" \ --limit=10
# Query the log-based error metricgcloud monitoring time-series list \ --filter='metric.type="logging.googleapis.com/user/ops_lab_errors"' \ --interval-start-time=$(date -u -v-15M +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d "15 minutes ago" +%Y-%m-%dT%H:%M:%SZ) \ --format=json \ --limit=5Task 6: Clean Up
Solution
# Delete Cloud Run servicegcloud run services delete ops-lab-api --region=$REGION --quiet
# Delete log-based metricsgcloud logging metrics delete ops_lab_errors --quietgcloud logging metrics delete ops_lab_slow_requests --quiet
# Delete uptime checkUPTIME_ID=$(gcloud monitoring uptime list-configs \ --filter="displayName='Ops Lab API Health'" --format="value(name)" | head -1)gcloud monitoring uptime delete $UPTIME_ID --quiet 2>/dev/null || true
# Clean up local filesrm -rf /tmp/ops-lab
echo "Cleanup complete."Success Criteria
Section titled “Success Criteria”- Cloud Run service deployed with structured JSON logging
- Traffic generated (normal, slow, and error requests)
- Structured logs visible in Cloud Logging with queryable fields
- Log-based metrics created for errors and slow requests
- Uptime check configured and running
- All resources cleaned up
Next Module
Section titled “Next Module”Next up: Module 2.11: Cloud Build & CI/CD --- Learn how to define build pipelines with cloudbuild.yaml, use built-in and custom builders, set up triggers from GitHub and GitLab, and orchestrate deployments with Cloud Deploy.