Cloud AI Services

Цей контент ще не доступний вашою мовою.

AI/ML Engineering Track | Complexity: [MEDIUM] | Time: 5-6 hours Prerequisites: Phase 5 complete (MLOps & LLMOps)

Learning Outcomes

By the end of this module, you will be able to:

Explain what managed Cloud AI services abstract away and which capacity dimensions remain your responsibility as a consumer.
Compare regional, geographic cross-region, and global routing models across major cloud AI platforms using a vendor-agnostic taxonomy.
Design quota-aware client architectures that prevent 429 retry cascades through backpressure, circuit breakers, and asymmetric scaling.
Analyze token-metered cost structures—including input versus output tokens, caching, and batch versus real-time pricing—to balance over- and under-provisioning.
Evaluate any managed AI offering against a structured requirements framework using the Cloud AI Services Rosetta Stone as a cross-vendor reference.

Why This Module Matters

Hypothetical scenario: Your team launches a customer-facing chat assistant backed entirely by a managed foundation-model API. Kubernetes clusters are healthy—CPU and memory look fine—but user-facing latency climbs from under a second to ten seconds or more. Error rates spike. Support tickets flood in. The root cause is not a pod crash or a misconfigured Deployment; it is exhaustion of on-demand token quotas on a regional endpoint, amplified by aggressive client retries that trigger cascading HTTP 429 responses. The fix involves purchasing provisioned throughput, routing through a cross-region inference profile, and redesigning retry policy—but only after hours of confusion because traditional infrastructure dashboards showed green.

This pattern is increasingly common because managed Cloud AI services shift the hard problem. You no longer provision GPU nodes, patch CUDA drivers, or schedule model weights across a cluster. Instead, you provision capacity contracts—quotas, throughput commitments, endpoint geography, and routing profiles—that behave like invisible infrastructure. Teams that treat a managed foundation-model endpoint as “just another REST call” discover that abstraction has boundaries. The GPUs are gone from your runbook, but capacity engineering remains.

This module teaches the durable spine of consuming hosted foundation-model services: deployment models, quota mechanics, token economics, and a cross-vendor mental model you can apply to any new entrant. For the broader discipline of AI-powered operations—anomaly detection algorithms, causal RCA graphs, automated runbooks—see AIOps, which owns that operational depth. Here we focus on what you must understand before you can operate managed AI APIs reliably at production scale.

Why “Serverless AI” Still Needs Capacity Engineering

The phrase “serverless AI” captures a genuine shift in operational burden. When you invoke a foundation model through a managed API, the provider owns model loading, GPU scheduling, weight updates, and hardware failure recovery. Your application sends a request; the service returns tokens. From a developer’s perspective, this feels like infinite scale—until it does not. Every managed AI platform enforces limits: requests per minute, tokens per minute, concurrent connections, and sometimes separate burst versus sustained quotas. These limits exist because inference is compute-intensive and providers must protect multi-tenant fairness.

Understanding the abstraction boundary is the first engineering skill this module builds. Above the boundary, you manage application logic, prompt design, and user experience. Below the boundary—inside the provider’s data center—the service manages hardware. At the boundary, you negotiate capacity through configuration choices that are easy to overlook during a prototype phase but become critical at launch. Choosing on-demand versus provisioned throughput, picking a regional endpoint versus a cross-region inference profile, and deciding whether data may leave a geographic boundary are all capacity decisions even though none of them involve SSH access to a GPU node.

The mental model that helps most teams is to treat a managed AI endpoint like a load-balanced microservice with opaque autoscaling rules. You cannot see queue depth directly, but you observe it through latency percentiles, time-to-first-token, and throttle response codes. You cannot scale GPU count yourself, but you can purchase dedicated throughput, switch routing profiles, or shard traffic across multiple endpoints. Capacity engineering for Cloud AI services is therefore a discipline of observing proxy metrics and adjusting contractual knobs rather than resizing instance types.

Self-hosted inference—covered in later modules in this track—reintroduces GPU visibility but trades away managed elasticity. Many production architectures hybridize: managed APIs for fast-moving foundation models and self-hosted endpoints for stable, high-volume, or privacy-sensitive workloads. The Rosetta Stone’s self-hosted column exists to remind you that “Cloud AI Services” is a consumption choice, not a permanent architectural destiny. Teams often start managed, measure unit economics at scale, and selectively repatriate workloads when the math and operational maturity justify running their own inference stack.

Multi-tenant capacity adds another layer. On-demand endpoints share provider infrastructure with other customers. During provider-wide demand spikes—or during your own traffic surges—you may encounter noisy-neighbor throttling even when your application’s Kubernetes layer looks idle. Dedicated or provisioned capacity isolates you from some of this contention at the cost of committed spend. The tradeoff between flexibility and guaranteed headroom is central to every production architecture decision in this space.

Finally, capacity engineering extends to the clients that call these APIs. An inference gateway, a batch enrichment pipeline, and an interactive chat UI have different latency and burst profiles. If all three share one endpoint with one quota, a batch job can starve interactive users without any single component appearing “broken” in isolation. Partitioning endpoints, quotas, and retry budgets by workload class is as important as partitioning Kubernetes namespaces—perhaps more so, because the throttle happens outside your cluster.

Platform engineers should document the abstraction boundary explicitly in architecture diagrams. Draw a line between “our cluster” and “provider inference plane,” listing every knob your team controls: endpoint ID, routing profile, API key scope, retry policy, max output tokens, and provisioned commitment IDs. Anything not on your side of the line is a vendor ticket, a purchase order, or a configuration change in the cloud console—not a kubectl scale command. That single diagram prevents more launch-day incidents than any amount of autoscaling tuning.

Landscape snapshot — as of 2026-06. This changes fast; verify against vendor docs before relying on specifics.

Capability	Amazon Bedrock	Google Vertex AI	Azure AI Foundry	OCI Generative AI
On-demand inference	Per-model regional quotas (RPM/TPM)	Per-model regional quotas	Standard deployment type	On-demand hosted models
Provisioned / dedicated throughput	Provisioned Throughput (Model Units)	Provisioned Throughput (where supported)	Provisioned Throughput units	Dedicated AI clusters
Cross-region routing	Geographic and global inference profiles	Regional endpoints; limited global endpoint	Global, data-zone, and regional deployments	Region-specific model availability
OpenAI-compatible API	Converse API (native); some third-party compat layers	OpenAI-compatible endpoints (select models)	Azure OpenAI-compatible endpoints	Documented OpenAI-compatible API
Fine-tuning / customization	Custom models require Provisioned Throughput	Regional endpoints required for tuning	Model deployment per project	Custom model import and fine-tuning
Agent / tool protocols	Agents, knowledge bases, tool use	Vertex AI Agent Builder ecosystem	Agent frameworks via Foundry	Responses API, tool hooks, MCP support documented

This table is illustrative, not a leaderboard or endorsement. Capabilities vary by model and region; confirm against current vendor documentation before designing production architecture.

Deployment and Routing Models

Managed AI platforms expose several durable routing patterns. Learning the taxonomy—not memorizing product marketing names—lets you map any new vendor announcement into a familiar slot within minutes.

Single-region endpoints bind inference to one cloud region. Requests stay local, latency is predictable, and data residency is straightforward to reason about because both storage and compute remain in the chosen region. The tradeoff is a hard ceiling: when that region’s quota is exhausted, every client receives throttling responses until demand subsides or you reconfigure routing. Single-region endpoints suit workloads with strict compliance boundaries, predictable traffic, and tolerance for manual failover during regional incidents.

Geographic cross-region inference profiles route requests across multiple regions within a geography—such as US, EU, or APAC—while keeping data processing inside that geography’s boundary. Providers typically expose these as unified model identifiers that abstract away the underlying regional ARNs. Throughput can exceed single-region quotas because traffic spreads across the profile’s member regions. Latency becomes less predictable because a request may land in any member region, and some capabilities (notably provisioned throughput on some platforms) may not attach to inference profiles at all. Geographic profiles suit production workloads that need higher burst capacity without abandoning data-residency requirements.

Global routing endpoints extend cross-region logic across commercial regions worldwide, prioritizing available capacity and sometimes offering cost advantages over strictly geographic routing. The tradeoff is explicit: global endpoints may process data outside your preferred jurisdiction and often omit features tied to regional data stores—custom tuning, certain batch pipelines, or residency-sensitive logging. Global routing belongs in workloads where maximum throughput and cost efficiency outweigh strict residency, not in regulated environments where audit trails must prove geographic containment.

Dedicated versus multi-tenant capacity forms an orthogonal axis. Multi-tenant on-demand endpoints charge per token with no upfront commitment; you share capacity with other customers and accept throttle risk during peaks. Dedicated or provisioned throughput reserves model capacity measured in provider-specific units—Model Units, provisioned tokens per minute, or isolated AI clusters—delivering predictable performance at committed cost. Custom fine-tuned models on several platforms require provisioned capacity because the provider must host your weights on reserved hardware.

Understanding when to move from on-demand to provisioned capacity is one of the highest-value decisions in this module. Stay on on-demand while traffic is unpredictable, quotas are far from exhaustion, and the cost of occasional throttling is acceptable—typical for internal tools and early prototypes. Move to provisioned throughput when you have measured sustained utilization above roughly seventy percent of on-demand limits for multiple weeks, when marketing or compliance events create known spikes, or when SLA commitments to customers make any 429 unacceptable. The transition is usually a configuration and billing change, not an application rewrite, provided you already route through a gateway that can swap model identifiers.

Data residency deserves explicit architectural treatment because it intersects with every routing choice. A compliance policy that mandates EU-only processing rules out global endpoints entirely and may rule out geographic profiles whose member regions extend beyond the EU. Conversely, a latency-sensitive US consumer application might accept global routing to survive Black-Friday-scale bursts. Document the residency decision alongside the routing decision in your architecture records; auditors ask about data location, not about which model identifier string you passed to the SDK.

Cross-region failover at the application layer remains your responsibility even when the provider offers inference profiles. Profiles handle within-vendor routing; they do not replace a multi-vendor fallback strategy when an entire cloud has an outage. Mature architectures maintain secondary endpoints—often in a different vendor or a self-hosted stack covered in later modules—and switch through gateway-level circuit breakers when primary endpoints degrade.

Latency SLOs should be defined differently for managed AI than for traditional microservices. Time-to-first-token and tokens-per-second during generation often dominate user-perceived latency more than network round-trip to the gateway. Capacity planning therefore tracks not only RPM and TPM but also output length distributions—a shift toward longer answers increases TPM consumption even when request rate stays flat. Instrument both request rate and average output tokens per request as first-class metrics in your gateway.

ROUTING DECISION TREE (DURABLE TAXONOMY)
=========================================

Start: Do compliance rules require data in a specific geography?
  |
  +-- YES --> Use regional or geographic-cross-region endpoints ONLY
  |           (verify profile member regions match policy)
  |
  +-- NO  --> Is predictable latency more important than burst headroom?
              |
              +-- YES --> Single-region + provisioned throughput
              |
              +-- NO  --> Geographic or global inference profile
                          + monitor p99 latency variance

When you deploy an inference gateway in Kubernetes—as many teams do to centralize authentication, logging, and routing—the gateway becomes the enforcement point for these decisions. It selects endpoint identifiers, attaches quota-aware retry policies, and exposes metrics your autoscaler consumes. The gateway does not eliminate capacity planning; it concentrates it in one place where platform engineers can reason about traffic holistically.

Consider how three common workload classes map to routing choices. A low-latency customer chatbot benefits from geographic cross-region profiles within the compliance boundary, strict output token caps, and provisioned headroom during known peak hours. An internal document summarization batch job tolerates minutes of latency and should use batch APIs where available, with scheduling that pauses when interactive utilization crosses warning thresholds. A developer sandbox can remain on cheap on-demand endpoints with hard daily spend caps, isolated from production quota pools so experiments never starve paying users. These are not product features—they are architectural patterns you implement through endpoint partitioning and gateway policy.

Failover testing belongs in the same category as regional DR drills. Quarterly, deliberately misconfigure a staging endpoint to return sustained 429 responses and verify that your gateway circuit breaker routes to the secondary profile within the expected time bound. Measure whether fallback endpoints preserve data residency and whether degraded responses meet product requirements when the primary model tier is unavailable.

Quotas, Rate Limits, and the 429 Cascade

HTTP 429 Too Many Requests is the defining failure mode of managed AI consumption. Unlike a 500-series error that suggests provider instability, a 429 often means your architecture is working correctly from the provider’s perspective—you simply exceeded a contracted or default limit. Treating 429 as a transient glitch and retrying immediately is one of the most expensive mistakes teams make.

Quotas on managed AI platforms typically decompose into requests per minute (RPM) and tokens per minute (TPM), sometimes with separate input and output token buckets. Burst allowances may permit short spikes above sustained limits, but sustained traffic above the sustained threshold triggers throttling even when burst capacity remains. Understanding which bucket your workload exhausts first matters for remediation: a chat application with long outputs may hit output TPM while RPM looks healthy, whereas a high-frequency classification service may exhaust RPM with modest token counts.

The 429 cascade follows a predictable mechanical sequence. An initial traffic spike—or a sudden increase in output length because users ask harder questions—consumes quota headroom. Some requests receive 429 responses. Naive client libraries retry with exponential backoff, but many application frameworks retry aggressively by default, multiplying effective request rate. Each retry consumes additional quota—or queue slots—without delivering user value. Latency rises because requests wait in client-side queues. Upstream services time out waiting for LLM responses, triggering their own retries. Within minutes, a modest quota overrun becomes a full service degradation that no amount of Kubernetes pod scaling can fix, because the bottleneck lives outside the cluster.

Preventing cascades requires defense in depth at the client and gateway layers. Backpressure means slowing or rejecting new work when quota utilization crosses a threshold—typically 70–80% of sustained limits—rather than waiting for hard 429s. Retry budgets cap the total retry attempts per time window across all clients sharing an endpoint. Jittered exponential backoff spreads retry timing so thundering herds do not synchronize. Circuit breakers stop forwarding traffic to an endpoint that returns sustained 429s, failing fast to a secondary route or a degraded response mode rather than amplifying load.

Design backpressure with explicit user experience tradeoffs. A chat application might show “high demand, please wait” with estimated queue position when utilization crosses the warning threshold, rather than accepting new sessions that will timeout sixty seconds later. An API might return HTTP 503 with a Retry-After header sourced from your quota monitor rather than forwarding to the provider and burning remaining headroom. These product decisions require collaboration between platform and application teams; the gateway implements the policy, but product defines acceptable degradation modes.

Asymmetric scaling complements quota awareness on the gateway itself. When predictive metrics—token consumption velocity, queue depth proxies, or forecasted RPM—indicate an approaching limit, scale gateway replicas up aggressively to parallelize and shed load through caching or request shaping. Scale down conservatively after demand subsides, because quota limits reset on provider schedules that may not align with your traffic decay. The HorizontalPodAutoscaler behavior block in Kubernetes expresses this pattern natively by separating scaleUp and scaleDown policies.

Observability for quota management differs from traditional CPU monitoring. Export provider-side metrics where available—CloudWatch for Bedrock, Cloud Monitoring for Vertex, Azure Monitor for Foundry, OCI Monitoring for Generative AI—and correlate them with application-side counters for tokens sent, tokens received, retry counts, and 429 rates. A dashboard that shows green Kubernetes health alongside climbing 429 rates is the canonical early-warning signal for managed AI incidents.

For deep treatment of time-series forecasting, anomaly detection algorithms, and automated incident correlation applied to these metrics, see AIOps. This module establishes what to measure and why; Module 1.2 teaches how to build the ML pipelines that forecast token consumption and detect pre-throttle anomalies.

429 CASCADE TIMELINE (SIMPLIFIED)
=================================

T+0    Traffic spike consumes 90% of TPM quota
T+30s  First 429 responses; clients begin retries
T+60s  Effective request rate doubles due to retries
T+90s  Quota fully exhausted; most requests fail or queue
T+5m   Upstream timeouts propagate; user-visible outage
T+30m  Quota window resets OR ops purchases provisioned capacity

Prevention: backpressure at 70%, retry budget, circuit breaker to fallback

Token consumption forecasting—predicting when you will hit limits before 429s fire—is the specific AIOps intersection this module endorses. The forecast inputs include historical TPM by hour-of-day, campaign calendars, rolling output-token averages, and deployment events that change prompt templates. The output is a capacity action: purchase provisioned throughput, enable a cross-region profile, or throttle non-critical batch workloads. General-purpose anomaly detection on CPU metrics will not catch this failure mode; quota-aware forecasting will.

Client-side architecture deserves equal attention. SDK defaults often retry on any error, including 429, without respecting Retry-After headers when present. Wrap provider SDKs in a gateway layer that centralizes retry policy, enforces per-tenant budgets, and converts hard throttles into queue delays with user-visible status rather than opaque timeouts. Implement token counting before dispatch—estimate input tokens from prompt length and cap requested max output—so you can reject oversize requests locally without consuming provider quota. These patterns cost engineering time upfront but prevent the exponential failure mode that makes managed AI incidents so confusing to debug.

Cost Model of Token-Metered Services

Managed AI economics invert the familiar cloud compute model. Instead of paying for vCPU-hours whether or not you use them, you pay per token processed—with separate meters often applied to input and output tokens. Output tokens frequently cost more per unit because generation consumes more compute than prompt ingestion. A prompt that elicits a verbose answer therefore costs disproportionately more than the same prompt engineered for concision, making prompt design a direct cost lever.

Input versus output asymmetry shapes architecture. Retrieval-augmented generation pipelines that inject large context windows consume input tokens on every request even when the answer is short. Caching strategies—provider-side prompt caching where available, or application-side semantic caches—reduce repeated input token charges for stable context. Teams that ignore input token volume while optimizing output length often discover that their bill scales with document corpus size rather than user count.

Agentic workflows compound both sides of the meter. Each tool call may inject additional context into the prompt—tool results, memory retrieval, MCP resource payloads—inflating input tokens on every reasoning step. Multi-step agents therefore consume quota faster than single-shot completion APIs for equivalent user-facing tasks. When evaluating agent platforms in the Rosetta Stone, budget capacity for the full loop, not just the first model call. Capacity planning for agents is an emerging discipline; start with measured traces of real agent sessions rather than extrapolating from simple chat benchmarks.

Context and prompt caching (where platforms support it) trades memory residency for discounted re-processing of identical prompt prefixes. The durable concept is straightforward: if thousands of requests share a static system prompt or a fixed knowledge-base preamble, caching that prefix avoids re-billing full input tokens on each call. Cache hit rates become a financial KPI alongside latency. Verify current caching availability and pricing in the landscape snapshot; implementations differ by vendor and model.

Batch versus real-time pricing introduces a second axis. Batch inference APIs accept jobs with higher latency tolerance—minutes to hours—in exchange for lower per-token cost. Real-time endpoints prioritize responsiveness at standard or premium rates. Architectures that can defer non-interactive work—nightly summarization, bulk classification, embedding backfills—should route through batch channels when available, reserving real-time quota for user-facing paths.

Over-provisioning versus under-provisioning carries asymmetric business risk in token-metered services, analogous to but sharper than traditional cloud sizing. Under-provisioning on-demand quota produces 429 errors and lost user sessions—often far costlier than the tokens you failed to purchase. Over-provisioning dedicated throughput produces idle committed spend—annoying on a finance review but rarely catastrophic in a single incident. Production policies therefore bias toward early provisioned capacity for launch events and conservative scale-down afterward.

Abstract cost reasoning helps before you look up any price table. Define a cost unit as one million input tokens and one million output tokens at on-demand rates, then express workloads as multiples of that unit. A support chat averaging 2,000 input and 500 output tokens per session at 10,000 sessions per day consumes predictable unit multiples that finance can budget. Ratios matter more than absolute dollars: if output tokens cost three times input tokens, shortening average response length by twenty percent may dominate savings compared to negotiating a ten percent input discount.

Committed-use discounts and provisioned throughput contracts introduce time horizon decisions. Hourly provisioned units suit spiky but predictable campaigns; monthly commitments suit steady production baselines. Breaking commitments early may forfeit savings, so align purchase duration with traffic forecasts validated against historical token metrics—not against engineering optimism at launch time.

Walk through a qualitative cost comparison without attaching dollar figures to volatile price tables. Imagine a support assistant handling ten thousand sessions daily, each averaging two thousand input tokens and five hundred output tokens. If output tokens carry triple the unit cost of input tokens in your provider’s meter, output represents a larger share of spend than raw token counts suggest—shortening responses through prompt engineering may dominate savings compared to chasing marginal input discounts. If thirty percent of requests repeat an identical system prompt, prompt caching—where the platform supports it—can remove a substantial fraction of input charges for cache hits. If half the workload is overnight batch summarization deferrable by six hours, batch pricing tiers may cut that half’s cost independently of interactive rates. These ratio-based thought experiments survive pricing changes; look up current unit prices in the landscape snapshot when you need numbers for a budget proposal.

Cloud AI Services Rosetta Stone

The Cloud track teaches infrastructure through Rosetta Stone tables that map durable capabilities across AWS, GCP, and Azure. Managed AI services benefit from the same pattern because vendor marketing names change while underlying capabilities persist. The table below maps durable capabilities (rows) to current vendor offerings (columns). Use it to translate a requirement document written for one cloud into equivalent configurations on another, or to slot a new entrant into your evaluation framework.

Durable capability	Amazon Bedrock	Google Vertex AI	Azure AI Foundry	OCI Generative AI	Self-hosted (see Module 1.3+)
On-demand foundation-model API	InvokeModel / Converse	Generative AI API	Foundry model deployments	Generative AI API	vLLM / SGLang endpoints
Provisioned / dedicated throughput	Provisioned Throughput (MUs)	Provisioned Throughput	Provisioned Throughput units	Dedicated AI clusters	GPU node pools + model replicas
Geographic cross-region routing	Inference profiles (US/EU/APAC)	Limited; check regional tables	Data-zone and global deployments	Region-bound model lists	Multi-cluster + DNS failover
Global routing (max throughput)	Global inference profiles	`locations/global` (feature gaps)	Global deployment type	Not equivalent; regional	Anycast / multi-region gateway
Data residency guarantees	Regional and geographic profiles	Regional endpoints; global caveats	Regional and data-zone options	OC1/OC4/OC19 realms	Full control; you operate residency
OpenAI-compatible API surface	Via Converse; ecosystem compat	OpenAI-compatible endpoints	Azure OpenAI API shape	Documented compat API	OpenAI API compat layers
Fine-tuning / custom weights	Custom models + provisioned	Regional tuning endpoints	Foundry fine-tuning	Import + fine-tune on dedicated	Full training stack ownership
Embeddings / rerank APIs	Titan Embeddings et al.	Vertex embedding models	Foundry embedding deployments	Embedding and rerank endpoints	Sentence-transformers / custom
Knowledge / RAG integration	Knowledge Bases for Bedrock	Vertex AI Search / RAG Engine	Foundry + Azure AI Search	Vector stores documented	Own vector DB + pipeline
Agent / tool / MCP support	Agents, action groups	Agent Builder	Agent frameworks	Responses API, MCP hooks	LangChain / custom MCP servers
Guardrails / safety filters	Guardrails for Bedrock	Safety filters / Model Garden policies	Content filters	AI guardrails (on-demand)	Prompt rules + own moderation
Batch inference	Batch inference jobs	Batch prediction endpoints	Batch deployments	Check current batch APIs	Offline job queues
Audit / enterprise logging	CloudTrail, CloudWatch	Cloud Audit Logs	Azure Monitor / Diagnostic	OCI Audit / Logging	Your observability stack

Reading the Rosetta Stone effectively means focusing on rows, not columns. When a product manager asks for “global routing with EU residency,” translate that into two row requirements—cross-region routing and data residency—and discover immediately that global endpoints and strict EU residency conflict on several platforms. When a security reviewer asks for “OpenAI-compatible API with private networking,” locate the compat API row and verify whether private endpoints exist for that vendor’s deployment type.

No column is universally superior. Each vendor optimizes for different enterprise anchors: existing cloud commitments, compliance certifications, model catalog breadth, or OpenAI API compatibility. Present them as peers with capability tradeoffs, not as a ranked leaderboard that will stale within weeks.

When onboarding a new platform engineer, walk the Rosetta row-by-row with your organization’s actual ADRs rather than column-by-column through vendor marketing. Ask which rows are hard requirements—EU residency, custom model hosting, MCP tool support—and mark columns that fail any hard row as disqualified before comparing soft preferences like catalog breadth. This inversion prevents the common anti-pattern of choosing a cloud because “we already use it for compute” without verifying that the specific model and routing capabilities you need exist in your target region.

Evaluating Any Managed AI Service

New managed AI offerings launch frequently. A durable evaluation framework protects you from re-architecting on every press release. The following checklist applies to any vendor—hyperscaler, sovereign cloud, or specialized inference provider—and maps directly to Rosetta Stone rows.

Step 1: Classify workload requirements. Document latency class (interactive, near-real-time, batch), data sensitivity (public, internal, regulated), expected token volume ranges, and model customization needs (base model only, fine-tuned, or fully custom). These inputs determine which routing and capacity rows in the Rosetta Stone matter for your decision. Be explicit about peak multipliers—Black Friday traffic is not average traffic times two; measure historical peaks or accept that you are guessing.

Step 2: Verify residency and compliance. Match regulatory obligations to endpoint geography options. Ask whether prompts, outputs, and logs remain in jurisdiction, whether global routing can override regional selection, and whether fine-tuning data storage locations differ from inference locations. If the vendor cannot answer with documentation citations, treat residency as unverified. Involve legal and security stakeholders before production—not after a regulator asks where prompts were processed.

Step 3: Map quota and throughput mechanics. Obtain default and increasable RPM/TPM limits for your target models and regions. Identify whether provisioned throughput is mandatory for your use case—custom models, guaranteed SLAs, or high sustained load usually require it. Model the 429 cascade risk for your peak traffic multiplier by running load tests that include realistic retry behavior, not idealized clients that fail fast.

Step 4: Model total cost of ownership. Compute token costs using input/output split, caching discounts, and batch eligibility. Add provisioned commitment costs, data egress, logging, and gateway infrastructure. Compare ratio scenarios (±30% traffic) rather than single-point estimates. Present finance with a range, not a single monthly number, because token workloads are inherently variable.

Step 5: Assess integration surface. Evaluate SDK quality, OpenAI compatibility if migrating existing clients, authentication model (API keys, IAM, workload identity), and private connectivity options (VPC endpoints, Private Link, service gateways). Integration cost often dominates engineering time compared to per-token price differences. A two-percent unit price advantage loses to a six-week migration if SDK gaps force custom HTTP clients.

Step 6: Test failure behavior. In a staging environment, deliberately exhaust quotas and observe 429 response bodies, retry-after headers, and recovery time after quota reset. Failover to secondary endpoints and measure circuit-breaker effectiveness. Providers differ sharply in error payload helpfulness—some return actionable error codes; others return opaque messages that complicate automated remediation.

Step 7: Plan operational ownership. Assign owners for quota monitoring, provisioned capacity renewals, model version upgrades, and deprecation migrations. Managed services shift hardware toil but create contract toil—renewals, limit increase tickets, and model ID changes still land on your platform team.

Authentication and network path. Beyond model capabilities, evaluate how workloads authenticate—API keys, cloud IAM roles, workload identity federation—and whether private connectivity (VPC endpoints, Private Link, service gateways) is required so inference traffic never traverses the public internet. These requirements rarely appear in model catalog comparisons but dominate enterprise security reviews. A model that meets every Rosetta row for capability still fails evaluation if it cannot be reached from your network zone without violating policy.

Observability contract. Before signing a production commitment, confirm which metrics the vendor exports natively—TPM/RPM utilization, throttle counts, latency percentiles, error codes—and which you must derive from application-side instrumentation. Gaps in provider metrics force you to build token counters in your gateway, which is feasible but should be planned effort, not a launch-week surprise.

Document evaluations in an architecture decision record (ADR) that references Rosetta Stone row IDs rather than vendor slogans. When a vendor ships a new feature, ask which row it affects; if none, the feature may be marketing noise relative to your requirements.

Revisit ADRs on a quarterly cadence even when nothing is broken. Managed AI platforms change model availability, deprecate endpoints, and adjust default quotas without fanfare comparable to a major Kubernetes version release. A lightweight quarterly review—compare your ADR assumptions against the current landscape snapshot, re-run staging quota drills, and confirm provisioned commitments still match traffic growth—catches drift before it becomes an incident. Treat this review like certificate expiry checks: boring, scheduled, and invaluable.

Operating Managed AI Services

Operating managed AI APIs combines the capacity concepts above with day-two practices: monitoring, incident response, and continuous capacity adjustment. This section covers operations specific to Cloud AI consumption—not the general AIOps curriculum.

Monitoring stack. At minimum, instrument four metric groups: provider quota utilization (RPM/TPM percentages where exposed), application token counters (input/output per service), error taxonomy (429 versus 5xx versus timeout), and latency (time-to-first-token and total generation time). Correlate provider metrics with gateway metrics to distinguish external throttling from internal bottlenecks. Dashboards should visualize quota headroom on the same page as Kubernetes pod counts so on-call engineers never misdiagnose a quota incident as a cluster problem.

Build alert thresholds on quota utilization, not just error rates. By the time 429 rates trigger a page, users have already suffered. Warning alerts at seventy percent sustained utilization give time to enable cross-region routing or pause batch workloads before hard throttling begins. Pair these alerts with runbook links that distinguish provider-side actions from cluster-side actions so responders do not waste minutes scaling Deployments.

Runbook priorities for 429 incidents. First, stop retry amplification—disable or tighten client retry policies at the gateway. Second, shed non-critical traffic (pause batch jobs, reduce max output tokens temporarily). Third, activate secondary routing (cross-region profile, alternate model tier, fallback vendor). Fourth, initiate provisioned throughput purchase or quota increase request with provider support. Fifth, post-incident, reconcile forecast models if you maintain them per AIOps guidance.

During the incident, preserve evidence for the postmortem: snapshot provider quota graphs, export gateway retry counters, and record which routing profile was active. Quota incidents often look like mysterious latency in aggregate dashboards; the evidence bundle proves whether the root cause was exhaustion, retry storms, or an unrelated application regression. This discipline also builds the historical dataset that forecasting pipelines in Module 1.2 require.

Capacity planning cadence. Review token consumption weekly during growth phases, monthly at steady state. Before marketing events or product launches, pre-purchase provisioned capacity or pre-approve quota increase tickets. Align planning horizons: hours-ahead for autoscaling gateway replicas, days-ahead for quota increases, weeks-ahead for provisioned commitments, quarters-ahead for contract negotiations.

Model lifecycle management. Providers deprecate model IDs, release new versions, and change default endpoints. Maintain an inventory mapping application configuration to model identifiers and routing profiles. Test new model versions in shadow traffic before cutover. Deprecation notices require the same rigor as any upstream API breaking change.

Security operations. Rotate API keys and workload credentials on schedule; prefer short-lived tokens and IAM roles over long-lived secrets. Scope keys to minimum models and operations required. Audit prompt logging policies—full prompt capture aids debugging but creates data-governance surface area.

Incident communication. When a quota incident occurs, status pages should describe user impact and remediation in business terms—“elevated response latency for AI features”—while internal runbooks track TPM utilization, active routing profile, and provisioned capacity headroom. Correlating external user reports with internal provider metrics quickly distinguishes quota throttling from application bugs, preventing wasted hours debugging Kubernetes when the fix is a capacity contract change.

Predictive autoscaling for an inference gateway—scaling Kubernetes replicas based on forecasted token demand rather than CPU—closes the loop between quota awareness and infrastructure action. The HPA example below demonstrates asymmetric behavior: aggressive scale-up when a custom metric predicts token pressure, conservative scale-down to avoid thrashing.

# Kubernetes v1.35+ HorizontalPodAutoscaler for an AI inference gateway
# Scales on a custom metric exported by your gateway (predicted token load)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-inference-gateway
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-inference-gateway
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Object
    object:
      metric:
        name: predicted_token_usage_15m
      describedObject:
        apiVersion: apps/v1
        kind: Deployment
        name: ai-inference-gateway
      target:
        type: Value
        value: 50000
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 1800
      policies:
      - type: Pods
        value: 1
        periodSeconds: 300

The gateway deployment must export predicted_token_usage_15m through the custom metrics API—typically via Prometheus adapter or a metrics pipeline fed by token consumption forecasts. For methodology on building those forecasts (ARIMA, Prophet, isolation-based anomaly detection on token streams), defer to AIOps.

Did You Know?

Cross-region inference profiles and provisioned throughput are often separate product paths. On Amazon Bedrock, inference profiles increase on-demand throughput via geographic or global routing, but provisioned throughput requires invoking dedicated model ARNs—a design constraint that catches teams expecting one knob to solve both burst and guaranteed capacity.
Global endpoints frequently trade features for reach. Google Cloud documents that Vertex AI’s global endpoint does not guarantee data residency and omits capabilities such as model tuning available on regional endpoints—making “global” a throughput tool, not a compliance shortcut.
Azure Foundry deployment types encode residency and routing policy. Microsoft documents distinct global, data-zone, and regional deployment types with different data processing boundaries—choosing a deployment type is equivalent to choosing a compliance and latency class.
Open standards for agent tooling outlast individual vendor agent products. The Model Context Protocol (MCP) provides a cross-vendor interface for tools and context, while AGENTS.md conventions document agent behavior across harnesses—both change far slower than any single cloud agent SKU.

Common Mistakes

Mistake	Why it happens	How to fix
Treating 429 as a transient error	Retry defaults assume server-side flakiness, not quota exhaustion.	Implement retry budgets, backpressure at 70–80% quota utilization, and circuit breakers; purchase headroom before launch.
Using global endpoints for regulated data	Global routing simplifies SDK configuration and improves burst throughput.	Match endpoint geography to residency policy; use regional or geographic profiles and verify member regions in documentation.
Sharing one endpoint across batch and interactive workloads	Simplicity during prototyping carries into production.	Partition endpoints or quotas by workload class; shed batch traffic first during pressure.
Scaling Kubernetes pods during quota incidents	On-call playbooks default to HPA/KEDA without checking provider metrics.	Add provider quota dashboards; scale routing and capacity contracts before replica counts.
Ignoring output token cost	Prompt engineering focuses on input context size.	Track input/output split; cap max output tokens per route; measure cost per successful user task.
Symmetric gateway scale-down	Symmetric policies are the HPA default and look “balanced.”	Configure asymmetric `behavior`: fast scale-up on predicted token load, slow scale-down with stabilization windows.
Skipping quota exhaustion drills	Staging environments use tiny traffic and never hit limits.	Deliberately exhaust staging quotas quarterly; validate runbooks and fallback routes under real 429 responses.

Knowledge Check

Scenario 1: Your EU-regulated healthcare application must keep patient prompts within the EU. An engineer proposes using a global inference endpoint to handle launch-day traffic spikes. What is wrong with this approach?

Global inference endpoints route requests to available commercial regions worldwide and typically do not guarantee data residency within the EU. Even if average latency improves, compliance requirements fail. The correct approach is a geographic cross-region profile confined to EU member regions—or a single EU regional endpoint with provisioned throughput for headroom—verified against current vendor documentation for data processing locations.

Scenario 2: Kubernetes shows all gateway pods at 30% CPU, but users report timeouts. CloudWatch/Bedrock metrics show TPM at 98%. What is the likely root cause and first action?

The bottleneck is provider-side token quota exhaustion, not compute inside the cluster. Scaling pods will not increase available TPM. First action: stop retry amplification at the gateway, shed non-critical batch traffic, and either enable a cross-region inference profile or initiate provisioned throughput purchase while quota resets or support processes an increase.

Scenario 3: You need guaranteed capacity for a fine-tuned custom model on a platform where custom models require dedicated hosting. Which Rosetta Stone row guides your decision?

The Provisioned / dedicated throughput row. Custom weights require reserved model capacity on most hyperscaler platforms—not on-demand shared endpoints. You must purchase provisioned units or dedicated cluster capacity and invoke the provisioned model identifier rather than the default on-demand ID.

Scenario 4: A batch embedding job and a live chat API share one endpoint. Overnight, embedding traffic spikes and chat error rates climb. What architectural change prevents recurrence?

Partition workloads by endpoint or quota allocation. Batch embedding should use a separate endpoint—ideally batch-priced if available—with its own retry and throttle policy. Interactive chat retains dedicated RPM/TPM headroom. Optionally schedule batch jobs off-peak with explicit backpressure when interactive utilization exceeds thresholds.

Scenario 5: Finance asks whether shortening average model responses by 25% is worth more than switching to a cheaper input-token rate. Output tokens cost 3× input tokens, and output represents 40% of total token spend. How do you reason about this?

Output is 40% of spend at 3× input cost, so output reduction disproportionately affects the bill. A 25% output length reduction saves roughly 10% of total token cost (0.25 × 0.40), often exceeding marginal input-rate negotiations. Run the calculation with your actual input/output ratio before committing; ratios vary by workload.

Scenario 6: Your HPA scales gateway pods on CPU utilization, but 429 rates spike during predictable daily peaks. What metric should replace or supplement CPU in the HPA?

A custom metric reflecting token demand or predicted token usage—such as predicted_token_usage_15m exported from your gateway’s forecasting pipeline. CPU poorly correlates with LLM API quota pressure because inference compute happens on the provider side; token velocity correlates directly with TPM exhaustion.

Scenario 7: A vendor announces a new "global agent" product. Your evaluation framework requires assessing agent support. Which rows and steps apply before adoption?

Map the announcement to Rosetta Stone rows: Agent / tool / MCP support, Data residency, and Quota / throughput mechanics. Follow evaluation Steps 2–6: verify residency claims, test quota behavior under load, model cost including tool-call token overhead, and failure behavior when tool endpoints throttle independently of the main model.

Hands-On Exercise

Task: Build quota-aware monitoring and asymmetric gateway scaling for a managed AI inference gateway. You will simulate token consumption metrics, implement backpressure logic, and validate an HPA manifest for Kubernetes 1.35.

Part 1: Token Quota Monitor

Create quota_monitor.py that tracks utilization against configurable RPM and TPM limits and recommends actions before hard throttling.

#!/usr/bin/env python3
"""Quota utilization monitor for managed AI endpoints."""
from dataclasses import dataclass

@dataclass
class QuotaLimits:
    rpm: int
    tpm: int
    burst_rpm: int | None = None

@dataclass
class QuotaSnapshot:
    requests_last_minute: int
    tokens_last_minute: int

def utilization(snapshot: QuotaSnapshot, limits: QuotaLimits) -> dict[str, float]:
    """Return RPM and TPM utilization as fractions (0.0–1.0+)."""
    rpm_limit = limits.burst_rpm or limits.rpm
    return {
        "rpm": snapshot.requests_last_minute / rpm_limit,
        "tpm": snapshot.tokens_last_minute / limits.tpm,
    }

def recommend_action(util: dict[str, float], warn: float = 0.7, critical: float = 0.9) -> str:
    """Recommend backpressure or capacity action based on utilization."""
    peak = max(util["rpm"], util["tpm"])
    if peak >= critical:
        return "CRITICAL: Enable backpressure, shed batch traffic, consider provisioned throughput"
    if peak >= warn:
        return "WARN: Reduce retry rates, activate cross-region profile if available"
    return "OK: Continue monitoring"

Append a __main__ block that simulates rising utilization and prints recommendations, then run a quick validation. Expect WARN above 0.7 and CRITICAL above 0.9.

python3 quota_monitor.py

Part 2: Retry Budget Calculator

Create retry_budget.py that computes maximum safe retries without exceeding a retry amplification factor.

def max_safe_retries(base_rps: float, quota_rpm: int, headroom_factor: float = 1.5) -> int:
    """
    Given base requests per second and an RPM quota, return the max retries
    per request that keep the total effective rate (original requests plus
    retries) at or below the quota divided by a safety headroom_factor.
    A headroom_factor of 1.5 reserves ~33% of the quota as burst headroom.
    """
    base_rpm = base_rps * 60
    safe_ceiling = quota_rpm / headroom_factor
    if safe_ceiling <= base_rpm:
        return 0
    # Total RPM if each request retries n times: base_rpm * (1 + n) <= safe_ceiling
    max_n = int((safe_ceiling - base_rpm) / base_rpm)
    return max(0, max_n)

Verify: at 100 RPS against a 10,000 RPM quota, safe retries should be bounded well below naive unlimited retry policies.

Part 3: Validate HPA Manifest

Save the HPA YAML from the Operating Managed AI Services section as hpa-ai-gateway.yaml. Validate syntax:

kubectl apply --dry-run=client -f hpa-ai-gateway.yaml

Confirm the manifest uses autoscaling/v2, defines separate scaleUp and scaleDown behavior blocks, and references a custom metric name suitable for token forecasting.

Success Checklist

quota_monitor.py reports OK, WARN, and CRITICAL at appropriate utilization thresholds.
retry_budget.py demonstrates that unlimited retries can exceed quota at modest base traffic.
hpa-ai-gateway.yaml passes kubectl apply --dry-run=client without errors.
You can articulate which actions belong in this module (quota, routing, provisioning) versus AIOps (forecasting algorithms, RCA).

Next Module

You now understand the durable architecture of Cloud AI Services—routing models, quota mechanics, token economics, and cross-vendor evaluation—and how to operate them without mistaking provider throttling for cluster failure.

Up Next: AIOps

In Module 1.2, you will build the operational depth this module intentionally deferred: statistical and ML-based anomaly detection on infrastructure telemetry, AI-powered causal graphs for root cause analysis, hybrid log parsing pipelines, and the forecasting methods that feed token consumption predictions referenced here.

Sources

Amazon Bedrock Cross-Region Inference — Geographic and global inference profiles, throughput tradeoffs, and the separation from Provisioned Throughput.
Amazon Bedrock Provisioned Throughput — Model Units, dedicated capacity for base and custom models, and hourly commitment mechanics.
Google Vertex AI Locations and Endpoints — Regional versus global endpoint behavior, data residency caveats, and feature availability by location.
Microsoft Foundry Deployment Types — Global, data-zone, and regional deployment models with data processing and routing implications.
Azure AI Foundry Provisioned Throughput — Provisioned capacity units (PTUs) and when dedicated throughput replaces standard deployments.
OCI Generative AI Overview — Hosted models, dedicated AI clusters, and OpenAI-compatible API entry points.
OCI Generative AI Dedicated AI Clusters — Isolated compute for fine-tuning and hosting custom models away from multi-tenant throttling.
Model Context Protocol Specification — Open standard for tool and context integration across AI platforms and agent harnesses.
AGENTS.md Convention — Cross-tool documentation convention for agent behavior, complementary to vendor-specific rule files.
Large-scale cluster management at Google with Borg — Verma et al., EuroSys 2015; foundational reading on cluster resource management, slack resources, and over-commitment that informs capacity thinking even when GPUs are provider-managed.
Kubernetes Horizontal Pod Autoscaling — Official documentation for HPA v2 behavior configuration including asymmetric scale-up and scale-down policies.