Module 3.16: Azure SQL Database — Operator Path
Цей контент ще не доступний вашою мовою.
Complexity: [COMPLEX]
Time to Complete: 90-120 min
Prerequisites: 3.1-entra-id, 3.2-vnet, 3.9-key-vault, 3.10-monitor
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Compare Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure VM by control boundary, scaling unit, networking model, restore granularity, and feature fit.
- Design an Azure SQL Database compute and storage model that separates DTU, vCore, General Purpose, Business Critical, Hyperscale, Serverless, replicas, and IO limits.
- Diagnose availability and recovery incidents by selecting the right combination of built-in HA, zone redundancy, failover groups, active geo-replication, PITR, geo-restore, and LTR.
- Implement operator-grade identity, networking, encryption, audit, and Defender controls without turning the module into a T-SQL internals course.
- Evaluate Azure SQL Database cost and observability signals so you can explain slow queries, throttling, restore risk, and monthly spend before the ticket reaches a DBA.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: an application team opens an incident because checkout latency tripled after a release, the database CPU chart is at the top of the graph, and the only person on call is a platform engineer rather than a DBA. The service owner asks whether to scale the database, fail over to the secondary region, roll back the application, change a firewall rule, or call the database team. The wrong first move can turn a contained performance incident into an expensive outage because Azure SQL Database combines database-engine behavior, cloud networking, identity, backup policy, and capacity billing in one managed surface. The operator does not need to tune every index from memory, but the operator does need to know which Azure SQL product is running, which tier limits it, which metric proves throttling, which security control blocks a client, and which restore path meets the promised RPO.
Azure SQL is not one service with one operating model. Microsoft uses the Azure SQL family name for Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure VM, each with a different line between what Azure operates and what your team operates. Azure SQL Database is a database-as-a-service for single databases and elastic pools, Azure SQL Managed Instance is closer to an instance-scoped SQL Server migration target, and SQL Server on Azure VM gives you OS-level control with IaaS responsibilities. Operators who collapse those three into “SQL in Azure” usually discover the difference during incidents, especially when they ask for SQL Agent, cross-database behavior, custom networking, full OS control, or a restore workflow that the chosen service does not support.
This module teaches the daily operator path. You will compare the managed SQL choices, choose compute and storage models, reason about HA and disaster recovery, configure identity and network access, model cost, read Azure Monitor and Query Performance Insight, run targeted SQL diagnostic queries, and perform a point-in-time restore in a hands-on exercise. The module deliberately stays out of deep T-SQL internals because the audience is a platform, SRE, or DevOps engineer responsible for production reliability. When a problem crosses into schema design, query plans, statistics, isolation levels, or application data modeling, you will learn when to call a DBA and what evidence to bring.
1. What Azure SQL Actually Means Today
Section titled “1. What Azure SQL Actually Means Today”The first operator question is not “how do I connect?” The first question is “which Azure SQL product am I operating?” Azure SQL Database, Azure SQL Managed Instance, and SQL Server on Azure VM all use the SQL Server engine lineage, but they differ in control boundary, deployment shape, networking, scale behavior, restore model, and feature surface. Microsoft’s Azure SQL overview describes the family as managed products built on SQL Server technology, with SQL Database and Managed Instance as PaaS options and SQL Server on Azure VM as IaaS Azure SQL overview.
Azure SQL Database is the most managed option. You deploy a single database or an elastic pool behind a logical server, choose a purchasing model and service tier, and let Azure handle engine patching, backups, high availability mechanics, and most infrastructure work. The logical server is not a SQL Server instance in the traditional DBA sense. It is an Azure management boundary for databases, firewall rules, identity configuration, private endpoints, auditing settings, and failover group participation.
Azure SQL Managed Instance is the migration-friendly PaaS option. It gives you an instance-scoped surface that is much closer to SQL Server, including broader compatibility for instance-level features, native VNet placement, and migration patterns that assume existing SQL Server applications. It reduces operating-system and engine maintenance, but it has a bigger deployment footprint and a different networking model than Azure SQL Database. Operators usually choose it when the application is too instance-shaped for Azure SQL Database but the team still wants PaaS patching, automated backups, and managed availability.
SQL Server on Azure VM is IaaS. You run SQL Server on a Windows or Linux VM, choose the SQL Server version and edition, configure disks, patch the OS, decide how backups run, and implement SQL Server HA patterns such as availability groups when the application requires them. Azure can help with the SQL IaaS Agent extension, automated backup, patching windows, and portal integration, but the control and responsibility boundary is fundamentally different. This is the choice when legacy features, third-party agents, file-system access, kernel-level control, or strict SQL Server compatibility matter more than PaaS simplicity.
Azure Database for PostgreSQL and Azure Database for MySQL are not Azure SQL Database variants. They are managed relational database services for different engines, drivers, extensions, operational metrics, backup semantics, and tuning habits. An operator may run them on the same platform and monitor them through Azure Monitor, but this module does not cover their engine behavior. Point teams that need PostgreSQL or MySQL toward the managed open-source database tracks instead of trying to transfer SQL Server assumptions across engines.
+----------------------+--------------------------+---------------------------+| Azure SQL family | Best mental model | Operator owns |+----------------------+--------------------------+---------------------------+| SQL Database | Managed database | DB settings, access, tier || Managed Instance | Managed SQL instance | Instance config, access || SQL Server on VM | SQL Server on Azure IaaS | OS, disks, SQL, HA, patch |+----------------------+--------------------------+---------------------------+| Operator comparison | Azure SQL Database | Azure SQL Managed Instance | SQL Server on Azure VM |
|---|---|---|---|
| What you control | Database settings, logical server policy, firewall, private endpoint, tier, replicas, backup retention, users, schema, query tuning | Instance configuration, VNet placement, databases, logins, jobs, policies, service tier, backup retention | VM size, OS, SQL version, disks, patching, backup tooling, HA topology, agents, registry, file system |
| What Azure controls | Hardware, OS, SQL engine patching, most HA, automated backups, storage abstraction | Hardware, OS patching, SQL engine patching, built-in HA, automated backups, platform lifecycle | Azure controls VM host fabric; your team controls guest OS and SQL Server stack |
| Scaling unit | Database, elastic pool, or Hyperscale replica | Managed instance or instance pool | VM, disk layout, SQL Server configuration, availability group replicas |
| Deployment time | Usually minutes for a small database | Commonly longer because the instance is network-integrated and heavier | VM provisioning plus SQL configuration time |
| Networking | Public endpoint, firewall rules, service endpoints, private endpoints | Native VNet integration and private addressing | Full VM networking, NSGs, load balancers, private IPs, custom routing |
| Restore granularity | Database-level PITR, geo-restore, deleted database restore, LTR restore to new DB | Database-level restore inside managed instance patterns | Whatever your backup design supports, often database and instance-level workflows |
| Max DB size | Tier-dependent; Hyperscale reaches the largest single-database size | Tier-dependent managed-instance limits | Limited by SQL edition, disk design, VM limits, and operational design |
| Eligible features | Most database-level features, no full instance control | Many instance-level SQL Server features without OS control | Full SQL Server surface when you install and operate it |
Worked example: a team is moving a vendor application that uses SQL Agent jobs, cross-database queries, CLR integration, and linked-server assumptions. Azure SQL Database may look attractive because it is the most managed option, but those instance-level assumptions are the warning sign. Managed Instance is the first PaaS candidate because it preserves more SQL Server instance behavior while still removing OS and engine patching from the team. SQL Server on Azure VM remains the fallback when the vendor requires OS access, unsupported features, custom agents, or a certified deployment matrix that excludes PaaS.
Now you solve it: your team owns a new SaaS API with one database per tenant, modest data size, no SQL Agent, and a requirement to isolate noisy tenants later. Would you start with Azure SQL Database single databases, an elastic pool, Managed Instance, or SQL Server on Azure VM? Write down the unit you want to scale and the unit you want to restore before choosing, because those two answers usually expose the right managed surface.
Pause and predict: If a production incident says “the Azure SQL server is down,” what three questions should you ask before any remediation command?
The strongest first questions are: which Azure SQL product, which database or instance, and which connection path. “Server” can mean a logical server for Azure SQL Database, a managed instance, or a VM host, and each answer changes the runbook. A firewall rule on a logical server will not fix a VM disk saturation incident, and resizing a VM will not help a serverless Azure SQL Database that is resuming from pause. Clear naming in dashboards, alerts, and runbooks prevents the team from debugging the wrong abstraction.
2. Compute and Storage Models
Section titled “2. Compute and Storage Models”Azure SQL Database has two major purchasing models: DTU and vCore. The DTU model bundles CPU, memory, reads, and writes into database transaction units, with Basic, Standard, and Premium tiers DTU purchasing model. The vCore model exposes logical CPU, memory, hardware family, storage, backup storage, and service tier more explicitly, and Microsoft recommends it for flexibility, transparency, and Azure Hybrid Benefit alignment purchasing models. Operators should understand both because many small or older databases still run on DTU, while production architecture reviews usually land on vCore.
DTU is useful when the workload is simple, small, and not worth detailed sizing. If a development database or low-traffic internal app fits comfortably in Standard and has no licensing optimization requirement, DTU can be a practical way to buy a bundle without arguing about vCores. The danger is that DTU hides which resource is saturated. When a DTU database throttles, the operator still has to look at CPU percentage, data IO percentage, log IO percentage, workers, sessions, waits, and query behavior to know whether more bundled DTUs are the right fix.
vCore is the operator-friendly model for serious production. General Purpose, Business Critical, and Hyperscale are not just price tiers. They describe different storage architectures, IO ceilings, failover behavior, read-scale options, and maximum database sizes. The service tier is an availability and performance decision as much as a cost decision.
General Purpose is the balanced default for many production systems. It uses remote storage, keeps cost lower, and works well when latency requirements are normal and the app can tolerate the tier’s IO limits. Business Critical is for low-latency OLTP and higher transaction rates, with local SSD-backed storage and multiple replicas behind the service tier. Hyperscale separates compute and storage so large databases can grow far beyond the ordinary single-database storage ceiling and can scale compute and replicas differently.
| Model | Operator reads it as | Best fit | Watch for |
|---|---|---|---|
| DTU | Bundled CPU, memory, read, and write budget | Small predictable apps, simple billing, older deployments | Hidden bottleneck type, coarse migration planning, bundled storage assumptions |
| vCore General Purpose | Explicit compute plus remote storage | Most ordinary production databases | Data IO and log IO throttling, remote storage latency, under-sized compute |
| vCore Business Critical | Higher-cost low-latency tier with replicas | Write-heavy OLTP, lower IO latency, read scale-out needs | Higher compute cost, local SSD capacity, replica behavior during failover |
| vCore Hyperscale | Cloud-native storage and compute split | Very large DBs, fast scale, many read replicas | Replica billing, page-server behavior, unsupported edge cases, migration planning |
| Serverless | Autoscaling vCore range with possible pause | Intermittent single DB workloads | Resume delay, min vCore floor, memory trimming, unpredictable warm-up |
Provisioned compute means you choose a fixed amount of compute and pay for it while it exists. That is the right default for steady production workloads because capacity is ready when traffic arrives. Serverless means you configure a minimum and maximum vCore range, and the database can automatically scale within that range while billing for used compute; General Purpose serverless can also auto-pause so only storage is billed during inactivity serverless compute tier. Serverless is compelling for intermittent workloads, but it is not a free production autoscaler when resume latency or cold memory affects user traffic.
Hyperscale deserves its own mental model. Traditional database storage often feels like one engine process managing local data files, but Hyperscale splits the work into compute nodes, page servers, a log service, and Azure Storage Hyperscale architecture. The primary compute replica handles read-write workload, page servers own ranges of database pages and keep SSD caches, the log service processes transaction log records, and durable page storage sits in Azure Storage. That design is why Hyperscale can support very large databases, fast snapshot-style backups, rapid scale operations, and named replicas for read-heavy patterns.
+--------------------------- Hyperscale database ---------------------------+| || Client writes || | || v || +-------------------+ +-------------------+ || | Primary compute | -----> | Log service | || | SQL engine | | Durable log path | || +-------------------+ +-------------------+ || | | | | || v v v v || +----------------+ +----------------+ +----------------+ || | Page server A | | Page server B | | Page server C | || | Page ranges | | Page ranges | | Page ranges | || +----------------+ +----------------+ +----------------+ || | | | || v v v || +------------------------------------------------------------------+ || | Azure Storage durable page storage | || +------------------------------------------------------------------+ || || Read replicas attach to the same storage architecture for scale-out. |+--------------------------------------------------------------------------+Read replicas are not one universal
feature with one billing model.
Business Critical includes a
read-scale secondary pattern that can
serve read-only workloads with
ApplicationIntent=ReadOnly.
Hyperscale can add HA replicas and
named replicas, and those replicas
have cost and resource implications.
Active geo-replication creates
readable secondary databases in
another region for Azure SQL Database,
while failover groups wrap one or more
databases with listener endpoints and
failover policy.
When a workload exceeds an SKU’s IO limit, Azure SQL Database does not become magically faster because the query is important. Throughput is governed, operations wait, latency rises, and clients may see timeouts or retry storms. On DTU, the symptom may be high DTU percentage with one of CPU, data IO, or log IO driving the blended limit. On vCore, the operator can usually see the pressure more directly through CPU percentage, data IO percentage, log IO percentage, workers percentage, sessions percentage, waits, and Query Performance Insight.
Worked example: a service runs on General Purpose with 2 vCores and shows CPU at moderate levels, data IO near the limit, and log IO spiking during batch imports. Scaling to 4 vCores may help if the tier gives more IO with more vCores, but the real question is whether the workload is write-log limited, data-read limited, or query-plan inefficient. If the import is business-critical and latency-sensitive, Business Critical may be the right tier because the storage architecture changes. If the database is already several terabytes and restore or scale time dominates the decision, Hyperscale may be the better architecture even if the immediate symptom is IO.
Now you solve it: a database is slow only during a nightly import, CPU remains below half of the limit, log IO reaches the top of the chart, and client retries amplify the problem. Would you first raise vCores, reduce batch size, move to Business Critical, change application retry policy, or call a DBA for indexing and transaction design? The operator answer is to reduce blast radius and collect proof: cap or schedule the import, confirm log IO saturation, check waits, then decide whether tier movement or database design is the durable fix.
3. High Availability and Disaster Recovery
Section titled “3. High Availability and Disaster Recovery”Azure SQL Database gives you built-in high availability before you configure any cross-region disaster recovery. The service is designed so Azure can patch and recover infrastructure without asking you to maintain SQL Server failover clustering. That built-in HA protects against many local infrastructure failures, but it is not the same as a cross-region recovery plan. Operators must separate local high availability, zone redundancy, geo-replication, failover groups, automatic backups, point-in-time restore, geo-restore, and long-term retention.
Zone redundancy is a regional availability design. For supported tiers and regions, Azure can place database replicas across availability zones so a zone failure does not imply a database outage. Business Critical and Hyperscale are the common production tiers where operators discuss zone-redundant HA, while General Purpose zone redundancy is also available in supported vCore configurations. The decision belongs in the same conversation as application zone architecture, private endpoint DNS, regional dependencies, and whether the app is ready for transient connection drops.
Backups solve a different problem. Automated backups support point-in-time restore for short-term retention, with retention generally configured in the 7 to 35 day range for most production tiers and lower limits for some small DTU choices automated backups. Long-term retention copies selected full backups into long-retention storage for compliance and can retain backups for up to 10 years long-term retention. Backups are for data recovery after deletion, corruption, bad deployments, bad writes, and compliance evidence; they are not a low-latency failover mechanism.
Point-in-time restore creates a new database from backups at a selected timestamp. It is the operator’s path after a mistaken migration, accidental delete, application bug, or data corruption event where the original database should not be overwritten blindly. Geo-restore creates a database from geo-replicated backups in another region when the source region is unavailable or when you need regional recovery without preconfigured active geo-replication. Restore workflows should be rehearsed because permission, naming, firewall, identity, connection-string, and DNS assumptions often fail during a real incident.
Active geo-replication is per-database asynchronous replication to readable secondary databases. It can support regional redundancy and manual geo-failover for individual databases, but applications must understand the connection path and possible data loss because replication is asynchronous active geo-replication. Failover groups build on geo-replication and manage one or more databases as a group with stable listener endpoints for read-write and read-only connections failover groups. For operators, failover groups are usually the safer application abstraction because the connection string can point at a listener instead of a specific regional server.
| Recovery question | Use this feature | Operator warning |
|---|---|---|
| A zone has infrastructure trouble, but the region is alive | Zone-redundant HA where supported | The application still needs retry logic for transient disconnects |
| A deployment wrote bad data ten minutes ago | PITR to a new database | Restoring over the original is rarely the first safe move |
| A region is unavailable and no secondary exists | Geo-restore from geo-redundant backups | RTO is longer and capacity in target region is not guaranteed instantly |
| One important database needs a readable remote secondary | Active geo-replication | Failover is database-scoped and connection strings need planning |
| Several app databases must fail over together | Failover group | Secondary server security, firewall, identity, and DNS must be ready |
| Compliance requires historical backups beyond PITR | LTR | Retention can quietly become a large bill if no one owns policy |
flowchart TD A[Incident asks for recovery] --> B{Is the region healthy?} B -->|Yes| C{Is data logically damaged?} C -->|Yes| D[PITR to a new database, validate rows, repair or redirect] C -->|No| E{Is this capacity or local HA?} E -->|Capacity| F[Scale tier, reduce load, or fix query path] E -->|HA event| G[Let built-in HA recover, monitor retries, inspect service health] B -->|No| H{Was a failover group prepared?} H -->|Yes| I[Fail over group, validate listener, confirm app writes] H -->|No| J{Are geo-redundant backups available?} J -->|Yes| K[Geo-restore to target region, update app connection path] J -->|No| L[Escalate business continuity gap and restore from available exports]RPO is the amount of data loss the business can tolerate. RTO is the amount of time the business can tolerate before the service is usable again. Those numbers should drive the feature choice, not the other way around. If the RPO is near zero and RTO is minutes, relying only on geo-restore is not a serious design; if the workload is internal and can wait, a failover group may be unnecessary cost and complexity.
Worked example: a customer-facing API promises a 15 minute RTO and accepts up to 5 seconds of data loss during a regional disaster. PITR alone cannot meet that RTO because a restore creates a new database and requires application reconnection work. Geo-restore is also too slow and too manual for the promise. The operator should design a failover group, test planned failover, verify the read-write listener, keep security settings aligned on the secondary server, and document the expected data-loss window from asynchronous replication.
Pause and predict: A failover group exists, but after failover the app cannot connect from its private subnet in the secondary region. Which resource is most likely missing: the database, the login, the private DNS link, or the backup retention policy?
The most likely missing piece is the network path, especially private DNS and private endpoint alignment in the target region. Authentication and database existence can also fail, but private endpoint designs are regional and DNS-dependent. Good DR runbooks include a secondary server access checklist: Entra admin, contained users, firewall state, private endpoint, private DNS zone links, application connection string, Key Vault access, monitoring, and audit destination. Do not treat failover as only a database command.
4. Identity, Networking, and Security
Section titled “4. Identity, Networking, and Security”Azure SQL Database access has two planes. The Azure control plane creates servers, databases, private endpoints, firewall rules, identities, diagnostic settings, Defender plans, and backup policies through Azure Resource Manager. The SQL data plane accepts database connections, authenticates users or applications, enforces permissions, runs queries, writes audit events, and exposes DMVs. Operators need both planes because a user can have Azure Contributor permission and still be unable to query a database, or have database permission and still be blocked by networking.
Microsoft Entra authentication is the preferred identity direction for humans and services because it centralizes identity, conditional access, group management, and service principal patterns configure Microsoft Entra authentication. An Azure SQL logical server can have a Microsoft Entra administrator, and that administrator can create contained database users from external provider identities. Managed identities can connect to Azure SQL using Entra authentication when the database has a corresponding user and permission set managed identities. This is the clean operator pattern for App Service, Functions, AKS workload identity, and automation jobs because it avoids long-lived SQL passwords in application settings.
-- Run as the Microsoft Entra administrator in the target database.CREATE USER [app-orders-prod] FROM EXTERNAL PROVIDER;ALTER ROLE db_datareader ADD MEMBER [app-orders-prod];ALTER ROLE db_datawriter ADD MEMBER [app-orders-prod];That snippet is intentionally small. The operator decision is not “grant db_owner so the app works.” The operator decision is to map an Entra identity to the least database roles the application needs, then make the application use managed identity in its connection string. If the app needs schema migrations, use a separate deployment identity with controlled permissions rather than giving the runtime identity ownership of the database.
Networking is a separate gate. Azure SQL Database can expose a public endpoint guarded by IP firewall rules, use virtual network service endpoints in older patterns, or use private endpoints through Azure Private Link private endpoint overview. Server-level firewall rules apply to all databases on the logical server, while database-level firewall rules are created through T-SQL and apply only to one database firewall rules. Private endpoints are usually the production target for private workloads, but they add DNS and routing responsibilities that must be tested from every caller network.
| Access pattern | When it is reasonable | Operator risk |
|---|---|---|
| Public endpoint plus narrow client IP firewall | Developer workstation, controlled admin jump host, short-lived lab | Dynamic IP changes, overly broad ranges, false confidence from “Allow Azure services” |
| Service endpoint | Existing VNet-based access pattern where supported | Less preferred than Private Link for many new designs |
| Private endpoint | Production private workloads and hub-spoke networks | DNS misconfiguration, regional failover gaps, split-horizon surprises |
| Database-level firewall rule | Rare per-database exception | Harder inventory because it is not managed like server-level firewall |
| Allow Azure services toggle | Emergency or legacy integration only | Broad Azure-origin access, poor least-privilege story |
Encryption controls protect different layers. Transparent Data Encryption protects data at rest by encrypting database files and backups, and customer-managed TDE keys let the customer control the TDE protector through Azure Key Vault or Managed HSM TDE with customer-managed keys. Always Encrypted protects selected sensitive columns so encryption keys are not exposed to the database engine in ordinary operation Always Encrypted. Dynamic Data Masking and Row-Level Security are data-access controls, not substitutes for permission design, and they require application and DBA review before production rollout.
Always Encrypted is a great example of an operator-DBA boundary. The operator can make sure Key Vault access, managed identity, deployment pipeline permissions, and driver support exist. The DBA and application team must decide which columns can be encrypted without breaking query patterns, indexing assumptions, sorting, equality searches, or operational support workflows. If the operator turns it on as a checkbox without application design, the incident may be a broken release rather than a safer database.
Auditing and Defender answer different questions. Azure SQL auditing tracks database events and can write audit logs to Azure Storage, Log Analytics, or Event Hubs SQL auditing. Microsoft Defender for SQL adds vulnerability assessment and advanced threat protection, including alerts for possible SQL injection and anomalous access patterns Defender for SQL. Auditing is the durable trail; Defender is detection and recommendation; neither is a replacement for least-privilege identity, private networking, secure coding, or patch discipline.
Hypothetical scenario: Defender raises a possible SQL injection alert after a new search endpoint goes live. The operator should not dismiss it because the service is behind private networking; SQL injection is an application input problem, not only an internet exposure problem. The first response is to preserve the alert, identify the database and principal, check audit logs and application logs around the timestamp, confirm the endpoint and query pattern, and pull in the application team. If the alert matches expected synthetic testing, tune the alert process; if it matches untrusted input reaching dynamic SQL, escalate as a security incident.
5. Observability and Day-Two Operations
Section titled “5. Observability and Day-Two Operations”The operator’s first slow-database screen is Azure Monitor, not a query-plan deep dive. Azure SQL Database exposes metrics such as CPU percentage, Data IO percentage, Log IO percentage, Workers percentage, Sessions percentage, DTU percentage for DTU databases, storage percentage, successful connections, failed connections, and blocked connections monitoring metrics. These metrics tell you whether the database is near a service-tier limit, whether the incident is connection-related, whether worker or session limits are involved, and whether storage is becoming a capacity risk. They do not tell you by themselves which query or schema design caused the pressure.
Query Performance Insight is the operator’s first stop for “why slow?” It uses Query Store data to show top resource-consuming and long-running queries, CPU, duration, execution count, query text where permitted, and performance recommendations Query Performance Insight. The point is not that the operator becomes the query tuner. The point is that the operator can identify whether one query, many chatty queries, blocking, or a tier limit is the dominant shape before escalating with useful evidence.
Extended Events and automatic tuning recommendations live one layer deeper. Extended Events can capture targeted diagnostic events with lower overhead than older tracing habits, but it still requires discipline because “capture everything” is not an incident strategy. Automatic tuning recommendations can suggest index creation, index drops, and plan correction, but production teams should treat automated changes as policy decisions with rollback awareness. For many teams, the operator collects recommendations, checks whether automatic tuning is enabled, and brings the evidence to the DBA or data platform owner.
AzureMetrics| where ResourceProvider == "MICROSOFT.SQL"| where ResourceId has "/servers/" and ResourceId has "/databases/"| where MetricName in ("cpu_percent", "dtu_consumption_percent", "data_io_percent", "log_write_percent")| summarize max_value = max(Maximum) by MetricName, bin(TimeGenerated, 5m), ResourceId| order by TimeGenerated descThat KQL query is the Azure Monitor
metrics view of the same pressure
operators often inspect with
sys.dm_db_resource_stats. The DMV
itself is still useful when you are
connected to the database and need a
quick recent view. It returns CPU, IO,
and memory consumption rows for recent
intervals, with values expressed as
percentages of service-tier limits.
Use it like a database-local top
command, not as a permanent telemetry
store.
SELECT TOP (20) end_time, avg_cpu_percent, avg_data_io_percent, avg_log_write_percent, max_worker_percent, max_session_percentFROM sys.dm_db_resource_statsORDER BY end_time DESC;Database waits are the next operator
clue. The DMV sys.dm_db_wait_stats
shows completed waits for the current
database, and high lock waits, IO
waits, log waits, or worker-related
waits can point the investigation in
the right direction. The counters
reset after failover or database
movement, so do not compare them as if
they were a forever ledger. Use waits
to shape the next question: blocking,
IO, log pressure, worker starvation,
or application round trips.
SELECT TOP (15) wait_type, waiting_tasks_count, wait_time_ms, max_wait_time_ms, signal_wait_time_msFROM sys.dm_db_wait_statsWHERE wait_type NOT LIKE 'SLEEP%'ORDER BY wait_time_ms DESC;For a more telemetry-native KQL
approach, Database Watcher can stream
Azure SQL monitoring datasets into
Azure Data Explorer or Fabric
Real-Time Analytics. Its SQL database
datasets include resource utilization,
query wait statistics, and wait
statistics tables such as
sqldb_database_resource_utilization,
sqldb_database_query_wait_stats, and
sqldb_database_wait_stats database
watcher
datasets.
That makes KQL the right place for
fleet-level questions that do not
belong in one database connection. The
sample below assumes Database Watcher
is configured and data is landing in a
KQL database.
sqldb_database_resource_utilization| where sample_time_utc > ago(2h)| where logical_server_name == "prod-sql-eastus"| where database_name == "orders"| summarize cpu=max(avg_cpu_percent), data_io=max(avg_data_io_percent), log_io=max(avg_log_write_percent), workers=max(max_worker_percent) by bin(sample_time_utc, 5m)| render timechartsqldb_database_wait_stats| where sample_time_utc > ago(2h)| where logical_server_name == "prod-sql-eastus"| where database_name == "orders"| summarize wait_ms=max(wait_time_ms) by wait_type, bin(sample_time_utc, 5m)| top 10 by wait_ms descDefender alerts add security context to day-two operations. A possible SQL injection alert asks whether untrusted input reached a dangerous query pattern. An unusual location or unfamiliar principal alert asks whether access behavior changed because of a real deployment, a compromised credential, or a misconfigured integration. Operators should route these alerts to the same incident discipline as infrastructure alerts: preserve evidence, identify scope, correlate with deployment and identity changes, and escalate with a precise timeline.
Call in a DBA when the investigation points to query plans, indexing, locking design, isolation levels, transaction shape, schema design, statistics, parameter sensitivity, or data distribution. Handle inline when the issue is firewall, private endpoint DNS, connection-string target, tier saturation, runaway non-production workload, backup retention policy, diagnostic setting, identity mapping, or a known platform failover. The boundary is not ego; it is risk management. A good operator brings the DBA the timeline, metrics, waits, top queries, recent deployments, tier, connection path, and user impact instead of a vague “SQL is slow” ticket.
Patterns and Anti-Patterns
Section titled “Patterns and Anti-Patterns”Patterns turn managed database operations into repeatable decisions. The goal is not to create a perfect database platform on day one. The goal is to make the common failure paths boring: access changes are reviewed, failover has been rehearsed, capacity graphs have owners, backup retention has a business reason, and cost increases can be explained before finance opens a surprise ticket. Use the patterns below as production defaults unless your workload gives you a concrete reason to diverge.
| Pattern | When to use it | Why it works | Scaling considerations |
|---|---|---|---|
| Logical server per environment boundary | Separate production, staging, and development blast radius | Firewall, Entra admin, audit, private endpoint, and failover policies stay easier to reason about | More servers means more policy automation and naming discipline |
| Managed identity for application access | Azure-hosted workloads that support Entra authentication | Removes long-lived SQL passwords from app settings and rotates trust through identity platform | Requires database users, driver support, and deployment identity separation |
| Private endpoint plus tested DNS | Private production workloads in VNets | Keeps traffic on private addressing and removes broad public exposure | Every region, spoke, resolver, and failover path needs validation |
| Failover group for multi-database apps | Apps with several databases that must move together | Listener endpoint avoids per-database connection-string rewrites | Secondary server access, cost, lag, and failover tests become recurring operations |
| Query Performance Insight before scaling | Slow queries or capacity alarms with unclear root cause | Shows whether one query, many queries, or tier limits dominate | Query Store retention and permissions must be healthy |
| PITR to new database before repair | Bad write, bad migration, accidental delete | Preserves evidence and gives a safe comparison target | Application repair scripts and data reconciliation may need DBA ownership |
Anti-patterns are usually attractive because they work once. A broad firewall rule fixes the developer demo. Scaling up hides a bad import for a week. One shared SQL admin password gets the application online. Each of those shortcuts transfers risk to the next incident, where the team has less context and more pressure.
| Anti-pattern | What goes wrong | Better alternative |
|---|---|---|
| Treating the logical server as a full SQL Server instance | Operators expect instance features that Azure SQL Database does not provide | Choose Managed Instance or SQL Server on VM when instance scope is required |
| Scaling before reading IO and query signals | Cost rises while the bottleneck remains application or schema-driven | Check Azure Monitor, Query Performance Insight, and waits before resizing |
| Leaving public access broad because private endpoints are hard | Attack surface and accidental access grow quietly | Use private endpoints for production and narrow temporary public rules for admin access |
| Enabling LTR without an owner | Backup storage becomes a compliance-cost trap | Attach retention to a named policy, audit cadence, and deletion process |
| Using SQL admin credentials in applications | Password rotation and blast radius become painful | Use managed identity and separate deployment identities |
| Building failover groups but never testing them | DNS, identity, and firewall gaps appear during the disaster | Run planned failover tests and record exact application validation steps |
Decision Framework
Section titled “Decision Framework”Start with the operational ownership model. If the team wants a managed database and the application fits database-scoped features, Azure SQL Database is the default. If the application is SQL Server instance-shaped and needs broader compatibility without OS ownership, Managed Instance is the default migration candidate. If the application requires full SQL Server control, a certified VM deployment, custom agents, unsupported instance features, or OS access, SQL Server on Azure VM is the honest choice.
flowchart TD A[Choose Azure SQL landing zone] --> B{Need OS or full SQL Server control?} B -->|Yes| C[SQL Server on Azure VM] B -->|No| D{Need broad instance-level compatibility?} D -->|Yes| E[Azure SQL Managed Instance] D -->|No| F{Single database or SaaS database pattern?} F -->|Yes| G[Azure SQL Database] F -->|No| H{Many small variable databases?} H -->|Yes| I[Azure SQL Database elastic pool] H -->|No| J[Re-check requirements with DBA and app team]Choose DTU only when simplicity is the real goal. DTU is fine for small, predictable, low-risk databases where a bundled capacity choice is easier than independent compute and storage sizing. Move to vCore when you need transparent CPU and storage sizing, Azure Hybrid Benefit, reserved capacity, serious production modeling, Business Critical, Hyperscale, or Serverless. Do not let “DTU is simpler” become “we cannot explain why this database throttles.”
Choose Serverless when the workload is intermittent and can tolerate resume behavior. Development, test, occasional reporting, low-traffic internal tools, and unpredictable new databases are good candidates. Steady user-facing APIs, latency-sensitive services, and workloads with constant background activity usually belong on provisioned compute. Serverless saves money when idle periods are real; it disappoints when the database never pauses or when cold-start behavior violates the user experience.
Choose Hyperscale when the architecture solves a real problem. Very large databases, fast scale up or down independent of data size, rapid snapshot-style backups, and read-scale patterns are strong reasons. Do not choose Hyperscale only because the name sounds safer. Its replicas and storage model change the cost surface, and some migration or feature limitations may matter for legacy applications.
| Decision | Choose this | Avoid this |
|---|---|---|
| New cloud app with one operational database | Azure SQL Database vCore General Purpose | Managed Instance just because it feels familiar |
| Existing SQL Server app with instance features | Azure SQL Managed Instance | Rewriting the app around single-database constraints under time pressure |
| Vendor app requiring OS access | SQL Server on Azure VM | Unsupported PaaS deployment that breaks vendor support |
| Small predictable internal app | DTU Standard or small vCore | Hyperscale or Business Critical without evidence |
| Intermittent dev or test database | Serverless General Purpose | Provisioned compute that runs idle all month |
| Very large single database with scale pressure | Hyperscale | Repeatedly raising General Purpose size without a restore and scale plan |
| Low-latency OLTP with high write rate | Business Critical | General Purpose if IO latency is the recurring incident |
| Legacy on-prem pattern with strict control | SQL Server on Azure VM | PaaS if kernel, agent, file, or edition control is mandatory |
Worked example: a customer database is 100 GB, serves a steady API, has ordinary latency needs, and peaks during business hours. General Purpose vCore is a sensible first design because the workload is not huge, not obviously low-latency OLTP, and not intermittent enough for Serverless. If the team later sees sustained log IO pressure and low-latency requirements, Business Critical becomes a performance architecture discussion rather than a panic scale button. If the database grows into many terabytes and restore time becomes a product risk, Hyperscale becomes a storage architecture discussion.
Now you solve it: a reporting workload runs for two hours each weekday, can wait after idle periods, and is inactive most nights and weekends. Would Serverless or Provisioned compute be the first cost model you test? The correct operator instinct is to test Serverless with a realistic resume and query schedule, then compare the bill against provisioned compute plus any scheduling or pause strategy the team can actually operate.
Cost Lens
Section titled “Cost Lens”Azure SQL Database cost is a design input, not a finance afterthought. For DTU, you pay for the chosen DTU service objective and bundled storage behavior. For vCore, the main bill is compute hours, provisioned or allocated data storage, backup storage beyond the included amount, optional long-term retention, replicas, Defender, and network or monitoring costs around the database. Reserved capacity, Dev/Test pricing, and Azure Hybrid Benefit can materially change the compute line, but they do not erase poor tier selection or unlimited retention.
The vCore pricing surface is usually easiest to explain as a formula. Provisioned compute is vCore-hour rate multiplied by hours in the month. Data storage is GB-month multiplied by the selected tier’s storage rate. PITR backup storage has an included amount tied to database size, with excess backup storage charged by GB-month, while LTR storage is separate. Geo-replication and failover groups add secondary compute and storage because the secondary database is a real database, not a free checkbox.
Hyperscale changes the mental model. Compute is charged per replica, storage is charged for allocated data storage that grows in increments, and additional replicas change the monthly number. Page servers and Azure Storage are part of the architecture, so the operator should read Hyperscale cost as “primary compute plus replica compute plus allocated storage plus backup storage,” not as “one database price.” Hyperscale can be cost-effective for the right workload, but it punishes vague replica habits.
Reserved capacity is a commitment discount for steady production. One-year and three-year reservations can reduce compute cost when the database is expected to run continuously. Azure Hybrid Benefit can apply existing SQL Server licenses to eligible vCore Azure SQL Database deployments, reducing compute cost when licensing terms are met. Dev/Test pricing can reduce non-production cost for eligible subscriptions, but it should not be used to hide production workloads in the wrong subscription.
Worked pricing example, using East US Pay-As-You-Go retail prices sampled from the Azure pricing page and Retail Prices API on May 19, 2026. Assume a 730-hour month, 100 GB maximum data size, no LTR, and PITR backup consumption within the included 100 percent database-size allowance. The numbers are rounded and should be recalculated in the Azure Pricing Calculator for your region, currency, reservation, licensing, backup redundancy, and support plan. The example compares General Purpose and Business Critical because the tier choice is often the biggest operator-controlled difference.
| Example shape | Compute assumption | Compute monthly | Data storage monthly | PITR backup monthly | Approximate total |
|---|---|---|---|---|---|
| General Purpose, 8 vCore, provisioned (Standard-series hardware) | $1.217736 per hour | $889.95 | $11.50 | $0.00 when within included allowance | $901.45 |
| Business Critical, 8 vCore, provisioned (Standard-series hardware) | $2.43548 per hour | $1,778.90 | $25.00 | $0.00 when within included allowance | $1,803.90 |
Zone-redundant configuration (where supported, including General Purpose and Business Critical) adds a premium of roughly 20-30% on the compute meter to account for the redundant deployment across availability zones. Use the Azure Pricing Calculator to model the exact ZR rate for your chosen hardware family — published per-hour rates vary across Standard-series, Fsv2-series, and DC-series options and the public meter names change as new hardware families are introduced.
Read the table above carefully. It is not a universal quote, and it does not include Defender for SQL, cross-region replicas, excess backup storage, LTR, Log Analytics ingestion, support, private endpoint data processing, or reservations. It shows the operator method: list the tier, compute hours, storage GB-month, backup assumption, and optional features separately. If the bill is surprising, break it back into those lines before changing the database.
Cost spikes usually come from operational shortcuts. An under-sized DTU or vCore database causes throttling, so the team over-provisions to silence alerts instead of fixing query shape or batch timing. Unbounded LTR retains backups for years without a named compliance owner. Cross-region geo-replication doubles much of the database footprint because the secondary exists and runs. Hyperscale replicas multiply compute cost when teams add read replicas without retiring experiments.
Defender for SQL is valuable, but it is still a billable plan. For a small low-risk development database, leaving Defender enabled may be less important than using a cheaper dev/test pattern and stronger subscription policy. For production data, the security signal is often worth the cost, especially when audit logs and incident response are mature enough to use the alerts. The operator decision is not “security costs too much”; it is “which subscriptions and data classes require this control, and who reviews the findings?”
Did You Know?
Section titled “Did You Know?”- Azure SQL Database Serverless bills compute per second when active and can auto-pause in General Purpose, but storage billing continues while the database is paused.
- Long-term retention for Azure SQL Database can retain selected backups for up to 10 years, which is useful for compliance and dangerous when no one owns deletion policy.
- Hyperscale separates compute from storage with page servers and a log service, so backup and scale behavior is not tied to copying one giant local data file.
- In the vCore model, Microsoft documents that Business Critical pricing is higher partly because additional replicas are allocated automatically behind the service tier.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Choosing Managed Instance for every migration | Teams equate “more compatible” with “always safer” | Start with feature requirements, then choose SQL Database when the app is database-scoped |
| Treating DTU percentage as a root cause | DTU is a blended signal, so it hides the dominant constrained resource | Break it into CPU, data IO, log IO, workers, sessions, waits, and top queries |
| Using public firewall exceptions as permanent production access | It is fast during setup and avoids DNS work | Move production callers to private endpoints and expire temporary public rules |
| Forgetting secondary-region security during failover design | The database replication setup succeeds, so teams assume access replicated too | Validate Entra admin, contained users, firewall, private endpoint, DNS, Key Vault, audit, and app settings |
| Enabling LTR with “keep everything” language | Compliance asks for retention and no owner translates it into policy | Map retention to a regulation, owner, review date, and deletion process |
| Scaling up to hide a bad batch job | More vCores reduce symptoms quickly | Schedule, batch, index, or redesign the job; scale only after the bottleneck is understood |
| Giving the runtime app db_owner | It avoids migration and permission errors during release | Split runtime identity from deployment identity and grant least privilege |
| Ignoring Query Store health | Query Performance Insight looks empty, so operators assume no data exists | Confirm Query Store is enabled, has space, and has collected enough workload history |
Your team runs a new checkout API on Azure SQL Database General Purpose. CPU is moderate, log IO is consistently near the service limit during imports, and the app sees timeout spikes. What do you check before scaling?
Start by confirming the pressure shape
with Azure Monitor metrics and
sys.dm_db_resource_stats, then look
at waits and Query Performance Insight
for the import workload. If log IO is
the dominant limit, scaling vCores may
help only if the tier provides more
log throughput at the new size. You
should also check batch size,
transaction scope, retry behavior, and
whether the import should be scheduled
or redesigned. Call in a DBA if
transaction design, indexing, or query
plans are part of the sustained
problem.
A vendor app requires SQL Agent jobs, linked-server behavior, and a certification matrix that names SQL Server on Windows. Which Azure SQL option is the safest starting point and why?
SQL Server on Azure VM is the safest starting point when the vendor requires OS-level or full SQL Server control. Azure SQL Managed Instance may be worth evaluating if the vendor supports it and the feature set fits, but certification language matters in production. Azure SQL Database is likely the wrong first choice because it is database-scoped and does not provide a full SQL Server instance surface. The operator should document which features force IaaS so the decision is not dismissed as preference.
A failover group test succeeds, but the application cannot connect from the secondary region after failover. The database exists and the listener resolves. What is your next investigation path?
Check the secondary-region network and identity path, not only database replication. Private endpoints, private DNS zone links, firewall rules, Entra admin, contained users, Key Vault access, and application configuration all need to work after failover. The listener can point to the new primary while the client subnet still resolves or routes incorrectly. A good failover test validates application writes, reads, secrets, audit, and monitoring, not just the Azure failover operation.
A development database is used a few hours per week and has no latency-sensitive users. The monthly bill is mostly idle compute. Which compute model do you evaluate first?
Evaluate Serverless first because the workload is intermittent and can likely tolerate resume delay. Set a realistic minimum and maximum vCore range, test auto-pause and resume behavior, and compare billed active compute against provisioned compute. If background jobs keep the database active all month, Serverless may not save much. The operator should measure real idle periods rather than assuming the service will pause.
A security team asks you to enable Always Encrypted for customer identifiers by Friday. What is the operator response?
Treat it as a joint application, DBA, and platform change rather than a simple portal toggle. The operator can prepare Key Vault access, managed identity, driver compatibility checks, and deployment permissions, but the application and DBA teams must validate query patterns, indexing, encryption type, and release safety. Always Encrypted changes how the application handles protected columns. Rushing it without a test plan can create an availability incident while trying to solve a confidentiality requirement.
An executive asks why a 100 GB database costs much more after adding geo-replication. How do you explain it?
Geo-replication creates a secondary database that consumes compute and storage, so it is not a free backup copy. The bill can roughly double major database lines depending on tier, region, replicas, backup redundancy, and configuration. Explain the cost in terms of primary compute, secondary compute, data storage, backup storage, and any monitoring or Defender charges. Then tie the added spend to the RPO and RTO that geo-replication or failover groups are meant to satisfy.
An alert shows high workers percentage and blocked connections, but CPU is not saturated. Should the operator immediately move to Business Critical?
No, not immediately. High workers and blocked connections can come from blocking, long-running transactions, connection storms, poor pooling, or workload shape rather than storage architecture. The operator should inspect sessions, waits, Query Performance Insight, application connection behavior, and recent deployments. Business Critical may be appropriate later, but moving tiers before identifying blocking or worker exhaustion can increase cost without fixing the cause.
Hands-On Exercise
Section titled “Hands-On Exercise”In this exercise you will create an
Azure SQL Database, configure access,
load data, generate pressure, observe
metrics, scale compute, and perform
point-in-time restore. The exercise is
designed for a solo operator with an
Azure subscription, Azure CLI,
sqlcmd, and permission to create a
resource group, SQL logical server,
database, firewall rule, and optional
private endpoint. Use a disposable
subscription or lab resource group,
because even short-lived databases can
cost money. Delete the resource group
at the end unless your instructor
tells you to keep it for review.
Setup assumptions
Section titled “Setup assumptions”Use East US in the examples so the commands are concrete. Change the region if your subscription policy requires another location. The private endpoint step is optional because not every learner has an existing VNet; the exercise still works with a narrow client IP firewall rule. For production, private endpoint plus correct DNS should replace workstation public access.
az account show --query "{subscription:name, tenant:tenantId}" -o tablesqlcmd -?Step 1: Create the resource group, logical server, and database
Section titled “Step 1: Create the resource group, logical server, and database”This step creates the management
boundary and a small provisioned vCore
database. The server name must be
globally unique under
database.windows.net, so the command
includes random shell values. Do not
paste a real production password into
the module or a ticket. Read it into
an environment variable and let the
shell keep it out of command history
where possible.
export RG="rg-kubedojo-sql"export LOCATION="eastus"export SQL_SERVER="dojo-sql-$RANDOM-$RANDOM"export SQL_DB="dojoapp"export SQL_ADMIN="dojooperator"
read -r -s -p "SQL admin password: " SQL_ADMIN_PASSWORDprintf '\n'
az group create \ --name "$RG" \ --location "$LOCATION"
az sql server create \ --resource-group "$RG" \ --name "$SQL_SERVER" \ --location "$LOCATION" \ --admin-user "$SQL_ADMIN" \ --admin-password "$SQL_ADMIN_PASSWORD"
az sql db create \ --resource-group "$RG" \ --server "$SQL_SERVER" \ --name "$SQL_DB" \ --edition GeneralPurpose \ --family Gen5 \ --capacity 2 \ --compute-model Provisioned \ --backup-storage-redundancy Local- Resource group exists.
- Logical server exists and shows the expected region.
- Database exists on General Purpose provisioned compute.
- You recorded the server FQDN as
$SQL_SERVER.database.windows.net.
Solution notes for Step 1
Use az sql db show --resource-group "$RG" --server "$SQL_SERVER" --name "$SQL_DB" -o table to confirm the
database. If server creation fails,
check global name uniqueness, password
complexity, and subscription policy.
If database creation fails because the
selected region or family is
unavailable, choose an allowed region
or use the portal to identify
available Azure SQL Database options.
Do not continue until the database
reaches Online state.
Step 2: Configure access with a narrow firewall rule and optional private endpoint
Section titled “Step 2: Configure access with a narrow firewall rule and optional private endpoint”The fastest solo path is to allow your current public client IP. That is acceptable for a lab and poor as a permanent production pattern. The optional private endpoint commands show the production direction when you already have a VNet and subnet. Private endpoint DNS must resolve the SQL server FQDN to the private address from the calling network.
CLIENT_IP="$(curl -fsS https://api.ipify.org)"
az sql server firewall-rule create \ --resource-group "$RG" \ --server "$SQL_SERVER" \ --name AllowMyClientIp \ --start-ip-address "$CLIENT_IP" \ --end-ip-address "$CLIENT_IP"Optional private endpoint path, only if you have a VNet and a dedicated subnet available:
export VNET_RG="rg-existing-network"export VNET_NAME="vnet-platform-lab"export SUBNET_NAME="snet-private-endpoints"export PE_NAME="pe-$SQL_SERVER"export DNS_ZONE="privatelink.database.windows.net"
SQL_SERVER_ID="$(az sql server show \ --resource-group "$RG" \ --name "$SQL_SERVER" \ --query id \ -o tsv)"
az network private-dns zone create \ --resource-group "$VNET_RG" \ --name "$DNS_ZONE"
az network private-dns link vnet create \ --resource-group "$VNET_RG" \ --zone-name "$DNS_ZONE" \ --name "link-$VNET_NAME" \ --virtual-network "$VNET_NAME" \ --registration-enabled false
az network private-endpoint create \ --resource-group "$VNET_RG" \ --name "$PE_NAME" \ --vnet-name "$VNET_NAME" \ --subnet "$SUBNET_NAME" \ --private-connection-resource-id "$SQL_SERVER_ID" \ --group-id sqlServer \ --connection-name "conn-$SQL_SERVER"
az network private-endpoint dns-zone-group create \ --resource-group "$VNET_RG" \ --endpoint-name "$PE_NAME" \ --name default \ --private-dns-zone "$DNS_ZONE" \ --zone-name "$DNS_ZONE"- A server-level firewall rule allows only your current client IP for the lab.
- If using Private Link, the private endpoint is approved and has a private IP.
- If using Private Link, the VNet is linked to
privatelink.database.windows.net. - You can explain why the lab firewall rule should be removed after cleanup.
Solution notes for Step 2
Use az sql server firewall-rule list --resource-group "$RG" --server "$SQL_SERVER" -o table to confirm the
rule. For private endpoints, test DNS
from a VM or runner inside the VNet
with nslookup "$SQL_SERVER.database.windows.net".
If DNS still returns public addresses
from the private network, fix the
private DNS zone link before blaming
SQL authentication. If you are working
only from your laptop, continue with
the narrow firewall rule.
Step 3: Connect with sqlcmd and load a small schema
Section titled “Step 3: Connect with sqlcmd and load a small schema”This step verifies that data-plane
authentication and network access
work. It also creates enough data to
make metrics and query views
interesting. The insert uses a
recursive CTE pattern to generate rows
without needing an external data file.
If sqlcmd cannot connect, fix that
before loading data because the rest
of the exercise depends on a stable
connection path.
sqlcmd \ -S "$SQL_SERVER.database.windows.net" \ -d "$SQL_DB" \ -U "$SQL_ADMIN" \ -P "$SQL_ADMIN_PASSWORD" \ -N -C \ -Q "SELECT DB_NAME() AS database_name, SUSER_SNAME() AS login_name;"cat > seed-operator-events.sql <<'SQL'SET NOCOUNT ON;
DROP TABLE IF EXISTS dbo.operator_events;
CREATE TABLE dbo.operator_events( event_id int IDENTITY(1,1) NOT NULL PRIMARY KEY, tenant_id int NOT NULL, event_time datetime2 NOT NULL, severity varchar(12) NOT NULL, payload varchar(200) NOT NULL);
WITH n AS( SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS rn FROM sys.all_objects AS a CROSS JOIN sys.all_objects AS b)INSERT dbo.operator_events (tenant_id, event_time, severity, payload)SELECT rn % 25, DATEADD(second, -rn, SYSUTCDATETIME()), CASE WHEN rn % 10 = 0 THEN 'warning' ELSE 'info' END, CONCAT('operator-path-event-', rn)FROM n;
SELECT COUNT(*) AS inserted_rows FROM dbo.operator_events;SQL
sqlcmd \ -S "$SQL_SERVER.database.windows.net" \ -d "$SQL_DB" \ -U "$SQL_ADMIN" \ -P "$SQL_ADMIN_PASSWORD" \ -N -C \ -i seed-operator-events.sql-
sqlcmdconnects to the expected database. - Table
dbo.operator_eventsexists. - The load reports 10000 inserted rows.
- You can see the database in the Azure portal Overview pane.
Solution notes for Step 3
If sqlcmd reports a login failure,
confirm the admin user, password, and
database name. If it reports a network
error, check the firewall rule,
private endpoint DNS, and whether your
current public IP changed. If the
insert fails because an object already
exists, rerun the script because it
drops and recreates the lab table. If
the row count is lower than expected,
check whether the script was
interrupted.
Step 4: Trigger load and watch database metrics
Section titled “Step 4: Trigger load and watch database metrics”The load script is intentionally blunt. It updates rows and runs aggregate reads in a loop so you can watch CPU, data IO, log IO, workers, and sessions move. The goal is not to benchmark Azure SQL Database. The goal is to connect workload behavior to Azure Monitor metrics and identify which limit appears first.
cat > load-operator-events.sql <<'SQL'SET NOCOUNT ON;
DECLARE @i int = 0;
WHILE @i < 800BEGIN UPDATE TOP (250) dbo.operator_events SET event_time = SYSUTCDATETIME() WHERE severity = 'info';
SELECT COUNT_BIG(*) AS warning_events FROM dbo.operator_events WHERE severity = 'warning' AND payload LIKE 'operator-path-event-%';
SET @i += 1;END;SQL
for worker in 1 2 3 4do sqlcmd \ -S "$SQL_SERVER.database.windows.net" \ -d "$SQL_DB" \ -U "$SQL_ADMIN" \ -P "$SQL_ADMIN_PASSWORD" \ -N -C \ -i load-operator-events.sql > "load-worker-$worker.log" 2>&1 &done
waitOpen the database in the Azure portal while the load runs. Go to Monitoring, Metrics, and chart CPU percentage, Data IO percentage, Log IO percentage, Workers percentage, Sessions percentage, and successful or failed connections. If the load completes too quickly, rerun it or increase the loop count modestly. If you see timeouts, keep the logs because they are part of the learning goal.
- You generated concurrent SQL activity.
- You observed at least CPU, data IO, and log IO metrics.
- You identified whether CPU, data IO, log IO, workers, or sessions looked closest to the limit.
- You opened Query Performance Insight and found the lab query shape or confirmed that it needs more collection time.
Solution notes for Step 4
Azure Monitor metrics may lag by a few
minutes, and Query Performance Insight
may need Query Store history before
useful charts appear. If no metric
changes, confirm that the load workers
actually ran and inspect
load-worker-1.log. If the database
throttles, the visible symptom may be
slower completion rather than a clean
error. The important output is your
observation of which metric saturates
first.
Step 5: Scale the database compute and compare signals
Section titled “Step 5: Scale the database compute and compare signals”Scaling is an operator action, not a permanent conclusion. You are testing whether more provisioned compute changes the observed pressure. In production, you would pair this with cost approval, a rollback plan, and a follow-up review of whether the workload should be tuned instead. For the lab, scale from 2 vCores to 4 vCores and rerun the load.
az sql db update \ --resource-group "$RG" \ --server "$SQL_SERVER" \ --name "$SQL_DB" \ --capacity 4
az sql db show \ --resource-group "$RG" \ --server "$SQL_SERVER" \ --name "$SQL_DB" \ --query "{name:name, status:status, edition:edition, capacity:currentServiceObjectiveName}" \ -o tableRerun the load after the update completes. Compare CPU, data IO, log IO, duration, and any timeout behavior against the first run. If the dominant bottleneck moved, write that down. If nothing improved, the bottleneck may be query shape, locking, log design, or another tier limit rather than raw compute.
- Database scale operation completed.
- The database reports the larger service objective.
- You reran the load and compared metrics.
- You can state whether scaling changed the bottleneck or only changed the bill.
Solution notes for Step 5
Some scale operations briefly disconnect sessions, so production applications need retry logic. If the update is blocked by subscription policy or regional limits, use the portal to select the next available size. If metrics improve, do not stop thinking; ask whether the new size is the right steady-state cost. If metrics do not improve, bring the top queries and waits to a DBA instead of repeatedly increasing size.
Step 6: Perform point-in-time restore to a new database
Section titled “Step 6: Perform point-in-time restore to a new database”Point-in-time restore should create a new database, not overwrite the original during a live incident. This step restores the database to a timestamp five minutes in the past. The restore time must be inside the available PITR window, and very new databases may need a little time before an earlier restore point is available. If the restore time fails, wait several minutes and retry with a slightly later timestamp.
export SQL_DB_RESTORED="${SQL_DB}-restore"
RESTORE_TIME="$(.venv/bin/python - <<'PY'from datetime import datetime, timedelta, timezone
print((datetime.now(timezone.utc) - timedelta(minutes=5)).strftime("%Y-%m-%dT%H:%M:%SZ"))PY)"
printf 'Restoring to %s\n' "$RESTORE_TIME"
az sql db restore \ --resource-group "$RG" \ --server "$SQL_SERVER" \ --name "$SQL_DB" \ --dest-name "$SQL_DB_RESTORED" \ --time "$RESTORE_TIME"
az sql db list \ --resource-group "$RG" \ --server "$SQL_SERVER" \ --query "[].{name:name,status:status,sku:currentServiceObjectiveName}" \ -o tableVerify the restored database through
sqlcmd. Use the restored database
name and check that the table exists.
The row count may differ depending on
the restore timestamp and load timing.
That difference is the point: PITR
gives you a database at a prior point
so you can compare, repair, or
redirect safely.
sqlcmd \ -S "$SQL_SERVER.database.windows.net" \ -d "$SQL_DB_RESTORED" \ -U "$SQL_ADMIN" \ -P "$SQL_ADMIN_PASSWORD" \ -N -C \ -Q "SELECT COUNT(*) AS restored_rows FROM dbo.operator_events;"- Restore command completed successfully.
- A new database with the restored name exists.
-
sqlcmdcan connect to the restored database. - The restored database contains
dbo.operator_events. - You can explain how PITR differs from failover.
Solution notes for Step 6
If the restore fails because the timestamp is too early, wait until backups catch up and choose a later timestamp. If the restored database cannot be queried, confirm firewall and login access on the same logical server. If the restored database exists but lacks expected rows, compare the restore timestamp with when the seed and load scripts ran. In production, this restored copy becomes the safe analysis target for data repair.
Cleanup
Section titled “Cleanup”Delete the resource group when you are done. This removes the logical server, databases, firewall rule, and lab artifacts in that resource group. If you created a private endpoint in a shared network resource group, remove that private endpoint and DNS zone group separately if your organization does not want to keep it. Do not leave scaled-up lab databases running.
az group delete \ --name "$RG" \ --yes \ --no-wait- Lab resource group deletion started.
- Any optional private endpoint resources in another resource group were removed or assigned to an owner.
- Local files
seed-operator-events.sql,load-operator-events.sql, andload-worker-*.logwere reviewed or deleted.
Sources
Section titled “Sources”- What is Azure SQL? — Defines the Azure SQL family and compares SQL Database, Managed Instance, and SQL Server on Azure VM.
- What is Azure SQL Database? — Documents Azure SQL Database as managed PaaS and frames single databases and elastic pools.
- Compare vCore and DTU purchasing models — Explains DTU versus vCore, compute costs, storage costs, and backup storage billing.
- DTU-based purchasing model — Provides DTU tier behavior and throttling context for bundled resource limits.
- General Purpose and Business Critical service tiers — Covers vCore tier architecture and Business Critical tradeoffs.
- Hyperscale service tier — Documents Hyperscale capabilities, storage scale, replicas, and pricing model.
- Hyperscale architecture — Explains compute nodes, page servers, log service, and Azure Storage architecture.
- Serverless compute tier — Documents autoscaling, auto-pause behavior, and serverless billing.
- Automated backups — Covers PITR retention, backup storage behavior, and backup cost mechanics.
- Restore from backup — Documents restore workflows, PITR, deleted database restore, and geo-restore.
- Long-term retention — Documents LTR policy and the 10-year retention ceiling.
- Active geo-replication — Explains per-database geo-replication and geo-secondary behavior.
- Failover groups — Documents failover group behavior, listener endpoints, and failover policy.
- Private endpoints — Covers Azure Private Link behavior for Azure SQL Database.
- Firewall rules — Documents server-level and database-level firewall rules.
- Microsoft Entra authentication — Covers Entra users, contained database users, and authentication setup.
- Managed identities for Azure SQL — Documents system-assigned and user-assigned managed identity behavior.
- TDE with customer-managed keys — Covers BYOK and customer-managed TDE protectors.
- Always Encrypted — Explains column encryption and key exposure boundaries.
- Azure SQL auditing — Documents audit log destinations and audit purpose.
- Microsoft Defender for SQL — Covers vulnerability assessment, SQL injection alerts, and anomalous access alerts.
- Azure SQL metrics and alerts — Lists key Azure Monitor metrics for Azure SQL Database operations.
- Query Performance Insight — Documents top query analysis, Query Store dependency, and operator workflow.
- Database Watcher datasets — Documents resource utilization and wait statistics datasets for KQL-based monitoring.
- Azure SQL Database pricing — Provides the pricing surface used for the cost lens and worked example.
Next Module
Section titled “Next Module”Return to the Azure Essentials index until the next ordered module is assigned; the next operator-path topic should continue from managed data services into production integration and data movement.