Skip to content

Latest commit

 

History

History
316 lines (205 loc) · 30.7 KB

auto-failover-group-sql-mi.md

File metadata and controls

316 lines (205 loc) · 30.7 KB
title description author ms.author ms.reviewer ms.date ms.service ms.subservice ms.topic ms.custom
Auto-failover groups overview & best practices
Auto-failover groups let you manage geo-replication and automatic / coordinated failover of all user databases on a managed instance in Azure SQL Managed Instance.
strahinjas
sstefanovic
mathoma
12/28/2022
sql-managed-instance
high-availability
conceptual
azure-sql-split

Auto-failover groups overview & best practices (Azure SQL Managed Instance)

[!INCLUDEappliesto-sqlmi]

[!div class="op_single_selector"]

The auto-failover groups feature allows you to manage the replication and failover of all user databases in a managed instance to another Azure region. This article focuses on using the Auto-failover group feature with Azure SQL Managed Instance and some best practices.

To get started, review Configure auto-failover group. For an end-to-end experience, see the Auto-failover group tutorial.

Note

This article covers auto-failover groups for Azure SQL Managed Instance. For Azure SQL Database, see Auto-failover groups in SQL Database.

Overview

[!INCLUDE auto-failover-groups-overview]

Terminology and capabilities

  • Failover group (FOG)

    A failover group allows for all user databases within a managed instance to fail over as a unit to another Azure region in case the primary managed instance becomes unavailable due to a primary region outage. Since failover groups for SQL Managed Instance contain all user databases within the instance, only one failover group can be configured on an instance.

    [!IMPORTANT]
    The name of the failover group must be globally unique within the .database.windows.net domain.

  • Primary

    The managed instance that hosts the primary databases in the failover group.

  • Secondary

    The managed instance that hosts the secondary databases in the failover group. The secondary cannot be in the same Azure region as the primary.

  • DNS zone

    A unique ID that is automatically generated when a new SQL Managed Instance is created. A multi-domain (SAN) certificate for this instance is provisioned to authenticate the client connections to any instance in the same DNS zone. The two managed instances in the same failover group must share the DNS zone.

  • Failover group read-write listener

    A DNS CNAME record that points to the current primary. It's created automatically when the failover group is created and allows the read-write workload to transparently reconnect to the primary when the primary changes after failover. When the failover group is created on a SQL Managed Instance, the DNS CNAME record for the listener URL is formed as <fog-name>.<zone_id>.database.windows.net.

  • Failover group read-only listener

    A DNS CNAME record that points to the current secondary. It's created automatically when the failover group is created and allows the read-only SQL workload to transparently connect to the secondary when the secondary changes after failover. When the failover group is created on a SQL Managed Instance, the DNS CNAME record for the listener URL is formed as <fog-name>.secondary.<zone_id>.database.windows.net.

[!INCLUDE auto-failover-group-terminology]

Failover group architecture

The auto-failover group must be configured on the primary instance and will connect it to the secondary instance in a different Azure region. All user databases in the instance will be replicated to the secondary instance. System databases like master and msdb will not be replicated.

The following diagram illustrates a typical configuration of a geo-redundant cloud application using managed instance and auto-failover group:

:::image type="content" source="media/auto-failover-group-sql-mi/auto-failover-group-mi.png" alt-text="Auto-failover group diagram for SQL MI":::

If your application uses SQL Managed Instance as the data tier, follow the general guidelines and best practices outlined in this article when designing for business continuity.

Create the geo-secondary instance

To ensure non-interrupted connectivity to the primary SQL Managed Instance after failover, both the primary and secondary instances must be in the same DNS zone. It will guarantee that the same multi-domain (SAN) certificate can be used to authenticate client connections to either of the two instances in the failover group. When your application is ready for production deployment, create a secondary SQL Managed Instance in a different region, and make sure it shares the DNS zone with the primary SQL Managed Instance. You can do it by specifying an optional parameter during creation. If you're using PowerShell or the REST API, the name of the optional parameter is DNSZonePartner. The name of the corresponding optional field in the Azure portal is Primary Managed Instance.

Important

The first managed instance created in the subnet determines DNS zone for all subsequent instances in the same subnet. This means that two instances from the same subnet cannot belong to different DNS zones.

For more information about creating the secondary SQL Managed Instance in the same DNS zone as the primary instance, see Create a secondary managed instance.

Use paired regions

Deploy both managed instances to paired regions for performance reasons. SQL Managed Instance failover groups in paired regions have better performance compared to unpaired regions.

Enable and optimize geo-replication traffic flow between the instances

Connectivity between the virtual network subnets hosting primary and secondary instance must be established and maintained for uninterrupted geo-replication traffic flow. There are multiple ways to provide connectivity between the instances that you can choose among based on your network topology and policies:

Important

Global virtual network peering (VNet peering) is the recommended way for establishing connectivity between two instances in a failover group. It provides a low-latency, high-bandwidth private connection between the peered virtual networks using the Microsoft backbone infrastructure. No public Internet, gateways, or additional encryption is required in the communication between the peered virtual networks. Global virtual network peering is supported for instances hosted in subnets created since September 22, 2020. To be able to use global virtual network peering for SQL managed instances hosted in subnets created before September 22, 2020, consider configuring non-default maintenance window on the instance, as it will move the instance into a new virtual cluster that supports global virtual network peering.

Regardless of the connectivity mechanism, there are requirements that must be fulfilled for geo-replication traffic to flow:

  • The Network Security Group (NSG) rules on the subnet hosting primary instance allow:
    • Inbound traffic on port 5022 and port range 11000-11999 from the subnet hosting the secondary instance.
    • Outbound traffic on port 5022 and port range 11000-11999 to the subnet hosting the secondary instance.
  • The Network Security Group (NSG) rules on the subnet hosting secondary instance allow:
    • Inbound traffic on port 5022 and port range 11000-11999 from the subnet hosting the primary instance.
    • Outbound traffic on port 5022 and port range 11000-11999 to the subnet hosting the primary instance.
  • IP address ranges of VNets hosting primary and secondary instance must not overlap.
  • There's no indirect overlap of IP address range between the VNets hosting primary and secondary instance and any other VNets they are peered with via local virtual network peering or other means

Additionally, if you're using other mechanisms for providing connectivity between the instances than the recommended global virtual network peering, you need to ensure the following:

  • Any networking device used, like firewalls or network virtual appliances (NVAs), do not block the traffic described above.
  • Routing is properly configured, and asymmetric routing is avoided.
  • If you deploy auto-failover groups in a hub-and-spoke network topology cross-region, replication traffic should go directly between the two managed instance subnets rather than directed through the hub networks. It will help you avoid connectivity and replication speed issues.

Important

Alternative ways of providing connectivity between the instances involving additional networking devices may make troubleshooting process in case of connectivity or replication speed issues very difficult and require active involvement of network administrators and significantly prolong the resolution time.

Initial seeding

When establishing a failover group between managed instances, there's an initial seeding phase before data replication starts. The initial seeding phase is the longest and most expensive part of the operation. Once initial seeding completes data is synchronized, and only subsequent data changes are replicated. The time it takes for the initial seeding to complete depends on the size of data, number of replicated databases, workload intensity on the primary databases, and the speed of the link between the virtual networks hosting primary and secondary instance that mostly depends on the way connectivity is established. Under normal circumstances, and when connectivity is established using recommended global virtual network peering, seeding speed is up to 360 GB an hour for SQL Managed Instance. Seeding is performed for a batch of user databases in parallel - not necessarily for all databases at the same time. Multiple batches may be needed if there are many databases hosted on the instance.

If the speed of the link between the two instances is slower than what is necessary, the time to seed is likely to be noticeably impacted. You can use the stated seeding speed, number of databases, total size of data, and the link speed to estimate how long the initial seeding phase will take before data replication starts. For example, for a single 100-GB database, the initial seed phase would take about 1.2 hours if the link is capable of pushing 84 GB per hour, and if there are no other databases being seeded. If the link can only transfer 10 GB per hour, then seeding a 100-GB database will take about 10 hours. If there are multiple databases to replicate, seeding will be executed in parallel, and, when combined with a slow link speed, the initial seeding phase may take considerably longer, especially if the parallel seeding of data from all databases exceeds the available link bandwidth.

Important

In case of an extremely low-speed or busy link causing the initial seeding phase to take days the creation of a failover group can time out. The creation process will be automatically canceled after 6 days.

Manage geo-failover to a geo-secondary instance

The failover group will manage geo-failover of all databases on the primary managed instance. When a group is created, each database in the instance will be automatically geo-replicated to the geo-secondary instance. You can't use failover groups to initiate a partial failover of a subset of databases.

Important

If a database is dropped on the primary managed instance, it will also be dropped automatically on the geo-secondary managed instance.

Use the read-write listener (primary MI)

For read-write workloads, use <fog-name>.zone_id.database.windows.net as the server name. Connections will be automatically directed to the primary. This name doesn't change after failover. The geo-failover involves updating the DNS record, so the new client connections are routed to the new primary only after the client DNS cache is refreshed. Because the secondary instance shares the DNS zone with the primary, the client application will be able to reconnect to it using the same server-side SAN certificate. The existing client connections need to be terminated and then recreated to be routed to the new primary. The read-write listener and read-only listener cannot be reached via the public endpoint for managed instance.

Use the read-only listener (secondary MI)

If you have logically isolated read-only workloads that are tolerant to data latency, you can run them on the geo-secondary. To connect directly to the geo-secondary, use <fog-name>.secondary.<zone_id>.database.windows.net as the server name.

In the Business Critical tier, SQL Managed Instance supports the use of read-only replicas to offload read-only query workloads, using the ApplicationIntent=ReadOnly parameter in the connection string. When you have configured a geo-replicated secondary, you can use this capability to connect to either a read-only replica in the primary location or in the geo-replicated location:

  • To connect to a read-only replica in the primary location, use ApplicationIntent=ReadOnly and <fog-name>.<zone_id>.database.windows.net.
  • To connect to a read-only replica in the secondary location, use ApplicationIntent=ReadOnly and <fog-name>.secondary.<zone_id>.database.windows.net.

The read-write listener and read-only listener can't be reached via public endpoint for managed instance.

Potential performance degradation after failover

A typical Azure application uses multiple Azure services and consists of multiple components. The automatic geo-failover of the failover group is triggered based on the state of the Azure SQL components alone. Other Azure services in the primary region may not be affected by the outage and their components may still be available in that region. Once the primary databases switch to the secondary region, the latency between the dependent components may increase. Ensure the redundancy of all the application's components in the secondary region and fail over application components together with the database so that application's performance is not affected by higher cross-region latency.

Potential data loss after failover

If an outage occurs in the primary region, recent transactions may not be able to replicate to the geo-secondary. Failover is deferred for the period you specify using GracePeriodWithDataLossHours. If you configured the automatic failover policy, be prepared for data loss. In general, during outages, Azure favors availability. Setting GracePeriodWithDataLossHours to a larger number, such as 24 hours, or disabling automatic geo-failover lets you reduce the likelihood of data loss at the expense of database availability.

DNS update

The DNS update of the read-write listener will happen immediately after the failover is initiated. This operation won't result in data loss. However, the process of switching database roles can take up to 5 minutes under normal conditions. Until it's completed, some databases in the new primary instance will still be read-only. If a failover is initiated using PowerShell, the operation to switch the primary replica role is synchronous. If it's initiated using the Azure portal, the UI will indicate completion status. If it's initiated using the REST API, use standard Azure Resource Manager's polling mechanism to monitor for completion.

Important

Use manual planned failover to move the primary back to the original location once the outage that caused the geo-failover is mitigated.

Save costs with a license-free DR replica

You can save on SQL Server license costs by configuring your secondary managed instance to be used for disaster recovery (DR) only. To set this up, see Configure free DR replica.

As long as the secondary instance is not used for read-workloads, Microsoft provides you with a free number of vCores to match the primary instance. You're still charged for compute and storage used by the secondary instance. Auto-failover groups support only one replica - the replica must either be a readable replica, or designated as a DR-only replica.

Enable scenarios dependent on objects from the system databases

System databases are not replicated to the secondary instance in a failover group. To enable scenarios that depend on objects from the system databases, make sure to create the same objects on the secondary instance and keep them synchronized with the primary instance.

For example, if you plan to use the same logins on the secondary instance, make sure to create them with the identical SID.

-- Code to create login on the secondary instance
CREATE LOGIN foo WITH PASSWORD = '<enterStrongPasswordHere>', SID = <login_sid>;

To learn more, see Replication of logins and agent jobs.

Synchronize instance properties and retention policies instances

Instances in a failover group remain separate Azure resources, and no changes made to the configuration of the primary instance will be automatically replicated to the secondary instance. Make sure to perform all relevant changes both on primary and secondary instance. For example, if you change backup storage redundancy or long-term backup retention policy on primary instance, make sure to change it on secondary instance as well.

Scale instances

You can scale up or scale down the primary and secondary instance to a different compute size within the same service tier. When scaling up, we recommend that you scale up the geo-secondary first, and then scale up the primary. When scaling down, reverse the order: scale down the primary first, and then scale down the secondary. When you scale instance to a different service tier, this recommendation is enforced.

The sequence is recommended specifically to avoid the problem where the geo-secondary at a lower SKU gets overloaded and must be reseeded during an upgrade or downgrade process.

Use failover groups and virtual network service endpoints

If you're using Virtual Network service endpoints and rules to restrict access to your SQL Managed Instance, note that each virtual network service endpoint applies to only one Azure region. The endpoint does not enable other regions to accept communication from the subnet. Therefore, only the client applications deployed in the same region can connect to the primary database.

Prevent loss of critical data

Due to the high latency of wide area networks, geo-replication uses an asynchronous replication mechanism. Asynchronous replication makes the possibility of data loss unavoidable if the primary fails. To protect critical transactions from data loss, an application developer can call the sp_wait_for_database_copy_sync stored procedure immediately after committing the transaction. Calling sp_wait_for_database_copy_sync blocks the calling thread until the last committed transaction has been transmitted and hardened in the transaction log of the secondary database. However, it doesn't wait for the transmitted transactions to be replayed (redone) on the secondary. sp_wait_for_database_copy_sync is scoped to a specific geo-replication link. Any user with the connection rights to the primary database can call this procedure.

To prevent data loss during user-initiated, planned geo-failover, replication automatically and temporarily changes to synchronous replication, then performs a failover. Replication then returns to asynchronous mode after the geo-failover is complete.

Note

sp_wait_for_database_copy_sync prevents data loss after geo-failover for specific transactions, but does not guarantee full synchronization for read access. The delay caused by a sp_wait_for_database_copy_sync procedure call can be significant and depends on the size of the not yet transmitted transaction log on the primary at the time of the call.

Failover group status

Auto-failover group reports its status describing the current state of the data replication:

  • Seeding - Initial seeding is taking place after creation of the failover group, until all user databases are initialized on the secondary instance. Failover process cannot be initiated while auto-failover group is in the Seeding status, since user databases aren't copied to secondary instance yet.
  • Synchronizing - the usual status of auto-failover group. It means that data changes on the primary instance are being replicated asynchronously to the secondary instance. This status doesn't guarantee that the data is fully synchronized at every moment. There may be data changes from primary still to be replicated to the secondary due to asynchronous nature of the replication process between instances in the auto-failover group. Both automatic and manual failovers can be initiated while the auto-failover group is in the Synchronizing status.
  • Failover in progress - this status indicates that either automatically or manually initiated failover process is in progress. No changes to the failover group or additional failovers can be initiated while the auto-failover group is in this status.

Permissions

Permissions for a failover group are managed via Azure role-based access control (Azure RBAC).

Azure RBAC write access is necessary to create and manage failover groups. The SQL Managed Instance Contributor has all the necessary permissions to manage failover groups.

For specific permission scopes, review how to configure auto-failover groups in Azure SQL Managed Instance.

Limitations

Be aware of the following limitations:

  • Failover groups can't be created between two instances in the same Azure region.
  • Failover groups can't be renamed. You will need to delete the group and re-create it with a different name.
  • A failover group contains exactly two managed instances. Adding additional instances to the failover group is unsupported.
  • An instance can participate only in one failover group at any moment.
  • A failover group can't be created between two instances belonging to different Azure tenants.
  • A failover group between two instances belonging to different Azure subscriptions can't be created using Azure portal or Azure CLI. Use Azure PowerShell or REST API instead to create such a failover group. Once created, cross-subscription failover group is regularly visible in Azure portal and all subsequent operations including failovers can be inititiated from Azure portal or Azure CLI.
  • Database rename isn't supported for databases in failover group. You will need to temporarily delete failover group to be able to rename a database.
  • System databases aren't replicated to the secondary instance in a failover group. Therefore, scenarios that depend on objects from the system databases such as Server Logins and Agent jobs, require objects to be manually created on the secondary instances and also manually kept in sync after any changes made on primary instance. The only exception is Service master Key (SMK) for SQL Managed Instance that is replicated automatically to secondary instance during creation of failover group. Any subsequent changes of SMK on the primary instance however will not be replicated to secondary instance. To learn more, see how to Enable scenarios dependent on objects from the system databases.
  • Failover groups can't be created between instances if any of them are in an instance pool.

Programmatically manage failover groups

Auto-failover groups can also be managed programmatically using Azure PowerShell, Azure CLI, and REST API. The following tables describe the set of commands available. Active geo-replication includes a set of Azure Resource Manager APIs for management, including the Azure SQL Database REST API and Azure PowerShell cmdlets. These APIs require the use of resource groups and support Azure role-based access control (Azure RBAC). For more information on how to implement access roles, see Azure role-based access control (Azure RBAC).

Cmdlet Description
New-AzSqlDatabaseInstanceFailoverGroup This command creates a failover group and registers it on both primary and secondary instances
Set-AzSqlDatabaseInstanceFailoverGroup Modifies configuration of a failover group
Get-AzSqlDatabaseInstanceFailoverGroup Retrieves a failover group's configuration
Switch-AzSqlDatabaseInstanceFailoverGroup Triggers failover of a failover group to the secondary instance
Remove-AzSqlDatabaseInstanceFailoverGroup Removes a failover group
Command Description
az sql instance-failover-group create This command creates a failover group and registers it on both primary and secondary servers
az sql instance-failover-group delete Removes a failover group from the server
az sql instance-failover-group show Retrieves a failover group configuration
az sql instance-failover-group update Modifies a failover group's configuration and/or adds one or more databases to a failover group
az sql instance-failover-group set-primary Triggers failover of a failover group to the secondary server
API Description
Create or Update Failover Group Creates or updates a failover group's configuration
Delete Failover Group Removes a failover group from the instance
Failover (Planned) Triggers failover from the current primary instance to this instance with full data synchronization.
Force Failover Allow Data Loss Triggers failover from the current primary instance to the secondary instance without synchronizing data. This operation may result in data loss.
Get Failover Group retrieves a failover group's configuration.
List Failover Groups - List By Location Lists the failover groups in a location.

Next steps