MSSQL AlwaysOn Failover Scenarios


Notes on the topology above...
  • In this configuration, if you are running Enterprise Edition with Software Assurance (or equivalent) then you only need to license the Primary.
  • Total loss of Primary datacentre requires manual DR invocation.
  • 4 nodes + witness, or 5 nodes or more is better (i.e. can survive two simultaneous server node failures without manual failover to DR), but each extra node incurs SQL license cost.
  • Witness can be local or cloud. Note that a cloud witness needs internet access from your SQL Servers.

Automatic Failover

For an automatic failover to happen the following criteria must be met...

NOTE: This is an important consideration for changes that require SQL Server or OS restarts. If your patch windows are too close together then automatic failover will not happen.

Failure-Condition Levels

1 - OnServerDown

2 - OnServerUnresponsive

3 - OnCriticalServerError

4 - OnModerateServerError

5 - OnAnyQualifiedFailureConditions

To set the Failure Condition Level...

ALTER AVAILABILITY GROUP AG1 SET (FAILURE_CONDITION_LEVEL = 1);

To set the Health Check Timeout (relevant for Level 2)...

ALTER AVAILABILITY GROUP AG1 SET (HEALTH_CHECK_TIMEOUT = 60000);

Default is 30000 milliseconds (30 seconds)

Recovery Scenarios

2 Nodes

All scenarios assume that the WSFC failure threshold has not been breached. 
In a two node cluster one node's vote is automatically zeroed.
Failure of primary node. The Secondary does not have a vote (0/1) so the cluster goes down.
Failure of secondary node.The Primary is the only voting node (1/1) so service continues.
Simultaneous failure of both nodes

2 Nodes + Witness

If you have two nodes, Microsoft suggest a witness is required
All scenarios assume that the WSFC failure threshold has not been breached. 
Failure of primary node. "Dynamic quorum behaviour" means that the witness won't vote after the Primary node goes down. As the only node with a vote (1/1), the Secondary can become the Primary without human intervention.
Failure of secondary node."Dynamic quorum behaviour" means that the witness won't vote after the Secondary node goes down. As the only node with a vote, the Primary (1/1) can remain the Primary without human intervention.
Simultaneous failure of both nodes

3 Nodes

All scenarios assume that the WSFC failure threshold has not been breached. In order to avoid a performance impact on the databases in the Primary Data Centre, Microsoft recommend Async commit to the DR data centre (which means that DR failover must be invoked manually).

3-Node: Single Node Failure

Failure of Primary node."Dynamic quorum behaviour" means that one of the remaining nodes (in this case, the DR Secondary) will lose its vote after the Primary node goes down. As the only node with a vote, the Secondary can become the Primary without human intervention.
Failure of Primary node."Dynamic quorum behaviour" means that one of the remaining nodes (in this case, the Secondary) will lose its vote after the Primary node goes down. As the Secondary doesn't have a vote and the DR Standby isn't valid for automatic failover then the Secondary will need human intervention in order to become the Primary.
Failure of Secondary node"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Secondary node goes down. As the only node with a vote, the Primary can continue to be the Primary without human intervention
Failure of DR Secondary node"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the DR Secondary node goes down. As the only node with a vote, the Primary can continue to be the Primary without human intervention

3-Node: Simultaneous Double Node Failure

Simultaneous failure of Primary and DR Secondary. The surviving Secondary does not have a quorum (i.e it doesn't have more than 50% of the votes) so cannot become the Primary without human intervention (forcing quorum)
Simultaneous failure of Secondary and DR Secondary. The surviving Primary does not have a quorum (i.e it doesn't have more than 50% of the votes) so shuts down to avoid a split brain situation (i.e. it assumes the other two nodes could have formed a quorum without it).
Simultaneous failure of Primary and SecondaryThe DR Secondary is not valid for automatic failover due to Async commit so cannot become the Primary without human intervention (forcing quorum)

3-Node: Consecutive Double Node Failure

Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the Primary node goes down. After the Secondary failure, the DR Secondary has a vote but is not valid for automatic failover due to Async commit.
Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Primary node goes down. After the Secondary failure, the DR Secondary does not have a vote so cannot become Primary.
Consecutive failure of Secondary followed by Primary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the Secondary node goes down. After the Primary failure, the DR Secondary has a vote but is not valid for automatic failover due to Async commit.
Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Secondary node goes down. After the Primary failure, the DR Secondary does not have a vote and is not valid for automatic failover due to Async commit..
Consecutive failure of Primary followed by DR Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Primary node goes down. After the DR Secondary failure, as the only node with a vote, the Secondary can become the Primary without human intervention
Consecutive failure of Primary followed by Secondary"Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the Primary node goes down. At this point the Secondary will become the new primary. After the DR Secondary failure, the Secondary does not have a vote so shuts down to avoid a split brain scenario.
Consecutive failure of Secondary followed by DR Secondary. "Dynamic quorum behaviour" means that one of the remaining nodes (in this case the DR Secondary) will lose its vote after the Secondary node goes down. As the only node with a vote, the Primary can continue to be the Primary without human intervention
Consecutive failure of Secondary followed by DR Secondary. "Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the Secondary node goes down. After the DR Secondary failure, the Primary does not have a vote so shuts down to avoid a split brain scenario.
Consecutive failure of DR Secondary, followed by Primary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the DR Secondary node goes down. As the only node with a vote, the Secondary can become the Primary without human intervention.
Consecutive failure of DR Secondary, followed by Primary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the DR Secondary node goes down. After the Primary failure, the Secondary does not have a vote so cannot become the Primary.
Consecutive failure of DR Secondary, followed by Secondary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Secondary) will lose its vote after the DR Secondary node goes down. After the Secondary failure, the Primary has the remaining vote and can continue without intervention.
Consecutive failure of DR Secondary, followed by Secondary."Dynamic quorum behaviour" means that one of the remaining nodes (in this case the Primary) will lose its vote after the DR Secondary node goes down. After the Secondary failure, the Primary does not have a vote so goes down to avoid a split brain scenario.

3 Nodes + Witness

If you have 3 nodes, Microsoft suggest a witness is strongly recommended
All scenarios assume that the WSFC failure threshold has not been breached. 
In order to avoid a performance impact on the databases in the Primary Data Centre, Microsoft recommend Async commit to the DR data centre (which means that DR failover must be invoked manually).
It is also a good idea to configure the Availability Group to prevent the DR Secondary from getting a vote

3-Node + Witness: Single Node Failure

Failure of Primary node.The Secondary and the Witness have a quorum so the Secondary can become the Primary without human intervention.
Failure of Secondary nodeThe Primary and the Witness have a quorum so the Primary can remain the Primary without human intervention.
Failure of DR Secondary nodeThe Primary, the Secondary and the Witness have a quorum so the Primary can remain the Primary without human intervention.
Failure of WitnessAs there are only two voting nodes left then either the Primary or Secondary will have voting rights revoked. The cluster will be able to continue as normal unless there is another failure before the Witness is restored (see consecutive double failure scenarios below)

3-Node + Witness: Simultaneous Double Node Failure

Simultaneous failure of Primary and DR Secondary. "Dynamic quorum behaviour" means that the witness will lose its vote after the Primary and DR Secondary node go down. The Secondary has the remaining vote and can become the Primary without intervention.
Simultaneous failure of Secondary and DR Secondary. "Dynamic quorum behaviour" means that the witness will lose its vote after the Secondary and DR Secondary node go down. The Primary has the remaining vote and can continue to be the Primary without intervention.
Simultaneous failure of Primary and Secondary. Only the Witness remains with a vote. The DR Secondary is not valid for automatic failover due to Async commit..

3-Node + Witness: Consecutive Double Node Failure

Consecutive failure of Primary followed by SecondaryThe witness is the only remaining voting member. The DR Secondary is not valid for automatic failover due to Async commit.
Consecutive failure of Primary followed by DR Secondary. "Dynamic quorum behaviour" means that the witness will lose its vote after the Primary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Secondary has the remaining vote and can become the Primary without intervention.
Consecutive failure of Secondary followed by PrimaryThe witness is the only remaining voting member. The DR Secondary is not valid for automatic failover due to Async commit.
Consecutive failure of Secondary followed by DR Secondary. "Dynamic quorum behaviour" means that the witness will lose its vote after the Secondary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Primary has the remaining vote and can continue without intervention.
Consecutive failure of DR Secondary, followed by Primary."Dynamic quorum behaviour" means that the witness will lose its vote after the Primary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Secondary has the remaining vote and can become the Primary without intervention.
Consecutive failure of DR Secondary, followed by Secondary."Dynamic quorum behaviour" means that the witness will lose its vote after the Secondary node goes down (remember, the DR Secondary is ignored as we have configured it without voting rights). The Primary has the remaining vote and can continue as the Primary without intervention.

Bibliography