Skype for Business – Standard Edition Pool Failover Disaster [UPDATED]

Last week I blogged about my experiences with trying to simulate a DR event by performing a planned failover of a Standard Edition pool paired deployment. For those of you who read that post will know it is not a pleasant experience. For those who may not have read that post here is the link http://wp.me/p5atJy-cz

Over the weekend, I spent a considerable amount of time trying to find a safe workaround for the problem without being too destructive. I decided to start from scratch and install a brand new domain with Skype for Business RTM, to eliminate any customer variables and potentially CUs that could attribute to this problem. Unfortunately, I was able to replicate this issue all too easily, even with zero configuration applied (literally out of the box vanilla config) and had to revert the deployment back using the steps in my previous post.

However, I was able to work out a procedure that will produce less blood, sweat and tears for Skype admins.

The problem as I understand it is that the copy of the CMS that is synchronised to the backup server RTCLOCAL CMS database (local copy) is not updated to change the master to the backup server by the invoke-CsManagementServerFailover commandlet. Instead, the local CMS copy is still expecting the primary server to be the active master of the CMS database and this is where the verification fails and replication stops.

This is a definite bug / issue with Skype for Business and not a PICNIC (Problem In Chair Not In Computer) issue.

The process I have come to understand works is as follows:

Failover Process

  1. First, make sure you have an up to date backup for the CMS and Location Databases Export-CsConfiguration –Filename c:\cms.zip and Export-CsLisConfiguration –Filename c:\lis.zip
  2. Make sure that you have no replication issues by running Invoke-CsManagementStoreReplicationStatus
  3. Next check and confirm replication was successful by running Get-CsManagementStoreReplicationStatus and check that both Front End servers return TRUE
    Also check event viewer on both servers for Informational event ID 3013 to ensure replication has completed
  4. Next run an manual synchronisation of the CMS database by using Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.local –Confirm:$false
    On the master server you should see 4 events that confirm the synchronisation was successful, these are Event ID 4066, 4090, 2038, and 3013 again

    On the backup server you should see an informational event ID 3013 to confirm the data was received.
  5. If you have an edge pool, at this moment change the next hop to the backup server by running Set-CsEdgeServer –Identity edge.domain.local –Registrar fe2.domain.local
  6. From the backup server, run the following command Invoke-CsManagementServerFailover –BackupServerFqdn fe2.domain.local –BackupSqlInstanceName RTC –Force

    We expect this to fail as “normal” on the validation process and you can press CTRL + C to cancel this or it will retry for 15 minutes
  7. Make sure the AD SCP point has updated by running the Get-CsManagementConnection command

    If not run the Set-CsConfigurationStoreLocation –SqlServerFqdn fe2.domain.local –SqlInstanceName rtc
  8. At this point the CMS won’t be in a happy state, running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus will return empty

    and so will Get-CsManagementStoreReplicationStatus

  9. You will also notice the File Transfer Agent in a stopped state on the primary server – Start this service.
  10. Next on the backup server we need to install the local configuration store again, pulling information from the CMS database using the deployment wizard

  11. Once complete, the command Enable-CsTopology

  12. Now run Step 2 from the deployment wizard or bootstrapper.exe from command line to reinstall any missing components
  13. After this has run restart the following services on the backup server; Skype for Business File Transfer Agent, Skype for Business Master Replica Agent, Skype for Business Replica Replicator Agent
  14. Run the following command to check the CMS has now become properly active on the backup server running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus

    Then run the Get-CsManagementStoreReplicationStatus command to make sure the backup server replication and all other servers are replicating
  15. Check the SCP connection point is now pointing at the backup server by running Get-CsManagementConnection

  16. Finally, check that topology is accurately showing the backup server as the active CMS server in the pool by running Get-CsService –CentralManagement
  17. Now we can run the Invoke-CsPoolFailover –PoolFqdn fe01.domain.local command to fail the users and services over
  18. Once complete, change your internal DNS records to point to the backup server and change the reverse proxy rules as required.

This now completes the failover process.

Failback Process

To fail the CMS, users and service back to the primary server, perform the following steps.

  1. If the primary server is still powered on after failover stop the remaining Skype for Business services that are running Stop-CsWindowsService -Force
  2. Now run Start-CsPool –PoolFqdn fe1.domain.local –Confirm:$false
  3. Once started run the Invoke-CsManagementStoreReplication command
  4. Then run the Invoke-CsBackupServiceSync –PoolFqdn fe2.domain.local command
  5. Check Event viewer on both servers for Event IDs 4066, 4090, 2038, and 3013 (Same as step 4 in failover process)
  6. Next change the edge servers next hop pool back to the primary server using Set-CsEdgeServer – identity edge.domain.local –Registrar fe1.domain.local
  7. Next run the Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe1.domain.com –BackupSqlInstanceName rtc –Force command

    Again, we expect this to fail verification
  8. Check the AD SCP has updated by running the following command Get-CsManagementConnection (should show fe1)
  9. After this has run restart the following services on the both servers; Skype for Business File Transfer Agent, Skype for Business Master Replica Agent, Skype for Business Replica Replicator Agent
  10. Next run Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus command. This should return the primary server back as the active master and active file transfer agent
  11. Then run the Get-CsManagementStoreReplicationStatus command to check replication between servers. All should return true
  12. Now fail the users and services back to the primary server using Invoke-CsPoolFailBack –PoolFqdn fe1.domain.local command
  13. Once completed, readjust your DNS records and reverse proxy rules to point back to the primary server.

This completes the failback process.

Recovery Procedure

There may be certain scenarios where this may not work as expected. In this case the active master and active file transfer agent will refuse to be set and we will now have to revert back to the CMS backup we took before performing failover.

If this is the case here is the problem. In the CMS database both servers are listed as masters and file transfer agents, but only one server is registered (and that’s a backup server).

How it should look is this:

I attempted to edit this table and found that the dbo.tasks table relies on the componentId field from the dbo.components table to correctly assign the master and file transfer agent task to the right CMS server. So I edited both tables to set the backup server as the registered master and fta host using the following SQL commands

UPDATE dbo.Component SET Registered = 1, LastUnregister=NULL WHERE ComponentId = 1

UPDATE dbo.Component SET Registered = 1, LastUnregister=NULL WHERE ComponentId = 2

I then updated the tasks database table to reflect the registered master and fta

UPDATE dbo.Task SET ComponentID = 1 WHERE TaskName = ‘Master Replication Task’

UPDATE dbo.Task SET ComponentID = 2 WHERE TaskName = ‘File Transfer Task’

And then removed the old primary server from the dbo.Component table

After restarting the services on both servers I found that this process DID NOT WORK!! Posting this step to show that editing the CMS is

  1. Not Supported
  2. Only gives you more headaches!

If you are at this stage, the only procedure left is to restore the CMS from the backup you took at the beginning of the process. Here is how to do it:

  1. On both servers, stop the backup, file transfer agent and master replica agent services
  2. On the primary server run Install-CsDatabase –CentralManagementDatabase –SqlServerName fe1.domain.local –SqlInstanceName RTC –Clean
  3. On the primary server run Import-CsConfiguration –Filename c:\cms.zip
  4. On the primary server run Import-CsLisConfiguration –Filename c:\lis.zip
  5. On the backup server run Install-CsDatabase –CentralManagementDatabase –SqlServerName fe2.domain.local –SqlInstanceName RTC –Clean
  6. On each server run the install local configuration store step from the deployment wizard
  7. Start the services on both servers from step 1
  8. Check CMS has been restored to the primary server by running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus

Summary

To summarise this post, failover can happen but not following the TechNet documented method. It may fail, if it does, then you need to revert your CMS back to the state it was in before initiating a CMS failover. So please, please take a backup of the CMS, it takes literally 2 seconds and will save you days of pain.

9 thoughts on “Skype for Business – Standard Edition Pool Failover Disaster [UPDATED]

  1. the trick is to run Invoke-CsManagementStoreFailover WITHOUT parapemeters from the destination pool. This way the command runs fine a you see no issues.

    Like

  2. I’ve encountered this issue on a pair of SfB EE pools running “CU0”.

    After a couple of goes at it I determined that my issue was because I hit ^-C at Failover Step 6 (when the text says you can!). In a later repeat test I waited and it failed-over correctly.

    I later found this article and I was wondering if AD replication might be contributing to the problem:
    https://ucvnext.org/2016/01/active-directory-site-considerations-in-lync-server-and-skype4b-server-pool-pairing/

    Like

    1. I’m also encountering this issue on a pair of EE pools running latest CU. It has nothing to do with stopping at step 6, it just doesn’t work for us. I’m forcing replication of course but it’s not the problem. The problem is exactly the one described in this blog and unfortunately, it seems that MS is not really aware of it (otherwise it would have been fixed i guess).
      I will open a ticket to make sure they are working on it.
      But you did a great job Mark. Thank you.

      Like

  3. It’s not listed as a fix in the June 2016 CU, but I tried an Invoke-CsManagementServerFailover at a client with paired SEs running that CU and it worked fine.

    Have just been trying it at a client who has to stay on the March 2016 CU (3rd party compliance software) and it fails exactly as described here.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s