Skype for Business – WARNING: Standard Edition Pool Failover Disaster

I have been setting up a Standard Edition pool pair for disaster recovery for a customer and wanted to share my experiences around failover. The deployment and migration of services, users and data from the legacy Lync installation went absolutely fine and without issues. I successfully paired the two Skype for Business Standard Edition servers together, both the backup and replication services were happily synchronising data.

A few prudent powershell commands to prove correct replication returned all values as expected. Once I was happy with the configuration I wanted to perform a controlled failover from the primary to the backup pool including users and CMS to prove the failover process worked as expected. At this point I would like to thank Chris Hayward (@WeakestLync) for warning me of a potential issue during failover that screws up your CMS.

It turns out that when performing the failover, Skype for Business leaves the CMS active on both Servers! However, this is not immediately apparent or clear and I wanted to detail my experience in identifying that this is the case and what I had to do to resolve this issue. I don’t have any screenshots of the problem because I was too busy trying to resolve it, so will do my best to explain.

Failing Over

Performing the failover, I followed the steps listed on TechNet (https://technet.microsoft.com/en-us/library/jj204678(v=ocs.15).aspx) as they have worked fine in previous versions and is the same process for Skype for Business.

When running the Invoke-CsManagementServerFailover commandlet with the –Whatif parameter the results showed correctly that the CMS was on the primary server and would be failed over to the backup pool server.

Running Get-CsManagementStoreReplicationStatus returned TRUE for every server in the topology.

Running Get-CsManagementStoreReplicationStatus –
CentralManagementStoreStatus returned the primary server as the Active Master and Active File Transfer Agent with the backup server listed in the Active Replicas list as expected

Running Get-CsService –CentralManagement showed that the primary server was active for the CMS and the backup server as false as expected

Downloading the current topology showed the primary server as the active CMS.

Running Get-CsBackupServiceStatus –PoolFqdn fe1.domain.local returned the server as in a Normal State and the same for the backup server.

To ensure that the CMS was properly up to date on both servers I then ran the Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.com and checked for any replication issues in event viewer and by using CLS Logging using the HADR scenario. Everything looked positive

One last invocation to ensure servers where up to date was to force replication to the RTCLOCAL databases on each server by running Invoke-CsManagementStoreReplication command

Once I was absolutely sure I was in a position to test this by re-running the Get commands above to triple check everything I decided on Chris’s advice to take a backup of the XDS, and Lis databases, just in case.

Export-CsConfiguration –Filename c:\cms.zip

Export-CsLisConfiguration –Filename c:\lis.zip

Now I went ahead and followed the TechNet procedure by setting the Edge server next hop to the backup server using Set-CsEdgeServer –identity edgepool.domain.local –Registrar fe2.domain.local command.

Next, ran the Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe2.domain.local –BackupSqlInstanceName RTC –Force

Here is where the problems started…

When failing over the verification process was failing to verify the CMS on the backup server with the following error:

“Backup Central Management Store state is Active, the expected status is Backup. Note that if the local replica is out of date, the topology document may be obsolete. Ensure that the local replica is up to date, and run Test Management Server Cmdlet. Central management server verification failed. Verification execution will be retried once a minute for 14 more minutes. Since Failover has already finished, the user can press Ctrl + c to end the current verification task at any time, and Failover will not be affected”

I let all the retries complete but none were a success.

I then ran the following commands to see what had actually happened and what state the CMS is in at this moment.

Running Get-CsManagementStoreReplicationStatus
did not return any values at all

Running Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus
did not return any values at all.

Running Get-CsService –CentralManagement showed that the backup server was the ACTIVE server for the CMS

Running Get-CsManagementConnection returned the primary server as the ACTIVE CMS

Downloading the current topology showed the primary server STILL as the ACTIVE CMS.

Running Get-CsBackupServiceStatus –PoolFqdn fe1.domain.local returned the server as in a Error State and the same for the backup server.

So I double checked the properties of the Active Directory Service Connection Point (SCP) for Skype for Business using ADSI Edit under the Configuration context

CN=<topology guid>,CN=Topology Settings,CN=RTC Service,CN=Services,CN=Configuration,DC=domain,DC=local

The msRTCSIP-BackEndServer attribute was set to the primary server fe1.domain.local/RTC

At this point I did a lot of panicking and head scratching, using various commands, restarting services etc to try and get the Active server to show the backup server and restart replication. By restarting the Replica Replica and File Transfer Agent services on both front end servers, I managed to get some results back from the following commands

Running Get-CsManagementStoreReplicationStatus returned all servers replication status as FALSE

Running Get-CsManagementStoreReplicationStatus –
CentralManagementStoreStatus returned values for the Active Replicas, but nothing for the Active Master Fqdn or Active File Transfer Agent Fqdn, so replication is never going to work.

Attempting to set the SCP using Set-CsManagementServer –Identity fe2.domain.local, although did update the SCP in AD, did not set this server as the Active Master or Active File Transfer Agent.

At this point there were no errors being reported in the Lync application log and users had full feature access.

I decided then to take a look at the XDS database in SQL management studio to see what that was reporting as the master server. So I opened the database and the table dbo.Component.

In this table it showed 3 entries – I was expecting only 2 as I have only 2 CMS servers!! The entries showed the following

Fqdn Component Registered
Fe1.domain.local Master 0
Fe2.domain.local Master 1
Fe1.domain.local Fta 1

How it should have looked

Fqdn Component Registered
Fe1.domain.local fta 1
Fe2.domain.local Master 1

So at this point it looks as though the XDS database ACTIVE ON BOTH NODES. Knowing I had a backup of this already I decided that I would try and manipulate this table to turn it back into the expected state. What a bad move that was and only made things worse by adding a new line entry like so:

Fqdn Component Registered
Fe1.domain.local Master 0
Fe2.domain.local Master 1
Fe1.domain.local Fta 1
Fe1.domain.local Master 1

Now faced with the total loss of the CMS database I had no choice but to revert my changes and restore the CMS from the backup. The below process details my recovery steps:

  1. On the primary server ran the following command Set-CsManagementServer- Identity fe1.domain.local to update the SCP back to the primary server
  2. On the primary server ran the Install-CsDatabase –CentralManagementDatabase –SqlServerFqdn fe1.domain.local –ForInstance RTC –Clean
  3. On the backup server ran the Install-CsDatabase –CentralManagementDatabase –SqlServerFqdn fe2.domain.local –ForInstance RTC –Clean
  4. Stopped the replication services and backup service on both servers
  5. On the primary server ran the Import-CsConfiguration –Filename c:\cms.zip to import the CMS data from my backup
  6. On the primary server ran the Import-CsLisConfiguration –Filename c:\lis.zip to import the CMS data from my backup
  7. Ran Enable-CsTopology
  8. Launched the Skype for Business Deployment Wizard and then ran Step 1 to reinstall the Local Configuration store using the data from the CMS on the Primary Server
  9. Ran Step 2 Install / Remove components on the primary server
  10. Launched the Skype for Business Deployment Wizard and then ran Step 1 to reinstall the Local Configuration store using the data from the CMS on the backup Server
  11. Ran Step 2 Install / Remove components on the backup server
  12. Ran Get-CsManagementConnection showed the primary server as the active node
  13. Ran Get-CsService –CentralManagement showed the primary server as the active node and the backup as false (expected)
  14. Started the backup and replica services on both front end servers
  15. Ran Invoke-CsBackupServiceSync –PoolFqdn fe1.domain.local
  16. Ran Invoke-CsManagementStoreReplication
  17. Ran Get-CsManagementStoreReplicationStatus and the results returned TRUE
  18. Ran Get-CsManagementStoreReplicationStatus –CentralManagementStoreStatus and the active master and active file transfer agent was now set to the primary server
  19. Event viewer showed no errors and replication is now happening OK

So the biggest lesson learned here, take a backup of the CMS before failing over the pool just in case this happens to you. Without it I am not sure I would have still been in a job!

Workaround Theory

As I am not the only one who has experienced this issue, it could be a problem with Skype for Business itself. I feel that if I try and failover the CMS again the same problem will occur. So I have come up with a theory that I am going to attempt to qualify in a lab, but welcome any suggestions

1. Create a daily backup of the XDS and Lis Databases and store them on the backup pool server (done with PowerShell) something like this to give me 5 points of recovery

# CMS Backup Script workaround
#Set Backup Locaton
$backupfolder = \\fe2.domain.local\CMS_BACKUP
#Days to Keep
$retention = “5”
#backup file names
$date = Get-Date -Format dd-MM-yy
$cmsfilename = “CMS-$($date).zip”
$lisfilename = “lis-$($date).zip”
#backup store cleanup
$limit = (Get-Date).AddDays(“-$($retention)”)
Get-ChildItem -Path $backupfolder -Recurse -Force | Where-Object { !$_.PSIsContainer -and $_.CreationTime -lt $limit } | Remove-Item –Force
Import-Module SkypeforBusiness
Export-CsConfiguration -Filename “$($backupfolder)\$($cmsfilename)” -ErrorAction SilentlyContinue
Export-CsLisConfiguration -Filename “$($backupfolder)\$($lisfilename)” -ErrorAction SilentlyContinue

(Export-RgsConfiguration too if you have these setup)

2. When failing over to the backup pool perform the setting of the edge server(s) next hope and invoke-CsPoolFailover to fail the users across.

3. Then repeat steps to reinstall the CMS to the backup server in a clean state and then reset the SCP. At this point Skype for Business should (in my mind) treat the backup server as the master

4. When failing back repeat the process on the primary

I guess the best method here is to move the CMS database to a SQL cluster away from the Standard Editions and probably going to be the recommendation from me to my customers moving forward.

Anyway, the moral of this story is that make sure you have a backup and make sure you test (but be aware of this issue) failover in a controlled manner before having to rely on it for real. If anyone has any suggestions, want to share their experiences or receives information from Microsoft about this please share in the comment section below.

22 thoughts on “Skype for Business – WARNING: Standard Edition Pool Failover Disaster

  1. Glad to hear I’m not the only one that had this happen. I was able to resolve it with this:
    invoke-csmanagementserverfailover -backupsqlserverFQDN -backupsqlinstancename rtc -force
    YMMV, but it’s what fixed it for me.. after a very panicked two hours of banging my head against it.

    Like

  2. Same here on an EE pool pair – we have an always on availability group in site A and a single SQL server in site B (DR) with our CMS DB. Same type of issues happened during testing failover. This might not just be SE pools, and likely needs a patch somewhere IMO. I am surprised there isn’t more chatter about this given how long the product has been out.

    Like

  3. I had the exact same issue yesterday. First time I set up SE pool pairing on skype4b so far, everything else I have done was enterprise edition. Your article allowed me to recover it to a working state, since I could invoke-csmanagementserverfailover until I was blue in the face but it’ll always end up saying that the SCP in AD for the active CMS didn’t match what it expected, and that both copies were active. Thankfully I had taken a full backup in the morning of everything before I attempted failover (by pure chance, not because I thought this would fail).

    Like

  4. You shouldn’t have used the -force paramater in the invoke of the CMS, because the Primary CMS was still active and reachable. -Force must only be used when the primary pool that hosts the CMS is unavailable. (TechNet: Invoke-CsManagementServerFailover -Force: You should not use the Force parameter if you are running the cmdlet for purposes other than disaster recovery, as it will not account for replication during the failover. When the parameter is not used, the cmdlet will first make sure all replications are done, then set the source DB to read-only mode.)

    Like

  5. Has anyone actually logged a case with MS on this? I thought I was off my nut when I hit this about 4 months ago. Assuming my memory is working correctly, my resolution was to 1) shut down the FE nodes of the failed over pool, and 2) shut down failed pool SQL. Wait a bit and then bring SQL back up again, and then the FE’s, and normality restored. Was hoping CU1 was going to have resolved this major road block, but nothing about it in the KB’s, and as if I’m going to retest my process, whatever it was, I may have just been lucky.

    Like

    1. I reported to the product team. But fell on deaf ears. People say not to use the -backupsqlserver parameter when both servers are online, but there are a couple of issues with that. One the command wants us to specify the parameter (mandatory), 2) even killing the primary server and failing over, sometimes worked for me, most of the time not. Failing back is the issue. when the backup (now primary) and backup (was primary) are both online. That fails 100% of the time for me.

      Like

      1. That’s also my experience, I’ve been able to reproduce the same issue on different pools 100% of the time. As things stand I’m not even recommending pool pairing anymore, as it’s a total hit and miss to recover from a failover at the moment. Funny thing is that it used to work fine in Lync 2013…

        Like

    1. Hi

      Yes, i checked the technet article. but technet is not always correct. When trying the command, it would not execute without specifiying the backend sql destination (as in, it prompted you for it and would not continue without it), that’s the point. Haven’t tried in CU1 yet. hopefully it is fixed.

      thanks

      Like

  6. I performed an invoke-csmanagementserverfailover in a controlled failover, (not sure why you were using the -sql switches, they’re only supposed to be used in the event of a DR where the primary is down), and it was successful. Woohoo. So, I’m thinking something was fixed in CU1. After failing over (technically, failing back) I also had to rerun the “Setup or Remove Skype for Business Server Componant” in the deployment wizard. I tempted fate yet again and invoked back, and was successful again, still took 4-6 mins with warnings about the CMS in AD doesn’t match the one in topology, and how it will retry again every minute for another 14 minutes. You can Control+c to end this “verification” process, I choose not to do that…

    I treat this as a good sign. Just wished they mentioned it in the CU KB’s somewhere.

    Like

  7. Pushed my luck, tried it again in order to test actual user functionality. Failing over to the DR site was fine, but when I was done testing, the failback of the CMS failed like above. I tried your process unsuccessfully a couple of times before figuring out that I needed to break the SQL Mirror for the XDS and LIS databases. (XDS for sure, couldn’t run the -CLEAN job with it in a mirrored state). To much fun for 4 am…

    So it’s hit or miss that a controlled fail over will work or not, or the failback. The only thing I maybe did differently is after I failed over, I didn’t run an Enable-CsTopology, which I had in the past for giggles.

    Like

  8. I had something similar on an Enterprise paired pool where for some reason I ended in a similar state for FTA and Master agents.

    I noticed that the Pool State wasn’t active for my pool that I brought up after the DR test so I just tried it this way and it worked:

    Set-CsRegistrarConfiguration -Identity XXXXX -PoolState active

    The Master and FTA agents picked it up from there and properly maked the pools as primary and backup afterwards.

    Like

  9. Hello Guys,

    This is pretty great and I think I’m in the same situation. however, My issue is more closer to Instance as I have default instance instead of RTC.

    Lets assume you using default instance in this senerios, what would be your –BackupSqlInstanceName value ? is it default or leave it blank as default or MSSQLServer doesn’t work either way?

    Invoke-CsManagementServerFailover –BackupSqlServerFqdn fe2.domain.local –BackupSqlInstanceName “” –Force

    Many Thanks

    Like

  10. I hit the same problem. I’ve used your process to restore from backup (fortunately we have a script that makes nightly backups using the export commands on all customer servers).

    Once restored I had a look in the working databases and I think I could have recovered if I had a working one to compare with:

    dbo.Component — ‘how it should have looked’ — rtc on your fe2 (what is now active as it’s failed over to it) should have the two entries but *both pointing to itself*. On fe1 that table has *no* entries.

    That might have been enough, don’t know, but I also got messages about it being in migrating state so checked some of the other tables. dbo.DbConfigInt has a value CurrentState. That needs to be 0 on the active server (FE2 in your scenario) and 3 on the backup server. In mine it was 1 on one and 3 on the other. Wish I’d tried 0!

    Also dbo.Batch might be relevant, the PartialVersion value had one entry at 2 and the other at 3.

    I edited the SCP manually via ADSIEdit, that’s easy enough.

    Thanks for your article.

    Rick

    Like

    1. Thanks Rick for your input here, I noticed a few of your findings in my initial troubleshooting, but wasn’t brave enough to go ahead and edit the DB directly. I also believe though not tried yet that CU4 for SfB resolves this issue?

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s