Skype for Business – [FIXED] [BUG] LS Backup Service Error 4069

This week I have been installing a Skype for Business Standard Edition in a pool pair for disaster recovery / manual failover. The project began with an existing mixed topology of Lync 2010 and 2013 servers that had been left in an unfinished migration coexistence state. The customer wanted to move from this to a Skype for Business platform that gave them a degree of resiliency. So with a crack of the knuckles and stretch of the toes I began to work this out for them.

Everything went smoothly and successfully migrated all users, devices and services from Lync 2013 to Skype for Business. However, when enabling pool pairing for resiliency I noticed a strange issue with the backup service. On the backup pool server I noticed an Error “LS Backup Service” Event ID 4069. This error appeared every 10 to 11 minutes. So naturally, I turn to my faithful online support manual (Google) for ideas. To my sheer panic I soon realised that this error must be unique, because there is literally nothing documented on it. Therefore, cue the coffee and let’s dig in and try and resolve this the old fashioned way.

Firstly I analysed the error message.

From this message the issue appears to be SQL related in that it cannot perform an update to the dbo.batch table in the rtcxds database. Before I decided to jump straight in, I wanted to explore a theory that the data may be in an incorrect format. A similar error is produced when SIP addresses contain invalid characters, but this is Event ID 4070. So I audited all SIP addresses and changed any that had special characters as part of the SIP address e.g. john.jones-smith@domain.com and john.O’Smith@domain.com to valid SIP address formats. Then unpaired the pools and reinstalled the CMS databases on the backup pool server using the following command

Install-CsDatabase –CentralManagementDatabase –SqlServerName fe2.domain.local –SqlInstanceName RTC –Clean

Once this had been reinstalled, I paired the pools once again for resiliency. However, the issue still persisted. I then decided I would try and get hold of the XML file referenced in the event log. Navigating to the backup directory \\fe2.domain.local\skypeshare\2-backupservice-7\backupstore\userservices\presencefocus\data I could not see this XML file. However, I did notice a backup.zip file being created each time this event occurred and then being deleted quickly. Attempting to copy this zip file using windows explorer is almost impossible because of how quickly it appears and disappears. So cue some powershell and robocopy foo. I created a simple script (can’t really call it a script) that did an infinite loop copy of this location using

While($true){

robocopy C:\SkypeShare\2-BackupService-7\BackupStore\UserServices\PresenceFocus\Data /ZB ‘C:\Capture\’ /W:0
}

Once I was able to capture the XML file, I studied it to see if there were any invalid characters I had missed. There weren’t. I also noticed that the XML name changed each time the event occurred and the data inside the XML referenced different users each time. Therefore this proved a false lead and dead end.

I also noted that when performing the following command to check the backup state

Get-CsBackupServiceStatus –PoolFqdn fe1.domain.com

And

Get-CsBackupServiceStatus –PoolFqdn fe2.domain.com

FE1 was showing an Overall Export Status of NonSteadyState and FE2 was showing FinalState. (This blog describes the states in detail : http://blogs.technet.com/b/jenstr/archive/2012/11/06/what-does-the-status-reported-by-get-csbackupservicestatus-mean.aspx)

At least at this point I was able to prove that the issue was not caused by malformed data, so the problem must be within the backup service process. Now I revisited the event ID and looked more closely at the event description. The following sentence stood out for me

“Snapshot isolation transaction aborted due to update conflict. You cannot use snapshot isolation to access table ‘dbo.Batch’ directly or indirectly in database ‘rtcxds’ to update, delete, or insert the row that has been modified or deleted by another transaction.”

So here the explanation says that the backup service is executing some SQL statement that is declaring the transaction level to use the snapshot method. Transaction levels control the locking and row versioning behaviour of Transact-SQL statements issued by a connection to a SQL Server. The default server level is “READ COMMITTED” and this cannot be changed at a server or default level. This means that there is some statement(s) in Skype for Business that is changing this setting when the statement is run to “SNAPSHOT”.

The description of SNAPSHOT from MSDN is:

“Specifies that data read by any statement in a transaction will be the transactionally consistent version of the data that existed at the start of the transaction. The transaction can only recognize data modifications that were committed before the start of the transaction. Data modifications made by other transactions after the start of the current transaction are not visible to statements executing in the current transaction. The effect is as if the statements in a transaction get a snapshot of the committed data as it existed at the start of the transaction.”

So, Skype for Business is trying to use a consistent point in time copy of the data in the table rows in which to edit which on the face of it is a good thing. However, there appears to be a conflicting argument within the statement or related statements that mean that Skype for Business cannot use SNAPSHOT.

At this point I believe I have identified a bug. However, by my inquisitive nature I decided I would try and find a workaround while I wait for Microsoft. Hoping that in some way the SQL statement at fault was a SQL stored procedure I opened the RTCXDS database on the backup pool server using SQL Management Studio.

After some investigating and trial and error I found the offending procedures causing this error and changed their transaction levels from SNAPSHOT to SERIALIZABLE. By doing this changed the behaviour of how the tables were accessed when the statements were executed.

The description of SERIALIZABLE from MSDN is:

  • Statements cannot read data that has been modified but not yet committed by other transactions.
  • No other transactions can modify data that has been read by the current transaction until the current transaction completes.
  • Other transactions cannot insert new rows with key values that would fall in the range of keys read by any statements in the current transaction until the current transaction completes.

By changing the transaction level to serializable means that we are telling SQL to lock the table row, edit it in place and then release and not allow any other statement or procedure to edit this row until the statement or procedure has released it.

Below are the stored procedures to edit:

XdsBackupAllItems

Change the transaction isolation level from snapshot to serializable and press execute to save the procedure


XdsDeleteItems

Change the transaction isolation level from snapshot to serializable and press execute to save the procedure


XdsUpdateReplicaData

Change the transaction isolation level from snapshot to serializable and press execute to save the procedure


XdsResyncItems

Change the transaction isolation level from snapshot to serializable and press execute to save the procedure


Once these changes have been saved to the stored procedures, I checked the event viewer on the backup pool server for the Event ID of 4069 again

As you can see the event ID has been cleared. To make sure I haven’t broken anything I then checked the backup sync status again to confirm it is healthy

Result!!

I should warn that this workaround is my own and not been vetted or suggested by Microsoft and therefore, use at your own risk. I would imagine any update to the CMS database from CU updates will revert these settings back to the default.

Hopefully, this will help you and not take up 2 days of your time like it did mine J

4 thoughts on “Skype for Business – [FIXED] [BUG] LS Backup Service Error 4069

  1. HI I am running into the exact error where I see snapshot error in the Event Viewer (error 4069) This happens when i am trying to failover the CMS from Prod Pool to DR Pool. the Failover fails with a different error (a network related or instance specific error etc) However from the FE Event Logs, I see this error message coming up. Will it be wise to modify the Snapshot to serializable on the SQL side. I have done quite some troubleshooting on the File Share Permissions, Firewall, Backup Service, FE Server, Removing and syncing back the resiliency and also uninstalling and reinstalling the DB.

    Like

    1. Hi
      There are some serious issues with the failover of the CMS with Skype for Business. So much so that I would advise at this point to ensure that you have a backup of the XDS and LIS databases. Failover can error and cause massive disruption and corruption to the CMS. The 4069 error for me was not in during a failover, but was seen before. I can confirm having relived my experiences that changing the SQL procedures does not affect the failover process itself. Therefore, if you change the procedures as shown the 4069 error should not re-occur, but may not fix your failover problem.
      Thanks

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s