Skype for Business–Time after Time driving you up the PSOM wall

I was troubleshooting a problem with a customer where they had a pair of Skype for Business Edge Servers but where experiencing intermittent issues with sharing whiteboards, polls and PowerPoints externally. For a portion of users this problem did not occur, but for the others no matter what they tried, the external party could not connect to the whiteboard, poll or PowerPoint. The external party received a “could not connect to the presenting server right now” error message. Federation, Desktop Sharing and AV were fine, just Whiteboards, Polls and PowerPoints where problematic.

Naturally, the first port of call in the unwritten troubleshooting guide was to test connectivity to both Web Conference Edge servers. This means obtaining the public IPs of both web conference edge services by using nslookup and preceding to telnet to port 443. Telnet at least allows me to prove a connection at TCP level, and sure enough I could create a TCP connection to both Edge servers. So maybe not a firewall problem yet. The next item on my list was to see if I could eliminate an edge server from the investigation. As some users had no problems, and others did, it seemed that perhaps only one Edge server was to blame for this. Before I waded in and started stopping services I wanted to perform some due diligence and investigated the event logs on both Edge servers for any obvious errors. Both servers reported no errors to worry about during the time where whiteboards etc. could not be displayed to external users.

Next I tried to use centralized logging to see if I could identify the problem. Running from a front end server, selecting the AlwaysOn scenario and both Front and Edge pools, when I clicked “Start Scenario” I noticed that CLS failed to connect to both Edge servers. Again breaking out Telnet I attempted TCP connection to ports 50001,50002 and 50003 to both inside interfaces of the Edge servers and returned a successful connection.

Returning to the Edge servers I decided to double check the certificates of both the external and internal interfaces. Running CertUtil.exe –VerifyStore My command returned a successful pass for both external and internal certificate chains, so the certs were good. I decided at this point that perhaps I needed to see what was going on at packet level, so enter WireShark. Starting a packet capture on the external interfaces at first on both Edge servers and attempting to connect to a whiteboard from an external source to generate some traffic enlightened the issue a little for me.

It seemed that for one Edge server I was seeing the incoming SYN packet from the external source to the web conference edge service. The server was replying with an ACK, then the TLS key exchange succeeded followed by a successful TLS handshake. Then the application data packet requested by the external source was immediately replied to with a TCP RESET packet from the server. I decided that since we had 2 Edge servers in the pool we could temporarily stop the Skype for Business services on this Edge server and attempt the same thing on the second. After stopping these services, attempting whiteboard sharing externally worked for everyone. So now at least I can prove it is one edge server causing this.

One thing that bugged me was the fact I could not use centralized logging on either edge server, the connection was simply not established. On the working edge server, I decided to capture the internal interface using WireShark and sure enough, I could see a similar behavior in that I saw the connection come in from the front end server and immediately hit with a TCP RESET response from the internal interface.

So now I starting to think of what it could be that is causing this. So I spent a lot of time comparing the configuration between the working Edge and the problem edge server, from certificates, to group policy, to driver & firmware versions, to windows time services. Everything was identical. Now I should say that when troubleshooting you lose all sense of time and space, so the problem wasn’t immediately obvious to me.

I was just about to suggest that perhaps we need to rebuild the OS on this server when I noticed that the system time on my laptop was one hour ahead of the time on BOTH edge servers! Checking the region settings on the Edge servers they were set to UTC London time zone as expected and daylight saving was being observed. but these where one hour behind. I changed the time to the “working” edge server first and tested centralized logging. It works, great! I wonder if this is the problem for Edge 1 that will not allow whiteboards?

Changing the time on the non-working edge server to the correct time instantly brought back the whiteboard functionality.

However, the story does not end there. There where still unanswered questions, like how come Edge 2 that also had the wrong time continued to allow whiteboards, polls and PowerPoints while the first edge didn’t? The answer to this I am still yet to find, and only have an assumption that the first edge server was the CSR generating server and the first edge server introduced to the edge pool. Perhaps as a result it is more sensitive to the effects of time slippage? Then the time slipped again, back one hour after about 15 minutes. Before I waded in to w32tm to set the time server, I wanted to prove that in fact it was the action of me setting the correct time that resolved the issue. So attempted whiteboard sharing again with just Edge 1 services started to force the connections out through that server. Low and behold, whiteboard sharing works…. OK, I was expecting it to fail. I decided to wait a bit and after about 2 hours I attempted whiteboard sharing again. This time the problem returned and sharing failed. Ok, lets set the time again, just to make sure that action is the fix. Setting the time forward an hour to the correct time, instantly resolved the sharing issue. So 100% the problem is down to time. But why does it take 2 hours after the time slips for this problem to present itself? Again, not got a definitive answer here, and assuming there is some scheduled service validation happening within Skype for Business that runs periodically? Centralized logging was not much use to me as I could only see the front end servers and as the incoming request from the external user got reset on the external side of the edge server, I wasn’t ever going to see anything in the log.

To try and determine why the time kept slipping back, it was obvious that the servers where obtaining a time synch from an incorrect source. Usually I advise customers (if they are not .gov) to use the time service provided by ntp.org, and to test time outbound to the internet on these edge servers I attempted to connect to the time servers at ntp.org. I found that UDP 123 for NTP was blocked on the external firewall. So the next step was to see where the servers where getting the time from. w32tm /query /source showed that the servers where getting their time from the local CMOS clock. As these servers where physical, this was not emulated by any integration services normally found on virtual machines.

Checking the CMOS clock, the time was indeed set incorrectly. These where HP servers that had been set to the UTC London time zone in the BIOS. Manually changing the time here seemed to work, but then the CMOS time reset itself back one hour. It seems that if you set the time zone in the CMOS to observe a specific time zone, the BIOS will attempt to synchronize from an external source. If this is blocked, it cannot update daylight saving. The fix was to select local zone in the HP BIOS settings in order to manually force the time to set at the required time.

If you ever experience this problem, be sure to check the time service first. It was a simple oversight for me because I was comparing two servers with the same time issue, and did not think immediately that they were both wrong because one was working and the other not. I still have unanswered questions to why the time slip caused PSOM data to reset on one edge but not the other, but the important thing here is that time is absolutely critical.

2 thoughts on “Skype for Business–Time after Time driving you up the PSOM wall

  1. Excellent post Mark. After you fixed the time issue were you able to run CLS on the edge servers? I ask because I always get the following error: Centralized Logging Service Agent Unexpected expected exception processing a synchronous command.
    Value cannot be null.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s