top of page

High Write Latency on Azure Local

  • alynpeden
  • May 28
  • 2 min read

We ran into an issue recently on an Azure Local cluster in switchless storage mode where we were getting reports of random sluggish performance (always a fun one) so to set out we 1st had to identify where the sluggish performance was coming from so we used the typical performance counters CPU \ Memory \ Storage \ RDP RTT and a few others and very quickly we noticed random but severe spikes in Write IOPS as you can see here when we placed some load on C:\



The next step was to work out why as there was no obvious errors anywhere on the Cluster and RMDA was showing up and operational.


We verified that RDMA was enabled and SMB Direct was operational using the following commands:


Get-SmbConnection -ServerName  | Select ServerName, ShareName, Dialect, RdmaCapable
Get-SmbMultichannelConnection -ServerName  | Format-Table ServerName, ClientRdmaCapable, ServerRdmaCapable, Selected, FailedCount 

Get-NetIntentStatus

RDMA appeared to be connected and active. So we then reviewed the storage network intent in Windows Admin Centre which appeared to be ok and matched our own test cluster.



Next, we checked Priority Flow Control (PFC) which again was correct and matched our working test cluster



We then ran Get-NetAdapterQoS to check what the adapters were doing which gave us a big clue



The Hardware column showed the adapter supports IEEE DCB with 3 traffic classes. But the Current (operational) column showed DcbxSupport: None and NumTCs: 0/0/0. Zero traffic classes, zero ETS, zero PFC. The adapter was completely ignoring the OS DCB configuration. This meant that despite the OS correctly configuring PFC on priority 3, the Broadcom adapters were not enforcing it. RoCEv2 was running over 25Gb with absolutely no flow control.


For context this is what a working cluster should look like



The next step was to get into the BIOS and verify what the adapter settings were so after draining a node and rebooting it to get into the BIOS we found the smoking gun



We enabled IEEE and restarted the nodes then ran Get-NetAdapterQoS again and got the following output



The next step was to run the same test again placing write IOPS onto C:\ again where we got some occasional spikes of 20ms but most importantly not 3500ms



I also ran an IOPS test using EUCScore and got a score of 1.23 over a 300 second run which is probably quicker than the laptop you are reading this on :)



How did this happen? We believe the settings were changed automatically during a firmware update on the cluster itself which is something we have seen happen in the field before so please be aware of this when doing firmware updates.


Hopefully this has been of help

Comments


Never Miss a Post.
Subscribe Now!

Thanks for submitting!

  • Youtube
  • Grey Twitter Icon
Tech Talk Lockup.png
bottom of page