High Write Latency on Azure Local

alynpeden
May 28
2 min read

We ran into an issue recently on an Azure Local cluster in switchless storage mode where we were getting reports of random sluggish performance (always a fun one) so to set out we 1st had to identify where the sluggish performance was coming from so we used the typical performance counters CPU \ Memory \ Storage \ RDP RTT and a few others and very quickly we noticed random but severe spikes in Write IOPS as you can see here when we placed some load on C:\

The next step was to work out why as there was no obvious errors anywhere on the Cluster and RMDA was showing up and operational.

We verified that RDMA was enabled and SMB Direct was operational using the following commands:

Get-SmbConnection -ServerName  | Select ServerName, ShareName, Dialect, RdmaCapable

Get-SmbMultichannelConnection -ServerName  | Format-Table ServerName, ClientRdmaCapable, ServerRdmaCapable, Selected, FailedCount 

Get-NetIntentStatus

RDMA appeared to be connected and active. So we then reviewed the storage network intent in Windows Admin Centre which appeared to be ok and matched our own test cluster.

Next, we checked Priority Flow Control (PFC) which again was correct and matched our working test cluster

We then ran Get-NetAdapterQoS to check what the adapters were doing which gave us a big clue

The Hardware column showed the adapter supports IEEE DCB with 3 traffic classes. But the Current (operational) column showed DcbxSupport: None and NumTCs: 0/0/0. Zero traffic classes, zero ETS, zero PFC. The adapter was completely ignoring the OS DCB configuration. This meant that despite the OS correctly configuring PFC on priority 3, the Broadcom adapters were not enforcing it. RoCEv2 was running over 25Gb with absolutely no flow control.

For context this is what a working cluster should look like

The next step was to get into the BIOS and verify what the adapter settings were so after draining a node and rebooting it to get into the BIOS we found the smoking gun

We enabled IEEE and restarted the nodes then ran Get-NetAdapterQoS again and got the following output

The next step was to run the same test again placing write IOPS onto C:\ again where we got some occasional spikes of 20ms but most importantly not 3500ms

I also ran an IOPS test using EUCScore and got a score of 1.23 over a 300 second run which is probably quicker than the laptop you are reading this on :)

How did this happen? We believe the settings were changed automatically during a firmware update on the cluster itself which is something we have seen happen in the field before so please be aware of this when doing firmware updates.

Hopefully this has been of help

High Write Latency on Azure Local

Recent Posts

Comments

Never Miss a Post.
Subscribe Now!

Comments

Never Miss a Post. Subscribe Now!

Never Miss a Post.
Subscribe Now!