Introduction
VMware recently published a paper titled Scalable Storage Performance that delivered a wealth of information on storage with respect to the ESX Server architecture. This paper contains details about the storage queues that are a mystery to many of VMware's customers and partners. I wanted to start a wiki article on some aspects of this paper that may be interesting to storage enthusiasts and performance freaks.
Two Important Queues
Let's use the following figure as a starting point for this discussion.
For the purposes of this paper, I'm going to call the two different queue types the "kernel queue" and the "device driver queue". The device driver queue is specified in the device itself and has historically been configured through Linux-like module commands in the console operating system. More on that in "Changing Queue Depth" below. The kernel queue should be thought of as infinitely long, for all practical purposes. Any time the device driver queue gets full, commands to the storage will queue up in the kernel.
Note that each LUN gets its own queue. This means that when you change the queue depth in the device driver, you're changing the queue depths for many queues. The underlying device (HBA) is going to have a hard limit on the number of active commands it will allow at one time. This should be considered when setting queue depth. If your HBA can support only 2,000 active commands but it is addressing 40 LUNs, a specified queue depth of 64 won't allow that many commands to all LUNs. This being due to the fact that 64*40 = 2,560--which is more than the 2,000 maximum commands. In practice this is rarely a concern, though, as rarely are so many LUNs being simultaneously addressed through a single HBA and so many outstanding commands being issued to these LUNs.
Device Driver Queue Function
The device driver queue is used for a low-level interaction with the storage device. It controls how many active, or "in flight", commands there can be at any one time. This is effectively the concurrency of the storage stack. Set the device queue to 1 and each storage command becomes sequential: each one must complete before the next starts.
But if the device queue is left at its default of 32, as an example, 32 commands will be concurrently processed by the storage system. All 32 will be shipped off to the storage device by the kernel and new commands are queued when completions arrive.
Kernel Queue Function
The kernel queue can be thought of as kind of an overflow queue for the device driver queues. But it's not just an overflow queue. ESX Server contains all kinds of cool optimizations to get the most out of your storage. And these features apply to commands in the kernel queue only. Here are some examples of features provided to commands queued at the kernel queues:
- Multi-pathing for failover and load balancing.
- Prioritization of storage activities based on VM and cluster shares.
- Optimizations to improve efficiency for long sequential operations.
There are others, as well.
Impacts of Queue Depths
So, increasing queue depths in the device driver can greatly improve the performance of the storage at the device level. Decreasing the device driver queue will result in increases in usage of the kernel queues. This decreases the device efficiency, but introduces opportunities for optimizations across multiple VMs and devices. So, what's the right ratio of these two depths? We think that the sweet spot lies with a depth 32 device driver queue. That's why we've set 32 as the default device driver queue length.
But your configuration and workloads may benefit from a change to this default queue depth. I'll refer you to the aforementioned storage paper for information on when you might want to change the driver queue depth. I'll just point out a couple of broad observations here:
- With fewer, very high IO VMs on a host, larger queues at the device driver will improve performance.
- As the VM count grows and storage performance features--like shares, load balancing, failover, etc.--become more important, the default queue depth is best.
- With too many servers each having too large of device queues, your storage array could easily be overloaded and see its performance suffer.
Improving Storage Performance
Now that we've covered how storage queuing works, you may be wondering how you can monkey around with these queue sizes for optimal performance. I can tell you as someone that has been involved with many, many performance analysis projects that changing queue size is rarely a fix to an acute storage performance problem. You should first go through the analysis techniques in Storage Performance Analysis and Monitoring. That may or may not lead to changing queue depths.
But, in the event that you do end up changing queue depths...
Changing Queue Depth
We have a helpful knowledge base article that describes the process of changing the device driver queue on ESX. For ESXi you will need to modify the queue using the vMA. First find the HBA module name (as the first command does below) then change the depth of the queue against the matching module name using the second command:
esxcfg-module -list | grep qla
vicfg-module -s ql2xmaxqdepth=64 <module_name>