Quantcast
Channel: VMware Communities : Document List - Performance & VMmark
Viewing all 360 articles
Browse latest View live

Interpreting esxtop Statistics

$
0
0

Table of Contents

Section 1. Introduction

Section 2. CPU

Section 2.1 Worlds and Groups

Section 2.2 PCPUs

Section 2.3 Global Statistics

Section 2.4 World Statistics

Section 3. Memory

Section 3.1 Machine Memory and Guest Physical Memory

Section 3.2 Global Statistics

Section 3.3 Group Statistics

Section 4 Disk

Section 4.1 Adapter, Device, VM screens

Section 4.2 Disk Statistics

Section 4.2.1 I/O Throughput Statistics

Section 4.2.2 Latency Statistics

Section 4.2.3 Queue Statistics

Section 4.2.4 Error Statistics

Section 4.2.5 PAE Statistics

Section 4.2.6 Split Statistics

Section 4.3 Batch Mode Output

Section 5 Network

Section 5.1 Port

Section 5.2 Port Statistics

Section 6. Interrupt

Section 7. Batch Mode

 

Section 1. Introduction

Esxtop allows monitoring and collection of data for all system resources: CPU, memory, disk and network. When used interactively, this data can be viewed on different types of screens; one each for CPU statistics, memory statistics, network statistics and disk adapter statistics. In addition to the disk adapter statistics in earlier versions, starting with ESX3.5, disk statistics at the device and VM level are also available. Starting with ESX 4.0, esxtop has an interrupt statistics screen. In the batch mode, data can be redirected to a file for offline uses.

 

Many esxtop statistics are computed as rates, e.g. CPU statistics %USED. A rate is computed based on the refresh interval, the time between successive snapshots. For example, %USED = ( CPU used time at snapshot 2 - CPU used time at snapshot 1 ) / time elapsed between snapshots. The default refresh interval can be changed by the command line option "-d", or the interactive command 's'. The return key can be pressed to force a refresh.

 

In each screen, data is presented at different levels of aggregation. It is possible to drill down to expanded views of this data. Each screen provides different expansion options.

 

It is possible to select all or some fields for which data collection is done. In the case of interactive use of esxtop, the order in which the selected fields are displayed can be selected.

 

In the following sections, this document will describe the esxtop statistics shown by each screen and their usage.

 

Section 2. CPU

Section 2.1 Worlds and Groups

Esxtop uses worlds and groups as the entities to show CPU usage. A world is an ESX Server VMkernel schedulable entity, similar to a process or thread in other operating systems. A group contains multiple worlds.

 

Let's use a VM as an example. A powered-on VM has a corresponding group, which contains multiple worlds. In ESX 4.0, there is one vcpu (hypervisor) world corresponding to each VCPU of the VM. The guest activities are represented mostly by the vcpu worlds. (In ESX 3.5, esxtop shows a vmm world and a vcpu world for each VCPU. The guest activities are represented mostly by the vmm worlds.) Besides the vcpu worlds, there are other assisting worlds, such as a MKS world and a VMX world. The MKS world assists mouse/keyboard/screen virtualization. The VMX world assists the vcpu worlds (the hypervisor). The usage of the VMX world is out of the scope of this document. In ESX 4.0, there is only one vmx world. (In ESX 3.5, there are two vmx worlds for each VM.)

 

There are other groups besides VM groups. Let's go through a few examples:

 

  • The "idle" group is the container for the idle worlds, each of which corresponds to one PCPU.

  • The "system" group contains the VMKernel system worlds.

  • The "helper" group contains the helper worlds that assist VMKernel operations.

  • In classic ESX, the "console" group is for the console OS, which runs ESX management processes. In ESXi, these ESX management processes are running as user worlds directly on VMKernel. So, on an ESXi box you can see much more groups than on a classic ESX, but not the "console" group.

 

Note that groups can be organized in a hierarchical manner in ESX. However, esxtop shows, in a flat form, the groups that contain some worlds. More detailed discussion on the groups are out of the scope.

 

Q: Why can't we find any vmm worlds for a VM in ESX 4.0?

A: Before ESX 4.0, each VCPU has two worlds "vmm" and "vcpu". In ESX 4.0, cpu scheduler merges their statistics to one vcpu world. So, CPU stats won't show vmm worlds. This is not a problem.

 

Section 2.2 PCPUs

In esxtop, a PCPU refers to a physical hardware execution context, i.e., a physical CPU core if hyper-threading is unavailable or disabled, or a logical CPU (aka LCPU or SMT thread) if hyper-threading is enabled.

  • When hyper-threading is unavailable or disabled, a PCPU is the same as a core. (So, esxtop does not show the "CORE UTIL(%)").

  • When hyper-threading is used, a PCPU is a logical CPU (aka a LCPU or SMT thread). So, there are two PCPUs on each core, i.e. PCPU 0 and PCPU 1 on Core 0, PCPU 2 and PCPU 3 on Core 1, etc.

 

Section 2.3 Global Statistics

  • "up time"

The elapsed time since the server has been powered on.

 

  • "number of worlds"

The total number of worlds on ESX Server.

 

  • "CPU load average"

The arithmetic mean of CPU loads in 1 minute, 5 minutes, and 15 minutes, based on 6-second samples. CPU load accounts the run time and ready time for all the groups on the host.

 

  • "PCPU UTIL(%)"

The percentage of unhalted CPU cycles per PCPU, and its average over all PCPUs.

 

Q: What does it mean if PCPU UTIL% is high?

A: It means that you are using lots of resource. (a) If all of the PCPUs are near 100%, it is possible that you are overcommiting your CPU resource. You need to check RDY% of the groups in the system to verify CPU overcommitment. Refer to RDY% below. (b) If some PCPUs stay near 100%, but others are not, there might be an imbalance issue. Note that you'd better monitor the system for a few minutes to verify whether the same PCPUs are using ~100% CPU. If so, check VM CPU affinity settings.

 

  • "CORE UTIL(%)" (only displayed when hyper-threading is enabled)

The percentage of CPU cycles per core when at least one of the PCPUs in this core is unhalted, and its average over all cores. It's the reverse of the "CORE IDLE" percentage, which is the percentage of CPU cycles when both PCPUs in this core are halted.

 

It is displayed only when hyper-threading is used.

 

Note that, in batch mode, we show the corresponding "CORE UTIL(%)" of each PCPU. So, PCPU 0 and PCPU 1 have the same "CORE UTIL(%)" number, i.e. the "CORE UTIL(%)" of Core 0.

 

Q: What is the difference between "PCPU UTIL(%)" and "CORE UTIL(%)"?

A: A core is utilized, if either or both of the PCPUs on this core are utilized. The percentage utilization of a core is not the sum of the percentage utilization of both PCPUs. Let's use a few examples to illustrate this.

'+' means busy, '-' means idle.
(1) PCPU 0:   +++++----- (%50)    PCPU 1:   -----+++++ (%50)    Core 0:   ++++++++++ (%100)
(2) PCPU 0:   +++++----- (%50)    PCPU 1:   +++++----- (%50)    Core 0:   +++++----- (%50)
(3) PCPU 0:   +++++----- (%50)    PCPU 1:   ---+++++-- (%50)    Core 0:   ++++++++-- (%80)

 

In all the three above scenarios, each PCPU is utilized by 50%. But, depending on how often they are run at the same time, the core utilization is between 50% and 100%. Generally speaking,

Max(PCPU0_UTIL%, PCPU1_UTIL%) <= CORE0_UTIL% <= Min(PCPU0_UTIL% + PCPU1_UTIL%, 100%)

 

Q: How do I retrieve the average core UTIL% no matter whether hyper-threading is used.

A: If hyper-threading is used, get the average "CORE UTIL(%)" directly. Otherwise, i.e. hyper-threading is unavailable or disabled, a PCPU is a Core, then We can just use the average "PCPU UTIL(%)". Based on esxtop batch output, we can use something like below.

     if ("Physical Cpu(_Total)\% Core Util Time" exists) // Indicating hyper-threading is used        return "Physical Cpu(_Total)\% Core Util Time";     else        return "Physical Cpu(_Total)\% Util Time";

 

  • "PCPU USED(%)"

The percentage CPU usage per PCPU, and its average over all PCPUs.

 

Q: What is the difference between "PCPU UTIL(%)" and "PCPU USED(%)"?

A: While "PCPU UTIL(%)" indicates how much time a PCPU was busy (unhalted) in the last duration, "PCPU USED(%)" shows the amount of "effective work" that has been done by this PCPU. The value of "PCPU USED(%)" can be different from "PCPU UTIL(%)" mainly for the following two reasons:

 

(1) Hyper-threading

The two PCPUs in a core share a lot of hardware resources, including the execution units and cache. And thus, the "effective work" done by a PCPU when the other PCPU in the core is busy is usually much less than the case when the other PCPU is idle. Based on this observation, our CPU scheduler charges each PCPU half of the elapsed durating when both PCPUs are busy. If only one PCPU is busy during a time period, the PCPU is charged for all that time period. Let's use some examples to illustrate this.

'+' means busy, '-' means idle.
(1) PCPU 0:   +++++----- (UTIL: %50 / USED: %50)    PCPU 1:   -----+++++ (UTIL: %50 / USED: %50)
(2) PCPU 0:   +++++----- (UTIL: %50 / USED: %25)    PCPU 1:   +++++----- (UTIL: %50 / USED: %25)
(3) PCPU 0:   +++++----- (UTIL: %50 / USED: %40, i.e. %30 + 20%/2)    PCPU 1:   ---+++++-- (UTIL: %50 / USED: %40, i.e. %20/2 + %30)

 

In all the three above scenarios, each PCPU is utilized by 50%. But, depending on whether they are busy at the same time, the PCPU USED(%) is between 25% and 50%. Generally speaking,

                                     /- PCPU0_UTIL%/2, if PCPU0_UTIL% < PCPU1_UTIL%       PCPU0_UTIL% >= PCPU0_USED% >= |                                     \- (PCPU0_UTIL% - PCPU1_UTIL%) + PCPU1_UTIL%/2, otherwise

Please note that the above inequations may not hold due to frequency scaling, which is discussed next.

 

(2) Power Management

The frequency of a PCPU may be changed due to power management. Obviously, a PCPU does less "effective work" (in a unit of time) when the frequency is lower. The CPU scheduler adjusts the "PCPU USED(%)" based on the frequency of the PCPU.

          PCPU_USED% = PCPU_UTIL% * Effective_Frequency / Nominal_Frequency

 

Suppose that UTIL% is 80%, and the nominal frequency is 2 GHz. If the effective frequency is 1.5 GHz. USED% would be 80% * 1.5 / 2 = 60%. Please note that since the CPU frequency may change often, you may go to the esxtop power screen, pressing 'p', to see how often the PCPU stays at what states, which can help guess the effective frequency.

 

Please also note that turbo mode may make the effective frequency higher than the nominal frequency. In that case, USED% would be higher than UTIL%.

 

If we want to add both reasons into account, just to make it more complicated, we can have something like this.

                      PCPU0_USED%            /- PCPU0_UTIL%/2, if PCPU0_UTIL% < PCPU1_UTIL%     PCPU0_UTIL% >= * Nomial_Frequency    >= |                    / Effective_Frequency    \- (PCPU0_UTIL% - PCPU1_UTIL%) + PCPU1_UTIL%/2, otherwise

 

Q: Why do I see ~100% for the average "PCPU UTIL(%)", but the average "PCPU USED(%)" is ~50%?

A: It is very likely that hyper-threading is enabled. A PCPU is only charged half the time when both PCPUs are busy. Typically,

        0 <= PCPU0_USED% + PCPU1_USED% <= 100% * Effective_Frequency / Base_Frequency

 

Suppose that CPU frequency is fixed to base frequecy, (e.g. power management features are not used), the sum of PCPU USED% for two PCPUs on the same core would be less than 100%. So, the average PCPU USED(%) won't be higher than 50%.

 

Q: Why is average CPU usage in vSphere client ~100%, but, average "PCPU USED(%)" in esxtop is ~50%?

A: Same as above. It is likely due to hyper-threading. The average CPU usage in vSphere client is deliberately doubled when hyper-threading is used; while esxtop does not double the average "PCPU USED(%)", which would otherwise mean the average USED% of all the cores.

 

Q: How do I retrieve the average core USED% no matter whether hyper-threading is used.

A: If hyper-threading is used, USED% for a core would be the sum of USED% for the corresponding PCPUs on that core. So, the average core USED% doubles the average PCPU USED%. Otherwise, i.e. hyper-threading is unavailable or disabled, a PCPU is a core, then We can just use the average "PCPU USED(%)". Based on esxtop batch output, we can use something like below.

     if ("Physical Cpu(_Total)\% Core Util Time" exists) // Indicating hyper-threading is used        return "Physical Cpu(_Total)\% Processor Time" * 2;     else        return "Physical Cpu(_Total)\% Processor Time";

 

  • "CCPU(%)"

Percentages of total CPU time as reported by the ESX Service Console. "us" is for percentage user time, "sy" is for percentage system time, "id" is for percentage idle time and "wa" is for percentage wait time. "cs/sec" is for the context switches per second recorded by the ESX Service Console.

 

Q: What's the difference of CCPU% and the console group stats?

A: CCPU% is measured by the COS. "console" group CPU stats is measured by VMKernel. The stats are related, but not the same.

 

Section 2.4 World Statistics

A group statistics is the sum of world statistics for all the worlds contained in that group. So, this section focuses on worlds. You may apply the description to the group as well, unless stated otherwise.

 

ESX can make use of the Hyperthreading technology, so, the performance counters takes Hyperthreading into consideration as well. But, to simplify this document, we will ignore HT related issues. Please refer to "Resource Management Guide" for more details.

 

  • "%USED"

The percentage physical CPU time accounted to the world. If a system service runs on behalf of this world, the time spent by that service (i.e. %SYS) should be charged to this world. If not, the time spent (i.e. %OVRLP) should not be charged against this world. See notes on %SYS and %OVRLP.

 

%USED = %RUN + %SYS - %OVRLP

 

+Q: Is it possible that %USED of a world is greater than 100%?+

+A: Yes, if the system service runs on a different PCPU for this world. It may happen when your VM has heavy I/O.+

 

+Q: For an SMP VM, why does VCPU 0 have higher CPU usage than others?+

+A: The system services are accounted to VCPU 0. You may see higher %USED on VCPU 0 than others, although the run time (%RUN) are balanced for all the VCPUs. This is not a problem for CPU scheduling, but only the way VMKernel does the CPU accounting.+

 

+Q: What is the maximum %USED for a VM group?+

+A: The group stats is the sum of the worlds. So, the maximum %USED = NWLD * 100%. NWLD is the number of worlds in the group.+

 

+Typically, worlds other than VCPU worlds are waiting for events most of time, not costing too much CPU cycles. Among all the worlds, VCPU worlds represent best the guest. Therefore, %USED for a VM group usually do not exceed Number of VCPUs * 100%.+

 

+Q: What does it mean if %USED of a VM is high?+

+A: The VM is using lots of CPU resource. You may expand to worlds to see what worlds are using most of them.+

 

  • "%SYS"

The percentage of time spent by system services on behalf of the world. The possible system services are interrupt handlers, bottom halves, and system worlds.

 

+Q: What does it mean if %SYS is high?+

+A: It usually means that your VM has heavy I/O.+

 

+Q: Are %USED and %SYS similar to user time and system time in Linux?+

+A: No. They are totally different. For Linux OS, user (system) time for a process is the time spent in user (kernel) mode. For ESX, %USED is for the accounted time and %SYS is for the system service time.+

 

  • "%OVRLP"

The percentage of time spent by system services on behalf of other worlds. In more detail, let's use an example.

 

When World 'W1' is running, a system service 'S' interrupts 'W1' and services World 'W2'. The time spent by 'S', annotated as 't', is included in the run time of 'W1'. We use %OVRLP of 'W1' to show this time. This time 't' is accounted to %SYS of 'W2', as well.

 

Again, let's take a look at "%USED = %RUN + %SYS - %OVRLP". For 'W1', 't' is included in %RUN and %OVRLP, not in %SYS. By subtracting %OVRLP from %RUN, we do not account 't' in %USED of 'W1'. For 'W2', 't' is included in %SYS, not in %RUN or %OVRLP. By adding %SYS, we accounted 't' to %USED of 'W2'.

 

+Q: What does it mean if %OVRLP of a VM is high?+

+A: It usually means the host has heavy I/O. So, the system services are busy handling I/O. Note that %OVRLP of a VM group may or may not be spent on behalf of this VM. It is the sum of %OVRLP for all the worlds in this group.+

 

  • "%RUN"

The percentage of total scheduled time for the world to run.

 

+Q: What is the difference between %USED and %RUN?+

A: %USED = %RUN + %SYS - %OVRLP. (%USED takes care of the system service time.) Details above.

 

+Q: What does it mean if %RUN of a VM is high?+

+A: The VM is using lots of CPU resource. It does not necessarily mean the VM is under resource constraint. Check the description of %RDY below, for determining CPU contention.+

 

  • "%RDY"

The percentage of time the world was ready to run.

 

A world in a run queue is waiting for CPU scheduler to let it run on a PCPU. %RDY accounts the percentage of this time. So, it is always smaller than 100%.

 

+Q: How do I know CPU resource is under contention?+

+A: %RDY is a main indicator. But, it is not sufficient by itself.+

 

+If a "CPU Limit" is set to a VM's resource settings, the VM will be deliberately held from scheduled to a PCPU when it uses up its allocated CPU resource. This may happen even when there is plenty of free CPU cycles. This time deliberately held by scheduler is shown by "%MLMTD", which will be describe next. Note that %RDY includes %MLMTD. For, for CPU contention, we will use "%RDY - %MLMTD". So, if "%RDY - %MLMTD" is high, e.g., larger than 20%, you may experience CPU contention.+

 

+What is the recommended threshold? Well, it depends. As a try, we could start with 20%. If your application speed in the VM is OK, you may tolerate higher threshold. Otherwise, lower.+

 

+Q: How do we break down 100% for the world state times?+

+A: A world can be in different states, either scheduled to run, ready to run but not scheduled, or not ready to run (waiting for some events).+

100% = %RUN + %READY + %CSTP + %WAIT.

+Check the description of %CSTP and %WAIT below.+

 

+Q: What does it mean if %RDY of a VM is high?+

+A: It means the VM is possibly under resource contention. Check "%MLMTD" as well. If "%MLMTD" is high, you may raise the "CPU limit" setting for the VM. If "%RDY - %MLMTD" is high, the VM is under CPU contention.+

 

  • "%MLMTD"

The percentage of time the world was ready to run but deliberately wasn't scheduled because that would violate the "CPU limit" settings.

 

Note that %MLMTD is included in %RDY.

 

+Q: What does it mean if %MLMTD of a VM is high?+

+A: The VM cannot run because of the "CPU limit" setting. If you want to improve the performance of this VM, you may increase its limit. However, keep in mind that it may reduce the performance of others.+

 

  • "%CSTP"

The percentage of time the world spent in ready, co-deschedule state. This co-deschedule state is only meaningful for SMP VMs. Roughly speaking, ESX CPU scheduler deliberately puts a VCPU in this state, if this VCPU advances much farther than other VCPUs.

 

+Q: What does it mean if %CSTP is high?+

+A: It usually means the VM workload does not use VCPUs in a balanced fashion. The VCPU with high %CSTP is used much more often than the others. Do you really need all those VCPUs? Do you pin the guest application to the VCPUs?+

 

  • "%WAIT"

The percentage of time the world spent in wait state.

 

This %WAIT is the total wait time. I.e., the world is waiting for some VMKernel resource. This wait time includes I/O wait time, idle time and among other resources. Idle time is presented as %IDLE.

 

+Q: How do I know the VCPU world is waiting for I/O events?+

+A: %WAIT - %IDLE can give you an estimate on how much CPU time is spent in waiting I/O events. This is an estimate only, because the world may be waiting for resources other than I/O.+ +Note that we should only do this for VMM worlds, not the other kind of worlds. Because VMM worlds represent the guest behavior the best. For disk I/O, another alternative is to read the disk latency stats which we will explain in the disk section.+

 

+Q: How do I know the VM group is waiting for I/O events?+

+A: For a VM, there are other worlds besides the VCPUs, such as a mks world and a VMX world. Most of time, the other worlds are waiting for events. So, you will see ~100% %WAIT for those worlds. If you want to know whether the guest is waiting for I/O events, you'd better expand the group and analyze the VCPU worlds as stated above.+

 

+Since %IDLE makes no sense to the worlds other than VCPUs, we may use the group stats to estimate the guest I/O wait by "%WAIT - %IDLE - 100% * (NWLD - NVCPU)". Here, NWLD is the number of worlds in the group; NVCPU is the number of VCPUs. This is a very rough estimate, due to two reasons. (1) The world may be waiting for resources other than I/O. (2) We assume the other assisting worlds are not active, which may not be true.+

 

Again, for disk I/O, another alternative is to read the disk latency stats which we will explain in the disk section.

 

Q: Why do I always see a high %WAIT for VMX/mks worlds?

A: This is normal. That means there are not too much activities on them.

 

Q: Why do I see a high %WAIT for a VM group?

A: For a VM, there are other worlds besides the VCPUs, such as a mks world and VMX worlds. These worlds are waiting for events most of time.

 

  • "%IDLE"

The percentage of time the VCPU world is in idle loop. Note that %IDLE is included in %WAIT. Also note that %IDLE only makes sense to VCPU world. The other worlds do not have idle loops, so, %IDLE is zero for them.

 

  • "%SWPWT"

The percentage of time the world is waiting for the ESX VMKernel swapping memory. The %SWPWT (swap wait) time is included in the %WAIT time. This is a new statistics added in ESX 4.0.

 

Q: Why do I see a high %SWPWT for a VM group?

A: The VM is swapping memory.

 

Section 3. Memory

Section 3.1 Machine Memory and Guest Physical Memory

It is important to note that some statistics refer to guest physical memory while others refer to machine memory. "Guest physical memory" is the virtual-hardware physical memory presented to the VM. "Machine memory" is actual physical RAM in the ESX host. Let's use the following figure to explain. In the figure, two VMs are running on an ESX host, where each block represents 4 KB of memory and each color represents a different set of data on a block.

 

http://communities.vmware.com/servlet/JiveServlet/downloadImage/102-9279-1-4857/memory.JPG

 

Inside each VM, the guest OS maps the virutal memory to its physical memory. ESX Kernel maps the guest physical memory to machine memory. Due to ESX Page Sharing technology, guest physical pages with the same content can be mapped to the same machine page.

 

Section 3.2 Global Statistics

  • "MEM overcommit avg"

Average memory overcommit level in 1-min, 5-min, 15-min (EWMA).

 

Memory overcommit is the ratio of total requested memory and the "managed memory" minus 1. VMKernel computes the total requested memory as a sum of the following components: (a) VM configured memory (or memory limit setting if set), (b) the user world memory, (c) the reserved overhead memory. (Overhead memory will be discussed in more detail for "OVHD" and "OVHDMAX" in Section 3.3.)

 

"managed memory" will be defined in "VMKMEM" section.

 

Q: What does it mean if overcommit is not 0?

A: It means that total requested guest physical memory is more than the machine memory available. This is fine, because ballooning and page sharing allows memory overcommit.

 

This metric does not necessarily mean that you will have performance issues. Use "SWAP" and "MEMCTL" to find whether you are experiencing memory problems.

 

Q: What's the meaning of overcommit?

A: See above description for details. Roughly speaking, it reflects the ratio of requested memory and the available memory.

 

  • "PMEM" (MB)

The machine memory statistics for the host.

 

"total": the total amount of machine memory in the server. It is the machine memory reported by BIOS.

 

"cos" : the amount of machine memory allocated to the ESX Service Console.

 

"vmk" : the amount of machine memory being used by the ESX VMKernel. "vmk" includes kernel code section, kernel data and heap, and other VMKernel management memory.

 

"other": the amount of machine memory being used by everything other than the ESX Service Console and ESX VMKernel. "other" contains not only the memory used by VM but also the user worlds that run directly on VMKernel.

 

"free" : the amount of machine memory that is free.

 

Q: Why is total not the same as RAM size plugged in my memory slots?

A: This is because some memory range is not available for use. It is fine, if the difference is small. If the difference is big, there might be some hardware issue. Check your BIOS.

 

Q: Why can't I find the cos part?

A: COS is only available in classic ESX. You are using ESXi.

 

Q: How do I break down the total memory?

A: total = cos + vmk + other + free

 

Q: Which one contains the memory used by VMs?

A: "other" contains the machine memory that backs guest physical memory of VMs. Note that "other" also includes the overhead memory.

 

Q: How do I know my "free" memory is low? Is it a problem if it is low?

A: You could use the "state" field, which will be explained next, to see whether the free memory is low. Basically, it is fine if you do not experience memory swapping or ballooning. Check "SWAP" and "MEMCTL" to find whether you are experiencing memory problems.

 

  • "VMKMEM" (MB)

The machine memory statistics for VMKernel.

 

"managed": the total amount of machine memory managed by VMKernel. VMKernel "managed" memory can be dynamically allocated for VM, VMKernel, and User Worlds.

 

"minfree": the minimum amount of machine memory that VMKernel would like to keep free. This is because VMKernel needs to keep some amount of free memory for critical uses.

 

"rsvd" : the amount of machine memory that is currently reserved. "rsvd" is the sum of three parts: (a) the reservation setting of the groups; (b) the overhead reservation of the groups; (c) "minfree".

 

"ursvd" : the amount of machine memory that is currently unreserved. It is the memory available for reservation.

 

Please note that the VM admission control is done at resource pool level. So, this statistics is not used directly by admission control. "ursvd" can be used

as a system level indicator.

 

"state" : the free memory state. Possible values are high, soft, hard and low. The memory "state" is "high", if the free memory is greater than or equal to 6% of "total" - "cos". If is "soft" at 4%, "hard" at 2%, and "low" at 1%. So, high implies that the machine memory is not under any pressure and low implies that the machine memory is under pressure.

 

While the host's memory state is not used to determine whether memory should be reclaimed from VMs (that decision is made at the resource pool level), it can affect what mechanisms are used to reclaim memory if necessary. In the high and soft states, ballooning is favored over swapping. In the hard and low states, swapping is favored over ballooning.

 

Please note that "minfree" is part of "free" memory; while "rsvd" and "ursvd" memory may or may not be part of "free" memory. "reservation" is different from memory allocation.

 

Q: Why is "managed" memory less than the sum of "vmk", "other" and "free" in the PMEM line? Is it normal?

+A: It is normal, just the way we do accounting. A more precise definition for "managed" is the free memory after VMKernel initialization. So, this amount of memory can be dynamically allocated for use of VMs, VMKernel, and user worlds. "managed" = "some part of vmk" + "other" + "free".+

 

+So, "managed" &lt; "vmk" + "other" + "free". Or, in an equivalent form, "managed" &lt; "total" - "cos".+

 

Q: How do I break down the managed memory in terms of reservation?

A: "managed" = "rsvd" + "ursvd" + "vmkernel usage"

 

VMKernel machine memory manager needs to use some part of memory, which should not be subject to reservation, so, it is not in "rsvd", nor in "ursvd". In the above equation, we put this part under "vmkernel usage". Unfortunately, it is not shown directly in esxtop.

 

Note that the vmkernel usage in managed memory is part of "vmk".

 

Q: What does it mean if "ursvd" is low?

A: VMKernel admission control prohibits a VM PowerOn operation, if it cannot meet the memory reservation of that VM. The memory reservation includes the reservation setting, a.k.a. "min", and the monitor overhead memory reservation. Note that even if "min" is not set, VMKernel still needs to reserve some amount

of memory for monitor uses.

 

So, it is possible that even though you have enough free memory, a new VM cannot power on due to the violation of memory reservation.

 

Q: Why do I fail admission control even though "ursvd" is high?

A: The VM admission control is done at resource pool level. Please check the "min" setting of all its parent resource pools.

 

Q: Why is "managed" greater than the sum of "rsvd" and "ursvd"? Is it normal?

A: It is normal. See above question. VMKernel may use some of the managed memory. It is not accounted in "rsvd" and "ursvd".

 

Q: What is the meaning of "state"?

A: See the description of "state" above.

 

Q: How do I know my ESX box is under memory pressure?

A: It is usually safe to say the ESX box is under memory pressure, if "state" is "hard" or "low". But, you need also check "SWAP" and "MEMCTL" to find whether you are experiencing memory problems. Basically, if there is not enough free memory and ESX are experiencing swapping or ballooning, ESX box is under memory pressure.

 

Note that ballooning does not have as big performance hit as swapping does. Ballooning may cause guest swapping. ESX swapping means host swapping.

 

Also note that A VM may be swapping or ballooning, even though there is enough free memory. This is due to the reservation setting.

 

  • "COSMEM" (MB)

The memory statistics reported by the ESX Service Console.

 

"free" : the amount of idle machine memory.

 

"swap_t": the total swap configured.

 

"swap_f": the amount of swap free.

 

"r/s" : the rate at which memory is swapped in from disk.

 

"w/s" : the rate at which memory is swapped out to disk.

 

Note that these stats essentially come from the COS proc nodes.

 

Q: What does it mean if I see a high r/s or w/s?

A: Your console OS is swapping. It is highly likely that your COS free memory is low. You may either configure more memory for COS and restart your ESX box, or stop some programs running inside your COS.

 

Q: Why can't I see this COSMEM line?

A: You are using ESXi not classic ESX.

 

  • "NUMA" (MB)

The ESX NUMA statistics. For each NUMA node there are two statistics: (1) the "total" amount of machine memory managed by ESX; (2) the amount of machine memory currently "free".

 

Note that ESX NUMA scheduler optimizes the uses of NUMA feature to improve guest performance. Please refer to "Resource Management Guide" for details.

 

Q: Why can't I see this NUMA line?

A: You are not using a NUMA machine, or your BIOS disables it.

 

Q: Why is the sum of NUMA memory not equal to "total" in the PMEM line?

A: The PMEM "total" is the memory reported by BIOS, while the NUMA "total" is the memory managed by VMKernel machine memory manager. There are two major parts of memory seen by BIOS but not given to machine memory manager: (1) COS uses, and (2) VMKernel uses during early initialization.

 

So, Sum("NUMA total") &lt; "PMEM total" - "cos".

 

Note that the free memory on all the nodes can be added up as the "free" memory in the PMEM line.

 

  • "PSHARE" (MB)

The ESX page-sharing statistics.

 

"shared": the amount of guest physical memory that is being shared.

 

"common": the amount of machine memory that is common across World(s).

 

"saving": the amount of machine memory that is saved due to page-sharing.

 

The monitor maps guest physical memory to machine memory. VMKernel selects to map guest physical pages with the same content to the same machine page. In other words, those guest physical pages are sharing the same machine page. This kind of sharing can happen within the same VM or among the VMs.

 

Since each VM's "shared" memory measures guest physical memory, the host's "shared" memory may be larger than the total amount of machine memory if memory is overcommitted. "saving" illustrates the effectiveness of page sharing for saving machine memory.

 

"shared" = "common" + "saving".

 

Note that esxtop only shows the pshare stats for VMs, excluding the pshare stats for user worlds.

 

  • "SWAP" (MB)

The ESX swap usage statistics.

 

"curr" : the current swap usage. This is the total swapped machine memory of all the groups. So, it includes VMs and user worlds.

 

"target": the swap usage expected to be. This is the total swap target of all the groups. So, it includes VMs and user worlds.

 

"r/s" : the rate at which machine memory is swapped in from disk.

 

"w/s" : the rate at which machine memory is swapped out to disk.

 

Note that swap here is host swap, not guest swap inside the VM.

 

Q: What does it mean if "curr" is not the same as "target"?

A: It means ESX will swap memory to meet the swap target. Note that the actual swapping is done at the group level. So, you should check "SWCUR" and "SWTGT" for each group. We will discuss this in the next section.

 

Q: Is it bad if "r/s" is high?

A: Yes, it is very bad. This usually means that you have memory resource contention. Because swapin is synchronous, it will hurt guest performance a lot.

 

Do two things: (1) Check your "free" memory or "state" as mentioned above. If free memory is low, you need to move VMs to other hosts or add more memory to the host. (2) If free memory is not low, check your resource setting of your VMs or user worlds. You may have set a low "limit", which causes swapping.

 

Q: Is it bad if "w/s" is high?

A: Yes, it is also very bad. This usually means that you have memory resource contention. Do the similar actions as mentioned above.

 

  • "MEMCTL" (MB)

The memory balloon statistics.

 

"curr" : the total amount of physical memory reclaimed by balloon driver. This is the total ballooned memory by the VMs.

 

"target": total amount of ballooned memory expected to be. This is the total ballooned targets of the VMs.

 

"max" : the maximum amount of physical memory reclaimable.

 

Note that ballooning may or may not lead to guest swapping, which is decided by the guest OS.

 

Q: What does it mean if "curr" is not the same as "target"?

A: It means ESX will balloon memory to meet the balloon target. Note that the actual ballooning is done for the VM group. So, you should check "MCTLSZ" and "MCTLTGT" for each group. We will discuss this in the next section.

 

Q: How do I know the host is ballooning memory?

A: If the "curr" is changing, you can know it is ballooning. Since ballooning is done at VM level, a better way is to monitor "MCTLSZ" for each group. We will discuss this in the next section.

 

Q: Is it bad if we have lots of ballooning activities?

A: Usually it is fine. Ballooning tends to take unused memory from one VM and make them available for others. The possible side effects are (a) reducing the memory cache used by guest OS, (b) guest swapping. In either cases, it may hurt guest performance. Please note that (a) and (b) may or may not happen, depending on your workload inside VM.

 

On the other hand, under memory contention, ballooning is much better than swapping in terms of performance.

 

Section 3.3 Group Statistics

Esxtop shows the groups that use memory managed by VMKernel memory scheduler. These groups can be used for VMs or purely for user worlds running directly on VMKernel. You may see many pure user world groups on ESXi, not on classic ESX.

 

Tip: use 'V' command to show only the VM groups.

 

  • "MEMSZ" (MB)

For a VM, it is the amount of configured guest physical memory.

 

For a user world, it includes not only the virtual memory that is backed by the machine memory, but also the reserved backing store size.

 

Q: How do I break down "MEMSZ" of a VM?

A: A VM's guest physical memory could be mapped to machine memory, reclaimed by balloon driver, or swapped to disk, or never touched. The guest physical memory can be "never touched", because (1) the VM has never used it since power on; or, (2) it was reclaimed by balloon driver before, but has not been used since the balloon driver releases it last time. This part of memory is not measured directly by VMKernel.

 

"MEMSZ" = "GRANT" + "MCTLSZ" + "SWCUR" + "never touched"

 

Please refer to "GRANT", "MCTLSZ", "SWCUR".

 

  • "GRANT" (MB)

For a VM, it is the amount of guest physical memory granted to the group, i.e., mapped to machine memory. The overhead memory, "OVHD" is not included in GRANT. The shared memory, "SHRD", is part of "GRANT". This statistics is added to esxtop in ESX 4.0.

 

The consumed machine memory for the VM, not including the overhead memory, can be estimated as "GRANT" - "SHRDSVD". Please refer to "SHRDSVD".

 

For a user world, it is the amount of virtual memory that is backed by machine memory.

 

Q: Why is "GRANT" less than "MEMSZ"?

A: Some guest physical memory has never been used, or is reclaimed by balloon driver, or is swapped out to the VM swap file. Note that this kind of swap is host swap, not the guest swap by the guest OS.

 

"MEMSZ" = "GRANT" + "MCTLSZ" + "SWCUR" + "never touched"

 

Q: How do I know how much machine memory is consumed by this VM?

A: GRANT accounts the guest physical memory, it may not be the same as the mapped machine memory, due to page sharing.

 

The consumed machine memory can be estimated as "GRANT" - "SHRDSVD". Please note that this is an estimate. Please refer to "SHRDSVD".

 

Note that overhead memory, "OVHD", is not part of the above consumed machine memory.

 

  • "SZTGT" (MB)

The amount of machine memory to be allocated. (TGT is short for "target".) Note that "SZTGT" includes the overhead memory for a VM.

 

This is an internal counter, which is computed by ESX memory scheduler. Usually, there is no need to worry about this. Roughly speaking, "SZTGT" of all the VMs is computed based on the resource usage, available memory, and the "limit/reservation/shares" settings. This computed "SZTGT" is compared against the current memory consumption plus overhead memory for a VM to determine the swap and balloon target, so that VMKernel may balloon or swap appropriate amount

of memory to meet its memory demand. Please refer to "Resource Management Guide" for details.

 

Q: How come my "SZTGT" is larger than "MEMSZ"?

A: "SZTGT" includes the overhead memory, while "MEMSZ" does not. So, it is possible for "SZTGT" be larger than "MEMSZ".

 

Q: How do I use "SZTGT"?

A: This is an internal counter. You don't need to use it.

 

This counter is used to determine future swapping and ballooning activities. Check "SWTGT" and "MCTLTGT".

 

  • "TCHD" (MB)

The amount of guest physical memory recently used by the VM, which is estimated by VMKernel statical sampling.

 

VMKernel estimates active memory usage for a VM by sampling a random subset of the VM's memory resident in machine memory to detect the number of memory reads and writes. VMKernel then scales this number by the size of VM's configured memory and averages it with previous samples. Over time, this average will approximate the amount of active memory for the VM.

 

Note that ballooned memory is considered inactive, so, it is excluded from "TCHD".

 

Because sampling and averaging takes time, "TCHD" won't be exact, but becomes more accurate over time.

 

VMKernel memory scheduler charges the VM by the sum of (1) the "TCHD" memory and (2) idle memory tax. This charged memory is one of the factors that memory scheduler uses for computing the "SZTGT".

 

Q: What is the difference between "TCHD" and working set estimate by guest OS?

A: "TCHD" is the working set estimated by VMKernel. This number may be different from guest working set estimate. Sometimes the difference may be big, because (1) guest OS uses a different working set estimate algorithm, (2) guest OS has a different view of active guest physical memory, due to ballooning and host swapping,

 

Q: How is "TCHD" used?

A: "TCHD" is a working set estimate, which indicates how actively the VM is using its memory. See above for the internal use of this counter.

 

  • "%ACTV"

Percentage of active guest physical memory, current value.

 

"TCHD" is actually computed based on a few parameters, coming from statistical sampling. The exact equation is out of scope of this document. Esxtop shows some of those parameters, %ACTV, %ACTVS, %ACTVF, %ACTVN. Here, this document provides simple descriptions without further discussion.

 

%ACTV reflects the current sample.

%ACTVS is an EWMA of %ACTV for long term estimate.

%ACTVF is an EWMA of %ACTV for short term estimate.

%ACTVN is a predict of what %ACTVF will be at next sample.

 

Since they are very internal to VMKernel memory scheduler, we do not discuss their usage here.

 

  • "%ACTVS"

Percentage of active guest physical memory, slow moving average. See above.

 

  • "%ACTVF"

Percentage of active guest physical memory, fast moving average. See above.

 

  • "%ACTVN"

Percentage of active guest physical memory in the near future. This is an estimated value. See above.

 

  • "MCTL?"

Memory balloon driver is installed or not.

 

If not, install VMware tools which contains the balloon driver.

 

  • "MCTLSZ" (MB)

The amount of guest physical memory reclaimed by balloon driver.

 

This can be called "balloon size". A large "MCTLSZ" means lots of this VM's guest physical memory is "stolen" to decrease host memory pressure. This usually is not a problem, because balloon driver tends to smartly steal guest physical memory that cause little performance problems.

 

Q: How do I know the VM is ballooning?

A: If "MCTLSZ" is changing, balloon driver is actively reclaiming or releasing memory. I.e., the VM is ballooning. Please note that the ballooning rate for a short term can be estimated by the change of "MCTLSZ", assuming it is either increasing or decreasing. But, for a long term, we cannot do it this way, because that monotonically increase/decrease assumption may not hold.

 

Q: Does ballooning hurt VM performance?

A: If guest working set is smaller than guest physical memory after ballooning, guest applications won't observe any performance degradation. Otherwise, it may cause guest swapping and hurt guest application performance.

 

Please check what causes ballooning and take appropriate actions to reduce memory pressure. There are two possible reasons: (1) The host does not have enough machine memory for use. (2) Memory used by the VM reaches the "limit" setting of itself or "limit" of the resource pools that contain this VM. In either case, ballooning is necessary and preferred over swapping.

 

  • "MCTLTGT" (MB)

The amount of guest physical memory to be kept in balloon driver. (TGT is short for "target".)

 

This is an internal counter, which is computed by ESX memory scheduler. Usually, there is no need to worry about this.

 

Roughly speaking, "MCTLTGT" is computed based on "SZTGT" and current memory usage, so that the VM can balloon appropriate amount of memory. If "MCTLTGT" is greater than "MCTLSZ", VMKernel initiates inflating the balloon immediately, causing more VM memory to be reclaimed. If "MCTLTGT" is less than "MCTLSZ", VMKernel will deflate the balloon when the guest is requesting memory, allowing the VM to map/consume additional memory if it needs it. Please refer to "Resource Management Guide" for details.

 

Q: Why is it possible for "MCTLTGT" to be less than "MCTLSZ" for a long time?

A: If "MCTLTGT" is less than "MCTLSZ", VMKernel allows the balloon to deflate. But, balloon deflation happens lazily until the VM requests new memory. So, it is possible for "MCTLTGT" to be less than "MCTLSZ" for a long time, when the VM is not requesting new memory.

 

  • "MCTLMAX" (MB)

The maximum amount of guest physical memory reclaimable by balloon driver.

 

This value can be set via vmx option "sched.mem.maxmemctl". If not set, it is determined by the guest operating system type. "MCTLTGT" will never be larger than "MCTLMAX".

 

If the VM suffers from ballooning, "sched.mem.maxmemctl" can be set to a smaller value to reduce this possibility. Remember that doing so may result in host swapping during resource contention.

 

  • "SWCUR" (MB)

Current swap usage.

 

For a VM, it is the current amount of guest physical memory swapped out to the backing store. Note that it is the VMKernel swapping not the guest OS swapping.

 

It is the sum of swap slots used in the vswp file or system swap, and migration swap. Migration swap is used for a VMotioned VM to hold swapped out memory on the destination host, in case the destination host is under memory pressure.

 

Q: What does it mean if "SWCUR" of my VM is high?

A: It means the VM's guest physical memory is not resident in machine memory, but on disk. If those memory will not be used in the near future, it is not an issue. Otherwise, those memory will be swapped in for guest's use. In that case, you will see some swap-in activities via "SWR/s", which may hurt the VM's performance.

 

  • "SWTGT" (MB)

The expected swap usage. (TGT is short for "target".)

 

This is an internal counter, which is computed by ESX memory scheduler. Usually, there is no need to worry about this.

 

Roughly speaking, "SWTGT" is computed based on "SZTGT" and current memory usage, so that the VM can swap appropriate amount of memory. Again, note that it is the VMKernel swapping not the guest swapping. If "SWTGT" is greater than "SWCUR", VMKernel starts swapping immediately, causing more VM memory to be swapped out. If "SWTGT" is less than "SWCUR", VMKernel will stop swapping. Please refer to "Resource Management Guide" for details.

 

Q: Why is it possible for "SWTGT" to be less than "SWCUR" for a long time?

A: Since swapped memory stays swapped until the VM accesses it, it is possible for "SWTGT" be less than "SWCUR" for a long time.

 

  • "SWR/s" (MB)

Rate at which memory is being swapped in from disk. Note that this stats refers to the VMKernel swapping not the guest swapping.

 

When a VM is requesting machine memory to back its guest physical memory that was swapped out to disk, VMKernel reads in the page. Note that the swap-in operation is synchronous.

 

Q: What does it mean if SWR/s is high?

A: It is very bad for VM's performance. Because swap-in is synchronous, the VM needs to wait until the requested pages are read into machine memory. This happens when VMKernel swapped out the VM's memory before and the VM needs them now. Please refer to "SWW/s".

 

  • "SWW/s" (MB)

Rate at which memory is being swapped out to disk. Note that this stats refers to the VMKernel swapping not the guest swapping.

 

As discussed in "SWTGT", if "SWTGT" is greater than "SWCUR", VMKernel will swap out memory to disk. It happens usually in two situations. (1) The host does not have enough machine memory for use. (2) Memory used by the VM reaches the "limit" setting of itself or "limit" of the resource pools that contain this VM.

 

Q: What does it mean if SWW/s is high?

A: It is very bad for VM performance. Please check the above two reasons and fix your problem accordingly.

 

If this VM is swapping out memory due to resource contention, it usually means VMKernel does not have enough machine memory to meet memory demands from all the VMs. So, it will swap out mapped guest physical memory pages to make room for the recent requests.

 

  • "SHRD" (MB)

Amount of guest physical memory that are shared.

 

VMKernel page sharing module scans and finds guest physical pages with the same content and backs them with the same machine page. "SHRD" accounts the total guest physical pages that are shared by the page sharing module.

 

  • "ZERO" (MB)

Amount of guest physical zero memory that are shared. Thisis an internal counter.

 

A zero page is simply the memory page that is all zeros. If a zero guest physical page is detected by VMKernel page sharing module, this page will be backed by the same machine page on each NUMA node. Note that "ZERO" is included in "SHRD".

 

  • "SHRDSVD" (MB)

Estimated amount of machine memory that are saved due to page sharing.

 

Because a machine page is shared by multiple guest physical pages, we only charge "1/ref" page as the consumed machine memory for each of the guest physical pages, where "ref" is the number of references. So, the saved machine memory will be "1 - 1/ref" page."SHRDSVD" estimates the total saved machine memory for the VM.

 

The consumed machine memory by the VM can be estimated as "GRANT" - "SHRDSVD".

 

  • "COWH" (MB)

Amount of guest physical hint pages for page sharing. This is an internal counter.

 

  • "OVHDUW" (MB)

Amount of overhead memory reserved for the vmx user world of a VM group. This is an internal counter.

 

"OVHDUW" is part of "OVHDMAX".

 

  • "OVHD" (MB)

Amount of overhead memory currently consumed by a VM.

 

"OVHD" includes the overhead memory consumed by the monitor, the VMkernel and the vmx user world.

 

  • "OVHDMAX" (MB)

Amount of reserved overhead memory for the entire VM.

 

"OVHDMAX" is the overhead memory a VM wants to consume in the future. This amount of reserved overhead memory includes the overhead memory reserved by the monitor, the VMkernel, and the vmx user world. Note that the actual overhead memory consumption is less than "OVHDMAX". "OVHD" &lt; "OVHDMAX".

 

"OVHDMAX" can be used as a conservative estimate of the total overhead memory.

 

Section 4 Disk

Section 4.1 Adapter, Device, VM screens

The ESX storage stack adds a few layers of code between a virtual machine and bare hardware. All virtual disks in virtual machines are seen as virtual SCSI disks. The ESX storage stack allows these virtual disks to be located on any of the multiple storage options available.

 

 

For performance analysis purposes, an IO request from an application in a virtual machine traverses through multiple levels of queues, each associated with a resource, in the guest OS, the VMkernel and the physical storage. (Note that physical storage could be an FC- or IP- SAN or disk array.) Each queue has an associated latency, dictated by its size and whether the IO load is low or high, which affects the throughput and latency seen by applications inside VMs.

 

Esxtop shows the storage statistics in three different screens: adapter screen, device screen, and vm screen. Interactive command 'd' can be used to switch to the adapter screen, 'u' for the device screen, and 'v' for the vm screen.

 

The main difference in the data seen in these three screens is the level at which it is aggregated, even though these screens have similar counters. By default, data is rolled up to the highest level possible for each screen. (1) On the adapter screen, by default, the statistics are aggregated per storage adapter but they can also be expanded to display data per storage channel, target, path or world using a LUN. See interacitive commands, 'e', 'E', 'P', 'a', 't', 'l', for the expand operations. (2) On the device screen, by default, statistics are aggregated per storage device. Statistics can also be viewed per path, world, or partition. See interactive commands, 'e', 'p', 't', for the expand operations. (3) On the VM screen, statistics are aggregated on a per-group basis by default. One VM has one corresponding group, so they are equivalent to per-VM statistics. You can use interactive command 'V' to show only statistics related to VMs. Statistics can also be expanded so that a row is displayed for each world or a per-world-per-device basis. See interactive commands, 'e' and 'l'.

 

Please refer to esxtop man page for the details of the interactive commands.

 

Section 4.2 Disk Statistics

Due to the similarities in the counters of the three disk screens, this section discusses the counters without distinguishing the screens. Similar to other esxtop screens, the storage counters are also organized in different sets, each of which contains related counters. The counters can be selected as a set by selecting the appropriate field option in esxtop. If esxtop is used in batch mode, make sure that the esxtop configuration file includes all counters of interest.

 

Each group of counters in the following subsections corresponds to a particular field option.

 

Section 4.2.1 I/O Throughput Statistics

  • CMDS/s

Number of commands issued per second.

 

  • READS/s

Number of read commands issued per second.

 

  • WRITES/s

Number of write commands issued per second.

 

  • MBREAD/s

Megabytes read per second.

 

  • MBWRTN/s

Megabytes written per second.

 

Section 4.2.2 Latency Statistics

This group of counters report latency values measured at three different points in the ESX storage stack. In the context of the figure below, the latency counters in esxtop report the Guest, ESX Kernel and Device latencies. These are under the labels GAVG, KAVG and DAVG, respectively. Note that GAVG is the sum of DAVG and KAVG counters.

 

http://communities.vmware.com/servlet/JiveServlet/downloadImage/102-9279-1-4856/latency.JPG

 

Note that esxtop shows the latency statistics for different objects, such as adapters, devices, paths, and worlds. They may not perfectly match with each other, since their latencies are measured at the different layers of the ESX storage stack. To do the correlation, you need to be very familiar with the storage layers in ESX Kernel, which is out of our scope.

 

Latency values are reported for all IOs, read IOs and all write IOs. All values are averages over the measurement interval.

  • All IOs: KAVG/cmd, DAVG/cmd, GAVG/cmd, QAVG/cmd

  • Read IOs: KAVG/rd, DAVG/rd, GAVG/rd, QAVG/rd

  • Write IOs: KAVG/wr, DAVG/wr, GAVG/wr, QAVG/wr

 

  • GAVG

This is the round-trip latency that the guest sees for all IO requests sent to the virtual storage device.

 

GAVG should be close to the R metric in the figure.

 

Q: What is the relationship between GAVG, KAVG and DAVG?

A: GAVG = KAVG + DAVG

 

  • KAVG

These counters track the latencies due to the ESX Kernel's command.

 

The KAVG value should be very small in comparison to the DAVG value and should be close to zero. When there is a lot of queuing in ESX, KAVG can be as high, or even higher than DAVG. If this happens, please check the queue statistics, which will be discussed next.

 

  • DAVG

This is the latency seen at the device driver level. It includes the roundtrip time between the HBA and the storage.

 

DAVG is a good indicator of performance of the backend storage. If IO latencies are suspected to be causing performance problems, DAVG should be examined. Compare IO latencies with corresponding data from the storage array. If they are close, check the array for misconfiguration or faults. If not, compare DAVG with corresponding data from points in between the array and the ESX Server, e.g., FC switches. If this intermediate data also matches DAVG values, it is likely that the storage is under-configured for the application. Adding disk spindles or changing the RAID level may help in such cases.

 

  • QAVG

The average queue latency. QAVG is part of KAVG.

 

Response time is the sum of the time spent in queues in the storage stack and the service time spent by each resource in servicing the request. The largest component of the service time is the time spent in retrieving data from physical storage. If QAVG is high, another line of investigation is to examine the queue depths at each level in the storage stack.

 

Section 4.2.3 Queue Statistics

  • AQLEN

The storage adapter queue depth. This is the maximum number of ESX Server VMKernel active commands that the adapter driver is configured to support.

 

  • LQLEN

The LUN queue depth. This is the maximum number of ESX Server VMKernel active commands that the LUN is allowed to have. (Note that, in this document, the terminologies of LUN and Storage device can be used interchangeably.)

 

  • WQLEN

The World queue depth. This is the maximum number of ESX Server VMKernel active commands that the World is allowed to have. Note that this is a per LUN maximum for the World.

 

  • ACTV

The number of commands in the ESX Server VMKernel that are currently active. This statistic is only applicable to worlds and LUNs.

 

Please refer to %USD.

 

  • QUED

The number of commands in the VMKernel that are currently queued. This statistic is only applicable to worlds and LUNs.

 

Queued commands are commands waiting for an open slot in the queue. A large number of queued commands may be an indication that the storage system is overloaded. A sustained high value for the QUED counter signals a storage bottleneck which may be alleviated by increasing the queue depth. Check that LOAD &lt; 1 after increasing the queue depth. This should also be accompanied by improved performance in terms of increased cmd/s.

 

Note that there are queues in different storage layers. You might want to check the QUED stats for devices, and worlds.

 

  • %USD

The percentage of queue depth used by ESX Server VMKernel active commands. This statistic is only applicable to worlds and LUNs.

 

%USD = ACTV / QLEN * 100%

 

For world stats, WQLEN is used as the denominator. For LUN (aka device) stats, LQLEN is used as the denominator.

 

%USD is a measure of how many of the available command queue "slots" are in use. Sustained high values indicate the potential for queueing; you may need to adjust the queue depths for system’s HBAs if QUED is also found to be consistently &gt; 1 at the same time. Queue sizes can be adjusted in a few places in the IO path and can be used to alleviate performance problems related to latency. For detailed information on this topic please refer to the VMware whitepaper entitled "Scalable Storage Performance".

 

  • LOAD

The ratio of the sum of VMKernel active commands and VMKernel queued commands to the queue depth. This statistic is only applicable to worlds and LUNs.

 

The sum of the active and queued commands gives the total number of outstanding commands issued by that virtual machine. The LOAD counter values is the ratio of this value with respect to the queue depth. If LOAD &gt; 1, check the value of the QUED counter.

 

Section 4.2.4 Error Statistics

  • ABRTS/s

The number of commands aborted per second.

 

It can indicate that the storage system is unable to meet the demands of the guest operating system. Abort commands are issued by the guest when the storage system has not responded within an acceptable amount of time, e.g. 60 seconds on some windows OS’s. Also, resets issued by a guest OS on its virtual SCSI adapter will be translated to aborts of all the commands outstanding on that virtual SCSI adapter.

 

  • RESETS/s

The number of commands reset per second.

 

Section 4.2.5 PAE Statistics

  • PAECMD/s

The number of PAE commands per second.

 

It may point to hardware misconfiguration. When the guest allocates a buffer, the vmkernel assigns some machine memory, which might come from a “highmem” region. If you have a driver that is not PAE-aware, then this counter is updated if accesses to this memory region result in copies by the vmkernel into a lower memory location before issuing the request to the adapter. This might happen if you do not populate the DIMMs with low memory first, then you may artificially cause “highmem” memory accesses.

 

  • PAECP/s

The number of PAE copies per second.

 

Section 4.2.6 Split Statistics

  • SPLTCMD/s

The number of split commands per second.

 

Commands can be split when they reach the vmkernel. This might impact perceived latency to the guest. The guest may be issuing commands of large block sizes which have to be broken down by the vmkernel. For ESX3.0.x, guest requests greater than 128KB are split into 128KB chunks. Since few applications do larger than 128KB ops, this is unlikely to be an issue. Splitting can also occur when IOs fall across partition boundaries but these are easily differentiated from the splitting as a result of the IO size.

 

  • SPLTCP/s

The number of split copies per second.

 

Section 4.3 Batch Mode Output

Esxtop batch mode output can be loaded in perfmon directly. It uses a csv (comma separated values) format. The instance type can be ideitified via its name. Because there are quite a number of instances related to disk statistics, let's list a few examples below. You may easily match the format in your own environment.

 

  • LUN (aka device): "
    &lt;host&gt;\Physical Disk(DEV-vmhba0:0:0)\&lt;counter&gt;"

  • Partition: "
    &lt;host&gt;\Physical Disk(PN-vmhba0:0:0-1)\&lt;counter&gt;"

  • Path: "
    &lt;host&gt;\Physical Disk(PH-vmhba0:C0:T0:L0)\&lt;counter&gt;"

  • Per-World-Per-Device: "
    &lt;host&gt;\Physical Disk(WD-vmhba0:0:0-1024)\&lt;counter&gt;"

  • Adapter: "
    &lt;host&gt;\Physical Disk(vmhba0)\&lt;counter&gt;"

 

Section 5 Network

Section 5.1 Port

We arrange the network stats per port of a virtual switch. "PORT-ID" identifies the port and "DNAME" shows the virtual switch name. A port can be linked to a physical NIC as an uplink, or can be connected by a virtual NIC. "UPLINK" indicates whether the port is an uplink.

 

If the port is an uplink, i.e., "UPLINK" is 'Y', "USED-BY" shows the physical NIC name.

 

If the port is connected by a virtual NIC, i.e., "UPLINK" is 'N', "USED-BY" shows the port client name. (a) If the port is used by a virtual machine, the client name contains a world id and the VM name. The world id identifies the leader world of the VM group. Note that "vswif" is used by COS (on classic ESX). (b) If the port is used by VMKernel system, there is no world id. The client name can be used to identify the use of the port. To give two examples.

 

  • "vmk" is a port used by vmkernel. Users can create vmk NICs for their uses, such as VMotion. On ESXi, there will be at least one vmk NIC to communicate with outside of the host.

  • "Management" is a management port for a portset. This is internal. Usually no need to worry about it.

 

For each non-uplink port, the NIC teaming policy determines which physical NIC is in charge of the port. "TEAM-PNIC" shows the physical NIC name, if valid. Please refer to NIC teaming documentation for details.

 

Section 5.2 Port Statistics

  • "SPEED" (Mbps)

The link speed in Megabits per second. This information is only valid for a physical NIC.

 

  • "FDUPLX"

'Y' implies the corresponding link is operating at full duplex. 'N' implies it is not. This information is only valid for a physical NIC.

 

  • "UP"

'Y' implies the corresponding link is up. 'N' implies it is not. This information is only valid for a physical NIC.

 

  • "PKTTX/s"

The number of packets transmitted per second.

 

  • "PKTRX/s"

The number of packets received per second.

 

  • "MbTX/s" (Mbps)

The MegaBits transmitted per second.

 

  • "MbRX/s" (Mbps)

The MegaBits received per second.

 

Q: Why does MbRX/s not match PKTRX/s for different workloads?

A: This is because the packet size may not be the same. The average packet size can be computed as follows: average_packet_size = MbRX/s / PKTRX/s . A large packet size may improve CPU efficiency of processing the packet. However, it may potentially increase latency.

 

  • "%DRPTX"

The percentage of transmit packets dropped.

 

"%DRPTX" = "dropped Tx packets" / ("success Tx packets" + "dropped Tx packets")

 

Q: What does it mean if %DRPTX is high?

A: This usually means the network transmit performance is bad. Please check whether the phsycial NICs are fully utilizing their capacity. You probably need physical NICs with better performance. Or, you may add more physical NICs and use a good NIC teaming load balancing policy.

 

  • "%DRPRX"

The percentage of receive packets dropped.

 

"%DRPRX" = "dropped Rx packets" / ("success Rx packets" + "dropped Rx packets")

 

Q: What does it mean if %DRPRX is high?

A: This usally means the network recieve performance is bad. Try to give more CPU resource to the impacted VM, or increase the ring buffer size.

 

  • "ACTN/s"

Number of actions per second. The actions here are VMkernel actions. It is an internal counter. We won't discuss it further here.

 

Section 6. Interrupt

Interrupt screens are under development for our next release.

 

Section 7. Batch Mode

Esxtop batch mode output uses a csv (comma separated values) format. The first line contains the names of the performance counters and their instances. Each of the following lines contains the performance data for those counter instances in one snapshot.

 

One way to read the batch mode output file is to load it in Windows perfmon. (1) Run perfmon; (2) Type "Ctrl + L" to view log data; (3) Add the file to the "Log files" and click OK; (4) Choose the counters to show the performance data. Each batch mode counter has a category name (listed as a performance object in perfmon) and a counter name (listed in the counter list in perfmon).

 

The counter names in esxtop batch mode are different from the ones in interactive mode listed in the sections above. The tables below describe their relationships. The first column is the interactive mode counter name; the second column is the batch mode counter category; the last column is the batch mode counter name.

 

  • Table 7-1 CPU Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

CPU load average

Physical Cpu Load

Cpu Load (1 Minute Avg)

Cpu Load (5 Minute Avg)

Cpu Load (15 Minute Avg)

PCPU USED(%)

Physical Cpu

% Processor Time

PCPU UTIL(%)

Physical Cpu

% Util Time

CORE UTIL(%)

Physical Cpu

% Core Util Time

CCPU(%) us

Console Physical Cpu

% User Time

CCPU(%) sy

Console Physical Cpu

% System Time

CCPU(%) id

Console Physical Cpu

% Idle Time

CCPU(%) wa

Console Physical Cpu

% I/O Wait Time

CCPU(%) cs/sec

Console Physical Cpu

% Context Switches/sec

%USED

Group Cpu (or Vcpu)

% Used

%SYS

Group Cpu (or Vcpu)

% System

%OVRLP

Group Cpu (or Vcpu)

% Overlap

%RUN

Group Cpu (or Vcpu)

% Run

%RDY

Group Cpu (or Vcpu)

% Ready

%MLMTD

Group Cpu (or Vcpu)

% Max Limited

%CSTP

Group Cpu (or Vcpu)

% CoStop

%WAIT

Group Cpu (or Vcpu)

% Wait

%IDLE

Group Cpu (or Vcpu)

% Idle

%SWPWT

Group Cpu (or Vcpu)

% Swap Wait

 

  • Table 7-2 Memory Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

MEM overcommit avg

Memory

Memory Overcommit (1 Minute Avg)

Memory Overcommit (5 Minute Avg)

Memory Overcommit (15 Minute Avg)

PMEM total

Memory

Machine MBytes

PMEM cos

Memory

Console MBytes

PMEM vmk

Memory

Kernel MBytes

PMEM other

Memory

NonKernel MBytes

PMEM free

Memory

Free MBytes

VMKMEM managed

Memory

Kernel Managed MBytes

VMKMEM minfree

Memory

Kernel MinFree MBytes

VMKMEM rsvd

Memory

Kernel Reserved MBytes

VMKMEM ursvd

Memory

Kernel Unreserved MBytes

VMKMEM state

Memory

Kernel State (0: high, 1: soft, 2:hard, 3: low)

COSMEM free

Console Memory

Free MBytes

COSMEM swap_t

Console Memory

Swap Total MBytes

COSMEM swap_f

Console Memory

Swap Free MBytes

COSMEM r/s

Console Memory

Swap MBytes Read/sec

COSMEM w/s

Console Memory

Swap MBytes Write/sec

NUMA

Numa Node

Total MBytes

Free MBytes

PSHARE shared

Memory

PShare Shared MBytes

PSHARE common

Memory

PShare Common MBytes

PSHARE saving

Memory

PShare Savings MBytes

SWAP curr

Memory

Swap Used MBytes

SWAP target

Memory

Swap Target MBytes

SWAP r/s

Memory

Swap MBytes Read/sec

SWAP w/s

Memory

Swap MBytes Write/sec

MEMCTL curr

Memory

Memctl Current MBytes

MEMCTL target

Memory

Memctl Target MBytes

MEMCTL max

Memory

Memctl Max MBytes

MEMSZ

Group Memory

Memory Size MBytes

GRANT

Group Memory

Memory Granted Size MBytes

SZTGT

Group Memory

Target Size MBytes

TCHD

Group Memory

Touched MBytes

%ACTV

Group Memory

% Active Estimate

%ACTVS

Group Memory

% Active Slow Estimate

%ACTVF

Group Memory

% Active Fast Estimate

%ACTVN

Group Memory

% Active Next Estimate

MCTL?

Group Memory

Memctl?

MCTLSZ

Group Memory

Memctl MBytes

MCTLTGT

Group Memory

Memctl Target MBytes

MCTLMAX

Group Memory

Memctl Max MBytes

SWCUR

Group Memory

Swapped MBytes

SWTGT

Group Memory

Swap Target MBytes

SWR/s

Group Memory

Swap Read MBytes/sec

SWW/s

Group Memory

Swap Written MBytes/sec

SHRD

Group Memory

Shared MBytes

ZERO

Group Memory

Zero MBytes

SHRDSVD

Group Memory

Shared Saved MBytes

COWH

Group Memory

Copy On Write Hint MBytes

OVHDUW

Group Memory

Overhead UW MBytes

OVHD

Group Memory

Overhead MBytes

OVHDMAX

Group Memory

Overhead Max MBytes

 

  • Table 7-3 Disk Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

CMDS/s

Physical Disk

Commands/sec

READS/s

Physical Disk

Reads/sec

WRITES/s

Physical Disk

Writes/sec

MBREAD/s

Physical Disk

MBytes Read/sec

MBWRTN/s

Physical Disk

MBytes Written/sec

KAVG/cmd

Physical Disk

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk

Average Queue MilliSec/Command

KAVG/rd

Physical Disk

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk

Average Driver MilliSec/Read

GAVG/rd

Physical Disk

Average Guest MilliSec/Read

QAVG/rd

Physical Disk

Average Queue MilliSec/Read

KAVG/wr

Physical Disk

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk

Average Driver MilliSec/Write

GAVG/wr

Physical Disk

Average Guest MilliSec/Write

QAVG/wr

Physical Disk

Average Queue MilliSec/Write

AQLEN

Physical Disk

Adapter Q Depth

LQLEN

Physical Disk

Lun Q Depth

DQLEN

Physical Disk

Device Q Depth

WQLEN

Physical Disk

World Q Depth

ACTV

Physical Disk

Active Commands

QUED

Physical Disk

Queued Commands

%USD

Physical Disk

% Used

LOAD

Physical Disk

Load

LOAD

Physical Disk

Load

ABRTS/s

Physical Disk

Aborts/sec

RESETS/s

Physical Disk

Resets/sec

PAECMD/s

Physical Disk

PAE Commands/sec

PAECP/s

Physical Disk

PAE Copies/sec

SPLTCMD/s

Physical Disk

Split Commands/sec

SPLTCP/s

Physical Disk

Split Copies/sec

 

  • Table 7-4 Network Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

SPEED

Network Port

Link Speed (Mb/s)

FDUPLX

Network Port

Full Duplex?

UP

Network Port

Link Up?

PKTTX/s

Network Port

Packets Transmitted/sec

PKTRX/s

Network Port

Packets Received/sec

MbTX/s

Network Port

MBits Transmitted/sec

MbRX/s

Network Port

MBits Received/sec

%DRPTX

Network Port

% Outbound Packets Dropped

%DRPRX

Network Port

% Received Packets Dropped

ACTN/s

Network Port

Actions Posted/sec

 


Best Practices for IBM Lotus Domino

$
0
0

Introduction

This page provides the best practices for virtualizing IBM Lotus Domino using VMware Infrastructure. This list is based on my VMworld 2008 session EA2348.

 

General Recommendations

Use newer hardware

  • Supports latest hardware assist technologies, larger on-processor cache

  • 64-bit may not perform better with older hardware

 

Use VMware ESX , that uses bare-metal or hypervisor architecture

  • You can start with VMware ESXi - the free version.

  • Do not use the VMware Workstation or Server, that use the hosted architecture.

 

VMware ESX allows you the choice of virtualization technology best suited for your workload

  • Hardware Assist (AMD, Intel) (both CPU and MMU virtualization) if your hardware supports it

  • Paravirtualization (if you use SLES for your Domino deployment)

  • Binary Translation

 

Migrate to latest version of ESX

  • E.g. ESX 3.5 defaults to 2nd Generation Hardware Assist if available, has several I/O Performance improvements

 

Lotus Domino: Plan to migrate to version 8.0

  • Significant performance improvements, specially disk I/O

 

Provide Redundancy to the ESX host

  • Power supplies, HBAs, NICs, Network and SAN switches

  • E.g. NIC teaming, HBA multi-pathing

 

Leverage VMotion, Storage VMotion, DRS and HA for higher Domino availability

 

VM configuration

64-bit OS recommended

  • VI3 supports all x86 OSs that Domino supports: Windows, SLES, RHEL

  • Improved memory limits in 64-bit OS helps cache more data, and thus avoid disk IO. Reduces response times, and hence increasing the number of users

  • Increase VM memory when running in 64-bit guest OS

 

64-bit may not perform better with older hardware

  • E.g. 64-bit Windows more sensitive to onboard L2/L3 chip caches

  • Microsoft reports 10-15% degradation with older hardware

 

Guest Operating System:

  • Windows: Use 2003 SP2

    • Microsoft eliminated most APIC TPR accesses, improves virtual performance

  • Linux: Use 2.6.18-53.1.4 kernel or later to use divider patch

    • Some older Linux versions have a 1Khz timer rate

    • Put divider=10 on the end of the kernel line in grub.conf and reboot

 

VM Time Synchronization

  • Use VMware Tools time synchronization within the virtual machine

  • Enable ESX server NTP daemon to sync with external stratum NTP source (VMware Knowledge Base ID# 1339)

  • Disable OS Time Service

    • Windows: w32time service

    • Linux: NTP daemon

 

Storage

Storage configuration is absolutely critical; most performance problems traced to this

  • Number of spindles, RAID configuration, drive speed, controller cache settings, queue depths – all make a big difference

 

Align partitions

 

Use separate, dedicated LUNs for OS/Domino, data and transaction logs

  • Separate the IO at physical disk level, not simply logical LUNs

  • Make sure these LUNs have enough spindles to support the IO demands

  • Fewer spindles or too many VMDK files on single VMFS LUN can substantially increase disk IO latencies

  • Check Scalable Storage Performance to understand the details

 

RAID configuration

  • RAID 1+0 for Data, RAID 0 for Log

 

Cache settings

  • Write policy to "write back“, read policy to "read ahead“

 

Queue Depths

  • Increase to 255

 

Storage Protocol: Fibre Channel or iSCSI

 

Storage Partition: VMFS or RDM

 

VI3 supports latest storage technologies: leverage these if you have already invested or plan to invest

  • Fibre channel – 8Gbps connectivity

  • ISCSI – 10GigE network connectivity, Jumbo Frames

  • Infiniband support

 

Virtual CPUs

The number of vCPUs per VM depends on the number of users to be supported

  • Start with uni-processor, may be enough

  • Try not to over-provision vCPUs in the guest CPU

 

Verify CPU compatibility for VMotion

 

Memory

Increasing memory to avoid disk I/O is most technique to improve performance

More available memory = more Lotus Domino Cache

 

Increase NSF_DbCache_Maxentries value

 

Leverage the higher VI 3.5 support 64GB memory limit per VM in VI 3.5 when using 64-bit guest OS for Domino

  • 64-bit OSs can take advantage of larger memory limits for file caching

 

Leverage NUMA optimizations in VI3

  • When using NUMA, try to fit the VM within a single node to avoid latencies accessing memory on remote nodes

 

Networking

Use dedicated NICs based on the network traffic

  • E.g. separate NICs for mail and replication traffic

 

Use NIC Teaming & VLAN Trunking

 

Use Enhanced VMXNET driver with TSO and Jumbo Frames support

 

Enable TCP transmit coalescing

 

Co-located VMs outperform physical 1Gbps network speed

 

Resource Management

Use proportional and absolute mechanisms to control VM priorities

  • Shares, reservations, and limits for CPU and memory

  • Shares for virtual disks

  • Traffic shaping for network

 

Faster migration resulting in better load balancing when using

  • Smaller VMs

  • Lesser memory reservations for VMs

 

Affinity rules for VM placement

  • E.g. Directory, Mail Server VMs on same ESX

 

Deployment

Virtualization Assessment

  • Capacity Planner

  • Benchmark against Information Warehouse

 

Easy migration

  • VMware Converter – both hot and cold cloning

  • Start with RDM to point to existing data/ transaction log LUNs, but move to VMFS later

 

Easier change management and quicker provisioning

  • Templates and clones for easy provisioning

 

Storage Performance: VMFS and Protocols

$
0
0

Introduction

VMware's customers are always asking us about the storage stack.  Without exception, the two most common questions about our storage system performance are:

 

  1. Which storage protocol performs best?

  2. Does VMFS scale to meet the demands of many servers and VMs?

 

This document will contain a few of the points needed to help understand this issue.

 

Storage Protocols

VMware published a paper comparing storage protocols in 2008.  This paper detailed the two key characteristics of ESX's storage stack:

 

  1. The hypervisor is easily able to drive the storage connection to link speed.

  2. Configurations where protocol management happens in the HBA (Fibre Channel and HW iSCSI) are more CPU efficient.

 

On the first note, take the following graph, taken from page three of the paper:

 

http://communities.vmware.com/servlet/JiveServlet/downloadImage/5618/protocol_throughput.png 

 

Note that in this case all four test cases drive the storage to link speed.  That's 2 Gb/s with the Fibre Channel HBA and 1 Gb/s with the other three.  In short, if throughput is your goal, make decisions based on link speed.  If you check through the rest of the paper, you'll see that response time is similar for all of the configurations, as well.  But you will see slight differences in throughput in some of the protocols.

 

This brings us to the second point from above: less work is done by the CPU when protocol management can be off-loaded to the HBA.  This means that FC and HW iSCSI HBAs will have additional CPU cycles for the VMs' work.  It can also explain the slight differences in throughput in the other graphs in the paper.  The efficiency results quoted in the paper are here:

 

 

 

The increased overheads of running software iSCSI or NFS are due to the VMkernel managing those protocols.  It's worth noting that the proliferation of iSCSI in the enterprise has led VMware to spend considerable effort to improve the efficiency of SW iSCSI.  Expect its efficiency to improve dramatically in the following releases.

 

VMFS Scalability

Many in the industry erroneously believe that VMFS won't scale as storage demands grow.  Often SCSI reservations and disk locking are cited as the technical-sounding but vaguely-supported reason for this claim.  It's worth sampling data from our scalable storage performance paper to debunk this myth.

 

 

 

This chart is a favorite in our world-wide tours as we address VMFS scalability.  It's was first introduced in a VMFS scalability blog article that went live in February of 2008.  It shows the results of using 64 hosts to generate a variety of traffic on a single VMFS volume.  And it's a wealth of information on VMFS and storage access patterns.  For instance:

 

  • The aggregate number of random writes, in cyan in the middle, maintains perfectly flat linear scalability as the host count grows from 1 to 64.

  • The aggregate number of random reads is initially limited by the few disks being accessed but ultimately matches the throughput of random writes as many disks come to bear to serve the large number of random reads.

  • The sequential read activity, which highlights the strengths of today's arrays, demonstrates the largest total throughput which only slightly drops as the array manages so many connections.

  • But the sequential read activity drops off dramatically as hosts are added.

 

This last example showing degradation in aggregate sequential read capability is an artifact of the workload that is very important to database administrators: multiple sequential reads approximate random activity.  Why is this?  As many hosts request more and more sequential data, the array interleaves these requests to maintain response times.  This means that the sequential accesses get "shuffled" which results in a random access pattern.

 

In short, VMFS has no scalability problems as many hosts drive tremendous amounts of traffic to a single volume.  If the data isn't convincing enough, consider the following: there are no SCSI reservations used during normal data access.  This means that there are no scalability limitations as a result of virtual machine storage access.  A word of caution, though: the file system is locked during administrative operations that change the metadata on the volume.  This means that virtual machine creation or destruction can will result in file system locks.  Perform these operations off of peak hours.

ESX Monitor Modes

$
0
0

VMware has supported Intel and AMD's virtualization assist since 2006.  Long before then we were using an all-software approach that we call binary translation (BT).  With the benefit of years of development and optimization, BT outperformed the early versions of hardware assist.  But as hardware assist evolved the use of these new features became more attractive.

 

Because our support for hardware assist is rich and BT is heavily optimized, the monitor can benefit from using either technology in different situations.  The following tables detail the defaults in ESX 4.0, which can be changed through VM settings if desired.

 

Monitor Defaults with Intel Processors

VM Configuration

Core-i7 (Nehalem)

45nm Core2 with VT-x

65nm Core2 with VT-x and FlexPriority

65nm Core2 with VT-x and No FlexPriority

P4 with VT-x

EM64T without VT-x

No EM64T

FT enabled

VT-x + SPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

Not runnable

Not runnable

Not runnable

64-bit guests

VT-x + EPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

Not runnable

Not runnable

VMI enabled

BT + SPT

BT + SPT

BT + SPT

BT + SPT

BT + SPT

BT + SPT

BT + SPT

OpenServer, UnixWare, OS/2

VT-x + EPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

BT + SPT

BT + SPT

32-bit Linux and 32-bit FreeBSD

VT-x + EPT

VT-x + SPT

BT + SPT (*)

BT + SPT (*)

BT + SPT (*)

BT + SPT

BT + SPT

32-bit Windows XP, Windows Vista, Windows Server 2003, Windows Server 2008

VT-x + EPT

VT-x + SPT

VT-x + SPT

BT + SPT (*)

BT + SPT (*)

BT + SPT

BT + SPT

Windows 2000, Windows NT, DOS, Windows 95, Windows 98, Netware, 32-bit Solaris

BT + SPT (*)

BT + SPT (*)

BT + SPT (*)

BT + SPT (*)

BT + SPT (*)

BT + SPT

BT + SPT

All other 32-bit guests

VT-x + EPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

VT-x + SPT

BT + SPT

BT + SPT

 

(*) When we use BT on an Intel system with VT-x capability, we dynamically switch to VT-x if the guest enters long mode.

 

Monitor Defaults with AMD Processors

Configuration

Barcelona, Phenom, and Newer

AMD64 pre-Barcelona

No AMD64

FT enabled

AMD-V + SPT

Not runnable

Not runnable

64-bit guests

AMD-V + RVI

BT + SPT

Not runnable

VMI enabled

BT + SPT

BT + SPT

BT + SPT

OpenServer, UnixWare, OS/2

AMD-V + RVI

BT + SPT

BT + SPT

32-bit Linux and 32-bit FreeBSD

AMD-V + RVI

BT + SPT

BT + SPT

32-bit Windows XP, Windows Vista, Windows Server 2003, Windows Server 2008

AMD-V + RVI

BT + SPT

BT + SPT

Windows 2000, Windows NT, DOS, Windows 95, Windows 98, Netware, 32-bit Solaris

BT + SPT

BT + SPT

BT + SPT

All other 32-bit guests

AMD-V + RVI

BT + SPT

BT + SPT

 

Legend

  • VT-x: Intel's virtualization hardware assist.

  • EPT: Extended Page Tables.  Intel's on-board, virtualization-aware memory management unit (MMU).

  • EM64T: Intel's 64-bit extensions to the x86 architecture.

  • SPT: Shadow page tables.  ESX's software memory management unit (i.e., not EPT or RVI.)

  • BT: Binary translation.  ESX's software virtualization capability (i.e., not VT or AMD-V)

  • AMD-V: AMD's virtualization hardware assist.

  • RVI: Rapid Virtualization indexing.  AMD's on-board, virtualization-aware memory management unit (MMU).

 

Meet the Engineer Series: VMware Performance Advancements

$
0
0

Real people, real faces, discussing VMware vSphere topics...

 

-


In this video, VMware's Chief Performance Architect discusses why you should seriously consider virtualizing all of your applications on VMware:

 

Using vscsiStats for Storage Performance Analysis

$
0
0

Introduction

esxtop is a great tool for performance analysis of all types.  However,  with only latency and throughput statistics, esxtop will not provide the  full picture of the storage profile.  Furthermore, esxtop only provides  latency numbers for Fibre Channel and iSCSI storage.  Latency analysis  of NFS traffic is not possible with esxtop.

 

Since ESX 3.5, VMware has provided a tool specifically for profiling  storage: vscsiStats.  vscsiStats collects and reports counters on  storage activity.  Its data is collected at the virtual SCSI device  level in the kernel.  This means that results are reported per VMDK (or  RDM) irrespective of the underlying storage protocol.  The following  data are reported in histogram form:

 

 

  • IO size
  • Seek distance
  • Outstanding IOs
  • Latency (in microseconds)
  • More!

 

Running vscsiStats

vscsiStats collection and analysis requires two steps:

  1. Start statistics collection.
  2. View accrued statistics.

 

Documentation on command-line parameters are available when running '/usr/lib/vmware/bin/vscsiStats -h'.

 

 

Starting and Stopping vscsiStats Collection

The tool is started with the following command:

/usr/lib/vmware/bin/vscsiStats -s -w <world_group_id>

 

 

This command starts the process that will accrue statistics.  The world  group ID must be set to a running virtual machine.  The running VMs' IDs  can be obtained by running '/usr/lib/vmware/bin/vscsiStats -l'.

 

 

After about 30 minutes vscsiStats will stop running.  If the analysis is  needed for a longer period, the start command should be repeated above  in this window.  That will defer the timeout and termination by another  30 minutes.

 

 

Since results are accrued and reported out in summary, the histograms  will include data since collection was started.  To reset all counters  to zero, run '/usr/lib/vmware/bin/vscsiStats -r'.

 

 

Viewing Statistics

Counters are displayed by using the following command:

/usr/lib/vmware/bin/vscsiStats -p <histo_type> [-c]

 

 

The histogram type is used to specify either all of the statistics or  one group of them.  Options include all, ioLength, seekDistance,  outstandingIOs, latency, interarrival.

 

 

Results can be produced in a more compact comma-delimited list by adding the optional "-c" above.

 

 

Using vscsiStats Results

Use Case 1: Identifying Sequential IO

Storage arrays can process sequential IO much faster than random IO.   You can therefore improve the performance of a sequential workload by  placing it on a dedicated LUN to allow the array to optimize access.   vscsiStats can help you identify your sequential workloads even if you  don't understand anything about the application in the VM.

 

Take the following graph as example, which I generated by running '/usr/lib/vmware/bin/vscsiStats -p seekDistance':

 

 

random_write_histo.png



This graph shows that most of the commands are being issued a great  distance from the previous command.  It looks like all of the commands  were 50,000 or more logical blocks away from the previous command.  When  I looked at the raw data, I saw that over 99% of the commands were more  than 128 blocks away from the previous command.  That's random access  if I've ever seen it.  Here's the opposite example:

sequential_write_histo.png

In this case the logical block number (LBN) of each command is most  frequently exactly one larger than the previous command.  That's the  signature of a heavily sequential workload.  It shouldn't surprise you  to learn that both of these profiles were generated by Iometer using  random and sequential writes, respectively.

Use Case 2: Optimizing for IO Sizes

The IO size is an important characteristic of storage profiles.  A  variety of best practices have been provided by storage vendors to  enable customers to tune their storage to a particular IO size.  As an  example, it may make sense to optimize an array's stripe size to its  average IO size.  vscsiStats can provide a histogram of IO sizes to help  this process.  The following graph was generated by  '/usr/lib/vmware/bin/vscsiStats -p ioLength':

io_size_4k.png

From these results I can see that about a quarter of the commands came  in IOs smaller than 4k.  About half of the commands were sized to 4k  commands.  The minute number of remaining IOs were larger than 4k.  This  signature is common of a VMDK formatted to 4k blocks and supporting OS  and application execution.  The storage array should be optimized for 4k  blocks if this disk's performance is a priority.

Use Case 3: Storage Latency Analysis (Including NFS!)

esxtop is a terrific tool for latency-based storage analysis.  Fibre  Channel and iSCSI HBAs have device and kernel latencies in esxtop's  storage panel.  Software iSCSI initiators will show up as vmhba32 (ESX  3.5 and earlier) and vmhba33 (ESX 4.0 and later.)  But esxtop does not  provide latency statistics for NFS stores.

Because vscsiStats collects its results where the guest interacts with  the hypervisor, it is unaware of the storage implementation.  Latency  statistics can be collected for all storage configurations with this  tool.

latency.png

The above graph shows that the server in my office with a single  direct-attached SCSI disk is performing as I would expect.  About half  of all the operations are completing in under 5 ms.  The other half take  5-15 ms to complete.  A few commands took longer than 15 ms, but the  number is so small that it doesn't concern me.  Similar results can be  seen with NFS arrays.

vscsiStats on ESXi

vscsiStats can be installed on ESXi hosts after putting the host into  tech support mode.  More information on this process is availalble on Scott's blog on the subject on vPivot.

Additional Resources

My colleagues Ajay Gulati, Chethan Kumar, and Irfan Ahmad presented at VPACT 09 Storage Workload Characterization and Consolidation in Virtualized Enviornments.  This paper serves as an excellent example of vscsiStats in action.

I learned vscsiStats by reviewing Irfan's VMworld 2007 presentation (vscsiStats: Fast and Easy Disk Workload Characterization on VMware ESX Server) and playing with the tool.  Check out his presentation if you'd like more detail.

vscsiStats: Fast and Easy Disk Workload Characterization on VMware ESX Server

Storage Workload Characterization and Consolidation in Virtualized Enviornments


Performance Troubleshooting for VMware vSphere 4 and ESX 4.0

$
0
0

Performance problems can arise in any computing environment. Complex application behaviors, changing demands, and shared infrastructure can lead to problems arising in previously stable environments. Troubleshooting performance problems requires an understanding of the interactions between the software and hardware components of a computing environment. Moving to a virtualized computing environment adds new software layers and new types of interactions that must be considered when troubleshooting performance problems.

 

The attached document is the first installment in a guide covering performance troubleshooting in a vSphere environment. It uses a guided approach to lead the reader through the observable manifestations of complex hardware/software interactions in order to identify specific performance problems. For each problem covered, it includes a discussion of the possible root-causes and solutions. Topics covered include performance problems arising from issues in the CPU, memory, storage, and network subsystems, as well as in the VM and ESX host configuration.  Guidance is given on relevant performance metrics to observe using the vSphere Client and esxtop in order to isolate specific performance issues. 

 

This first installment of Performance Troubleshooting for VMware vSphere 4 covers performance troubleshooting on a single VMware ESX 4.0 host. It focuses on the most common performance problems which affect an ESX host. Future updates will add more detailed performance information, including troubleshooting information for more advanced problems and multi-host vSphere deployments.

 

This is a living document. Reader comments, questions, and suggestions are encouraged.

Memory Performance Chart Metrics in the vSphere Client

$
0
0

The vSphere Client exposes several memory performance statistics for users to identify VM memory usage.

 

Some of the important memory performance metrics follow. Each metric name appears under the Measurement column of the Performance Chart Legend, as shown in the following screenshot:

 

http://communities.vmware.com/servlet/JiveServlet/downloadImage/6421/Performance_Charts_VM.png

 

 

  • Active: The amount of guest physical memory that is being used by the VM. Active memory may be different from what is seen inside the guest operating system. This is because the guest operating system generally has a more precise view about what memory is “active” than the hypervisor because it knows when applications allocate or deallocate memory. In addition, the sampling technique used by ESX often takes time to converge, so the memory usage measured in the guest operating system may be more accurate when the workload memory usage is fluctuating.

  • Shared: The amount of guest physical memory shared through transparent page sharing. This includes the memory shared with other VMs and the memory shared within the VM.

  • Consumed: The amount of host physical memory allocated to the VM, accounting for saving from memory sharing with other VMs. When multiple VMs share a host memory region, each VM is accounted to consume the shared memory proportionally based on the total references to that host memory. For example, if a VM has 100MB host memory equally shared with the other three VMs, the Consumed memory only accounts for 25MB. If the 100MB memory is only shared within the VM, the Consumed memory accounts for 100MB. 
     
         Note that for a host that is not memory overcommitted, the Consumed memory represents a “high water mark” of the memory usage by the VM. It is possible that in the past, the VM was actively using a large amount of host physical memory but currently it is not. Because host memory is not overcommitted, the Consumed memory will not be shrunk through ballooning or swapping. Hence, the Consumed memory could be much higher than the Active memory when host memory is not overcommitted.

  • Granted: The amount of guest physical memory currently backed by the host physical memory. Due to memory sharing, the Granted memory is greater than or equal to the Consumed memory. For instance, assuming a guest allocates 100MB memory while the whole memory are zeroes, once all the zeroed pages are shared, the VM’s Granted memory is 100MB but the VM’s Consumed memory is only 4k.

  • Overhead: The extra host physical memory used by the ESX to run a VM. The Overhead memory has two components: 1) System wide overhead from VMkernel; 2) Additional overhead for each VM, including the space reserved for the VM frame buffer and various virtualization data structures. Since the Overhead memory always resides in host memory, ESX must reserve memory for it. Thus a VM’s memory reservation has two individual components: user-specified memory reservation and overhead memory reservation. For example, if the user specifies a 1GB reservation and the Overhead memory for the VM is 100MB, the VM’s memory reservation when powered on would be 1.1GB.

  • Balloon: The amount of guest physical memory that is currently reclaimed through the balloon driver.

  • Swapped: The amount of guest physical memory swapped out to the VM’s swap device by ESX.

  • Swapped in rate: The rate at which the host physical memory is being swapped in from the host swap device.

  • Swapped out rate: The rate at which the host physical memory is being swapped out to the host swap device.

 

VAM Doc

Advanced Networking Performance Options

$
0
0

Some of the advanced networking options available in vSphere 4.0 are reviewed in this paper. Many of these options control trade-offs between latency, throughput, CPU utilization, and reliability (e.g., dropped packets). It is not possible to optimize all of these at the same time, so option defaults are chosen to be suitable for the vast majority of applications. These options are provided to meet the stricter requirements of other applications. Advanced options often have subtle side effects, or merely move an issue from one area to another. Therefore it is recommended that VMware Support be engaged before changing such options, especially for production machines.

 

There are over 100 options that can be set under Configuration → Advanced Settings → Net. Of these, the ones listed below are most likely to be useful for tuning networking performance. Many of the others are for internal testing or enable unreliable features.

 

All of the options listed here take integer values. For the “Boolean” ones only the default value is shown: 0 for “false”, and 1 for “true”. Other parameters are shown with their default, minimum, and maximum values.

 

Parameter Name

(Default, Minimum, Maximum)

Description

MaxPortRxQueueLen

(80, 1, 500)

Maximum length of the Rx queue for virtual ports whose clients support queueing. Possibly should be increased if Rx packet drops are seen in the port connected to a VM. Relevant only for e1000 vNICs used with Fault Tolerance (FT) and VLANs.

MaxNetifTxQueueLen

(500, 1, 1000)

Maximum length of the Tx queue for the physical NICs. Increase if Tx packet drops are seen in uplink port to the pNIC.

GuestTxCopyBreak

(64, 60, 4294967295)

Packet header transmits smaller than this in bytes will be copied rather than mapped. More security and functionality than performance implications.

VmxnetTxCopySize

(256, 0, 4294967295)

Transmits smaller than this in bytes will be copied rather than mapped. Copying costs CPU but puts lets pressure on the Tx queue and doesn’t require completion.

VmxnetWinUDPTxFullCopy

(1)

Enable full copy of Windows vmxnet UDP Tx packets. Might disable to save CPU, especially for jumbo frames, at the cost of risking more packet drops.

NetTxDontClusterSize

(0, 0, 8192)

Tx packet size (in bytes) smaller than this are transmitted immediately (coalescing options are over-ruled for these packets). Used to ensure good latency for small packets.

CoalesceTxTimeout

(4000, 1, 4294967295)

The coalesce timeout in micro-seconds, or effectively the maximum latency without transmitting. Smaller values can reduce the packet latency at the cost of CPU. Risky to go below 1000.

CoalesceDefaultOn

(1)

Enable dynamic coalescing. Disable to test if issues are related to coalescing.

CoalesceHandlerPcpu

(1, 0, 128)

pCPU that coalesce timeout handler runs on. May be important to set this if VM CPU pinning is used.

CoalesceTxQDepthCap

(40, 0, 80)

Maximum number of “normalized” Tx packets to coalesce. Reduce if Tx coalescing appears to be too aggressive.

CoalesceRxQDepthCap

(40, 0, 80)

Maximum number of “normalized” Rx packets to coalesce. Reduce if Rx coalescing appears to be too aggressive.

vmxnetThroughputWeight

(0, 0, 255)

How far to favor Tx throughput for vmxnet 2 & 3. “0” is dynamic, otherwise this is a weight where a lower value favors latency and a higher value favors throughput.

TcpipHeapSize

(24, 24, 120)

Initial size of the TCP/IP module heap in megabytes. May need to increase if there are many vmkernel connections (NFS, iSCSI, etc.).

TcpipDefLROMaxLength

(16000, 1, 65535)

Maximum length for the LRO aggregated packet for vmkernel connections. Increasing this reduces the number of acknowledgments, which improves efficiency but may increase latency.

E1000TxZeroCopy

(0)

If disabled copy UDP or non-TSO Tx packets for e1000.

E1000TxTsoZeroCopy

(1)

If enabled do not copy TSO Tx packets for e1000.

E1000IntrCoalesce

(1)

Enable interrupt coalescing for e1000. Disabling can improve latency at the expense of CPU.

MaxPktRxListQueue

(3500, 0, 200000)

Maximum number of packets queued in vmkernel. Increasing this can reduce the number of dropped packets but at the cost of increased vmkernel memory and queuing latency.

Vmxnet3RSSHashCache

(1)

Enable RSS hash cache for vmxnet3 in Windows guests.

VmklnxLROEnabled

(0)

Enable large packets for recent Linux guests with vmxnet 2 & 3. Most likely to benefit hosts with small number of VMs with few sessions each, where each session has a heavy Rx load (more than 1 MB/sec). This is an experimental feature and has not been tested extensively.

VmklnxLROMaxAggr

(6, 0, 24)

Maximum aggregation count in number of packets for vmklinux LRO.

 

IOmega ix4-200d IOmeter results (100MB/Full Network)

Achieving High Web Throughput Scaling with VMware vSphere 4 on Intel Xeon 5500 series (Nehalem) servers

$
0
0

We just published a SPECweb2005 benchmark score of 62,296 -- the highest result published to date on a virtual configuration. This result was obtained on an HP ProLiant DL380 G6 server running VMware vSphere 4 and featuring Intel Xeon 5500 series (Nehalem) processors, and Intel 82598EB 10 Gigabit AF network interface cards. While driving the network throughput from a single host to just under 30 Gbps, this benchmark score still stands at 85% of the level achieved in native (non-virtualized) execution on equivalent hardware configurations.These results clearly demonstrate that VMware software works very efficiently with HP systems and Intel processors to provide high performance virtualization solutions to meet the performance and scaling needs of modern data centers.

 

The benchmark result just published includes the following distinctive characteristics:

 

  • Use of VMDirectPath for virtualizing network I/O that builds upon Intel's VT-d technology


  • High performance and linear scaling with the addition of virtual machines


  • A highly simplified setup that does not require binding of interrupts to CPUs


  • 85% of native performance while driving ~30 Gbps of network traffic


 

We will elaborate upon each of these characteristics. We focus first on the use of VMDirectPath for this publication. We will then describe the workload configuration before discussing the remaining three aspects listed above.

 

 

 

 

 

Use of VMDirectPath

 

 

In VMware vSphere, network I/O can be virtualized using either device emulation, paravirtualization, or through the use of VMDirectPath capability. The result we just published is notably different from our previous results in that this time we used VMDirectPath feature to take benefit of the higher performance that it makes possible. To explain why this is the case, let us first describe how the three methods of network I/O virtualization work, and their implications in a customer environment.

 

 

Emulation: Under emulation, the hypervisor presents to the guest a virtual device such as an e1000 NIC. A potent benefit of the emulation approach is that the guest does not need to be modified, as it already has the driver support for the commonly emulated devices. The guest remains unaware of the actual hardware through which the hypervisor conducts the I/O, and it runs merely as though it were running in a physical platform that had the emulated device in it, even if the actual physical platform had some radically different type of communication gear for which the guest had no driver available. This flexibility and simplicity comes at some performance cost: each interaction of the guest with the emulated device causes a transition into hypervisor, which incurs overhead at the CPU. Even so, we would like to note that improvements in virtualization technology in modern processors as well as in the algorithms that VMware employs have been effective in ensuring that the overhead from emulation is low or tolerable in the majority of customer environments.

 

 

Paravirtualization: Under paravirtualization, the operating system in a guest uses a device driver that is explicitly designed to drive traffic through a virtual device such as the vmxnet2 or the vmxnet3 implementations; these implementations are made available by VMware for nearly all popular guest operating systems. This technique permits the guest to interact with the hypervisor through a send-receive interface designed specifically for optimal mediation by the hypervisor. Compared to emulation, paravirtualization reduces the number of transitions through the hypervisor, reducing latency and CPU usage.

 

 

Paravirtualization in combination with VMware NetQueue is a proven high performance network I/O virtualization technique that is applicable and recommended in most customer environments. Note that NetQueue and paravirtualization complement each other in 10 Gigabit Ethernet consolidation environments. Let us briefly amplify upon NetQueue. Ordinarily, I/O operations from different VMs have to be multiplexed over a single I/O channel in a NIC adapter. VMware's NetQueue capability takes advantage of multiple I/O channel capability in such NICs as Intel's 82598EB to organize the traffic from different VMs into separate queues and bypasses the need for the hypervisor to perform such multiplexing.

 

 

We have used paravirtualization in combination with VMware NetQueue in our prior world record SPECweb2005 results that featured fifteen virtual machines. Together they handled close to sixteen Gigabits per second web traffic on a single ESX host.

 

 

VMDirectPath: VMDirectPath is a more recent technique in vSphere 4 that builds upon Intel VT-D (Virtualization Technology for Directed I/O) capability engineered into recent Intel processors. The technique allows guest operating systems to directly access an I/O device, bypassing the virtualization layer. This direct path or pass-through can improve performance in certain situations that demand requirements to drive large amounts of network traffic from a single VM.

 

 

We note that most customers whose performance requirements are met by either emulation or paravirtualization would not find the VMDirectPath option compelling, since VMotion, Fault-tolerance, VMSafe, and Memory-Overcommit features rely on hypervisor controlled abstraction to properly decouple virtual machines from the hardware infrastructure.

 

 

However, for the high bandwidth requirement that we targeted in this benchmark test in which each of the VMs drove close to eight Gigabits per second traffic, it was critical to employ VMDirectPath technology. The vmxnet3 implementation presently faces a potential interrupt handling bottleneck in cases where a guest must deal with large amount of network traffic because receive side scaling (RSS) remains to be implemented for certain guest OSes. While such a situation is not a frequent one, VMDirectPath can certainly help in such situations. In the more common case where the objective is to satisfy the aggregate network I/O requirements of multiple gigabits in multiple VMs, paravirtualization is generally successful. In such cases VMDirectPath provides the means for near-complete avoidance of the hypervisor overheads to reach peak results as we demonstrated in our publication.

 

 

We proceed in the next section to describe the workload and configuration details.

 

 

Workload

 

 

The SPECweb2005 benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing the three widespread usages of web servers. Each workload measures the number of simultaneous user sessions a web server can support while still meeting stringent quality-of-service and error-rate requirements. The aggregate metric reported by the SPECweb2005 benchmark is a normalized metric based on the performance scores obtained on all three workloads.

 

 

Benchmark Configuration

 

 

In our test configuration, the system under test was an HP ProLiant DL380 G6 server with dual-socket, quad-core Intel Xeon X5570 2.933 GHz processors and 96 GB memory. The SUT was configured with VMware vSphere 4 that hosted four virtual machines. Each virtual machine was configured with 4 vCPUs, and 21 GB memory. Each of the four virtual machines used a separate Intel 82598EB 10 Gigabit AF NIC (configured with VMDirectPath) for client traffic.

 

 

We used the 64-bit SuSE Linux Enterprise Server (SLES) 11 release as the guest operating system in the four virtual machines. The Linux kernel in SLES 11 is based on 2.6.27.x kernel source, which incorporates TX multi-queue support and MSI-X improvements. In conjunction with VMDirectPath, these improvements in SLES 11 kernel further reduce the interactions between guest OS and hypervisor, and thereby help improve performance and scaling.

 

 

The web serving software consisted of the Rock Webserver and the Rock JSP server. The same web serving software was used in native benchmark submissions.

 

 

We next describe three distinctive aspects of our SPECWeb2005 publication: (1) high performance with linear scaling, (2) highly simplified setup, and (3) competitiveness of the virtualized system's performance with that of prior native results on equivalent hardware.

 

 

High Performance with Linear Scaling

 

 

In a consolidated server environment, one can expect multiple virtual machines with high network I/O demands. Although, VMDirectPath bypasses the virtualization layer to a large extent for the network interactions, we still face a measurable number of guest OS and hypervisor interactions, such as those needed for the hypervisor to vector the interrupts to the guests that own each of the physical network adapters. The possibility exists, therefore that the hypervisor can become a scaling limiter in a multi-VM environment. The scaling data obtained in our tests and charted in Figure 1 removes this concern.

 

 

Figure 1 shows the aggregate throughput of 1, 2, 3, and 4 virtual machines for each of the three SPECweb2005 workloads. As depicted in the Figure 1, performance scales linearly as we add more VMs.

 

 

 

 

Highly Simplified Setup

 

 

A technique commonly employed in SPECweb2005 is to bind device interrupts to specific processors, which maximizes performance by removing the overhead and scaling hurdles from unbalanced interrupt loads. Results published at the SPECweb2005 website reveal the complexity of "interrupt pinning" that is common in the configurations in a native setting, generally employed in order to make full use of all the cores in today's multicore processors.

 

 

By comparison, our results show that virtualization approach can dramatically simplify the networking configuration by dividing the load among multiple VMs, each of which is smaller and therefore easier to keep core-efficient.

 

 

Virtualization Performance

 

 

VMware vSphere 4 is designed for high performance. With a number of superior optimizations, even the most I/O intensive applications perform well when deployed on vSphere 4.

 

 

The table below compares the performance of the industry standard SPECweb2005 workload on a virtualized system with the prior native results published on equivalent hardware that featured Intel Xeon X5570 2.933 GHz processors, 96 GB memory and Intel 82598EB 10 Gigabit NICs. Note that SPEC® and SPECweb® are registered trademarks of the Standard Performance Evaluation Corp. (SPEC). Competitive numbers shown below reflect results published on www.spec.org as of 03/01/2010.  For the latest SPECweb2005 results visit http://www.spec.org/web2005.

 

 

 

 

As shown in the table, the aggregate performance obtained on a virtualized environment was close to 85% of the scores obtained on equivalent native configurations. For more details concerning the test configuration, tuning, and performance results, please refer to the full disclosure reports published at the SPECweb2005 website.

 

 

Conclusion

 

 

The 10 Gigabit networks and the increasing number of cores in the systems pose new challenges in a virtualized server environment. Our SPECweb2005 result shows that VMware, together with its partners Intel and HP is able to provide innovative virtualization solutions that can, in this instance, achieve a network throughput of 30 Gbps and reach a highly respectable performance level of 85% of the best reported native results on equivalent physical configuration. In addition, the simplification achieved through such consolidation contributes to easing the cost of setting up and administering the software environment.

Looking for training centre in chennai,India

$
0
0

I'm looking for training centre in chennai to undergo the training for VMWare certification. Please advise me, Which centre is best in chennai to go?

thanks,

Kumar T


esx-storage-latency

vFilters: Maximum Performance Tweaks

Interpreting esxtop 4.1 Statistics

$
0
0

Table of Contents

Section 1. Introduction

Section 2. CPU

Section 2.1 Worlds and Groups

Section 2.2 PCPUs

Section 2.3 Global Statistics

Section 2.4 World Statistics

Section 3. Memory

Section 3.1 Machine Memory and Guest Physical Memory

Section 3.2 Global Statistics

Section 3.3 Group Statistics

Section 4 Disk

Section 4.1 Adapter, Device, VM screens

Section 4.2 Disk Statistics

Section 4.2.1 I/O Throughput Statistics

Section 4.2.2 Latency Statistics

Section 4.2.3 Queue Statistics

Section 4.2.4 Error Statistics

Section 4.2.5 PAE Statistics

Section 4.2.6 Split Statistics

Section 4.2.7 Clone Statistics

Section 4.2.8 ATS Statistics

Section 4.2.9 Zero Statistics

Section 4.2.10 Reservation Statistics

Section 4.3 Batch Mode Output

Section 5 Network

Section 5.1 Port

Section 5.2 Port Statistics

Section 6. Interrupt

Section 7. Batch Mode

 

Section 1. Introduction

Esxtop allows monitoring and collection of data for all system resources: CPU, memory, disk and network. When used interactively, this data can be viewed on different types of screens; one each for CPU statistics, memory statistics, network statistics, disk adapter statistics,  disk device statistics, disk  VM  statistics and interrupt statistics. In the batch mode, data can be redirected to a file for offline uses.

 

Many esxtop statistics are computed as rates, e.g. CPU statistics %USED. A rate is computed based on the refresh interval, the time between successive snapshots. For example, %USED = ( CPU used time at snapshot 2 - CPU used time at snapshot 1 ) / time elapsed between snapshots. The default refresh interval can be changed by the command line option "-d", or the interactive command 's'. The return key can be pressed to force a refresh.

 

In each screen, data is presented at different levels of aggregation. It is possible to drill down to expanded views of this data. Each screen provides different expansion options.

 

It is possible to select all or some fields for which data collection is done. In the case of interactive use of esxtop, the order in which the selected fields are displayed can be selected.

 

In the following sections, this document will describe the esxtop statistics shown by each screen and their usage.

 

Section 2. CPU

Section 2.1 Worlds and Groups

Esxtop uses worlds and groups as the entities to show CPU usage. A world is an ESX Server VMkernel schedulable entity, similar to a process or thread in other operating systems. A group contains multiple worlds.

 

Let's use a VM as an example. A powered-on VM has a corresponding group, which contains multiple worlds. There is one vcpu (hypervisor) world corresponding to each VCPU of the VM. The guest activities are represented mostly by the vcpu worlds. Besides the vcpu worlds, there are other assisting worlds, such as a MKS world and a VMX world. The MKS world assists mouse/keyboard/screen virtualization. The VMX world assists the vcpu worlds (the hypervisor). The usage of the VMX world is out of the scope of this document. There is only one vmx world for each VM

 

There are other groups besides VM groups. Let's go through a few examples:

 

  • The "idle" group is the container for the idle worlds, each of which corresponds to one PCPU.

  • The "system" group contains the VMKernel system worlds.

  • The "helper" group contains the helper worlds that assist VMKernel operations.

  • In classic ESX, the "console" group is for the console OS, which runs ESX management processes. In ESXi, these ESX management processes are running as user worlds directly on VMKernel. So, on an ESXi box you  can see much more groups than on a classic ESX, but not the "console" group.

 

Note that groups can be organized in a hierarchical manner in ESX. However, esxtop shows, in a flat form, the groups that contain some worlds. More detailed discussion on the groups are out of the scope.

 

Q: Why can't we find any vmm worlds for a VM?

A: cpu scheduler merges "vmm" and "vcpu" statistics to one vcpu world. So, CPU stats won't show vmm worlds. This is not a problem.

 

Section 2.2 PCPUs

In esxtop, a PCPU refers to a physical hardware execution context, i.e., a physical CPU core if hyper-threading is unavailable or disabled, or a logical CPU (aka LCPU or SMT thread) if hyper-threading is enabled.

  • When hyper-threading is unavailable or disabled, a PCPU is the same as a core. (So, esxtop does not show the "CORE UTIL(%)").

  • When hyper-threading is used, a PCPU is a logical CPU (aka a LCPU or SMT thread). So, there are two PCPUs on each core, i.e. PCPU 0 and PCPU 1 on Core 0, PCPU 2 and PCPU 3 on Core 1, etc.

 

Section 2.3 Global Statistics

  • "up time"

The elapsed time since the server has been powered on.

 

  • "number of worlds"

The total number of worlds on ESX Server.

 

  • "CPU load average"

The arithmetic mean of CPU loads in 1 minute, 5 minutes, and 15 minutes, based on 6-second samples. CPU load accounts the run time and ready time for all the groups on the host.

 

  • "PCPU UTIL(%)"

The percentage of unhalted CPU cycles per PCPU, and its average over all PCPUs.

 

Q: What does it mean if PCPU UTIL% is high?

A: It means that you are using lots of resource. (a) If all of the PCPUs are near 100%, it is possible that you are overcommiting your CPU resource. You need to check RDY% of the groups in the system to verify CPU overcommitment. Refer to RDY% below. (b) If some PCPUs stay near 100%, but others are not, there might be an imbalance issue. Note that you'd better monitor the system for a few minutes to verify whether the same PCPUs are using ~100% CPU. If so, check VM CPU affinity settings.

 

  • "CORE UTIL(%)" (only displayed when hyper-threading is enabled)

The percentage of CPU cycles per core when at least one of the PCPUs in this core is unhalted, and its average over all cores. It's the reverse of the "CORE IDLE" percentage, which is the percentage of CPU cycles when both PCPUs in this core are halted.

 

It is displayed only when hyper-threading is used.

 

Note that, in batch mode, we show the corresponding "CORE UTIL(%)" of each PCPU. So, PCPU 0 and PCPU 1 have the same "CORE UTIL(%)" number, i.e. the "CORE UTIL(%)" of Core 0.

 

Q: What is the difference between "PCPU UTIL(%)" and "CORE UTIL(%)"?

A: A core is utilized, if either or both of the PCPUs on this core are utilized. The percentage utilization of a core is not the sum of the percentage utilization of both PCPUs. Let's use a few examples to illustrate this.

'+' means busy, '-' means idle.
(1) PCPU 0:   +++++----- (%50)    PCPU 1:   -----+++++ (%50)    Core 0:   ++++++++++ (%100)
(2) PCPU 0:   +++++----- (%50)    PCPU 1:   +++++----- (%50)    Core 0:   +++++----- (%50)
(3) PCPU 0:   +++++----- (%50)    PCPU 1:   ---+++++-- (%50)    Core 0:   ++++++++-- (%80)

 

 

In all the three above scenarios, each PCPU is utilized by 50%. But, depending on how often they are run at the same time, the core utilization is between 50% and 100%. Generally speaking,

 

 

 

 

 

Max(PCPU0_UTIL%, PCPU1_UTIL%) <= CORE0_UTIL% <= Min(PCPU0_UTIL% + PCPU1_UTIL%, 100%)

 

 

Q: How do I retrieve the average core UTIL% no matter whether hyper-threading is used.

A: If hyper-threading is used, get the average "CORE UTIL(%)" directly. Otherwise, i.e. hyper-threading is unavailable or disabled, a PCPU is a Core, then We can just use the average "PCPU UTIL(%)". Based on esxtop batch output, we can use something like below.

 

 

 

 

 

     if ("Physical Cpu(_Total)\% Core Util Time" exists) // Indicating hyper-threading is used        return "Physical Cpu(_Total)\% Core Util Time";     else        return "Physical Cpu(_Total)\% Util Time";

 

 

 

 

  • "PCPU USED(%)"

The percentage CPU usage per PCPU, and its average over all PCPUs.

 

Q: What is the difference between "PCPU UTIL(%)" and "PCPU USED(%)"?

A: While "PCPU UTIL(%)" indicates how much time a PCPU was busy (unhalted) in the last duration, "PCPU USED(%)" shows the amount of "effective work" that has been done by this PCPU. The value of "PCPU USED(%)" can be different from "PCPU UTIL(%)" mainly for the following two reasons:

 

(1) Hyper-threading

The two PCPUs in a core share a lot of hardware resources, including the execution units and cache. And thus, the "effective work" done by a PCPU when the other PCPU in the core is busy is usually much less than the case when the other PCPU is idle. Based on this observation, our CPU scheduler charges each PCPU half of the elapsed duration when both PCPUs are busy. If only one PCPU is busy during a time period, the PCPU is charged for all that time period. Let's use some examples to illustrate this.

'+' means busy, '-' means idle.
(1) PCPU 0:   +++++----- (UTIL: %50 / USED: %50)    PCPU 1:   -----+++++ (UTIL: %50 / USED: %50)
(2) PCPU 0:   +++++----- (UTIL: %50 / USED: %25)    PCPU 1:   +++++----- (UTIL: %50 / USED: %25)
(3) PCPU 0:   +++++----- (UTIL: %50 / USED: %40, i.e. %30 + 20%/2)    PCPU 1:   ---+++++-- (UTIL: %50 / USED: %40, i.e. %20/2 + %30)

 

 

In all the three above scenarios, each PCPU is utilized by 50%. But, depending on whether they are busy at the same time, the PCPU USED(%) is between 25% and 50%. Generally speaking,

 

 

 

 

 

                                     /- PCPU0_UTIL%/2, if PCPU0_UTIL% < PCPU1_UTIL%       PCPU0_UTIL% >= PCPU0_USED% >= |                                     \- (PCPU0_UTIL% - PCPU1_UTIL%) + PCPU1_UTIL%/2, otherwise

 

 

Please note that the above inequations may not hold due to frequency scaling, which is discussed next.

 

(2) Power Management

The frequency of a PCPU may be changed due to power management. Obviously, a PCPU does less "effective work" (in a unit of time) when the frequency is lower. The CPU scheduler adjusts the "PCPU USED(%)" based on the frequency of the PCPU.

 

 

 

 

 

          PCPU_USED% = PCPU_UTIL% * Effective_Frequency / Nominal_Frequency

 

 

Suppose that UTIL% is 80%, and the nominal frequency is 2 GHz. If the effective frequency is 1.5 GHz. USED% would be 80% * 1.5 / 2 = 60%. Please note that since the CPU frequency may change often, you may go to the esxtop power screen, pressing 'p', to see how often the PCPU stays at what states, which can help guess the effective frequency.

 

Please also note that turbo mode may make the effective frequency higher than the nominal frequency. In that case, USED% would be higher than UTIL%.

 

If we want to add both reasons into account, just to make it more complicated, we can have something like this.

 

 

 

 

 

                      PCPU0_USED%            /- PCPU0_UTIL%/2, if PCPU0_UTIL% < PCPU1_UTIL%     PCPU0_UTIL% >= * Nominal_Frequency    >= |                    / Effective_Frequency    \- (PCPU0_UTIL% - PCPU1_UTIL%) + PCPU1_UTIL%/2, otherwise

 

 

Q: Why do I see ~100% for the average "PCPU UTIL(%)", but the average "PCPU USED(%)" is ~50%?

A: It is very likely that hyper-threading is enabled. A PCPU is only charged half the time when both PCPUs are busy. Typically,

 

 

 

 

 

        0 <= PCPU0_USED% + PCPU1_USED% <= 100% * Effective_Frequency / Base_Frequency

 

 

Suppose that CPU frequency is fixed to base frequency, (e.g. power management features are not used), the sum of PCPU USED% for two PCPUs on the same core would be less than 100%. So, the average PCPU USED(%) won't be higher than 50%.

 

Q: Why is average CPU usage in vSphere client ~100%, but, average "PCPU USED(%)" in esxtop is ~50%?

A: Same as above. It is likely due to hyper-threading. The average CPU usage in vSphere client is deliberately doubled when hyper-threading is used; while esxtop does not double the average "PCPU USED(%)", which would otherwise mean the average USED% of all the cores.

 

Q: How do I retrieve the average core USED% no matter whether hyper-threading is used.

A: If hyper-threading is used, USED% for a core would be the sum of USED% for the corresponding PCPUs on that core. So, the average core USED% doubles the average PCPU USED%. Otherwise, i.e. hyper-threading is unavailable or disabled, a PCPU is a core, then we can just use the average "PCPU USED(%)". Based on esxtop batch output, we can use something like below.

 

 

 

 

 

     if ("Physical Cpu(_Total)\% Core Util Time" exists) // Indicating hyper-threading is used        return "Physical Cpu(_Total)\% Processor Time" * 2;     else        return "Physical Cpu(_Total)\% Processor Time";

 

 

 

 

  • "CCPU(%)"

Percentages of total CPU time as reported by the ESX Service Console. "us" is for percentage user time, "sy" is for percentage system time, "id" is for percentage idle time and "wa" is for percentage wait time. "cs/sec" is for the context switches per second recorded by the ESX Service Console.

 

Q: What's the difference of CCPU% and the console group stats?

A: CCPU% is measured by the COS. "console" group CPU stats is measured by VMKernel. The stats are related, but not the same.

 

Section 2.4 World Statistics

A group statistics is the sum of world statistics for all the worlds contained in that group. So, this section focuses on worlds. You may apply the description to the group as well, unless stated otherwise.

 

ESX can make use of the HyperThreading technology, so, the performance counters takes HyperThreading into consideration as well. But, to simplify this document, we will ignore HT related issues. Please refer to "Resource Management Guide" for more details.

 

  • "%USED"

The percentage physical CPU time accounted to the world. If a system service runs on behalf of this world, the time spent by that service (i.e. %SYS) should be charged to this world. If not, the time spent (i.e. %OVRLP) should not be charged against this world. See notes on %SYS and %OVRLP.

 

%USED = %RUN + %SYS - %OVRLP

 

+Q: Is it possible that %USED of a world is greater than 100%?+

+A: Yes, if the system service runs on a different PCPU for this world. It may happen when your VM has heavy I/O.+

 

+Q: For an SMP VM, why does VCPU 0 have higher CPU usage than others?+

+A: The system services are accounted to VCPU 0. You may see higher %USED on VCPU 0 than others, although the run time (%RUN) are balanced for all the VCPUs. This is not a problem for CPU scheduling, but only the way VMKernel does the CPU accounting.+

 

+Q: What is the maximum %USED for a VM group?+

+A: The group stats is the sum of the worlds. So, the maximum %USED = NWLD * 100%. NWLD is the number of worlds in the group.+

 

+Typically, worlds other than VCPU worlds are waiting for events most of time, not costing too much CPU cycles. Among all the worlds, VCPU worlds represent best the guest. Therefore, %USED for a VM group usually do not exceed Number of VCPUs * 100%.+

 

+Q: What does it mean if %USED of a VM is high?+

+A: The VM is using lots of CPU resource. You may expand to worlds to see what worlds are using most of them.+

 

  • "%SYS"

The percentage of time spent by system services on behalf of the world. The possible system services are interrupt handlers, bottom halves, and system worlds.

 

+Q: What does it mean if %SYS is high?+

+A: It usually means that your VM has heavy I/O.+

 

+Q: Are %USED and %SYS similar to user time and system time in Linux?+

+A: No. They are totally different. For Linux OS, user (system) time for a process is the time spent in user (kernel) mode. For ESX, %USED is for the accounted time and %SYS is for the system service time.+

 

  • "%OVRLP"

The percentage of time spent by system services on behalf of other worlds. In more detail, let's use an example.

 

When World 'W1' is running, a system service 'S' interrupts 'W1' and services World 'W2'. The time spent by 'S', annotated as 't', is included in the run time of 'W1'. We use %OVRLP of 'W1' to show this time. This time 't' is accounted to %SYS of 'W2', as well.

 

Again, let's take a look at "%USED = %RUN + %SYS - %OVRLP". For 'W1', 't' is included in %RUN and %OVRLP, not in %SYS. By subtracting %OVRLP from %RUN, we do not account 't' in %USED of 'W1'. For 'W2', 't' is included in %SYS, not in %RUN or %OVRLP. By adding %SYS, we accounted 't' to %USED of 'W2'.

 

+Q: What does it mean if %OVRLP of a VM is high?+

+A: It usually means the host has heavy I/O. So, the system services are busy handling I/O. Note that %OVRLP of a VM group may or may not be spent on behalf of this VM. It is the sum of %OVRLP for all the worlds in this group.+

 

  • "%RUN"

The percentage of total scheduled time for the world to run.

 

+Q: What is the difference between %USED and %RUN?+

A: %USED = %RUN + %SYS - %OVRLP. (%USED takes care of the system service time.) Details above.

 

+Q: What does it mean if %RUN of a VM is high?+

+A: The VM is using lots of CPU resource. It does not necessarily mean the VM is under resource constraint. Check the description of %RDY below, for determining CPU contention.+

 

  • "%RDY"

 

The percentage of time the world was ready to run.

 

A world in a run queue is waiting for CPU scheduler to let it run on a PCPU. %RDY accounts the percentage of this time. So, it is always smaller than 100%.

 

+Q: How do I know CPU resource is under contention?+

+A: %RDY is a main indicator. But, it is not sufficient by itself.+

 

+If a "CPU Limit" is set to a VM's resource settings, the VM will be deliberately held from scheduled to a PCPU when it uses up its allocated CPU resource. This may happen even when there is plenty of free CPU cycles. This time deliberately held by scheduler is shown by "%MLMTD", which will be describe next. Note that %RDY includes %MLMTD. For, for CPU contention, we will use "%RDY - %MLMTD". So, if "%RDY - %MLMTD" is high, e.g., larger than 20%, you may experience CPU contention.+

 

+What is the recommended threshold? Well, it depends. As a try, we could start with 20%. If your application speed in the VM is OK, you may tolerate higher threshold. Otherwise, lower.+

 

+Q: How do we break down 100% for the world state times?+

+A: A world can be in different states, either scheduled to run, ready to run but not scheduled, or not ready to run (waiting for some events).+ 

 

100% = %RUN + %READY + %CSTP + %WAIT.

 

+Check the description of %CSTP and %WAIT below.+

 

+Q: What does it mean if %RDY of a VM is high?+

+A: It means the VM is possibly under resource contention. Check "%MLMTD" as well. If "%MLMTD" is high, you may raise the "CPU limit" setting for the VM. If "%RDY - %MLMTD" is high, the VM is under CPU contention.+

 

  • "%MLMTD"

The percentage of time the world was ready to run but deliberately wasn't scheduled because that would violate the "CPU limit" settings.

 

Note that %MLMTD is included in %RDY.

 

+Q: What does it mean if %MLMTD of a VM is high?+

+A: The VM cannot run because of the "CPU limit" setting. If you want to improve the performance of this VM, you may increase its limit. However, keep in mind that it may reduce the performance of others.+

 

  • "%CSTP"

The percentage of time the world spent in ready, co-deschedule state. This co-deschedule state is only meaningful for SMP VMs. Roughly speaking, ESX CPU scheduler deliberately puts a VCPU in this state, if this VCPU advances much farther than other VCPUs. VCPU with high %CSTP is "stopped" from executing so that another VCPU in the same virtual machine could be run to "catch-up".

 

  • "%WAIT"

The percentage of time the world spent in wait state.

 

This %WAIT is the total wait time. I.e., the world is waiting for some VMKernel resource. This wait time includes I/O wait time, idle time and among other resources. Idle time is presented as %IDLE.

 

+Q: How do I know the VCPU world is waiting for I/O events?+

+A: %WAIT - %IDLE can give you an estimate on how much CPU time is spent in waiting I/O events. This is an estimate only, because the world may be waiting for resources other than I/O.+ +Note that we should only do this for VMM worlds, not the other kind of worlds. Because VMM worlds represent the guest behavior the best. For disk I/O, another alternative is to read the disk latency stats which we will explain in the disk section.+

 

+Q: How do I know the VM group is waiting for I/O events?+

+A: For a VM, there are other worlds besides the VCPUs, such as a mks world and a VMX world. Most of time, the other worlds are waiting for events. So, you will see ~100% %WAIT for those worlds. If you want to know whether the guest is waiting for I/O events, you'd better expand the group and analyze the VCPU worlds as stated above.+

 

+Since %IDLE makes no sense to the worlds other than VCPUs, we may use the group stats to estimate the guest I/O wait by "%WAIT - %IDLE - 100% * (NWLD - NVCPU)". Here, NWLD is the number of worlds in the group; NVCPU is the number of VCPUs. This is a very rough estimate, due to two reasons. (1) The world may be waiting for resources other than I/O. (2) We assume the other assisting worlds are not active, which may not be true.+

 

Again, for disk I/O, another alternative is to read the disk latency stats which we will explain in the disk section.

 

Q: Why do I always see a high %WAIT for VMX/mks worlds?

A: This is normal. That means there are not too much activities on them.

 

Q: Why do I see a high %WAIT for a VM group?

A: For a VM, there are other worlds besides the VCPUs, such as a mks world and VMX worlds. These worlds are waiting for events most of time.

 

  • "%IDLE"

The percentage of time the VCPU world is in idle loop. Note that %IDLE is included in %WAIT. Also note that %IDLE only makes sense to VCPU world. The other worlds do not have idle loops, so, %IDLE is zero for them.

 

  • "%SWPWT"

The percentage of time the world is waiting for the ESX VMKernel swapping memory. The %SWPWT (swap wait) time is included in the %WAIT time.

 

Q: Why do I see a high %SWPWT for a VM group?

A: The VM is swapping memory.

 

Section 3. Memory

Section 3.1 Machine Memory and Guest Physical Memory

It is important to note that some statistics refer to guest physical memory while others refer to machine memory. "Guest physical memory" is the virtual hardware physical memory presented to the VM. "Machine memory" is actual physical RAM in the ESX host. Let's use the following figure to explain. In the figure, two VMs are running on an ESX host, where each block represents 4 KB of memory and each color represents a different set of data on a block.

 

http://communities.vmware.com/servlet/JiveServlet/downloadImage/102-9279-1-4857/memory.JPG

 

Inside each VM, the guest OS maps the virtual memory to its physical memory. ESX Kernel maps the guest physical memory to machine memory. Due to ESX Page Sharing technology, guest physical pages with the same content can be mapped to the same machine page.

 

Section 3.2 Global Statistics

  • "MEM overcommit avg"

Average memory overcommit level in 1-min, 5-min, 15-min (EWMA).

 

Memory overcommit is the ratio of total requested memory and the "managed memory" minus 1. VMKernel computes the total requested memory as a sum of the following components: (a) VM configured memory (or memory limit setting if set), (b) the user world memory, (c) the reserved overhead memory. (Overhead memory will be discussed in more detail for "OVHD" and "OVHDMAX" in Section 3.3.)

 

"managed memory" will be defined in "VMKMEM" section.

 

Q: What does it mean if overcommit is not 0?

A: It means that total requested guest physical memory is more than the machine memory available. This is fine, because ballooning and page sharing allows memory overcommit.

 

This metric does not necessarily mean that you will have performance issues. Use "SWAP" and "MEMCTL" to find whether you are experiencing memory problems.

 

Q: What's the meaning of overcommit?

A: See above description for details. Roughly speaking, it reflects the ratio of requested memory and the available memory.

 

  • "PMEM" (MB)

The machine memory statistics for the host.

 

"total": the total amount of machine memory in the server. It is the machine memory reported by BIOS.

 

"cos" : the amount of machine memory allocated to the ESX Service Console.

 

"vmk" : the amount of machine memory being used by the ESX VMKernel. "vmk" includes kernel code section, kernel data and heap, and other VMKernel management memory.

 

"other": the amount of machine memory being used by everything other than the ESX Service Console and ESX VMKernel. "other" contains not only the memory used by VM but also the user worlds that run directly on VMKernel.

 

"free" : the amount of machine memory that is free.

 

Q: Why is total not the same as RAM size plugged in my memory slots?

A: This is because some memory range is not available for use. It is fine, if the difference is small. If the difference is big, there might be some hardware issue. Check your BIOS.

 

Q: Why can't I find the cos part?

A: COS is only available in classic ESX. You are using ESXi.

 

Q: How do I break down the total memory?

A: total = cos + vmk + other + free

 

Q: Which one contains the memory used by VMs?

A: "other" contains the machine memory that backs guest physical memory of VMs. Note that "other" also includes the overhead memory.

 

Q: How do I know my "free" memory is low? Is it a problem if it is low?

A: You could use the "state" field, which will be explained next, to see whether the free memory is low. Basically, it is fine if you do not experience memory swapping or ballooning. Check "SWAP" and "MEMCTL" to find whether you are experiencing memory problems.

 

  • "VMKMEM" (MB)

The machine memory statistics for VMKernel.

 

"managed": the total amount of machine memory managed by VMKernel. VMKernel "managed" memory can be dynamically allocated for VM, VMKernel, and User Worlds.

 

"minfree": the minimum amount of machine memory that VMKernel would like to keep free. This is because VMKernel needs to keep some amount of free memory for critical uses.

 

"rsvd" : the amount of machine memory that is currently reserved. "rsvd" is the sum of three parts: (a) the reservation setting of the groups; (b) the overhead reservation of the groups; (c) "minfree".

 

"ursvd" : the amount of machine memory that is currently unreserved. It is the memory available for reservation.

 

Please note that the VM admission control is done at resource pool level. So, this statistics is not used directly by admission control. "ursvd" can be used  as a system level indicator.

 

"state" : the free memory state. Possible values are high, soft, hard and low. The memory "state" is "high", if the free memory is greater than or equal to 6% of "total" - "cos". If is "soft" at 4%, "hard" at 2%, and "low" at 1%. So, high implies that the machine memory is not under any pressure and low implies that the machine memory is under pressure.

 

While the host's memory state is not used to determine whether memory should be reclaimed from VMs (that decision is made at the resource pool level), it can affect what mechanisms are used to reclaim memory if necessary. In the high and soft states, ballooning is favored over swapping. In the hard and low states, swapping is favored over ballooning.

 

Please note that "minfree" is part of "free" memory; while "rsvd" and "ursvd" memory may or may not be part of "free" memory. "reservation" is different from memory allocation.

 

Q: Why is "managed" memory less than the sum of "vmk", "other" and "free" in the PMEM line? Is it normal?

+A: It is normal, just the way we do accounting. A more precise definition for "managed" is the free memory after VMKernel initialization. So, this amount of memory can be dynamically allocated for use of VMs, VMKernel, and user worlds. "managed" = "some part of vmk" + "other" + "free".+

 

+So, "managed" &lt; "vmk" + "other" + "free". Or, in an equivalent form, "managed" &lt; "total" - "cos".+

 

Q: How do I break down the managed memory in terms of reservation?

A: "managed" = "rsvd" + "ursvd" + "vmkernel usage"

 

VMKernel machine memory manager needs to use some part of memory, which should not be subject to reservation, so, it is not in "rsvd", nor in "ursvd". In the above equation, we put this part under "vmkernel usage". Unfortunately, it is not shown directly in esxtop.

 

Note that the vmkernel usage in managed memory is part of "vmk".

 

Q: What does it mean if "ursvd" is low?

A: VMKernel admission control prohibits a VM PowerOn operation, if it cannot meet the memory reservation of that VM. The memory reservation includes the reservation setting, a.k.a. "min", and the monitor overhead memory reservation. Note that even if "min" is not set, VMKernel still needs to reserve some amount  of memory for monitor uses.

 

So, it is possible that even though you have enough free memory, a new VM cannot power on due to the violation of memory reservation.

 

Q: Why do I fail admission control even though "ursvd" is high?

A: The VM admission control is done at resource pool level. Please check the "min" setting of all its parent resource pools.

 

Q: Why is "managed" greater than the sum of "rsvd" and "ursvd"? Is it normal?

A: It is normal. See above question. VMKernel may use some of the managed memory. It is not accounted in "rsvd" and "ursvd".

 

Q: What is the meaning of "state"?

A: See the description of "state" above.

 

Q: How do I know my ESX box is under memory pressure?

A: It is usually safe to say the ESX box is under memory pressure, if "state" is "hard" or "low". But, you need also check "SWAP" and "MEMCTL" to find whether you are experiencing memory problems. Basically, if there is not enough free memory and ESX are experiencing swapping or ballooning, ESX box is under memory pressure.

 

Note that ballooning does not have as big performance hit as swapping does. Ballooning may cause guest swapping. ESX swapping means host swapping.

 

Also note that A VM may be swapping or ballooning, even though there is enough free memory. This is due to the reservation setting.

 

  • "COSMEM" (MB)

The memory statistics reported by the ESX Service Console.

 

"free" : the amount of idle machine memory.

 

"swap_t": the total swap configured.

 

"swap_f": the amount of swap free.

 

"r/s" : the rate at which memory is swapped in from disk.

 

"w/s" : the rate at which memory is swapped out to disk.

 

Note that these stats essentially come from the COS proc nodes.

 

Q: What does it mean if I see a high r/s or w/s?

A: Your console OS is swapping. It is highly likely that your COS free memory is low. You may either configure more memory for COS and restart your ESX box, or stop some programs running inside your COS.

 

Q: Why can't I see this COSMEM line?

A: You are using ESXi not classic ESX.

 

  • "NUMA" (MB)

The ESX NUMA statistics. For each NUMA node there are two statistics: (1) the "total" amount of machine memory managed by ESX; (2) the amount of machine memory currently "free".

 

Note that ESX NUMA scheduler optimizes the uses of NUMA feature to improve guest performance. Please refer to "Resource Management Guide" for details.

 

Q: Why can't I see this NUMA line?

A: You are not using a NUMA machine, or your BIOS disables it.

 

Q: Why is the sum of NUMA memory not equal to "total" in the PMEM line?

A: The PMEM "total" is the memory reported by BIOS, while the NUMA "total" is the memory managed by VMKernel machine memory manager. There are two major parts of memory seen by BIOS but not given to machine memory manager: (1) COS uses, and (2) VMKernel uses during early initialization.

 

So, Sum("NUMA total") &lt; "PMEM total" - "cos".

 

Note that the free memory on all the nodes can be added up as the "free" memory in the PMEM line.

 

  • "PSHARE" (MB)

The ESX page-sharing statistics.

 

"shared": the amount of guest physical memory that is being shared.

 

"common": the amount of machine memory that is common across World(s).

 

"saving": the amount of machine memory that is saved due to page-sharing.

 

The monitor maps guest physical memory to machine memory. VMKernel selects to map guest physical pages with the same content to the same machine page. In other words, those guest physical pages are sharing the same machine page. This kind of sharing can happen within the same VM or among the VMs.

 

Since each VM's "shared" memory measures guest physical memory, the host's "shared" memory may be larger than the total amount of machine memory if memory is overcommitted. "saving" illustrates the effectiveness of page sharing for saving machine memory.

 

"shared" = "common" + "saving".

 

Note that esxtop only shows the pshare stats for VMs, excluding the pshare stats for user worlds.

 

  • "SWAP" (MB)

The ESX swap usage statistics.

 

"curr" : the current swap usage. This is the total swapped machine memory of all the groups. So, it includes VMs and user worlds.

 

"target": the swap usage expected to be. This is the total swap target of all the groups. So, it includes VMs and user worlds.

 

"r/s" : the rate at which machine memory is swapped in from disk.

 

"w/s" : the rate at which machine memory is swapped out to disk.

 

Note that swap here is host swap, not guest swap inside the VM.

 

Q: What does it mean if "curr" is not the same as "target"?

A: It means ESX will swap memory to meet the swap target. Note that the actual swapping is done at the group level. So, you should check "SWCUR" and "SWTGT" for each group. We will discuss this in the next section.

 

Q: Is it bad if "r/s" is high?

A: Yes, it is very bad. This usually means that you have memory resource contention. Because swapping is synchronous, it will hurt guest performance a lot.

 

Do two things: (1) Check your "free" memory or "state" as mentioned above. If free memory is low, you need to move VMs to other hosts or add more memory to the host. (2) If free memory is not low, check your resource setting of your VMs or user worlds. You may have set a low "limit", which causes swapping.

 

Q: Is it bad if "w/s" is high?

A: Yes, it is also very bad. This usually means that you have memory resource contention. Do the similar actions as mentioned above.

 

  • "MEMCTL" (MB)

The memory balloon statistics.

 

"curr" : the total amount of physical memory reclaimed by balloon driver. This is the total ballooned memory by the VMs.

 

"target": total amount of ballooned memory expected to be. This is the total ballooned targets of the VMs.

 

"max" : the maximum amount of physical memory reclaimable.

 

Note that ballooning may or may not lead to guest swapping, which is decided by the guest OS.

 

Q: What does it mean if "curr" is not the same as "target"?

A: It means ESX will balloon memory to meet the balloon target. Note that the actual ballooning is done for the VM group. So, you should check "MCTLSZ" and "MCTLTGT" for each group. We will discuss this in the next section.

 

Q: How do I know the host is ballooning memory?

+A: If the "curr" is changing, you can know it is ballooning. Since

ballooning is done at VM level, a better way is to monitor "MCTLSZ" for

each group. We will discuss this in the next section.+

 

Q: Is it bad if we have lots of ballooning activities?

A: Usually it is fine. Ballooning tends to take unused memory from one VM and make them available for others. The possible side effects are (a) reducing the memory cache used by guest OS, (b) guest swapping. In either cases, it may hurt guest performance. Please note that (a) and (b) may or may not happen, depending on your workload inside VM.

 

On the other hand, under memory contention, ballooning is much better than swapping in terms of performance.

 

Section 3.3 Group Statistics

Esxtop shows the groups that use memory managed by VMKernel memory scheduler. These groups can be used for VMs or purely for user worlds running directly on VMKernel. You may see many pure user world groups on ESXi, not on classic ESX.

 

Tip: use 'V' command to show only the VM groups.

 

  • "MEMSZ" (MB)

For a VM, it is the amount of configured guest physical memory.

 

For a user world, it includes not only the virtual memory that is backed by the machine memory, but also the reserved backing store size.

 

Q: How do I break down "MEMSZ" of a VM?

A: A VM's guest physical memory could be mapped to machine memory, reclaimed by balloon driver, or swapped to disk, or never touched. The guest physical memory can be "never touched", because (1) the VM has never used it since power on; or, (2) it was reclaimed by balloon driver before, but has not been used since the balloon driver releases it last time. This part of memory is not measured directly by VMKernel.

 

"MEMSZ" = "GRANT" + "MCTLSZ" + "SWCUR" + "never touched"

 

Please refer to "GRANT", "MCTLSZ", "SWCUR".

 

  • "GRANT" (MB)

For a VM, it is the amount of guest physical memory granted to the group, i.e., mapped to machine memory. The overhead memory, "OVHD" is not included in GRANT. The shared memory, "SHRD", is part of "GRANT".

 

The consumed machine memory for the VM, not including the overhead memory, can be estimated as "GRANT" - "SHRDSVD". Please refer to"SHRDSVD".

 

For a user world, it is the amount of virtual memory that is backed by machine memory.

 

Q: Why is "GRANT" less than "MEMSZ"?

A: Some guest physical memory has never been used, or is reclaimed by balloon driver, or is swapped out to the VM swap file. Note that this kind of swap is host swap, not the guest swap by the guest OS.

 

"MEMSZ" = "GRANT" + "MCTLSZ" + "SWCUR" + "never touched"

 

Q: How do I know how much machine memory is consumed by this VM?

A: GRANT accounts the guest physical memory, it may not be the same as the mapped machine memory, due to page sharing.

 

+The consumed machine memory can be estimated as "GRANT" - "SHRDSVD".

Please note that this is an estimate. Please refer to "SHRDSVD".+

 

Note that overhead memory, "OVHD", is not part of the above consumed machine memory.

 

  • "SZTGT" (MB)

The amount of machine memory to be allocated. (TGT is short for "target".) Note that "SZTGT" includes the overhead memory for a VM.

 

This is an internal counter, which is computed by ESX memory scheduler. Usually, there is no need to worry about this. Roughly speaking, "SZTGT" of all the VMs is computed based on the resource usage, available memory, and the "limit/reservation/shares" settings. This computed "SZTGT" is compared against the current memory consumption plus overhead memory for a VM to determine the swap and balloon target, so that VMKernel may balloon or swap appropriate amount  of memory to meet its memory demand. Please refer to "Resource Management Guide" for details.

 

Q: How come my "SZTGT" is larger than "MEMSZ"?

A: "SZTGT" includes the overhead memory, while "MEMSZ" does not. So, it is possible for "SZTGT" be larger than "MEMSZ".

 

Q: How do I use "SZTGT"?

A: This is an internal counter. You don't need to use it.

 

This counter is used to determine future swapping and ballooning activities. Check "SWTGT" and "MCTLTGT".

 

  • "TCHD" (MB)

The amount of guest physical memory recently used by the VM, which is estimated by VMKernel statistic sampling.

 

VMKernel estimates active memory usage for a VM by sampling a random subset of the VM's memory resident in machine memory to detect the number of memory reads and writes. VMKernel then scales this number by the size of VM's configured memory and averages it with previous samples. Over time, this average will approximate the amount of active memory for the VM.

 

Note that ballooned memory is considered inactive, so, it is excluded from "TCHD".

 

Because sampling and averaging takes time, "TCHD" won't be exact, but becomes more accurate over time.

 

VMKernel memory scheduler charges the VM by the sum of (1) the "TCHD"

memory and (2) idle memory tax. This charged memory is one of the

factors that memory scheduler uses for computing the "SZTGT".

 

Q: What is the difference between "TCHD" and working set estimate by guest OS?

A: "TCHD" is the working set estimated by VMKernel. This number may be different from guest working set estimate. Sometimes the difference may be big, because (1) guest OS uses a different working set estimate algorithm, (2) guest OS has a different view of active guest physical memory, due to ballooning and host swapping,

 

Q: How is "TCHD" used?

A: "TCHD" is a working set estimate, which indicates how actively the VM is using its memory. See above for the internal use of this counter.

 

  • "%ACTV"

Percentage of active guest physical memory, current value.

 

"TCHD" is actually computed based on a few parameters, coming from statistical sampling. The exact equation is out of scope of this document. Esxtop shows some of those parameters, %ACTV, %ACTVS, %ACTVF, %ACTVN. Here, this document provides simple descriptions without further discussion.

 

%ACTV reflects the current sample.

%ACTVS is an EWMA of %ACTV for long term estimate.

%ACTVF is an EWMA of %ACTV for short term estimate.

%ACTVN is a predict of what %ACTVF will be at next sample.

 

Since they are very internal to VMKernel memory scheduler, we do not discuss their usage here.

 

  • "%ACTVS"

Percentage of active guest physical memory, slow moving average. See above.

 

  • "%ACTVF"

Percentage of active guest physical memory, fast moving average. See above.

 

  • "%ACTVN"

Percentage of active guest physical memory in the near future. This is an estimated value. See above.

 

  • "MCTL?"

Memory balloon driver is installed or not.

 

If not, install VMware tools which contains the balloon driver.

 

  • "MCTLSZ" (MB)

The amount of guest physical memory reclaimed by balloon driver.

 

This can be called "balloon size". A large "MCTLSZ" means lots of this VM's guest physical memory is "stolen" to decrease host memory pressure. This usually is not a problem, because balloon driver tends to smartly steal guest physical memory that cause little performance problems.

 

Q: How do I know the VM is ballooning?

A: If "MCTLSZ" is changing, balloon driver is actively reclaiming or releasing memory. I.e., the VM is ballooning. Please note that the ballooning rate for a short term can be estimated by the change of "MCTLSZ", assuming it is either increasing or decreasing. But, for a long term, we cannot do it this way, because that monotonically increase/decrease assumption may not hold.

 

Q: Does ballooning hurt VM performance?

A: If guest working set is smaller than guest physical memory after ballooning, guest applications won't observe any performance degradation. Otherwise, it may cause guest swapping and hurt guest application performance.

 

Please check what causes ballooning and take appropriate actions to reduce memory pressure. There are two possible reasons: (1) The host does not have enough machine memory for use. (2) Memory used by the VM reaches the "limit" setting of itself or "limit" of the resource pools that contain this VM. In either case, ballooning is necessary and preferred over swapping.

 

  • "MCTLTGT" (MB)

The amount of guest physical memory to be kept in balloon driver. (TGT is short for "target".)

 

This is an internal counter, which is computed by ESX memory scheduler. Usually, there is no need to worry about this.

 

Roughly speaking, "MCTLTGT" is computed based on "SZTGT" and current memory usage, so that the VM can balloon appropriate amount of memory. If "MCTLTGT" is greater than "MCTLSZ", VMKernel initiates inflating the balloon immediately, causing more VM memory to be reclaimed. If "MCTLTGT" is less than "MCTLSZ", VMKernel will deflate the balloon when the guest is requesting memory, allowing the VM to map/consume additional memory if it needs it. Please refer to "Resource Management Guide" for details.

 

Q: Why is it possible for "MCTLTGT" to be less than "MCTLSZ" for a long time?

A: If "MCTLTGT" is less than "MCTLSZ", VMKernel allows the balloon to deflate. But, balloon deflation happens lazily until the VM requests new memory. So, it is possible for "MCTLTGT" to be less than "MCTLSZ" for a long time, when the VM is not requesting new memory.

 

  • "MCTLMAX" (MB)

The maximum amount of guest physical memory reclaimable by balloon driver.

 

This value can be set via vmx option "sched.mem.maxmemctl". If not set, it is determined by the guest operating system type. "MCTLTGT" will never be larger than "MCTLMAX".

 

If the VM suffers from ballooning, "sched.mem.maxmemctl" can be set to a smaller value to reduce this possibility. Remember that doing so may result in host swapping during resource contention.

 

  • "SWCUR" (MB)

Current swap usage.

 

For a VM, it is the current amount of guest physical memory swapped out to the backing store. Note that it is the VMKernel swapping not the guest OS swapping.

 

It is the sum of swap slots used in the vswp file or system swap, and migration swap. Migration swap is used for a VMotioned VM to hold swapped out memory on the destination host, in case the destination host is under memory pressure.

 

Q: What does it mean if "SWCUR" of my VM is high?

A: It means the VM's guest physical memory is not resident in machine memory, but on disk. If those memory will not be used in the near future, it is not an issue. Otherwise, those memory will be swapped in for guest's use. In that case, you will see some swap-in activities via "SWR/s", which may hurt the VM's performance.

 

  • "SWTGT" (MB)

The expected swap usage. (TGT is short for "target".)

 

This is an internal counter, which is computed by ESX memory scheduler. Usually, there is no need to worry about this.

 

Roughly speaking, "SWTGT" is computed based on "SZTGT" and current memory usage, so that the VM can swap appropriate amount of memory. Again, note that it is the VMKernel swapping not the guest swapping. If "SWTGT" is greater than "SWCUR", VMKernel starts swapping immediately, causing more VM memory to be swapped out. If "SWTGT" is less than "SWCUR", VMKernel will stop swapping. Please refer to "Resource Management Guide" for details.

 

Q: Why is it possible for "SWTGT" to be less than "SWCUR" for a long time?

A: Since swapped memory stays swapped until the VM accesses it, it is possible for "SWTGT" be less than "SWCUR" for a long time.

 

  • "SWR/s" (MB)

Rate at which memory is being swapped in from disk. Note that this stats refers to the VMKernel swapping not the guest swapping.

 

When a VM is requesting machine memory to back its guest physical memory that was swapped out to disk, VMKernel reads in the page. Note that the swap-in operation is synchronous.

 

Q: What does it mean if SWR/s is high?

A: It is very bad for VM's performance. Because swap-in is synchronous, the VM needs to wait until the requested pages are read into machine memory. This happens when VMKernel swapped out the VM's memory before and the VM needs them now. Please refer to "SWW/s".

 

  • "SWW/s" (MB)

 

Rate at which memory is being swapped out to disk. Note that this stats refers to the VMKernel swapping not the guest swapping.

 

As discussed in "SWTGT", if "SWTGT" is greater than "SWCUR", VMKernel will swap out memory to disk. It happens usually in two situations. (1) The host does not have enough machine memory for use. (2) Memory used by the VM reaches the "limit" setting of itself or "limit" of the resource pools that contain this VM.

 

Q: What does it mean if SWW/s is high?

A: It is very bad for VM performance. Please check the above two reasons and fix your problem accordingly.

 

If this VM is swapping out memory due to resource contention, it usually means VMKernel does not have enough machine memory to meet memory demands from all the VMs. So, it will swap out mapped guest physical memory pages to make room for the recent requests.

 

  • "SHRD" (MB)

Amount of guest physical memory that are shared.

 

VMKernel page sharing module scans and finds guest physical pages with the same content and backs them with the same machine page. "SHRD" accounts the total guest physical pages that are shared by the page sharing module.

 

  • "ZERO" (MB)

Amount of guest physical zero memory that are shared. This is an internal counter.

 

A zero page is simply the memory page that is all zeros. If a zero guest physical page is detected by VMKernel page sharing module, this page will be backed by the same machine page on each NUMA node. Note that "ZERO" is included in "SHRD".

 

  • "SHRDSVD" (MB)

Estimated amount of machine memory that are saved due to page sharing.

 

Because a machine page is shared by multiple guest physical pages, we only charge "1/ref" page as the consumed machine memory for each of the guest physical pages, where "ref" is the number of references. So, the saved machine memory will be "1 - 1/ref" page."SHRDSVD" estimates the total saved machine memory for the VM.

 

The consumed machine memory by the VM can be estimated as "GRANT" - "SHRDSVD".

 

  • "COWH" (MB)

Amount of guest physical hint pages for page sharing. This is an internal counter.

 

  • "OVHDUW" (MB)

Amount of overhead memory reserved for the vmx user world of a VM group. This is an internal counter.

 

"OVHDUW" is part of "OVHDMAX".

 

  • "OVHD" (MB)

Amount of overhead memory currently consumed by a VM.

 

"OVHD" includes the overhead memory consumed by the monitor, the VMkernel and the vmx user world.

 

  • "OVHDMAX" (MB)

Amount of reserved overhead memory for the entire VM.

 

"OVHDMAX" is the overhead memory a VM wants to consume in the future. This amount of reserved overhead memory includes the overhead memory reserved by the monitor, the VMkernel, and the vmx user world. Note that the actual overhead memory consumption is less than "OVHDMAX". "OVHD" &lt; "OVHDMAX".

 

"OVHDMAX" can be used as a conservative estimate of the total overhead memory.

 

Section 4 Disk

Section 4.1 Adapter, Device, VM screens

The ESX storage stack adds a few layers of code between a virtual machine and bare hardware. All virtual disks in virtual machines are seen as virtual SCSI disks. The ESX storage stack allows these virtual disks to be located on any of the multiple storage options available.

 

For performance analysis purposes, an IO request from an application in a virtual machine traverses through multiple levels of queues, each associated with a resource, in the guest OS, the VMkernel and the physical storage. (Note that physical storage could be an FC- or IP- SAN or disk array.) Each queue has an associated latency, dictated by its size and whether the IO load is low or high, which affects the throughput and latency seen by applications inside VMs.

 

Esxtop shows the storage statistics in three different screens: adapter screen, device screen, and vm screen. Interactive command 'd' can be used to switch to the adapter screen, 'u' for the device screen, and 'v' for the vm screen.

 

The main difference in the data seen in these three screens is the level at which it is aggregated, even though these screens have similar counters. By default, data is rolled up to the highest level possible for each screen. (1) On the adapter screen, by default, the statistics are aggregated per storage adapter but they can also be expanded to display data per storage path. See interactive command 'e'  for the expand operation. (2) On the device screen, by default, statistics are aggregated per storage device, NFS client is also considered as a storage device. NFS statistics are appended after all the LUN devices, you can tell if a device is NFS by checking if the device name starts with "(NFS)". Non-NFS statistics can also be viewed per path, world, or partition, see interactive commands, 'e', 'P', 't', for the expand operations. (3) On the VM screen, statistics are aggregated on a per-group basis by default. One VM has one corresponding group, so they are equivalent to per-VM statistics. Statistics can also be expanded so that a row is displayed for per-vscsi-device basis. See interactive commands 'e'.

 

Please refer to esxtop man page for the details of the interactive commands.

 

Section 4.2 Disk Statistics

Due to the similarities in the counters of the three disk screens, this section discusses the counters without distinguishing the screens. Similar to other esxtop screens, the storage counters are also organized in different sets, each of which contains related counters. The counters can be selected as a set by selecting the appropriate field option in esxtop. If esxtop is used in batch mode, make sure that the esxtop configuration file includes all counters of interest.

 

Each group of counters in the following subsections corresponds to a particular field option.

 

Section 4.2.1 I/O Throughput Statistics

  • CMDS/s

Number of commands issued per second.

 

  • READS/s

Number of read commands issued per second.

 

  • WRITES/s

Number of write commands issued per second.

 

  • MBREAD/s

Megabytes read per second.

 

  • MBWRTN/s

Megabytes written per second.

 

Section 4.2.2 Latency Statistics

This group of counters report latency values measured at three different points in the ESX storage stack. In the context of the figure below, the latency counters in esxtop report the Guest, ESX Kernel and Device latencies. These are under the labels GAVG, KAVG and DAVG, respectively. Note that GAVG is the sum of DAVG and KAVG counters.

 

http://communities.vmware.com/servlet/JiveServlet/previewBody/12495-102-1-13125/esx-storage-diagram.JPG

 

Note that esxtop shows the latency statistics for different objects, such as adapters, devices, paths, and worlds. They may not perfectly match with each other, since their latencies are measured at the different layers of the ESX storage stack. To do the correlation, you need to be very familiar with the storage layers in ESX Kernel, which is out of our scope.

 

Latency values are reported for all IOs, read IOs and all write IOs. All values are averages over the measurement interval.

  • All IOs: KAVG/cmd, DAVG/cmd, GAVG/cmd, QAVG/cmd

  • Read IOs: KAVG/rd, DAVG/rd, GAVG/rd, QAVG/rd

  • Write IOs: KAVG/wr, DAVG/wr, GAVG/wr, QAVG/wr

 

  • LAT

This is the round-trip VSCSI latency for all IO requests sent to the storage device.

 

  • GAVG

This is the round-trip latency for the device and the storage layer of VMKernel.

 

GAVG should be close to the R metric in the figure.

 

Q: What is the relationship between GAVG, KAVG and DAVG?

A: GAVG = KAVG + DAVG

 

  • KAVG

This is the round-trip latency in storage stack.

 

The KAVG value should be very small in comparison to the DAVG value and should be close to zero. When there is a lot of queuing in ESX, KAVG can be as high, or even higher than DAVG. If this happens, please check the queue statistics, which will be discussed next.

 

  • DAVG

This is the latency seen at the device driver level. It includes the roundtrip time between the HBA and the storage.

 

DAVG is a good indicator of performance of the backend storage. If IO latencies are suspected to be causing performance problems, DAVG should be examined. Compare IO latencies with corresponding data from the storage array. If they are close, check the array for misconfiguration or faults. If not, compare DAVG with corresponding data from points in between the array and the ESX Server, e.g., FC switches. If this intermediate data also matches DAVG values, it is likely that the storage is under-configured for the application. Adding disk spindles or changing the RAID level may help in such cases.

 

  • QAVG

The average queue latency. QAVG is part of KAVG.

 

Response time is the sum of the time spent in queues in the storage stack and the service time spent by each resource in servicing the request. The largest component of the service time is the time spent in retrieving data from physical storage. If QAVG is high, another line of investigation is to examine the queue depths at each level in the storage stack.

 

Section 4.2.3 Queue Statistics

  • AQLEN

The storage adapter queue depth. This is the maximum number of ESX Server VMKernel active commands that the adapter driver is configured to support.

 

  • LQLEN

The LUN queue depth. This is the maximum number of ESX Server VMKernel active commands that the LUN is allowed to have. (Note that, in this document, the terminologies of LUN and Storage device can be used interchangeably.)

 

  • WQLEN

The World queue depth. This is the maximum number of ESX Server VMKernel active commands that the World is allowed to have. Note that this is a per LUN maximum for the World.

 

  • ACTV

The number of commands in the ESX Server VMKernel that are currently active. This statistic is only applicable to worlds and LUNs.

 

Please refer to %USD.

 

  • QUED

The number of commands in the VMKernel that are currently queued. This statistic is only applicable to worlds and LUNs.

 

Queued commands are commands waiting for an open slot in the queue. A large number of queued commands may be an indication that the storage system is overloaded. A sustained high value for the QUED counter signals a storage bottleneck which may be alleviated by increasing the queue depth. Check that LOAD &lt; 1 after increasing the queue depth. This should also be accompanied by improved performance in terms of increased cmd/s.

 

Note that there are queues in different storage layers. You might want to check the QUED stats for devices, and worlds.

 

  • %USD

The percentage of queue depth used by ESX Server VMKernel active commands. This statistic is only applicable to worlds and LUNs.

 

%USD = ACTV / QLEN * 100%

 

For world stats, WQLEN is used as the denominator. For LUN (aka device) stats, LQLEN is used as the denominator.

 

%USD is a measure of how many of the available command queue "slots" are in use. Sustained high values indicate the potential for queuing; you may need to adjust the queue depths for system’s HBAs if QUED is also found to be consistently &gt; 1 at the same time. Queue sizes can be adjusted in a few places in the IO path and can be used to alleviate performance problems related to latency. For detailed information on this topic please refer to the VMware whitepaper entitled "Scalable Storage Performance".

 

  • LOAD

The ratio of the sum of VMKernel active commands and VMKernel queued commands to the queue depth. This statistic is only applicable to worlds and LUNs.

 

The sum of the active and queued commands gives the total number of outstanding commands issued by that virtual machine. The LOAD counter values is the ratio of this value with respect to the queue depth. If LOAD &gt; 1, check the value of the QUED counter.

 

Section 4.2.4 Error Statistics

  • ABRTS/s

The number of commands aborted per second.

 

It can indicate that the storage system is unable to meet the demands of the guest operating system. Abort commands are issued by the guest when the storage system has not responded within an acceptable amount of time, e.g. 60 seconds on some windows OS’s. Also, resets issued by a guest OS on its virtual SCSI adapter will be translated to aborts of all the commands outstanding on that virtual SCSI adapter.

 

  • RESETS/s

The number of commands reset per second.

 

Section 4.2.5 PAE Statistics

  • PAECMD/s

The number of PAE commands per second.

 

It may point to hardware misconfiguration. When the guest allocates a buffer, the vmkernel assigns some machine memory, which might come from a “highmem” region. If you have a driver that is not PAE-aware, then this counter is updated if accesses to this memory region result in copies by the vmkernel into a lower memory location before issuing the request to the adapter. This might happen if you do not populate the DIMMs with low memory first, then you may artificially cause “highmem” memory accesses.

 

  • PAECP/s

The number of PAE copies per second.

 

Section 4.2.6 Split Statistics

  • SPLTCMD/s

The number of split commands per second.

 

Commands can be split when they reach the vmkernel. This might impact perceived latency to the guest. The guest may be issuing commands of large block sizes which have to be broken down by the vmkernel. For ESX3.0.x, guest requests greater than 128KB are split into 128KB chunks. Since few applications do larger than 128KB ops, this is unlikely to be an issue. Splitting can also occur when IOs fall across partition boundaries but these are easily differentiated from the splitting as a result of the IO size.

 

  • SPLTCP/s

The number of split copies per second.

 

Section 4.2.7 Clone Statistics

  • CLONE_RD

The number of  CLONE commands successfully completed where this device was a source.

 

  • CLONE_WR

The number of CLONE commands successfully completed where this device was a destination.

 

  • CLONE_F

The number of failed CLONE commands.

 

High number of failed CLONE commands indicates that the array does not support this command.

 

  • MBC_RD/s

Megabytes clone read per second.

 

  • MBC_WR/s

Megabytes clone written per second.

 

  • CAVG/suc

The average clone latency  per successful command.

 

Clone latency  can be affected by VM size, storage type and network traffic.

 

  • CAVG/f

The average clone latency  per failed command.

 

Section 4.2.8 ATS Statistics

  • ATS

Then number of ATS (Atomic test and set) commands successfully completed.

 

The ATS operation atomically compares an on-disk sector to a given buffer, and, if the two are identical, writes new data into the on-disk sector. The ATS primitive obviates the need for SCSI reservations for the purpose of disk-locking.

 

  • ATSF

The number of ATS commands failed.

 

ATS fails if there is a mismatch between old image and the current image on disk, most likely because of an intervening ATS command from another ESX host interested in the same lock.  It can also fail because of  reservation conflicts if another host has the LUN reserved using legacy SCSI reservation.

 

High number of failed ATS commands indicates that the array does not support this command.

 

  • AAVG/suc

The average ats latency  per successful command.

 

  • AAVG/f

The average ats latency per failed command.

 

Section 4.2.9 Zero Statistics

  • ZERO

The number of the ZERO_BLOCKS commands successfully completed

 

When a virtual disk is created, depends on the VMDK type, any data  remaining on the physical disk or LUN can be  zeroed out during creation of the virtual disk or  zeroed out at a later time during virtual machine read and write operations.

 

  • ZERO_F

The number of ZERO commands failed.

 

High number of failed ZERO commands indicates that the array does not support this command.

 

  • MBZERO/s

Megabytes zeroed per second.

 

  • ZAVG/suc

The average zero latency per successful command.

 

  • ZAVG/f

The average zero latency per failed command.

 

Section 4.2.10 Reservation Statistics

A SCSI reservation indicates the host is doing some metadata operation on the VMFS volume (like file creation, allocation of disk space, file rename etc) which by itself is benign, but If an another host does I/O at the same time when the SCSI reservation is in progress then they could have reservation conflicts.

 

During the conflict period any I/O going to the LUN will fail with the BUSY status, and will have to be retried.  Usually reservations are held for a very short time (few hundred microseconds). Obviously high number of reservation conflicts is not good and I/O latencies will go up when there are retries.  Likelihood of having reservation conflicts increases with the number of metadata operations and with the number of ESX hosts sharing the same LUN doing I/O at the same time. 

 

  • RESV/s

The number of SCSI reservations per second.

 

  • CONS/s

The number of SCSI reservation conflicts per second.

 

  • FRESV/s

The number of reserve commands failed per second.

 

Section 4.3 Batch Mode Output

Esxtop batch mode output can be loaded in perfmon directly. It uses a csv (comma separated values) format. The instance type can be identified via its name. Because there are quite a number of instances related to disk statistics, let's list a few examples below. You may easily match the format in your own environment.

 

  • LUN (aka device): "
    &lt;host&gt;\Physical Disk(DEV-vmhba0:0:0)\&lt;counter&gt;"

  • Partition: "
    &lt;host&gt;\Physical Disk(PN-vmhba0:0:0-1)\&lt;counter&gt;"

  • Path: "
    &lt;host&gt;\Physical Disk(PH-vmhba0:C0:T0:L0)\&lt;counter&gt;"

  • Per-World-Per-Device: "
    &lt;host&gt;\Physical Disk(WD-vmhba0:0:0-1024)\&lt;counter&gt;"

  • Adapter: "
    &lt;host&gt;\Physical Disk(vmhba0)\&lt;counter&gt;"

  • VSCSI: "
    &lt;host&gt;\Virtual Disk(win-64)\&lt;counter&gt;"

  • NFS: "
    &lt;host&gt;\Physical Disk NFS Volume(vmfs_nas1)\&lt;counter&gt;"

 

Section 5 Network

Section 5.1 Port

We arrange the network stats per port of a virtual switch. "PORT-ID" identifies the port and "DNAME" shows the virtual switch name. A port can be linked to a physical NIC as an uplink, or can be connected by a virtual NIC. "UPLINK" indicates whether the port is an uplink.

 

If the port is an uplink, i.e., "UPLINK" is 'Y', "USED-BY" shows the physical NIC name.

 

If the port is connected by a virtual NIC, i.e., "UPLINK" is 'N', "USED-BY" shows the port client name. (a) If the port is used by a virtual machine, the client name contains a world id and the VM name. The world id identifies the leader world of the VM group. Note that "vswif" is used by COS (on classic ESX). (b) If the port is used by VMKernel system, there is no world id. The client name can be used to identify the use of the port. To give two examples.

 

  • "vmk" is a port used by vmkernel. Users can create vmk NICs for  their uses, such as VMotion. On ESXi, there will be at least one vmk  NIC to communicate with outside of the host.

  • "Management" is a management port for a portset. This is internal. Usually no need to worry about it.

 

For each non-uplink port, the NIC teaming policy determines which physical NIC is in charge of the port. "TEAM-PNIC" shows the physical NIC name, if valid. Please refer to NIC teaming documentation for details.

 

Section 5.2 Port Statistics

  • "SPEED" (Mbps)

The link speed in Megabits per second. This information is only valid for a physical NIC.

 

  • "FDUPLX"

'Y' implies the corresponding link is operating at full duplex. 'N' implies it is not. This information is only valid for a physical NIC.

 

  • "UP"

'Y' implies the corresponding link is up. 'N' implies it is not. This information is only valid for a physical NIC.

 

  • "PKTTX/s"

The number of packets transmitted per second.

 

  • "PKTRX/s"

The number of packets received per second.

 

  • "MbTX/s" (Mbps)

The MegaBits transmitted per second.

 

  • "MbRX/s" (Mbps)

The MegaBits received per second.

 

Q: Why does MbRX/s not match PKTRX/s for different workloads?

A: This is because the packet size may not be the same. The average packet size can be computed as follows: average_packet_size = MbRX/s / PKTRX/s . A large packet size may improve CPU efficiency of processing the packet. However, it may potentially increase latency.

 

  • "%DRPTX"

The percentage of transmit packets dropped.

 

"%DRPTX" = "dropped Tx packets" / ("success Tx packets" + "dropped Tx packets")

 

Q: What does it mean if %DRPTX is high?

A: This usually means the network transmit performance is bad. Please check whether the physical NICs are fully utilizing their capacity. You probably need physical NICs with better performance. Or, you may add more physical NICs and use a good NIC teaming load balancing policy.

 

  • "%DRPRX"

The percentage of receive packets dropped.

 

"%DRPRX" = "dropped Rx packets" / ("success Rx packets" + "dropped Rx packets")

 

Q: What does it mean if %DRPRX is high?

A: This usually means the network recieve performance is bad. Try to give more CPU resource to the impacted VM, or increase the ring buffer size.

 

  • "ACTN/s"

Number of actions per second. The actions here are VMkernel actions. It is an internal counter. We won't discuss it further here.

 

Section 6. Interrupt

  • "DEVICES"

The devices that use the interrupt vector.

 

It is the list of devices separated by comma's. If the interrupt vector is not enabled for the device, its name is enclosed in "&lt;&gt;", e.g. "&lt;VMK device&gt;".

 

Tip: If "DEVICES" is cut off due to its length, use interactive command "L" to change the length of column "DEVICES".

 

  • "COUNT/s"

The total number of interrupts per second across all the CPUs.

 

E.g., If you have 2 CPUs, "COUNT/s" = "COUNT_0" + "COUNT_1".  "COUNT_x" will be described below.

 

This counter measures how often an interrupt is raised on the "DEVICE".

 

  • "COUNT_x"

The number of interrupts per second on CPU 'x'.

 

This is a per CPU counter. Comparing "COUNT_x" for the same interrupt vector on different CPUs can tell us how balanced the interrupts are  scheduled across all the CPUs.

 

  • "TIME/int"

The average processing time in microseconds per interrupt.

 

  • "TIME_x"

The average processing time in microseconds per interrupt on CPU 'x'.

 

"TIME/int" is the average for all the interrupts of the same vector,  while "TIME_x" averages only the interrupts raised on CPU 'x'.

 

Section 7. Batch Mode

Esxtop batch mode output uses a csv (comma separated values) format. The first line contains the names of the performance counters and their instances. Each of the following lines contains the performance data for those counter instances in one snapshot.

 

One way to read the batch mode output file is to load it in Windows perfmon. (1) Run perfmon; (2) Type "Ctrl + L" to view log data; (3) Add the file to the "Log files" and click OK; (4) Choose the counters to show the performance data. Each batch mode counter has a category name (listed as a performance object in perfmon) and a counter name (listed in the counter list in perfmon).

 

The counter names in esxtop batch mode are different from the ones in interactive mode listed in the sections above. The tables below describe their relationships. The first column is the interactive mode counter name; the second column is the batch mode counter category; the last column is the batch mode counter name.

 

  • Table 7-1 CPU Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

CPU load average

Physical Cpu Load

Cpu Load (1 Minute Avg)

Cpu Load (5 Minute Avg)

Cpu Load (15 Minute Avg)

PCPU USED(%)

Physical Cpu

% Processor Time

PCPU UTIL(%)

Physical Cpu

% Util Time

CORE UTIL(%)

Physical Cpu

% Core Util Time

CCPU(%) us

Console Physical Cpu

% User Time

CCPU(%) sy

Console Physical Cpu

% System Time

CCPU(%) id

Console Physical Cpu

% Idle Time

CCPU(%) wa

Console Physical Cpu

% I/O Wait Time

CCPU(%) cs/sec

Console Physical Cpu

% Context Switches/sec

%USED

Group Cpu (or Vcpu)

% Used

%SYS

Group Cpu (or Vcpu)

% System

%OVRLP

Group Cpu (or Vcpu)

% Overlap

%RUN

Group Cpu (or Vcpu)

% Run

%RDY

Group Cpu (or Vcpu)

% Ready

%MLMTD

Group Cpu (or Vcpu)

% Max Limited

%CSTP

Group Cpu (or Vcpu)

% CoStop

%WAIT

Group Cpu (or Vcpu)

% Wait

%IDLE

Group Cpu (or Vcpu)

% Idle

%SWPWT

Group Cpu (or Vcpu)

% Swap Wait

 

  • Table 7-2 Memory Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

MEM overcommit avg

Memory

Memory Overcommit (1 Minute Avg)

Memory Overcommit (5 Minute Avg)

Memory Overcommit (15 Minute Avg)

PMEM total

Memory

Machine MBytes

PMEM cos

Memory

Console MBytes

PMEM vmk

Memory

Kernel MBytes

PMEM other

Memory

NonKernel MBytes

PMEM free

Memory

Free MBytes

VMKMEM managed

Memory

Kernel Managed MBytes

VMKMEM minfree

Memory

Kernel MinFree MBytes

VMKMEM rsvd

Memory

Kernel Reserved MBytes

VMKMEM ursvd

Memory

Kernel Unreserved MBytes

VMKMEM state

Memory

Kernel State (0: high, 1: soft, 2:hard, 3: low)

COSMEM free

Console Memory

Free MBytes

COSMEM swap_t

Console Memory

Swap Total MBytes

COSMEM swap_f

Console Memory

Swap Free MBytes

COSMEM r/s

Console Memory

Swap MBytes Read/sec

COSMEM w/s

Console Memory

Swap MBytes Write/sec

NUMA

Numa Node

Total MBytes

Free MBytes

PSHARE shared

Memory

PShare Shared MBytes

PSHARE common

Memory

PShare Common MBytes

PSHARE saving

Memory

PShare Savings MBytes

SWAP curr

Memory

Swap Used MBytes

SWAP target

Memory

Swap Target MBytes

SWAP r/s

Memory

Swap MBytes Read/sec

SWAP w/s

Memory

Swap MBytes Write/sec

MEMCTL curr

Memory

Memctl Current MBytes

MEMCTL target

Memory

Memctl Target MBytes

MEMCTL max

Memory

Memctl Max MBytes

MEMSZ

Group Memory

Memory Size MBytes

GRANT

Group Memory

Memory Granted Size MBytes

SZTGT

Group Memory

Target Size MBytes

TCHD

Group Memory

Touched MBytes

%ACTV

Group Memory

% Active Estimate

%ACTVS

Group Memory

% Active Slow Estimate

%ACTVF

Group Memory

% Active Fast Estimate

%ACTVN

Group Memory

% Active Next Estimate

MCTL?

Group Memory

Memctl?

MCTLSZ

Group Memory

Memctl MBytes

MCTLTGT

Group Memory

Memctl Target MBytes

MCTLMAX

Group Memory

Memctl Max MBytes

SWCUR

Group Memory

Swapped MBytes

SWTGT

Group Memory

Swap Target MBytes

SWR/s

Group Memory

Swap Read MBytes/sec

SWW/s

Group Memory

Swap Written MBytes/sec

SHRD

Group Memory

Shared MBytes

ZERO

Group Memory

Zero MBytes

SHRDSVD

Group Memory

Shared Saved MBytes

COWH

Group Memory

Copy On Write Hint MBytes

OVHDUW

Group Memory

Overhead UW MBytes

OVHD

Group Memory

Overhead MBytes

OVHDMAX

Group Memory

Overhead Max MBytes

 

  • Table 7-3 Disk Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

CMDS/s

Physical Disk

Commands/sec

READS/s

Physical Disk

Reads/sec

WRITES/s

Physical Disk

Writes/sec

CMDS/s

Virtual Disk

Commands/sec

READS/s

Virtual Disk

Reads/sec

WRITES/s

Virtual Disk

Writes/sec

CMDS/s

Physical Disk Adapter

Commands/sec

READS/s

Physical Disk Adapter

Reads/sec

WRITES/s

Physical Disk Adapter

Writes/sec

CMDS/s

Physical Disk Path

Commands/sec

READS/s

Physical Disk Path

Reads/sec

WRITES/s

Physical Disk Path

Writes/sec

CMDS/s

Physical Disk SCSI Device

Commands/sec

READS/s

Physical Disk SCSI Device

Reads/sec

WRITES/s

Physical DiskSCSI Device

Writes/sec

CMDS/s

Physical Disk Partition

Commands/sec

READS/s

Physical Disk Partition

Reads/sec

WRITES/s

Physical Disk Partition

Writes/sec

CMDS/s

Physical Disk Per-Device-Per-World

Commands/sec

READS/s

Physical Disk Per-Device-Per-World

Reads/sec

WRITES/s

Physical Disk Per-Device-Per-World

Writes/sec

CMDS/s

Physical Disk NFS Volume

Commands/sec

READS/s

Physical Disk NFS Volume

Reads/sec

WRITES/s

Physical Disk NFS Volume

Writes/sec

CLONE_RD

Physical Disk SCSI Device

CReads

CLONE_WR

Physical Disk SCSI Device

CWrites

CLONE_F

Physical Disk SCSI Device

CFailed

CLONE_RD

Physical Disk NFS Volume

CReads

CLONE_WR

Physical Disk NFS Volume

CWrites

CLONE_F

Physical Disk NFS Volume

CFailed

ATS

Physical Disk SCSI Device

ATS

ATSF

Physical Disk SCSI Device

ATS Failed

ATS

Physical Disk NFS Volume

ATS

ATSF

Physical Disk NFS Volume

ATS Failed

ZERO

Physical Disk SCSI Device

Zeros

ZERO_F

Physical Disk SCSI DeviceZeros Failed

ZERO

Physical Disk NFS Volume

Zeros

ZERO_F

Physical Disk NFS Volume

Zeros Failed

MBREAD/s

Physical Disk

MBytes Read/sec

MBWRTN/s

Physical Disk

MBytes Written/sec

MBREAD/s

Virtual Disk

MBytes Read/sec

MBWRTN/s

Virtual Disk

MBytes Written/sec

MBREAD/s

Physical Disk Adapter

MBytes Read/sec

MBWRTN/s

Physical Disk Adapter

MBytes Written/sec

MBREAD/s

Physical Disk Path

MBytes Read/sec

MBWRTN/s

Physical Disk Path

MBytes Written/sec

MBREAD/s

Physical Disk SCSI Device

MBytes Read/sec

MBWRTN/s

Physical Disk SCSI Device

MBytes Written/sec

MBREAD/s

Physical Disk Partition

MBytes Read/sec

MBWRTN/s

Physical Disk Partition

MBytes Written/sec

MBREAD/s

Physical Disk Per-Device-Per-World

MBytes Read/sec

MBWRTN/s

Physical Disk Per-Device-Per-World

MBytes Written/sec

MBREAD/s

Physical Disk NFS Volume

MBytes Read/sec

MBWRTN/s

Physical Disk NFS Volume

MBytes Written/sec

MBZERO/s

Physical Disk SCSI Device

MBytes Zeroed/sec

MBC_RD/s

Physical Disk SCSI Device

MBytes CReads/sec

MBC_WR/s

Physical Disk SCSI Device

MBytes CWrites/sec

MBZERO/s

Physical Disk NFS Volume

MBytes Zeroed/sec

MBC_RD/s

Physical Disk NFS Volume

MBytes CReads/sec

MBC_WR/s

Physical Disk NFS Volume

MBytes CWrites/sec

KAVG/cmd

Physical Disk

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk

Average Queue MilliSec/Command

KAVG/cmd

Physical Disk Adapter

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk Adapter

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk Adapter

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk Adapter

Average Queue MilliSec/Command

KAVG/cmd

Physical Disk Path

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk Path

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk Path

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk Path

Average Queue MilliSec/Command

KAVG/cmd

Physical Disk SCSI Device

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk SCSI Device

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk SCSI Device

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk SCSI Device

Average Queue MilliSec/Command

KAVG/cmd

Physical Disk Partition

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk Partition

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk Partition

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk Partition

Average Queue MilliSec/Command

KAVG/cmd

Physical Disk Per-Device-Per-World

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk Per-Device-Per-World

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk Per-Device-Per-World

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk Per-Device-Per-World

Average Queue MilliSec/Command

KAVG/cmd

Physical Disk NFS Volume

Average Kernel MilliSec/Command

DAVG/cmd

Physical Disk NFS Volume

Average Driver MilliSec/Command

GAVG/cmd

Physical Disk NFS Volume

Average Guest MilliSec/Command

QAVG/cmd

Physical Disk NFS Volume

Average Queue MilliSec/Command

KAVG/rd

Physical Disk

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk

Average Driver MilliSec/Read

GAVG/rd

Physical Disk

Average Guest MilliSec/Read

QAVG/rd

Physical Disk

Average Queue MilliSec/Read

KAVG/rd

Physical Disk Adapter

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk Adapter

Average Driver MilliSec/Read

GAVG/rd

Physical Disk Adapter

Average Guest MilliSec/Read

QAVG/rd

Physical Disk Adapter

Average Queue MilliSec/Read

KAVG/rd

Physical Disk Path

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk Path

Average Driver MilliSec/Read

GAVG/rd

Physical Disk Path

Average Guest MilliSec/Read

QAVG/rd

Physical Disk Path

Average Queue MilliSec/Read

KAVG/rd

Physical Disk SCSI Device

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk SCSI Device

Average Driver MilliSec/Read

GAVG/rd

Physical Disk SCSI Device

Average Guest MilliSec/Read

QAVG/rd

Physical Disk SCSI Device

Average Queue MilliSec/Read

KAVG/rd

Physical Disk Partition

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk Partition

Average Driver MilliSec/Read

GAVG/rd

Physical Disk Partition

Average Guest MilliSec/Read

QAVG/rd

Physical Disk Partition

Average Queue MilliSec/Read

KAVG/rd

Physical Disk Per-Device-Per-World

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk Per-Device-Per-World

Average Driver MilliSec/Read

GAVG/rd

Physical Disk Per-Device-Per-World

Average Guest MilliSec/Read

QAVG/rd

Physical Disk Per-Device-Per-World

Average Queue MilliSec/Read

KAVG/rd

Physical Disk NFS Volume

Average Kernel MilliSec/Read

DAVG/rd

Physical Disk NFS Volume

Average Driver MilliSec/Read

GAVG/rd

Physical Disk NFS Volume

Average Guest MilliSec/Read

QAVG/rd

Physical Disk NFS Volume

Average Queue MilliSec/Read

KAVG/wr

Physical Disk

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk

Average Driver MilliSec/Write

GAVG/wr

Physical Disk

Average Guest MilliSec/Write

QAVG/wr

Physical Disk

Average Queue MilliSec/Write

KAVG/wr

Physical Disk Adapter

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk Adapter

Average Driver MilliSec/Write

GAVG/wr

Physical Disk Adapter

Average Guest MilliSec/Write

QAVG/wr

Physical Disk Adapter

Average Queue MilliSec/Write

KAVG/wr

Physical Disk Path

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk Path

Average Driver MilliSec/Write

GAVG/wr

Physical Disk Path

Average Guest MilliSec/Write

QAVG/wr

Physical Disk Path

Average Queue MilliSec/Write

KAVG/wr

Physical Disk SCSI Device

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk SCSI Device

Average Driver MilliSec/Write

GAVG/wr

Physical Disk SCSI Device

Average Guest MilliSec/Write

QAVG/wr

Physical Disk SCSI Device

Average Queue MilliSec/Write

KAVG/wr

Physical Disk Partition

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk Partition

Average Driver MilliSec/Write

GAVG/wr

Physical Disk Partition

Average Guest MilliSec/Write

QAVG/wr

Physical Disk Partition

Average Queue MilliSec/Write

KAVG/wr

Physical Disk Per-Device-Per-World

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk Per-Device-Per-World

Average Driver MilliSec/Write

GAVG/wr

Physical Disk Per-Device-Per-World

Average Guest MilliSec/Write

QAVG/wr

Physical Disk

Per-Device-Per-WorldAverage Queue MilliSec/Write

KAVG/wr

Physical Disk NFS Volume

Average Kernel MilliSec/Write

DAVG/wr

Physical Disk NFS Volume

Average Driver MilliSec/Write

GAVG/wr

Physical Disk NFS Volume

Average Guest MilliSec/Write

QAVG/wr

Physical Disk NFS Volume

Average Queue MilliSec/Write

LAT/rd

Virtual Disk

Average MilliSec/Read

LAT/wr

Virtual Disk

Average MilliSec/Write

CAVG/suc

Physical Disk SCSI Device

Average Success Latency ms/Clone

CAVG/f

Physical Disk SCSI Device

Average Failure Latency ms/Clone

AAVG/suc

Physical Disk SCSI Device

Average Success Latency ms/ATS

AAVG/f

Physical Disk SCSI Device

Average Failure Latency ms/ATS

ZAVG/suc

Physical Disk SCSI Device

Average Success Latency ms/Zero

ZAVG/f

Physical Disk SCSI Device

Average Failure Latency ms/Zero

CAVG/suc

Physical Disk NFS Volume

Average Success Latency ms/Clone

CAVG/f

Physical Disk NFS Volume

Average Failure Latency ms/Clone

AAVG/suc

Physical Disk NFS Volume

Average Success Latency ms/ATS

AAVG/f

Physical Disk NFS Volume

Average Failure Latency ms/ATS

ZAVG/suc

Physical Disk NFS Volume

Average Success Latency ms/Zero

ZAVG/f

Physical Disk NFS Volume

Average Failure Latency ms/Zero

AQLEN

Physical Disk

Adapter Q Depth

DQLEN

Physical Disk SCSI Device

Device Q Depth

WQLEN

Physical Disk SCSI Device

World Q Depth

DQLEN

Physical Disk Partition

Device Q Depth

WQLEN

Physical Disk Partition

World Q Depth

DQLEN

Physical Disk Per-Device-Per-World

Device Q Depth

WQLEN

Physical Disk Per-Device-Per-World

World Q Depth

DQLEN

Physical Disk Path

Device Q Depth

WQLEN

Physical Disk Path

World Q Depth

ACTV

Physical Disk SCSI Device

Active Commands

QUED

Physical Disk SCSI Device

Queued Commands

%USD

Physical Disk SCSI Device

% Used

LOAD

Physical Disk SCSI Device

Load

ACTV

Physical Disk NFS Volume

Active Commands

ABRTS/s

Physical Disk

Aborts/sec

RESETS/s

Physical Disk

Resets/sec

ABRTS/s

Physical Disk Adapter

Aborts/sec

RESETS/s

Physical Disk Adapter

Resets/sec

ABRTS/s

Physical Disk Path

Aborts/sec

RESETS/s

Physical Disk Path

Resets/sec

ABRTS/s

Physical Disk SCSI Device

Aborts/sec

RESETS/s

Physical Disk SCSI Device

Resets/sec

ABRTS/s

Physical Disk Per-Device-Per-World

Aborts/sec

RESETS/s

Physical Disk Per-Device-Per-World

Resets/sec

ABRTS/s

Physical Disk NFS Volume

Aborts/sec

RESETS/s

Physical Disk NFS Volume

Resets/sec

PAECMD/s

Physical Disk

PAE Commands/sec

PAECP/s

Physical Disk

PAE Copies/sec

PAECMD/s

Physical Disk Adapter

PAE Commands/sec

PAECP/s

Physical Disk Adapter

PAE Copies/sec

PAECMD/s

Physical Disk Path

PAE Commands/sec

PAECP/s

Physical Disk Path

PAE Copies/sec

SPLTCMD/s

Physical Disk

Split Commands/sec

SPLTCP/s

Physical Disk

Split Copies/sec

SPLTCMD/s

Physical Disk Adapter

Split Commands/sec

SPLTCP/s

Physical Disk Adapter

Split Copies/sec

SPLTCMD/s

Physical Disk Path

Split Commands/sec

SPLTCP/s

Physical Disk Path

Split Copies/sec

RESV/s

Physical Disk

Reserves/sec

CONS/s

Physical Disk

Conflicts/sec

FRESV/s

Physical Disk

Failed Reserves/sec

 

  • Table 7-4 Network Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

SPEED

Network Port

Link Speed (Mb/s)

FDUPLX

Network Port

Full Duplex?

UP

Network Port

Link Up?

PKTTX/s

Network Port

Packets Transmitted/sec

PKTRX/s

Network Port

Packets Received/sec

MbTX/s

Network Port

MBits Transmitted/sec

MbRX/s

Network Port

MBits Received/sec

%DRPTX

Network Port

% Outbound Packets Dropped

%DRPRX

Network Port

% Received Packets Dropped

ACTN/s

Network Port

Actions Posted/sec

 

  • Table 7-5 Interrupt Batch Mode Counters

Counter Name

Batch Mode Category

Batch Mode Counter Name

COUNT/s

Interrupt Vector

Interrupts/second

TIME/int

Interrupt Vector

Processing Time MicroSec/Interrupt

 

Troubleshooting Performance Related Problems in vSphere 4.1 Environments

Monitoring Hardware Performance Events on ESXi 5.0 with vmkperf

$
0
0

Introduction

Many processors allow monitoring performance events that occur in hardware using the hardware performance counters. Some examples of these events are cache misses, TLB flushes, and unhalted clock cycles. ESXi provides a command-line utility “vmkperf” that can be used to monitor these events on a system-wide basis.  Based on the specified configuration, the tool will configure a hardware performance counter to measure a given event and then read this counter periodically to report the number of events that occurred in a given time interval. This document describes how to use the vmkperf utility to configure and monitor these performance events. This utility works on AMD Athlon 64, AMD Opteron, and Intel CORE2 architecture–based processors.

Vmkperf Syntax

vmkperf command [eventName] <options>

 

Configure a performance counter to monitor a new event and start counting

vmkperf  start <eventname> -e <eventselect>  -u <unitmask>

Event name can be any name you choose that easily describes the event that is being configured. Event select describes an event group, and unit mask further qualifies that group. Event select and unitmasks are a maximum of 32-bits in length and must be specified in hexadecimal format. Event select and unit mask values for every event that the hardware supports is described in the processor’s manual. Vmkperf will take this input and calculate a complete 64-bit event select register value, and then configure the performance counter.

 

Alternatively, you can specify a complete 64-bit event select register value instead of event select and unit masks.

 

vmkperf start <eventname> -r <event select register value>

 

Read current event count

vmkperf  read <eventname>

This will print per-physical CPU cumulative counts for the event since the event was started.

 

Monitor an event periodically

vmkperf  poll <eventname>  -i <interval> -f <format> -n <iterations>

This command will poll the performance counter periodically and print the event rate. The default polling interval is 5 seconds. The default rate format is “avgPerSecond,” which is the average number of events per second per physical CPU. Another available format is “avgPerMillionCycles,” which is the number of events that occurred in a million CPU cycles per physical CPU.

 

Stop monitoring an event

vmkperf  stop <eventname>

This will stop event monitoring. Note that there are some events ESXi configures and these cannot be stopped using vmkperf.

 

Get an event configuration

 

vmkperf  getconfig <eventname>

 

Read the counters for all events

vmkperf  readall

This will print the current count for all configured events.

 

Stop monitoring all events

vmkperf  stopall

 

This will stop monitoring all the events that were configured by the user.

 

List predefined events

vmkperf  listevents

 

There are some predefined events that can be listed using the command “listevents”.  The predefined events have a fixed event select value but you can provide a unitmask at runtime. Note that these events may or may not be active. Use the start command to activate any of the predefined events.

Vmkperf Usage Example

The following shows an example of measuring a last level cache miss event on an Intel Nehalem system.

 

vmkperf start last_level_cache_miss   –e  0x1b7  –u  0x0

This command will start event monitoring with event select “0x1b7” and unit mask “0x0”.  The event will be named “last_level_cache_miss”


vmkperf read last_level_cache_miss

Output:

pcpuID counterVal timeStamp counterNum
0 130084327520 408384367696497 0
1 72577456163 408384367679023 0
2 116851812207 408384367682868 0
3 79918703630 408384367688954 0
4 119799647768 408384367699670 0
5 81561804987 408384366222173 0
6 111652423132 408384367677871 0
7 90228202334 408384367693134 0
8 226104621554 408384367694230 0
9 140914377472 408384367681360 0
10 214133735035 408384367686560 0
11 152284828464 408384367684685 0
12 209715250335 408384367690309 0
13 157156654479 408384367692089 0
14 180435423619 408384367695287 0
15 84368812495 408384367679910 0

 

vmkperf getconfig last_level_cache_miss

Output:
eventSel=0x1b7 unitMask=0x0 eventSelReg=0x0

 

vmkperf poll last_level_cache_miss

Output:
last_level_cache_miss per second per cpu

---------------------
pcpu0    pcpu1    pcpu2    pcpu3    pcpu4    pcpu5    pcpu6    pcpu7    pcpu8    pcpu9    pcpu10    pcpu11    pcpu12    pcpu13    pcpu14    pcpu15   
51600.80 4969.00 14694.60 991.40 10977.20 12249.20 3643.20 4598.60 11015.80 15591.40 17977.20 7620.00 4382.00 15729.40 36800.60 16594.40

Frequently Asked Questions

Why does the stop or stopall command sometimes not stop an event?

You can only stop an event using the stop or stopall commands if that event was started using the vmkperf tool. On some systems, ESX/ESXi may be using an event internally and that cannot be stopped using vmkperf. If the event was not started using vmkperf, you may see the warning "operation not permitted.”  However, reading an event is always permitted.

 

How do I monitor fixed performance counters on Intel core architecture?

Fixed performance counters are already defined by vmkperf. These show up as event names “fixed_xxx” in  the “vmkperf listevents” output. Fixed performance counters can be enabled using the "vmkperf start" command and providing the same name as printed in the listevents output. For fixed events, only the unit mask is required. Event select is already defined.

 

Why, when I execute a start command, do I get the error “out of resources?”

This happens when there are no more free performance counters available. No new event can be configured unless you stop an event.

 

How do I monitor multiple events simultaneously?

A processor has a limited number of performance counters and these vary based on the processor model. Hence, only a limited number of events can be monitored simultaneously.  In addition, the BIOS and ESX/ESXi use some performance counters and this further limits the number of available counters.

Viewing all 360 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>