skills/vmware-esxi-performance/SKILL.md
VMware ESXi performance troubleshooting for network and storage issues. Use when troubleshooting vmxnet3 TX hangs, NETDEV WATCHDOG errors, ring buffer exhaustion, TSO issues, high storage latency (KAVG/DAVG/GAVG), queue depth tuning, DSNRO configuration, iSCSI performance, PVSCSI optimization, or CPU overcommit symptoms. Covers both guest OS and ESXi host-level diagnostics.
npx skillsauth add abix-/claude-blueprints vmware-esxi-performanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Network issues (TX hangs, watchdog errors):
dmesg -T | grep -i watchdog for timeout messagesethtool -g <interface>ethtool -S <interface> | grep -E "ring.full|tx_timeout"ethtool -k <interface>Storage issues (high latency):
esxtop, press u for device viewesxcli storage core device list -d <naa.id>NETDEV WATCHDOG: <interface> (vmxnet3): transmit queue X timed out in dmesgtx hang messages# Driver and version
ethtool -i <interface>
# Ring buffer sizes
ethtool -g <interface>
# Adapter statistics - look for "ring full" and "tx timeout count"
ethtool -S <interface>
# Offload features
ethtool -k <interface> | grep -E 'tcp-segmentation|generic-segmentation'
# Kernel messages with timestamps
dmesg -T | grep -i -E 'watchdog|tx.hang|timeout'
# Current uptime in seconds (compare to dmesg timestamps)
awk '{print $1}' /proc/uptime
# Increase ring buffers (immediate, non-persistent)
ethtool -G <interface> rx 4096 tx 4096
# Disable TSO if hangs continue (temporary test)
ethtool -K <interface> tso off
# Make ring buffer change persistent (RHEL/AlmaLinux)
# Add to /etc/NetworkManager/dispatcher.d/ or udev rule
| Metric | Meaning | Action if high | |--------|---------|----------------| | ring full | TX ring was completely full | Increase TX ring buffer | | tx timeout count | Watchdog fired | Indicates sustained ring exhaustion | | pkts tx err | Hardware TX errors | Check physical/virtual NIC |
# In esxtop, press 'c' for CPU view
# Look at %RDY (ready time) and %CSTP (co-stop)
# > 5% is concerning, > 10% is problematic
# On ESXi host
grep -E "H:0x5|H:0x7|H:0x8" /var/log/vmkernel.log
grep -i -E "APD|PDL|lost.access" /var/log/vmkernel.log
# NIC statistics
esxcli network nic stats get -n vmnicX
H:0x5 - Host adapter timeoutH:0x7 - Command abortedH:0x8 - Host adapter resetVM issues I/O
↓
Guest OS disk scheduler
↓
Virtual SCSI adapter (PVSCSI/LSI)
↓
VMkernel I/O scheduler ← KAVG measured here (includes QAVG)
↓
Device queue (DQLEN) ← QAVG measured here
↓
HBA/iSCSI initiator
↓
Network/Fabric
↓
Storage Array ← DAVG measured here
GAVG = KAVG + DAVG (what the guest actually experiences)
QAVG is a subset of KAVG. Both being high and equal indicates queue depth saturation.
| KAVG | QAVG | Interpretation | |------|------|----------------| | High | High (equal) | Queue depth limit reached | | High | High (QAVG higher) | Array overwhelmed or QoS throttling | | High | Low/Zero | DSNRO throttling or CPU contention | | Low | Low | Healthy |
KAVG measures VM I/O only. QAVG includes all I/O (VM + hypervisor metadata). During contention, hypervisor I/O gets deprioritized, inflating QAVG average while VM I/O (KAVG) stays lower.
| Metric | Healthy | Warning | Critical | |--------|---------|---------|----------| | DAVG | < 5ms | 5-15ms | > 15ms | | KAVG | < 1ms | 1-2ms | > 2ms | | GAVG | < 10ms | 10-20ms | > 20ms |
For databases (SQL, SingleStore, etc.):
| Metric | Meaning | Healthy Value | |--------|---------|---------------| | DAVG | Device/array latency | < 10ms | | KAVG | Kernel latency | < 1ms | | QAVG | Queue wait time | < 1ms | | GAVG | Guest observed (DAVG+KAVG) | < 15ms | | ACTV | Active commands in flight | < DQLEN | | QUED | Commands waiting in queue | 0 | | DQLEN | Device queue depth limit | Varies |
| DAVG | KAVG | QAVG | Likely Cause | |------|------|------|--------------| | High | Low | Low | Array/storage problem | | Low | High | High | Queue depth limit hit | | Low | High | Low | DSNRO throttling or CPU contention | | High | High | High | Array slow + queue backing up |
High DAVG = storage array is slow. Check:
This is the tricky case. I/O is being held in the kernel but not in the device queue.
Check esxcli storage core device list -d <naa.id>:
No of outstanding IOs with competing worlds: 32
If you have multiple VMDKs on the same datastore, each VMDK counts as a "competing world." With 8 VMDKs and DSNRO=32, each is limited to 32 outstanding I/Os.
Fix: esxcli storage core device set -d <naa.id> -O 256
Is Shared Clusterwide: true triggers defensive throttling even if only one host is active. ESXi doesn't know other hosts are idle.
Fix: Same DSNRO increase.
Is SSD: false causes ESXi to apply HDD scheduling policies.
Fix: esxcli storage core device set -d <naa.id> -m true
Even on dedicated hosts, software iSCSI threads or storage driver processing can get delayed.
Check: esxtop → press c → look at %RDY for VMkernel threads.
Queue depth is saturated. Commands are waiting in line.
Check in esxtop:
Fixes:
Key columns:
Shows HBA-level statistics:
Shows per-VM disk statistics:
# Batch mode capture for 60 seconds, 2-second intervals
esxtop -b -d 2 -n 30 > /tmp/esxtop_capture.csv
# Then analyze with Excel or perfmon
┌─────────────────────────────────────────────────────────────────┐
│ Guest VM │
│ PVSCSI queue: 64 default, 254 max per device │
│ PVSCSI adapter: 256 default, 1024 max total │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ VMkernel │
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Scheduler Queue │ │ Device Queue │ │
│ │ (DSNRO throttle) │ ──→ │ (DQLEN) │ │
│ │ │ │ │ │
│ │ Per-world limit │ │ Per-device limit │ │
│ │ Default: 32 │ │ HBA-dependent │ │
│ └─────────────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ HBA / iSCSI Initiator │
│ Queue depth: 32-255 depending on vendor │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Storage Array │
│ Port queue depth: 1600-2048 typical │
└─────────────────────────────────────────────────────────────────┘
DSNRO limits outstanding I/Os per "world" to a shared device. A "world" is:
DSNRO throttling activates when:
Wrong: "One VM per datastore means no DSNRO throttling" Right: "One VMDK per datastore means no DSNRO throttling"
A single VM with 8 VMDKs on one datastore = 8 competing worlds = DSNRO applies.
# Check current setting per device
esxcli storage core device list -d <naa.id> | grep "outstanding"
# Output: No of outstanding IOs with competing worlds: 32
# Change per device (immediate, persists until reboot)
esxcli storage core device set -d <naa.id> -O 256
# Valid range: 1-256
Per-device DSNRO changes do not persist across reboots. Options:
# Apply to all Compellent devices on host
for naa in $(esxcli storage core device list | grep -B20 "COMPELNT" | grep "^naa\." | awk '{print $1}'); do
esxcli storage core device set -d $naa -O 256
esxcli storage core device set -d $naa -m true
done
DQLEN is the maximum commands the device queue can hold. Set by:
| Adapter | Default DQLEN | |---------|---------------| | QLogic FC | 64 | | Emulex FC | 32 | | Software iSCSI | 128 | | Hardware iSCSI | Varies |
# Check current HBA module parameters
esxcli system module parameters list -m <driver>
# Example for QLogic
esxcli system module parameters set -m qlnativefc -p "ql2xmaxqdepth=128"
# Requires reboot
Datastore A
├── VM1 (1 VMDK) ← World 1
├── VM2 (1 VMDK) ← World 2
└── VM3 (1 VMDK) ← World 3
DSNRO applies: Each VM limited to 32 outstanding I/Os
Datastore A
└── VM1
├── disk1.vmdk ← World 1
├── disk2.vmdk ← World 2
├── disk3.vmdk ← World 3
└── disk4.vmdk ← World 4
DSNRO applies: Each VMDK limited to 32 outstanding I/Os
Total VM I/O capped at 4 × 32 = 128, but per-disk only 32
Datastore A
└── VM1
└── disk1.vmdk ← Only world
DSNRO does NOT apply: Limited only by device queue depth
Datastore A (shared by 6 hosts)
└── VM1 on Host1
└── disk1.vmdk ← Only active world
DSNRO still applies: ESXi sees "Is Shared Clusterwide: true"
Defensive throttling even though other hosts are idle
Raw Device Mappings (RDMs) are NOT subject to DSNRO. Each RDM gets full device queue depth.
When Storage I/O Control (SIOC) is enabled:
| Vendor | DSNRO | HBA Queue | |--------|-------|-----------| | Dell/EMC SC Series | 64 | 255 | | Pure Storage | 256 | 256 | | NetApp | 64-128 | 128 | | HPE Nimble | 256 | 256 |
Always check vendor best practices documentation.
Using Little's Law:
Queue Depth = IOPS × Latency (seconds)
Example:
But for bursts, multiply by 3-5x headroom: 30-50 queue depth needed.
PVSCSI (Paravirtualized SCSI) is VMware's high-performance virtual storage adapter.
| Adapter | Queue/Device | Queue/Adapter | CPU Overhead | |---------|--------------|---------------|--------------| | LSI Logic | 32 | 128 | Higher | | LSI Logic SAS | 32 | 128 | Higher | | PVSCSI | 64 (default) | 256 (default) | Lowest | | PVSCSI (tuned) | 254 (max) | 1024 (max) | Lowest |
# Per-device queue depth
cat /sys/module/vmw_pvscsi/parameters/cmd_per_lun
# Ring pages (affects adapter queue)
cat /sys/module/vmw_pvscsi/parameters/ring_pages
# Verify PVSCSI is in use
lspci | grep -i vmware
# Should show: VMware PVSCSI SCSI Controller
# Check registry
Get-ItemProperty "HKLM:\SYSTEM\CurrentControlSet\Services\pvscsi\Parameters\Device" -Name DriverParameter
Create /etc/modprobe.d/pvscsi.conf:
options vmw_pvscsi cmd_per_lun=254 ring_pages=32
Rebuild initramfs:
# RHEL/CentOS/AlmaLinux
dracut -f
# Ubuntu/Debian
update-initramfs -u
# Then reboot
Cannot change at runtime. Requires module reload or reboot.
Path: HKLM\SYSTEM\CurrentControlSet\Services\pvscsi\Parameters\Device
Value Name: DriverParameter
Value Type: REG_SZ
Value Data: RequestRingPages=32,MaxQueueDepth=254
Then reboot.
Ring pages control the PVSCSI adapter's command ring buffer:
Increasing ring pages allows more concurrent adapter-level I/O.
Three queue depths interact:
I/O flows through all three. Lowest limit wins.
Guest PVSCSI: cmd_per_lun=254
ESXi DSNRO: 32
Array: plenty of capacity
Result: Guest sends 254, ESXi throttles to 32 per world
KAVG increases even though PVSCSI is tuned
Must tune both guest and host settings.
Maximum 4 PVSCSI controllers per VM, 64 devices per controller.
| Controller | Devices | Purpose | |------------|---------|---------| | SCSI 0 | OS disk | System | | SCSI 1 | Data disks | Database data files | | SCSI 2 | Log disks | Transaction logs | | SCSI 3 | Temp disks | TempDB / scratch |
In vSphere:
# Verify module parameters
cat /sys/module/vmw_pvscsi/parameters/cmd_per_lun
# Should show: 254
cat /sys/module/vmw_pvscsi/parameters/ring_pages
# Should show: 32
# Check disk queue depth visible to block layer
cat /sys/block/sd*/device/queue_depth
In esxtop (v view for VM disks):
dmesg | grep pvscsi for driver messagestools
AutoHotkey v2 scripting standards for Windows automation, hotkeys, and game macros. Built from the official AHK v2 docs and the AHK community conventions. v1 reached EOL in March 2024.
data-ai
Analyze why Claude made its previous response -- trace reasoning to system prompt, CLAUDE.md, memory, skills, or context
tools
development
Build, test, and release Timberbot mod to GitHub and Steam Workshop