Since Fall of 2021 when I upgraded from ESXi 7U1c to 7U2a, I’ve noticed several times when my host would lose access to its internal SD card. My setup is pretty standard – Dell’s custom ESXi image installed on my R830’s internal dual SD module (IDSDM).
After around 16-20 days of uptime, the host would lock up – API would be unresponsive for ESXi (busted vmware-hostd and/or vmware-vpxa services?) meaning that any attempt to manage VMs via ESXi web UI would timeout and/or sit there indefinitely, along with any attempts via esxcli. Hosts managed by vCenter would timeout for any dispatched vCenter jobs, with similar UI symptoms as above.
Some of these include:
- Unable to manage power state of VMs (operation pending infinitely)
- Unable to vMotion/migrate VMs
- Web/remote console does not work
Some host functionality is persisted, such as managing host services and enabling SSH, but it’s very limited what you can do at this point.
Fortunately, running workload (VMs, vApps, Pools) seem to be unaffected.
What’s going on?
This is a known issue as of VMware KB #2149257 where high frequency read operations to ESXi’s SD card (either single, or IDSDM) causes ‘SD card corruption’.
This is attributed to a new partition schema, where ESXi’s scratch partition located on the same media experiences high I/O. Hence why in the same KB, they suggest the usage of Ramdisk as a VMware Tools repository.
There are other theories as well which hint at a bug in the vmkusb driver, an issue VMware engineers are still looking into.
It’s important to note that this has only been observed for those deploying ESXi to an SD card, IDSDM, or crappy flash drives. Those that have deployed ESXi to a disk (HDD/SSD) have not experienced this issue.
You can follow along with the VMware community discussion here.
What can I do?
You’ve got a few options, all of which have varying success and suck in production.
Ordered from safest to outage:
- Restart vmware-hostd and vmware-vpxa services, which seem to re-establish connectivity to the ESXi filesystem on your SD card.
- Move scratch partition to disk or datastore, and use Ramdisk for vmware-tools (done automatically in ESXi 7U3).
- Downgrade to ESXi 7U1c (last known 7.0 version before these issues)
- Export an ESXi config bundle backup, re-install ESXi, and restore from ESXi config bundle.
Depending on whether you’re experiencing one of the two aforementioned issues, YMMV.
How I temporarily remediated
Until we see this resolved 100% in the future, I’ve outlined a remediation plan based on feedback by other awesome individuals in the VMware community which should address both faults.
1. Stop vmware-hostd and vmware-vpxa
/etc/init.d/hostd stop
/etc/init.d/vpxa stop
2. Wait 60 seconds, attempt unload of vmkusb, and wait another 60 seconds
sleep 60
vmkload_mod -u vmkusb
sleep 60
3. Start vmware-hostd and vmware-vpxa
/etc/init.d/hostd start
/etc/init.d/vpxa start
After a moment you should find your host is once again responsive.
4. Enter maintenance mode
esxcli system maintenanceMode set --enable true
You can also check maintenanceMode state, useful after the last step.
esxcli system maintenanceMode get
5. Enable Ramdisk for VMware Tools repo
esxcfg-advcfg -A ToolsRamdisk --add-desc "Use VMware Tools repository from /tools ramdisk" --add-default "0" --add-type 'int' --add-min "0" --add-max "1"
6. Reboot host
reboot
7. Exit maintenance mode
esxcli system maintenanceMode set --enable false