Nanny
We are now witnessing boot-related problems more frequently with ESXi 7.x
How exactly are they witnessing it? Through "telemetry"?
The nanny syndrome is creeping in...
VMware has warned users it will end support for non-persistent removable storage as a boot medium for its flagship vSphere VM-wrangler. A post last week delivered the news. "ESXi Boot configuration with only SD card or USB drive, without any persistent device, is deprecated with vSphere 7 Update 3," the post states. "In …
<i"...Had to roll back twenty-seven hosts to 6.7 after boot issues on 7.0.
Absolute f****** shamble costing me and my team a full weekend..."</i>
Pre-deployment testing? Rollback plan? Testing in between deployments?
I would be genuinely interested how you got to 27 hosts before you noticed any issues. Surely you tested a limited number some time before a mass rollout?
It's one of those delightful issues that may not show obvious symptoms for days or weeks, so it's actually easy for people to do all the right testing and still miss this. If you're looking at the vmkernel log you *might* notice some extra churn before host issues pop up.
And yes, they are "noticing" because of people opening support cases.
It takes weeks or months to show up, depending upon how good / bad your SD cards are.
We have many servers running ESXi on SD cards, they have been running ESXi from v4 (can update any further as no longer supported). But they have worked for 10+ years, but for some reason ESXi7 will kill these SD cards within weeks. They screwed up, logging far too much (no doubt due to the changes for kubernetes),
As this was how we have run thing for a while and that its worked well, zero problems in over 10 years doing it, we purchased replacement r7515 from dell 2 years ago, with the boot option for SD with ESXi, but now we need to replace / purchase new drives for these servers as they cant run ESXi7.
They are probably getting support calls from their customers. They don't need telemetry to see the problem.
As far as "nanny" goes... If people call VMware for support that costs them money. I mean, one call makes no difference to them, but lots of calls mean they have to hire more support staff. So they are banning a whole category of support calls, which will save VMware money.
It also gives IT departments a stick to beat the beancounters with: "VMware changed the rules, our old configuration is no longer supported" is a much stronger argument than "I'm worried the SD cards might fail soon". That argument also shifts blame nicely - it's all VMware's fault that they have to spend money and time reconfiguring systems, everyone in the company is blameless. Yes, this paragraph is full of nonsense big company bullsh*t, but that doesn't make it wrong.
I already saw direct eveidence of this.
I gave up on USB boot/SD-card boot of any server (including many VMware hypervisors) after so many failures over time. I experienced this with all other OSs as well.
Sure, it works at first. And if you only have a few servers, you probably won't notice it that much. But if you have a lot of servers, you will see large #s of failures over time.
Most "appliance" PCs come with Flash DOM modules, which are a bit more robust, but I have still had to replace many a DOM module as well.
Full on SSDs have had a normal small range of failures from a large fleet of servers, well within my expected range. SC-card and USB flash boot failures are well over 50% over enough time in my environment.
I second that. Old place of work used to have USB sticks for Esxi instances but after the second death, we ripped out the CD drive and hooked a small but quality sata SSD drive instead.
No loss of SAS drives or hard disk space, just lose the DVD drive (which can be conveniently replaced with a usb dvd drive in a pinch).
The biggest issue was getting hold of an adapter cable for HP's combi sata data+power cable.
Definitely not nanny syndrome. More like customers screaming down the phone and sending SWAT teams to their homes.
We've had to open support cases with VMware in relation to this. Expensive Dell hardware (FX2 / FC630) that has been rock solid for 3+ years with 6.x started to show problems with 7.x after unspecified period. Problem was not extant during precursor testing and took weeks to show up. Net result is no rollback as our entire real estate was over to 7.x before we knew the extent of the issue.
Problem is down to a change in VMware 7.x which results in many more frequent writes/reads, choking on USB/SD storage, rendering the hypervisor unresponsive and requiring a fool reboot to fix. VMware recently released update 2c after much pressure to address this.
Short version? Seems VMware 7.x was not adequately tested prior to release to the customer base and this should have been picked up.
Rather than unpick some of the changes in 7.x the PHBs at VMware have simply decided to use the Unsupported Hardware billy club to make the problem crawl away and die.
There is a VMware product that auto sends them your logs which allegedly can save them time when trying to diagnose an issue.
Vrealize may have some insight into the issues, but I’d assume that an sd or usb card issue would be obvious.
It looks like their change to the multi partition methodology concentrates reads / writes to a constrained area of those cards preventing those cards hardware doing wear levelling and causing accelerating wear to the most frequently used locations.
Moving those frequent read/writes off card would be the best answer and looks like what they are proposing.
Some kind of remote boot may be an answer, like going back to the future.
SSDs will last a lot longer. I had to put them in our two main hosts urgently a while back as it was clear that the the SD cards were all failing at once (two SD cards in each host for supposed redundancy).
The bigger question is why major server manufacturers ever thought it was a good idea to use SD cards as boot devices.
There is the added advantage that they now boot much faster too.
The entire point of vmware allowing SD cards/USB sticks for boot was to allow HP and IBM to ship systems with the vmware boot image on the internal sd card, prepped and ready to go. The *idea* was that admins would run a proper install once racked and cabled, but folks is lazy and tend to run off the sd card. Basically, getting around the 'you can't sell our software man' legal rider they had in place. Later, vmware would audit the site and find x unregistered, unlicensed installs, and could oracle all over the client.
Fishes, lures, hooks and gaffs.
(And yes, I found over 150 of those when we decided to actually hunt down the underdesk, internal dev, but it works for production systems)
I (ab)use SD/USB boot routinely on my machines now. Why? Because I pass-through the HBA and run ZFS on the actual drives.
Taking this away is just stupid. Sure, SD cards die if you write to them excessively. VMware shouldn’t be doing that, VMware are just too lazy to fix their crap. If you have GB of RAM to run VMs, there’s no reason it can’t carve out a few MB to make a RAM disk for the logs. This is exactly what SmartOS does!
I remember back when ESXi first came out and everyone was touting SD card and usb drive booting. Servers started coming with internal(??) SD card slots and stuff. Company I was at at the time deployed some using USB sticks I think and had failures pretty quick(had a failure within 4-6 months). At that point I realized I really didn't like the thought of the boot device for a $10-30k+ server being reliant upon such a cheap piece of crap for a boot drive.
I looked a few times but could never find reviews or rankings of higher endurance usb drives/sd cards (perhaps that changed in recent years). HP (and probably others too) came out with a dual micro SD(?) USB stick at one point, I inherited 4 servers that ran that, and went through the associated recall (https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c05369827). Add to that as far as I could tell it was not possible to tell the status of the individual SD cards, you'd only know if both failed. Dell has a BOSS(?) card that sits in a PCI slot I think that uses NVMe drives, sounds pretty neat.
I realize it worked fine for many people for many years. Past 10 years all of my hosts have been fibre channel boot from SAN. Except my personal esxi hosts which use local SSD storage.
"Dell has a BOSS(?) card that sits in a PCI slot I think that uses NVMe drives, sounds pretty neat."
That is until it fails and then you need to open it up to replace the drives.
If its an ESXi server that is not a node using all its drive in vSAN then a BOSS card probably should be not used.
Use normal SAS/SATA SSDs hotswap, no need to migrate your VMs, shutdown the server, open it up and replace the drives, just pull the drive and put in a new one. Its also cheaper and doesn't use up a PCI slot that you may want to use for something else.
Does my head in... I've used USB to boot dozens of ESXi for 10+ years... I've never had a single problem. I even have 5 at home running ESXi 7 with zero issues!
Rather dictating they should simply ask, and suggest it has limited support. This all stems from "Two customers had problems, so we're going to screw you all over."
I may look to see if PXE is feasible... I ain't wasting any more money on storage.
We've been running ESXi 5.x, 6.x for close to 10 years w/ the internal SD card booting. NO issues.
But, on every machine, any tmp/temp files, all log files get redirected to external (Enterprise) storage where possible. This is done to reduce the exact problem they have (finally?) started thinking about: SD card writes.
Rather than making such an idiotic move (eliminating/not supporting SD boot), they should provide some guidance on how to reduce writes the SD card.
Really VMWARE? Stop being lazy.
An alternative is to copy/replace the SD card every once-in-a-while (every year or two?) during a maintenance cycle.
Although, I recently moved to SSD boot for RPi4's for the exact same reason, now that this capability is easier to set up and reversable, but there is also no convenience penalty, like there is on an enterprise system.
apparently writes aren't the only issue
https://kb.vmware.com/s/article/2149257
"High frequency of read operations on VMware Tools image may cause SD card corruption (2149257)"
dates back to 6.0 and 6.5
other issues
https://kb.vmware.com/s/article/83376 Connection to the /bootbank partition intermittently breaks when you use USB or SD devices
(note applies to 6.7 too with no resolution available)
https://kb.vmware.com/s/article/83963 Bootbank cannot be found at path '/bootbank' errors being seen after upgrading to ESXi 7.0 U2
probably others too just 3 that I saw in a recent thread elsewhere.
Also vmware seems to be suggesting, perhaps requiring over 100GB of disk space for the boot disk, which probably factors into their decision to stop SD/USB support:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.esxi.install.doc/GUID-DEB8086A-306B-4239-BF76-E354679202FC.html
* A local disk of 138 GB or larger. The disk contains the boot partition, ESX-OSData volume and a VMFS datastore.
* A device that supports the minimum of 128 Terabytes Written (TBW).
* A device that delivers at least 100 MB/s of sequential write speed.
* To provide resiliency in case of device failure, a RAID 1 mirrored device is recommended.
Not sure if there are any USB or SD cards that have 128TBW life spans and 100MB/sec sequential write speed.
I have yet to touch vsphere 7 myself, maybe next year.
Anon because my job is supporting ESXi for one of the Big Name OEMs....
"They wear out" is not the whole story. It's the easy cop-out answer, the "face saving" move if you will. From where I sit the problem is mainly that VMware doesn't want to put any effort into making their USB-storage driver reliable. This has very, VERY little to do with how much I/O you push or with the type or 'grade' of flash involved. It looks like their _driver_ chokes and then they blame the hardware.
On the flip side, as the hardware distributor we have no serious way to check or validate the health of the SD card subsystem. We can boot to a Linux live ISO and hammer it with 'dd'...usually works fine! But that doesn't really solve anything. The customer wants it to _work_, so saying "the hardware is fine" just sets up a finger-pointing blame game.
I don't think this is purely apathy....probably VMware Engineering got word from TPTB to focus on some new shiny and let USB/SD die. I think it's a good move long-term, but horribly communicated to everyone involved.
Anyone else have their data centres go unresponsive as their systems stopped talking to the Host's SD cards?
They're just scrapping that functionality now rather than deal with it. As my datacentres have no installed storage other than the SD card and the SANs I have the empty drive bays to spare.
Was always a bit dubious when the salesperson recommended using an SD card in the first place though tbh.
How easy is it to migrate from SD card to a conventional drive anyway?
That would depend upon if you also got you servers with RAID controllers. If you are purchasing without drives, usually no reason to get a controller either.
Other wise its just install esxi again on the host and add back to the cluster.