I don't quite understand how this article starts off talking about a kernel patch then segues into kexec and then into an Ubuntu package not working.
Linux kernel patch from Google speeds up server shutdowns
A new Linux kernel patch from a Google engineer resolves a problem caused by a condition that many of us might quite like to experience – having too many NVMe drives. The problem is caused by the relatively long time it takes to properly shut down a drive: apparently, as much as four-and-a-half seconds. Remember Sun's X4500 …
COMMENTS
-
-
Wednesday 30th March 2022 13:52 GMT DoContra
There is a connecting thread between all three which is faster reboots:
- This patch speeds up reboots on machines with many physical block devices/nvmes
- kexec speeds up reboots by skipping BIOS (which tends to take a while in X86/X86-64 servers at least)
- kexec on some version of Ubuntu seems borked, according to the article's author
I used kexec on my desktop machine for a while when I had to reboot; stopped in part thanks to nvidia jank, in part because the time I was getting wasn't significant enough to justify not resetting the hardware.
PD: I never used the script mentioned in the article/StackExchange question when I used kexec; I'd just load the new kernel (with kexec -l) and then proceeded to reboot the system (systemctl kexec / kexec -e before I started using systemd). I know Ubuntu lets you configure the system such that the normal reboot machinery actually does a kexec (on package installation/dpkg-reconfigure) but I never used that.
-
-
Tuesday 29th March 2022 20:21 GMT jake
Speed up shutdowns?
And here I thought the object of the game was not to have any shutdowns at all ... I'm perfectly happy waiting a minute or two on the very odd occasion that I need to shutdown a server. I guess Alphagoo anticipates many line-stopper moments out there in their cloud. I'm sure all their clients are quite happy about this.
-
-
Tuesday 29th March 2022 22:09 GMT jake
Re: Speed up shutdowns?
"But 1 minute each for 100,000 servers... that's a lot of downtime."
I dunno 'bout you, but my servers don't shut down and reboot sequentially. They shut downand reboot in parallel, leaving just enough running to carry the load (if a trifle slowly sometimes). Downtime is already nil, at least in a properly run service.
Example: My email system hasn't been down, ever, since the NCP to TCP/IP transition (so-called "flag day", 1st Jan, '83). It would have stayed up through even that, but I chose to bring it all back up from scratch.
-
Tuesday 29th March 2022 22:22 GMT Anonymous Coward
Re: Speed up shutdowns?
At work there's a lot of systems that need to come up sequentially for everything to work - should we ever need to shut down everything for whatever reason.
Security monitoring, DNS, AD, LDAP, logging and so forth are booted one at a time before the rest can be brought up in parallel. Certain production systems comprise of several servers where there's a startup sequence as well (DB first...UI last).
The shut down sequences follow the boot sequences backwards, although not as strictly.
-
-
Tuesday 29th March 2022 23:34 GMT bazza
Re: Speed up shutdowns?
There's nothing in the AC's post to suggest that those things are hosted on singular hardware.
The order makes sense; there has to be some DNS up before AD will work. The AD has to be up at least somewhere before the LDAP can kick in (I'm making assumptions about their LDAP and what's hosting it). And in a zero trust setting, you'd want security and logging up from the get go.
Zero Trust makes start up orders interesting.
-
-
This post has been deleted by its author
-
Wednesday 30th March 2022 21:42 GMT Anonymous Coward
Re: Speed up shutdowns?
...there should never ever be a time...
Oh, I read it like that as well.
It would be ideal to have every service running eternally. Just like in an ideal world you never have to resort to backups, hardware doesn't break and no-one forgets to fill the generator diesel tank.
But as any sane company prepares for DR, so we have a start/stop list that is updated whenever servers, routers/appliances are added/removed.
Designing for nonstop operations with multiple DC sites sounds great but it comes with an added cost. We're also running air-gapped production networks (with their own AD, DNS and such services), so can't just chuck it all to Azure in case of a disaster.
-
-
-
-
-
Wednesday 30th March 2022 09:09 GMT Tom 38
Re: Speed up shutdowns?
I dunno 'bout you, but my servers don't shut down and reboot sequentially. They shut downand reboot in parallel, leaving just enough running to carry the load (if a trifle slowly sometimes). Downtime is already nil, at least in a properly run service.
Imagine you rent out your server's compute capacity by the second, and you need to reboot 100,000 servers. You don't shut down all your servers in parallel because there would not be enough capacity to relocate workloads, so you do a rolling restart.
You don't overly care about the wall time it takes to shutdown and restart all these servers, but you absolutely care about the total number of server-hours of downtime that you experience. After all, if each server is down for 2 minutes, that's ~139 server-days of downtime that you can't rent out a server for. If you can reduce that to 15 seconds per server, you can potentially rent out 121 server-days more compute.
-
Wednesday 30th March 2022 13:34 GMT Ace2
Re: Speed up shutdowns?
Also your local substation would go up in flames if you tried to power on 100,000 servers simultaneously.
I noticed yesterday that HPE iLO has an option to add a random power-on delay. I’d seen delay settings for HDDs (we always staggered them) but not for whole servers.
-
-
-
-
Thursday 31st March 2022 13:32 GMT Richard 12
Re: Speed up shutdowns?
When you're renting out your servers, downtime is defined as the sum of all time when each individual server can't be rented.
It's not a house or office, it's a hotel.
If 10 hotel rooms aren't available tonight, that's 10 nights of lost (potential) revenue.
The hotel is still open, of course.
Same with server billable seconds.
-
-
-
-
-
-
Wednesday 30th March 2022 12:49 GMT TeeCee
Re: I've figured out even quicker shutdowns
If you stop the cleaner doing that, things can get worse.
Finding "Do not remove this plug" labels on everything around the room, the cleaner hunted around for a bit and found some unused sockets handily placed at waist height behind a small door. The cleaner whacked in the Numatic "Henry" and switched it on.
The clean PSU handling the cabinet in question comprehensively shat itself on the spot, taking down the comms rack and causing a Europe-wide outage.
If you really want to know how resilient your systems are, just let the cleaning staff work unsupervised.
-
-