back to article Linux kernel patch from Google speeds up server shutdowns

A new Linux kernel patch from a Google engineer resolves a problem caused by a condition that many of us might quite like to experience – having too many NVMe drives. The problem is caused by the relatively long time it takes to properly shut down a drive: apparently, as much as four-and-a-half seconds. Remember Sun's X4500 …

  1. Jim Mitchell

    I don't quite understand how this article starts off talking about a kernel patch then segues into kexec and then into an Ubuntu package not working.

    1. VoiceOfTruth

      It's Linux

      So if you don't like the way this article is written, fork it and do it your own way.

      1. Jim Mitchell

        Re: It's Linux

        I can't find the El Reg github repo.

      2. jake Silver badge

        Re: It's Linux

        It's not Linux, nor is it licensed like Linux.

        It's an article on ElReg, and copyright exists accordingly.

        1. Robert Carnegie Silver badge

          Re: It's Linux

          Comments are patchy :-)

          1. jake Silver badge

            Re: It's Linux

            Ever read the source for httpd?

            1. Paul Smith

              Re: It's Linux

              Isn't that illegal?

              1. khjohansen
                Devil

                Re: It's Linux

                Depends on your legislation - invoking daemons is fine where I live

    2. DoContra
      Boffin

      There is a connecting thread between all three which is faster reboots:

      - This patch speeds up reboots on machines with many physical block devices/nvmes

      - kexec speeds up reboots by skipping BIOS (which tends to take a while in X86/X86-64 servers at least)

      - kexec on some version of Ubuntu seems borked, according to the article's author

      I used kexec on my desktop machine for a while when I had to reboot; stopped in part thanks to nvidia jank, in part because the time I was getting wasn't significant enough to justify not resetting the hardware.

      PD: I never used the script mentioned in the article/StackExchange question when I used kexec; I'd just load the new kernel (with kexec -l) and then proceeded to reboot the system (systemctl kexec / kexec -e before I started using systemd). I know Ubuntu lets you configure the system such that the normal reboot machinery actually does a kexec (on package installation/dpkg-reconfigure) but I never used that.

  2. Nate Amsden

    HDDs too?

    is this an issue with HDDs too?(I assume it would be?) I've never noticed anyone complaining with lots of hard disks(assuming they are not abstracted by a RAID controller) complaining about slow reboot over the years.

    1. SImon Hobson Silver badge

      Re: HDDs too?

      Yes, it'll affect HDs too. And yes, some people have complained about reboot times even though for many it's not an everyday occurrence.

      However, if you have a server farm with millions of servers, even an occasional reboot can consume a lot of time.

  3. chuckufarley Silver badge
    Joke

    Why even...

    ...let the kernel handle this stuff? Isn't this what systemd is for?

    1. Anonymous Coward
      Anonymous Coward

      Re: Why even...

      I was going to tell you to wash your mouth out with soap and water, then I noticed the joke icon....

    2. Stuart Castle Silver badge
      Joke

      Re: Why even...

      Yep.. Systemd does pretty much everything. Roll over logs, various system maintenance tasks, wash the car, walk the dog..

      1. PC Paul

        Re: Why even...

        The real question then is Emacs going to implement systemd or is systemd going to implement Emacs?

    3. Pirate Dave Silver badge
      Pint

      Re: Why even...

      You earned this =>

    4. anothercynic Silver badge

      Re: Why even...

      Ohhhh, you're funny! Are you here all week? :-)

  4. jake Silver badge

    Speed up shutdowns?

    And here I thought the object of the game was not to have any shutdowns at all ... I'm perfectly happy waiting a minute or two on the very odd occasion that I need to shutdown a server. I guess Alphagoo anticipates many line-stopper moments out there in their cloud. I'm sure all their clients are quite happy about this.

    1. Richard 12 Silver badge

      Re: Speed up shutdowns?

      Nope, the idea is to have the smallest possible amount of downtime.

      1 minute for a single server is trivial, nobody really cares.

      But 1 minute each for 100,000 servers... that's a lot of downtime.

      1. jake Silver badge

        Re: Speed up shutdowns?

        "But 1 minute each for 100,000 servers... that's a lot of downtime."

        I dunno 'bout you, but my servers don't shut down and reboot sequentially. They shut downand reboot in parallel, leaving just enough running to carry the load (if a trifle slowly sometimes). Downtime is already nil, at least in a properly run service.

        Example: My email system hasn't been down, ever, since the NCP to TCP/IP transition (so-called "flag day", 1st Jan, '83). It would have stayed up through even that, but I chose to bring it all back up from scratch.

        1. Anonymous Coward
          Anonymous Coward

          Re: Speed up shutdowns?

          At work there's a lot of systems that need to come up sequentially for everything to work - should we ever need to shut down everything for whatever reason.

          Security monitoring, DNS, AD, LDAP, logging and so forth are booted one at a time before the rest can be brought up in parallel. Certain production systems comprise of several servers where there's a startup sequence as well (DB first...UI last).

          The shut down sequences follow the boot sequences backwards, although not as strictly.

          1. jake Silver badge

            Re: Speed up shutdowns?

            If they are all that important, those systems should be on redundant hardware with auto-fallback, and preferably on redundant OSes and geographically diverse. Shutting one bit of hardware down should not be able to cripple the overall service. This isn't the 1970s.

            1. bazza Silver badge

              Re: Speed up shutdowns?

              There's nothing in the AC's post to suggest that those things are hosted on singular hardware.

              The order makes sense; there has to be some DNS up before AD will work. The AD has to be up at least somewhere before the LDAP can kick in (I'm making assumptions about their LDAP and what's hosting it). And in a zero trust setting, you'd want security and logging up from the get go.

              Zero Trust makes start up orders interesting.

              1. PC Paul

                Re: Speed up shutdowns?

                I read Jake's post as saying that there should never ever be a time when DNS, logging, AD etc is not running. One or more of the servers hosting them may be down right now but the _service_ is still running on the other redundant servers hosting it.

                1. This post has been deleted by its author

                2. Anonymous Coward
                  Anonymous Coward

                  Re: Speed up shutdowns?

                  ...there should never ever be a time...

                  Oh, I read it like that as well.

                  It would be ideal to have every service running eternally. Just like in an ideal world you never have to resort to backups, hardware doesn't break and no-one forgets to fill the generator diesel tank.

                  But as any sane company prepares for DR, so we have a start/stop list that is updated whenever servers, routers/appliances are added/removed.

                  Designing for nonstop operations with multiple DC sites sounds great but it comes with an added cost. We're also running air-gapped production networks (with their own AD, DNS and such services), so can't just chuck it all to Azure in case of a disaster.

        2. Tom 38

          Re: Speed up shutdowns?

          I dunno 'bout you, but my servers don't shut down and reboot sequentially. They shut downand reboot in parallel, leaving just enough running to carry the load (if a trifle slowly sometimes). Downtime is already nil, at least in a properly run service.

          Imagine you rent out your server's compute capacity by the second, and you need to reboot 100,000 servers. You don't shut down all your servers in parallel because there would not be enough capacity to relocate workloads, so you do a rolling restart.

          You don't overly care about the wall time it takes to shutdown and restart all these servers, but you absolutely care about the total number of server-hours of downtime that you experience. After all, if each server is down for 2 minutes, that's ~139 server-days of downtime that you can't rent out a server for. If you can reduce that to 15 seconds per server, you can potentially rent out 121 server-days more compute.

          1. Ace2 Silver badge

            Re: Speed up shutdowns?

            Also your local substation would go up in flames if you tried to power on 100,000 servers simultaneously.

            I noticed yesterday that HPE iLO has an option to add a random power-on delay. I’d seen delay settings for HDDs (we always staggered them) but not for whole servers.

      2. Bill Michaelson

        Re: Speed up shutdowns?

        Downtime could be defined as the period during which production stops, not the period during which particular systems reboot. Reboot all day with zero downtime.

        1. Richard 12 Silver badge

          Re: Speed up shutdowns?

          When you're renting out your servers, downtime is defined as the sum of all time when each individual server can't be rented.

          It's not a house or office, it's a hotel.

          If 10 hotel rooms aren't available tonight, that's 10 nights of lost (potential) revenue.

          The hotel is still open, of course.

          Same with server billable seconds.

  5. Gene Cash Silver badge

    I've figured out even quicker shutdowns

    Just pull the plug.

    I'll be submitting a pull request tonight.

    1. Boris the Cockroach Silver badge

      Re: I've figured out even quicker shutdowns

      So you're the cleaner who keeps doing that.....

      1. TeeCee Gold badge

        Re: I've figured out even quicker shutdowns

        If you stop the cleaner doing that, things can get worse.

        Finding "Do not remove this plug" labels on everything around the room, the cleaner hunted around for a bit and found some unused sockets handily placed at waist height behind a small door. The cleaner whacked in the Numatic "Henry" and switched it on.

        The clean PSU handling the cabinet in question comprehensively shat itself on the spot, taking down the comms rack and causing a Europe-wide outage.

        If you really want to know how resilient your systems are, just let the cleaning staff work unsupervised.

        1. Alan Brown Silver badge

          Re: I've figured out even quicker shutdowns

          In such environments you label the sockets the cleaners MAY use and make sure they're kept unobstructed

          It's far safer that way

      2. bombastic bob Silver badge
        Unhappy

        Re: I've figured out even quicker shutdowns

        at least he didn't plug the vacuum cleaner into the UPS...

  6. bldrco

    Heh, I remember solving this 8 years ago... https://lkml.org/lkml/2014/5/8/512

    1. waldo kitty
      Childcatcher

      Heh, I remember solving this 8 years ago... https://lkml.org/lkml/2014/5/8/512

      interestingly enough, that original patch author responded in the latest thread about this ;)

      https://lore.kernel.org/lkml/YkO7d7Eel4BVQOy4@kbusch-mbp.dhcp.thefacebook.com/

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like