back to article Cloudflare says it has automated empathy to avoid fixing flaky hardware too often

Cloudflare has revealed a little about how it maintains the millions of boxes it operates around the world – including the concept of an "error budget" that enacts "empathy embedded in automation." In a Tuesday post titled "Autonomous hardware diagnostics and recovery at scale," the internet-taming biz explains that it built …

  1. Lee D Silver badge

    Sounds pretty basic and obvious to be honest.

    3 days seems a long time to notice a node is broken.

    Booting into a recovery/rescue/testing mode isn't ground-breaking.

    Keeping it turned off until it's scheduled for repair is basic automation isn't it?

    Every day I'm surprised by just how "un-clever" all these large scale systems are, especially when it comes to the basic back-end operations.

    1. samzeman

      If this works flawlessly though, it is the best system I have ever worked with.

      Manual versions of all of this are the second best.

      Automated solutions have never been anything but more trouble to me.

    2. Kevin McMurtrie Silver badge

      There's a large number of people in the workforce that only know cloud computing. Processes for tracking host health are old school at this point.

      A common practice is for every client to log its state into a database and also check to see what its state should be. If a server is missing too many check-ins, you know. If a server awakens from a long coma, it can see that it is out of date and should shut back down.

  2. ChoHag Silver badge

    Have they reinvented Nagios again?

  3. Anonymous Coward
    Anonymous Coward

    I like Cloudflare so don't want to dis them but they didn't spot the single fibre between racks in one DC that caused a massive near global outage. They need a tool which can count the number of diverse links between multiple objects, now that would be innovative !

  4. John Smith 19 Gold badge
    Coat

    Goes back to Bell System ESS1

    Where "Auditor" programs swept the exchanges memory looking for corrupted data structures (which was how it kept track of the calls and services) and returned them to the free storage pool, possibly terminating a call in the process.

    It appears Cloudfare have generalised this to their whole data centers. LSI used to supply large drive arrays with fail-soft capabilities at the turn of the century.

    Honestly I don't get why, in the 3rd decade of the 21st century, most (all?) major infrastructure suppliers haven't automated their processes.

    All routine tasks should be at least semi-automated by now using one or other available tools.

    Experienced, skilled staff are the major asset of such businesses.

    Mines the one with a copy of Glenford Myers "Software Reliability" in the (oversize) side pocket.

    1. heyrick Silver badge

      Re: Goes back to Bell System ESS1

      "Experienced, skilled staff are the major asset of such businesses."

      Unfortunately it is often beancounters that get to call the shots. The more automation there is, the fewer experienced skilled staff need to be kept on payroll.

  5. Bitsminer

    LSI used to supply large drive arrays with fail-soft capabilities...

    The user interface was some java-based app.

    Whenever a (SATA) drive had some little hiccup, it was switched to "offline" mode.

    To bring it back online, no kidding, you pressed the "Resurrect" button.

    I wonder what their Buddhist customers thought?

  6. Stuart Castle

    Interestng article, but I'd be surprised if they didn't run something like this. Monitoring systems are not new, even in smaller deployments. I doubt that Cloudflare employs enough engineers that they can afford to have one out of action for several hours tracking down a dead server. I also thought that this sort of thing was a major part of the reason IPMI was invented?

  7. Mike 16

    Some servers are more equal?

    I first read this to be a selection criterion for "Who do we probe today":

    "probes up to two datacenters known to house broken boxen."

    So, why are only housebroken boxen privy to the list? :-)

    I've wondered why I sometimes get served by a yet to be potty-trained server.

  8. CapeCarl

    "Shut down all the garbage mashers on the detention level!"

    I spent almost a decade working in the primary data center for a NYC-based HFT firm...About 5K servers (80/20 cattle vs pets)...Keeping at least 98% of the nodes of a large research/training cluster online was the minimum goal (about 4,000 servers)...Versus a max of 5 on-site employees (tasked with "pet" management, monitoring, server builds, internal customer support and misc projects also).

    Eventually an automated tool (Python and shell based) was developed to help with the basic stuff. Over time we would add more tests to this originally simple "bot", as we noticed patterns or issues with a given server model...For example it turned out that one model of 2U/4node server would throttle back its CPUs if a power supply failed. So it seems that said model had a bit more PS failures for a new beasty then expected...So two things 1) Figured out why we couldn't detected said systems PS failures automatically, and 2) Upon every reboot of a server run a quick CPU benchmark, look up said result versus the expected for this type of Xeon/EPYC and add to the "Hey humans: Check this out" list if it was subpar.

    Not rocket science...A major productivity enhancer.

    1. John Smith 19 Gold badge
      Unhappy

      Not rocket science...A major productivity enhancer.

      No. That's the depressing part about this.

      It's the failure to recognise that a small bit of time spent setting up some automation could save X hours a week, for the rest of the life of the equipment.

      Depending on the system it might not even need buying in a product to do it, either a download or the tools already on site can give some leverage.

      Just sad in the 3rd decade of the 21st century.

  9. Anonymous Coward
    Anonymous Coward

    Cloudflare’s backend has had an outage all afternoon. Check cloudflarestatus.com, it is a sight to behold

  10. Toni the terrible

    cloudfare

    I am ambivalent about clouddare. I gave up using a subscription website (Crunchyroll) in the USA because cloudfare locked me out too many times for too long, and had problems with Just Eat in the UK for a while. Never found out why

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like