back to article Cloudflare explains how it managed to break the internet

A large chunk of the web (including your own Vulture Central) fell off the internet this morning as content delivery network Cloudflare suffered a self-inflicted outage. The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back …

  1. Ben Tasker

    > This morning was a wake-up call for the price we pay for over-reliance on big cloud providers. It is completely unsustainable for an outage with one provider being able to bring vast swathes of the internet offline.

    Multi-CDN is relatively easy to set up nowadays, and isn't even that expensive.

    Unfortunately, if you want to use Cloudflare then you need to have your DNS with them - at least, unless you're willing to pay $200/month for their business tier in order to unlock support for CNAMEing to them rather than giving them control of your zone.

    There's not a *lot* of point in setting up multi-CDN if your authoritatives are tied to one of the providers that you're trying to mitigate against.

    It's a business choice by Cloudflare, but is part of the reason that outages there are so severe. If something happens to Cloudfront, Akamai, Fastly etc then there's the option to flip traffic away from them (it can even be done automatically) and serve via a different CDN until things settle down.

    It's a core part of why I neither use or recommend Cloudflare: they might be huge, but they're still a single basket and mistakes happen. Not having an option to move traffic away from longer outages isn't really acceptable.

    1. Michael Wojcik Silver badge

      I have mixed feelings about Cloudflare, but they are generally quite good about explaining what went wrong. They also publish a lot of good technical content in general.

      Mark Boost, on the other hand, sounds like a spoiled brat. "Everything isn't perfect! My gratification isn't immediate! How dare you!"

      I've been using the public Internet since a few years after Flag Day, and I've managed to avoid panicking when I can't "access the online services that are part of the fabric of all our lives". Sometimes there are, y'know, network interruptions. Or power interruptions. Grow the fuck up, Mark.

  2. KOST

    "Automation: There are several opportunities in our automation suite that would mitigate some or all of the impact seen from this event."

    Translation: Dave in QA's just lost his job to a robot

    1. Anonymous Coward
      Anonymous Coward

      Dave in QA

      Got promoted and is working 60hrs wrangling the farm of bots doing automated regression testing. Being able to automate more just means being able to cover slightly more of your test cases. This is one of those cases where adding automation just ads another layer for the humans to manage. The laws governing math and primary logic make automating all of your automation problematic, so unless your company was the rare unicorn that actually had complete case coverage it just makes it possible to do more of the work you weren't getting to with each release.

      Sure slacker outfits might be able to squeeze headcount a little with better tools, but Cloudflare has to put on it's big boy pants every day or it will blow up the internet(again). I do not want that job.

      1. Eclectic Man Silver badge

        Re: Dave in QA

        E. M. Forster got there a while ago (published in 1909):

        'The Machine Stops.' https://www.bbc.co.uk/sounds/play/m0018fs6

        (Note: login required to listen.)

        1. 1752

          Re: Dave in QA

          Someone (you?) linked a text version the other, I read it. Recommended.

          1. Eclectic Man Silver badge

            Re: Dave in QA

            The BBC Radio adaptation differs from what I recall of the original story, but is well done, and Tamsin Grieg is always good. Apparently Forster wrote 'The Machine Stops' as a response to H G Wells' claiming new technology would make the world wonderful. (Big debate on that one, I suspect.). One of the only things where I actually agree with F Nietzsche is that he claimed in the late 19th century that humanity was treating nature with contempt.*. (However, I totally disagree with his approval of genocide and the subjugation of women as stated in the same book.)

            *'On the genealogy of morals'

        2. PRR Silver badge
          Mushroom

          Re: Dave in QA

          > E. M. Forster ....1909: 'The Machine Stops.' https://www.bbc.co.uk/sounds/play/m0018fs6 (Note: login required to listen.)

          Free version: here

          Free PDF: here

          Free AudioBook: here

          Wiki explanation: here

          Literary analysis: here

          MAD Magazine(!) parody(?) from MAD #1(!) 1952: here (open this full-screen, it is worth it) (and skip to slideshow page 4.)

          There is a Keith Laumer short (Cocoon, 1962) which echos this plot but with a Laumer-esque protagonist.

      2. CrazyOldCatMan Silver badge

        Re: Dave in QA

        make automating all of your automation problematic

        Automation is fine for the regular run-of-the-mill stuff (do X and you get Y, feed Y to Z and get expected result). What it's really, really not good at is edge cases - and in a lot of cases it can make the issue worse.

        And over-reliance on automation also makes the techie skills atrophy - there's a reason why 'practice makes perfect' is a reasonable saying. If your techies[1] don't know how the guts of things work (and not just a 1-day course on the infrastructure but a 'I herd this stuff all day' thing) then any root-cause analysis is going to take longer because the people responsible for fixing the mess are not as intimately acquainted with the guts of the system and have to explore as well as fix..

        [1] And not just the 'hero' types but the lower levels too. Having only one or two staff who know how everything works is a real risk in itself. Especially if (as most hero types are) they are spectacularly averse to documentation...

    2. Snake Silver badge

      Re: automation

      This is why I intentionally keep my 4 disparate home IoT automation systems independent, regardless of what internet forum PFY's / BOFH's tell you to do otherwise. My 2 separate heating systems back one another up, not create a single point of failure.

  3. Mike 137 Silver badge

    More than just their users affected

    DNS lookup for our web sites and email failed to resolve while this was going on, despite the hosting service we use not relying on Cloudflare. But the widely used 1.1.1.1 DNS is hosted by them, so that could at least partially explain the extent of the problem.

    1. alain williams Silver badge

      Re: More than just their users affected

      An important idea within DNS is to have multiple servers so you should have had slave servers with other providers. 8.8.8.8 would do nicely or even have a few your own, it is not hard.

      1. Lockwood

        Re: More than just their users affected

        I have a pihole with unbound - I used to use 8.8.8.8, then 1.1.1.1

        Not the ideal solution for everyone, but it has some advantages

        1. Hubert Cumberdale Silver badge

          Re: More than just their users affected

          I like NextDNS. It's also the last layer of defence in my DNS-based ad blocking.

  4. TRT Silver badge

    The more they overthink the plumbing,

    the easier it is to stop up the drain.

    1. David 132 Silver badge
      Thumb Up

      Re: The more they overthink the plumbing,

      Automatic upvote for any Scotty quote.

      And it's closely followed in the movie by one of my favourite Bones/Kirk exchanges:

      "Nice of you to tell me in advance."

      "That's what you get for missing Staff meetings, Doctor..."

  5. Coastal cutie
    Facepalm

    "One can imagine the panic at Cloudflare towers, although we cannot imagine a controlled process that resulted in a scenario where "network engineers walked over each other's changes."

    Umm, left hand, right hand, have you been introduced? And have you been taught how to tell your derriere from your elbow?

    1. CrazyOldCatMan Silver badge

      Umm, left hand, right hand, have you been introduced?

      It all shows a pretty bad case of immature change control. JDI is *not* a valid change methodology..

      (I suspect that, in all their testing, they had not tested the effects of new and old topologies mixing..)

  6. dvvdvv

    Indeed

    "we cannot imagine a controlled process that resulted in a scenario where "network engineers walked over each other's changes."

    It's the most sarcastic thing I've heard this morning.

    1. Anonymous Coward
      Anonymous Coward

      Re: Indeed

      I think they meant "network engineers REVIEWED each other's changes."

      As in "they did a walk through, a dry run" to see what might have gone wrong. Doesn't sound like a bad way of understanding the problem, but could probably have been described better.

      1. Anonymous Coward
        Anonymous Coward

        Re: Indeed

        They clearly weren't talking about reviews.

        The full quote says that this resulted in reverts being reverted and the issues reoccuring as a result.

        It's a sign of a lack of proper incident management discipline - everyone scrambles to fix the issue, but with noone actually coordinating the response you get chaos and conflicting changes.

  7. Claptrap314 Silver badge

    I'm curious

    How many of the stone-throwers here have ever worked in, let alone managed, an operation of Cloudflare's scale?

    I've not been inside, nor have I had any extensive dealings with them, but I will say this: getting this stuff right is **** hard. I've not read their post-mortem, but just from the article, they are showing FAR more transparency than we generally see.

    To address two specific points, one from the article and one from the comments. First, if $200/mo is enough to matter, then you are not spending enough on resilience for me to care what your opinion is. Resilience happens at every level of the stack, and it is a sick joke to suggest that it can be achieved on a shoestring budget.

    Second as for that blowhard CEO's gripes: what is your company again? This is the sound of a small competitor whose many failings have not been in the news because they don't matter.

    Certainly, from just the article, it is clear that "mistakes were made". But the nature of these mistakes--insufficient QA before a change, unclear responsibilities during a major incident--are relatively easy to fix, especially when compared to, say, Google's inability to get quotas right, or Microsoft's inability to even have an inventory of their internal DNS servers.

    Yes, yes: this incident is another reminder that resilient happens at EVERY level of the stack. No one said otherwise. And big does not mean "perfect". No one said otherwise.

    1. gratou

      Re: I'm curious

      You missed the useful bit. It's not the 200 dollars. It's that there's not a *lot* of point in setting up multi-CDN if your authoritatives are tied to one of the providers that you're trying to mitigate against

    2. Sam Liddicott

      Re: I'm curious

      Simply paying the $200 won't help if you don't understand what the problem is that he was talking about

    3. Ben Tasker

      Re: I'm curious

      > How many of the stone-throwers here have ever worked in, let alone managed, an operation of Cloudflare's scale?

      As you've specifically called out my comment (whilst missing the point), I'll answer.

      I have.

      In fact, I've also worked with customers who considered themselves too big to deliver via Cloudflare, and as well as building and managing global CDNs, I've built and integrated against a number of multi-CDN solutions, working with customers that you've definitely heard of (and on a balance of probability will likely interact with sometime today).

      I don't really do calls to authority, but if you're concerned I'm griping with no view into the industry, I'm not.

      > First, if $200/mo is enough to matter, then you are not spending enough on resilience for me to care what your opinion is.

      You've completely missed the point.

      It's not just the cost, it's the fact that their solution is architecturally flawed for no reason other than commercial gain. Cloudflare is the only provider who charges extra to be able to CNAME in.

      The true market leader, where real money is spent (Akamai) offers DNS services but does not mandate them: to do so would mean forcing your customers onto a possible SPOF.

      That $200/mo by the way, may not give you much else extra that you care about (xepending on your requirements), you're just paying a premium for something that's a basic feature on basically every other CDN.

      It'd be much better to be able to spend that $200 on Cedexis or similar so that you can move traffic about.

      > Resilience happens at every level of the stack, and it is a sick joke to suggest that it can be achieved on a shoestring budget.

      There's a cost, sure, but that doesn't justify Cloudflare's commercial choice.

      Perfect resilience costs a lot, but that's not what we're talking about here, we're talking about building multi-CDN: resilience against single provider failures.

      That absolutely can be achieved on a shoestring budget (though I wouldn't recommend it).

      You're letting perfect be the enemy of good, and even a few years within that segment of the industry, seeing the things that big companies actually use/do would show you how wrong you are.

      > they are showing FAR more transparency than we generally see.

      Cloudflare generally do, their post mortems are open and honest, something which they should rightly be praised for.

      1. Cowards Anonymous

        Re: I'm curious

        You forgot the

        MIC drop at the end. Great reply though.

      2. Claptrap314 Silver badge

        Re: I'm curious

        It's $200/mo for something that their competition does for free. You attacked their business model over this. And I'm doubling down here--you're demonstrating an utter lack of proper prioritization on this point.

        You know (apparently first hand) that the monthly cost of building a resilient app runs at least 6 figures a year. Against that level of spending, you're going to weigh $2500 priced as an addon? What if the cost of the base contract is $5000 less?

        Look, if you just don't like them, that's fine. But any evaluation of a solution has to be based on total cost, and you're talking about a rounding error.

        1. Ben Tasker

          Re: I'm curious

          > You know (apparently first hand) that the monthly cost of building a resilient app runs at least 6 figures a year.

          Again, we're talking about different things here.

          You're talking about end-to-end resiliency in an application context, I'm talking about a feature that makes it possible to cope with a failure in an edge provider (Cloudflare in this case).

          You're talking in the context of an application server, whilst CDN's primary domain is serving (and caching) static content (the big money is in serving game artefacts and video). We're talking about completely seperate problem domains with very different solutions.

          It's not nearly as expensive as you make out. In fact, it can even be achieved for as little as £60/year (though that price point doesn't come without issues, the risk is still lower than with your authoratives tucked away inside CF).

          Fuck, you can do it nearly free if you don't want it automated:

          - Have DNS across providers

          - CDN1 goes down, update CNAME to point to CDN2

          Not that I'd recommend it, but your recovery rate will still be faster than without the ability to switch away.

          Something more production ready like Cedexis or Constellix doesn't break the bank either.

          The days of CDN being an expensive bespoke thing are long gone, it's a commodity nowadays. Multi-CDN support isn't much different in that respect

          I've focused on resiliency because that's what the article is about, but there's a bunch of other knock on effects too. In the early stage of the relationship It knackers a customer's ability to do A/B testing.

          Big customers also tend to use different CDNs in different parts of the world - If Cloudflare's coverage is pants in parts of Asia, you might route those users via AliCDN or similar etc. The lack of CNAME support hampers this.

          The key thing with both examples is that you generally want it to be transparent to the user, which generally means CNAMEs (I've seen DNAMEs get used too... the horror)

          > Against that level of spending, you're going to weigh $2500 priced as an addon?

          I think the thing you're missing is that that $2500 is an otherwise unnecessary charge unless you also want support from CF.

          It also comes on top of whatever you're paying to implement failover. If you're using something simple (like DNSMadeEasy's fairly limited auto failover) then that 2500 increases your costs many times over.

          If Cloudflare were head and shoulders above the competition you might ignore it, but in many locations, they're not.

          > But any evaluation of a solution has to be based on total cost

          I disagree.

          Evaluation of a solution is based on *value* not cost.

          CDN customers tend to have a spectrum of criteria. Some will pay the earth for the provider with the lowest latency (in whatever market they care about), some want to spend as little as possible.

          Most, obviously, fall somewhere in the middle.

          If you're charging extra for something basic that your competitors offer, you need to be able to justify it, either through sheer speed (attracting the left of the spectrum) or by being able to explain that its omission makes lower tiers cheaper (a hard sell with this particular issue).

          That's not something Cloudflare have achieved IMO.

          > You attacked their business model over this

          I think you've overlooked the context I posted under.

          My original comment quoted a bit from the article that said it was unacceptable that a single provider could take so much of the net down.

          The point in my comment wasn't to attack CF's business model, but to point out that that SPOF was there not because of technical reasons, but because of a business decision by CF.

          The guy quoted in the article is a little breathless, but he's right, this should and could have been avoidable.

    4. CrazyOldCatMan Silver badge

      Re: I'm curious

      insufficient QA before a change, unclear responsibilities during a major incident--are relatively easy to fix

      Not in my experience - they speak of the root culture at a place and that's really, really not simple to fix.

      All those things are needed for operations at scale and, if CF don't have them, it's a miracle that this hasn't happened before.

  8. Ken Moorhouse Silver badge

    Knocked out a lot of stock market research sites...

    During the critical pre-LSE market start of trading period.

    1. Short Fat Bald Hairy Man
      Pint

      Re: Knocked out a lot of stock market research sites...

      Lovely

    2. IGotOut Silver badge

      Re: Knocked out a lot of stock market research sites...

      You mean it took the bots offline?

  9. JibberX

    The Internet...? Or the World Wide Web?

    I know they talk about BGP shenanigens but still, its just affects people looking and animated GIFs in Netscape right?

    1. jake Silver badge

      I was online and doing work in that timeframe. I didn't notice anything amiss.

      I was not using that new-fangled WWW-thingy. Do with that what you will.

    2. Michael Wojcik Silver badge

      Cloudflare is used for things that aren't exclusively web-related, such as DNS. Per comments above.

      But, yeah, if you weren't caught out by using Cloudflare-backed DNS, you probably wouldn't have observed too many issues with your SSH connections or whatever.

      I missed the outage, thanks to my time zone and working hours, but it probably wouldn't have troubled me too much since I'd still have the corporate network and there's no shortage of things I can be doing.

  10. AlanSh
    Pint

    It's not that big a deal

    OK - so we lost part of the internet for a couple of hours. At a time when half the world was asleep. Get a life guys - it's not that big a deal. There are MUCH more important and worrying things going on in the world today. Cloudflare fixed it - and it won't happen again like that.

    So, sit and enjoy a beer.

    Alan

    1. First Light

      Re: It's not that big a deal

      This is a site reporting on the industry and the story is an example of an industry screwup. It's legitimate to report on and comment about it. At what point does it become a big deal? Is there a threshold number of hours for it to become one?

      It was daytime in India and the outage lasted til 1pm. So it had a bigger effect on that country of 1.2bn people. Not exactly a nothingburger.

      https://economictimes.indiatimes.com/markets/stocks/news/retail-investors-hit-by-global-cloudflare-outage-as-services-of-zerodha-other-brokers-affected/articleshow/92357392.cms

    2. Ken Moorhouse Silver badge
      Pint

      Re: So, sit and enjoy a beer.

      With a bit of imagination you could say this is a story about fermenting hops.

      Addressing the other points in the post:-

      Half the world might be asleep (which I doubt), but that means the other half is awake.

      How can anyone say this won't happen again? Are the theory and methods of calculating Risk dead? It is very worrying if people think this.

      1. jake Silver badge

        Re: So, sit and enjoy a beer.

        "With a bit of imagination you could say this is a story about fermenting hops."

        That'd take quite the imagination indeed ... not a lot of fermentables in hops.

  11. hayzoos

    CDNs are evil!

    Okay, got your attention. But, The Internet core design philosophy is completely subverted by global oligopolistic CDNs. Because of CDNs, your use of the internet is insulated from the real Internet. The servers people intended to contact were not down, yet they were down. The Internet design was to be able to route around "failures". I submit that CDNs as implemented are breaking The Internet.

    1. EBG

      same answer

      that I give to those who are shocked that bitcoin mining consolidated into a few groups in China. Economics 101. That's what commodity supply chains do.

    2. Wayland

      Re: CDNs are evil!

      IPFS, or Inter Planetary File System would have routed round the problem if used as a CDN.

    3. cyberdemon Silver badge
      Devil

      Re: CDNs are evil!

      The worst example of this problem, of course, being Google AMP. Pages auto-mangled by Google, and the original website doesn't even know that you looked at it.

      To Room 101 with it

    4. Michael Wojcik Silver badge

      Re: CDNs are evil!

      Well, yeah. And the web was ruined by Javascript (and arguably by CSS, and even graphical browsers, though those are useful for viewing useful graphic assets like charts and plots). And the web ruined Usenet. And really with a few improvements to Gopher or WAIS we could have dispensed with the web in the first place. And GUIs ruined UIs. And so on.

      As Nick said in Metropolitan, I'm not entirely joking.

      But, curmudgeonly as I am, even I make use of some online services that wouldn't be feasible without CDNs or some other edge-delivery mechanism. (Someone else mentioned IPFS, but I'm dubious.) Could I live without them? Absolutely. Would I miss them if they vanished? Eh, a bit, but to be honest I'd miss good old fashioned paper books far more if I lost those.

      Still, I can't pretend I get no value from CDNs. And I suspect that's true for the vast majority of people who do anything online.

  12. jake Silver badge
    Pint

    "I submit that CDNs as implemented are breaking The Internet."

    Change that from "are breaking" to "have broken". But quite.

    1. Phones Sheridan Silver badge

      I actually found this post helpful back when I was deciding to use Cloudflare or not.

      https://www.devever.net/~hl/cloudflare#:~:text=Websites%20should%20avoid%20using%20Cloudflare,the%20state%20of%20the%20web.

  13. jake Silver badge

    " ... Large cloud providers have to manage a vast degree of complexity and moving parts, significantly increasing the risk of an outage."

    To say nothing of the vastly increased size and scope of potential attack vectors.

    Clouds are snake-oil.

    1. M.V. Lipvig Silver badge

      You forgot to mention that the larger the cloud company, the further down you're likely to be on the recovery list when something bad happens that requires backups be loaded. Or, how missing a payment or two because there was a problem with payments going through (a surprisingly large number of smaller companies use a credit card for such things) means your business space gets sold to someone rlse, and overwritten. There went your business.

      Cloud computing may have its uses, but nobody cares about your business data but you. Relying on another company to take care of it for you might be saving you an IT guy's salary and a couple of servers, but will that be enough when you walk into work in the morning to find that you have no business, and the cloud company says "Sorry, should have paid your bill, we stopped being responsible when your account lapsed."? I know my own company will email late notices, then will shut your service down without a care in the world about whether your business will run, and when you call to complain we tell you flat out "You were disconnected for nonpayment, call 800-givenofuck if you want it back." Someone may answer, if you call between 8:00 and 8:05 on Feb 30 of any year.

  14. The Morgan Doctrine

    Which means a bad actor could TAKE DOWN THE FREAKING INTERNET

    Holy Mother of Mercy! I'm surprised Russia hasn't pulled the plug on us all.

    1. jake Silver badge

      Re: Which means a bad actor could TAKE DOWN THE FREAKING INTERNET

      Well, yes.

      To be perfectly honest, perhaps not "take down the entire Internet", but it would be fairly easy to balkanize it for awhile.

      Some say this has already happened, especially to the WWW.

  15. Toni the terrible Bronze badge
    Flame

    IT's a Pain

    Cloudfare is a pain in the proverbial. It has prevented access when using Firefox (and even Chrome etc) to websites that access have been paid for begining 3 months ago. Which is worse when the only way to contact the vendor is via the website, e.g. Crunchyroll.

    It has also prevented me from getting my takeaway via Just Eat, grumble, grumble - of course going direct is doable so I do not need a takeaway aggregator - but still it is not convenient and prevents actual business happening

  16. Anonymous Coward
    Anonymous Coward

    To be honest this isn't news - Cloudflare seems to have so many outages! Either these major foobared "Self-inflicted" ones - or the random cloudflare error pages you see around the internet when it's servers don't seem to be able to cope! Often hidden by the fact that it is static content missing...

  17. 1752

    https://xkcd.com/908/

  18. Potemkine! Silver badge

    From my point of view, I would congrats the engineers who were able to locate and fix the problem in such a short period, when under an enormous pressure. Shit happens and will still continue to do.

    An interesting comment is the one showing how we are more and more dependent of an internet connection. Public works severing optical fibres results in many unable to work, and that can last for several days and huge losses. I wonder if the bean counters took that into consideration before deciding to migrate everything to the lowest bidder?

  19. Version 1.0 Silver badge

    Internet vs Renitent

    The Internet was created to renitent (resist and survive) and it worked very well for years, but now we have "upgraded" so many things that this incident is no big surprise. This environment applies to everything I work in and with new "features" making data easy it's sold as "So Great" but then something stupid happens and we see this style of issue ... it's normal these days, we're busy creating problems and solving them but originally the Internet was designed and built just to avoid problems and work like that for years.

    1. jake Silver badge

      Re: Internet vs Renitent

      "The Internet was created to renitent (resist and survive)"

      Oft repeated, but simply not true.The (D)ARPANET was just a research network designed to research networking. The "survives nukes" myth came about much later ... The cold, sad reality is that the only reason it was built to be resilient is because the available hardware was really, really flaky.

      The networks that were designed to survive nuclear attack included the "Minimum Essential Emergency Communications Network", or MEECN, and the prior "Survivable Low Frequency Communications System" or SLFCS. If you use an ounce of common sense, it only stands to reason ... no military would design a command and control system that inherently wasn't securable, and the Internet was not then, and still isn't securable.

  20. IGotOut Silver badge

    Network engineer issue.

    The issue of stepping on each other is easy to explain, as I've had first hand experience (VoIP).

    Network make a change. As far as they are concerned all is good push out....

    Hits VoIP all hell break loose. Telecoms report issue. They spot issue and start emergency changes, get back up and running in a few minute, while networks run round like headless chickens with fingers up their arses.

    Networks panic without talking to anyone and revert changes (after asking if you can ping it 3000 times).

    Revert breaks telecoms fix. Telecoms revert back to old config.

    Rinse and repeat.

    Yes, it happened too many times to remember.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like