back to article Fastly 'fesses up to breaking the internet with an 'an undiscovered software bug' triggered by a customer

Fastly has explained how it managed to black-hole big chunks of the internet yesterday: a customer triggered a bug. The customer, Fastly points out in a post titled Summary of June 8 outage, was blameless. "We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a …

Page:

  1. Korev Silver badge

    The company has therefore resolved to do four things:

    We’re deploying the bug fix across our network as quickly and safely as possible.

    We are conducting a complete post mortem of the processes and practices we followed during this incident.

    We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.

    We’ll evaluate ways to improve our remediation time.

    Assuming these things actually happen, then I can't think of a much better way to respond to a screw up.

    1. John Robson Silver badge

      There is one step missing - we'll update our processes to make sure that *similar* bugs get caught (not just this one, but anything in this class).

      1. Anonymous Coward
        Anonymous Coward

        Agree - understanding what went wrong is one thing, fixing the current fault is another, but you need to take the further step of implementing (and verifying) processes to prevent a recurrence.

      2. DCdave

        I'd add another step - we will work on limiting the scope of any changes to cause such a widespread issue. A customer should maximum only be able to affect their own systems.

        1. happyuk

          Agreed.

          Also any blaming, no matter how subtle or indirect should be noted.

          That would raise a red flag for me, and would indicate a dysfunctional environment.

          In this case the dirty stick is being pointed at the customer somewhat.

          1. John Robson Silver badge

            Pretty sure they have said it was a *valid* customer config....

          2. Ken Moorhouse Silver badge

            Re: In this case the dirty stick is being pointed at the customer somewhat.

            Sounds like Badge of Honour is deserved, rather than a dirty stick.

            Would you point the dirty stick at Microsoft's entire customer base?

      3. Beeblebrox

        make sure that *similar* bugs get caught

        Please do note that non-similar bugs will continue to be undetected under these updated processes as previously.

    2. AW-S

      Simple things in a message, that most understand. Line drawn under the issue. Next.

      (like you, I can't think of a better way to put it).

      1. Denarius Silver badge

        things missing

        techie thrown under bus outside company

        bonuses all round for CEO, board and pals.

        Snarking aside, full points for discovering cause in a minute. Seems like their monitoring code produces meaningful error messages, unlike some in the IT game.

        1. Nifty Silver badge

          Re: things missing

          'While scrambling through our logs for an hour we found the root cause was there in the first minute of the log'.

        2. This post has been deleted by its author

        3. 6491wm

          Re: things missing

          "Seems like their monitoring code produces meaningful error messages"

          Or it was a bug they were aware of and were just waiting to roll out the fix in a scheduled maintenance window.................

      2. Roopee

        Aspberger’s

        “Simple things in a message, that most understand.” - that’s the basic problem for Aspies, and I reckon there are a lot Aspberger sufferers on El Reg. Usually it is couched in terms of non-verbal cues such as inflection and facial expression but it includes idiomatic language too.

        It runs in the male line of my family, I’m one of the less autistic members...

        1. Roopee

          Re: Aspberger’s

          Oops, that should be Asperger’s!

        2. bombastic bob Silver badge
          Facepalm

          Re: Aspberger’s

          non-verbal cues

          are highly overrated...

          (except for icons)

    3. Lunatic Looking For Asylum

      ...and fire (the scapegoat) who let it through as soon as we find out who to blame....

    4. rcxb Silver badge

      I'd want to see a second layer of protection against misbehavior, not just trying to make their software perfect and bug-free.

  2. John Robson Silver badge

    And the bug was?

    See title.

    1. Terafirma-NZ

      Re: And the bug was?

      Exactly my thoughts, Cloudflare are very good at giving deep details of what went wrong e.g. last time they had a a big one they went as far as publishing their BGP filters to show how it happened.

      At times they have even shown the bad code.

      Yes a customer and not advocating for them above others but the level of openness is defiantly much better over there.

      1. John Robson Silver badge

        Re: And the bug was?

        I am somewhat sympathetic until they have finished pushing the patch around - but the deep dive is what really boosts confidence about said misfortunes.

      2. Colonel Mad

        Re: And the bug was?

        I do not think that any one country is better than another, openess depends on the organisations culture, and it's "definitely" BTW.

    2. Anonymous Coward
      Anonymous Coward

      Re: And the bug was?

      Mr. Robson?

    3. Ken Moorhouse Silver badge

      Re: And the bug was? See title.

      Careful: Diodesign will be on here complaining that they had to reboot TheRegister due to the endless loop you created there.

    4. penx

      Re: And the bug was?

      I'm thinking it was a timezone (GMT+13) issue, given their explanation, and that it was around 11am in the UK

      https://twitter.com/penx/status/1402908009253199877

      1. bombastic bob Silver badge
        Thumb Up

        Re: And the bug was?

        hmmm - that actually makes a LOT of sense depending on how the date/time math was being done.

        more reason to ALWAYS store and work with date+time info as time_t (as GMT), or something very similar, to avoid [most] date+time math issues (then just tolerate any others).

        I've done a LOT of date+time kinds of calculations with databases, etc. over the years, for decades even (from business analysis tools to capturing electric power waveform data to millisecond motion data capture) and the idea that a date+time calculation that crosses 0:00 might be responsible for a system-wide outage sounds VERY plausible.

        I think AWS (and others), had a similar problem once (or maybe MORE than once) due to a leap second and its effect on the world-wide synchronization of data...

  3. Greybearded old scrote Silver badge
    Pint

    Credit where it's due

    50 minutes to fix a hair-on-fire emergency? I'd call that a good performance under great stress. I'm sure I couldn't have done it. Bet the stressed engineers indulged in a few of the icon afterwards.

    More generally, there's a reason the internet has a decentralised design. Why do all these numpties keep rushing to any company that centralises it? (Looks sideways at the employer's GitHub repository.)

    1. Tom 7 Silver badge

      Re: Credit where it's due

      Something this close to the atom wide sharp bit of the pointy end should tell you pretty much exactly WHAT went wrong a small fraction of a second after it did. All power to the engineers being able to read the log files through the Niagra falls of sweat this would induce in most people. Once you've done that the WHY should be pretty clear thought the HTF do we fix it might take a couple of minutes going over the pre-written disaster recovery plan, which should include a big 'make sure this cant get in again' post mortem procedure which should explicitly exclude bean counters.

      1. John Robson Silver badge

        Re: Credit where it's due

        When the failure is a customer config triggering a bug that was introduced months earlier.... spotting it might not be that easy and obvious

      2. Charlie van Becelaere

        Re: Credit where it's due

        "post mortem procedure which should explicitly exclude bean counters."

        Thumbs up specifically for this.

        1. File Not Found

          Re: Credit where it's due

          Exclude the ‘bean counters’? Of course. That will make sure they are kept in the dark, uninformed and unable to either support or understand the work being carried out. And what you want in a well-run organisation are uninformed, excluded and ignorant people to provide and manage your budgets, don’t you? I’ve worked as an IT budget holder in those sorts of firms/orgs, and in the other more enlightened variety, and I know which works better (in both company and personal outcomes). Lay off this ‘bean counter’ bollocks.

          1. Anonymous Coward
            Anonymous Coward

            Re: Credit where it's due

            Downvoted because even though I work in a non-IT company, the wonderful beancounters are half the problem. They don't understand a fucking thing other than "that's too expensive". So something spec'ed as 'X' gets replaced by some cheaper flimsier crap that doesn't even manage to last a fraction of the time that the desired object would have managed. Over the long term, their decisions actually cost more time, more money, and a lot more employee unhappiness (and in one case, the loss of a contract, but that was conveniently whitewashed)...funny how the bean counters are most interested in the short term gains. Numbers and statistics can be massaged, and that's what they are best at isn't it?

            I get that somebody has to look after the budget else the employees would all want their own personal coffee machines, but that shouldn't take more than one on site accountant (on-site so they have hands-on experience of what they're talking about). As for all the handsomely-paid bean counters in head office? Fucking parasites, the lot of them.

            [ought to be pretty obvious why anon]

        2. Confuciousmobil

          Re: Credit where it's due

          Exclude the bean counters? Are you sure they weren’t the ones responsible for this bug? Have you seen their share price since the incident?

          I’m sure the bean counters couldn’t be happier.

    2. katrinab Silver badge
      Meh

      Re: Credit where it's due

      The whole point of Fastly and similar services is to provide decentralised design?

      Obviously it failed.

      1. Greybearded old scrote Silver badge

        Re: Credit where it's due

        Well Fastly's internal architecture may qualify as decentralised. But such a large slice of the web using the same company? Not so much.

        1. Jamie Jones Silver badge

          Re: Credit where it's due

          Yes, the blind rollout of any config / new software to all nodes is it's own single point of failure.

          1. John Robson Silver badge

            Re: Credit where it's due

            Hardly a blind roll out - the bug had been rolled out months before, it wasn't until *customer* configuration tripped a specific set of circumstances that it caused an issue.

            And the fix is still being rolled out, so hardly a rollout to all nodes...

      2. Anonymous South African Coward Silver badge

        Re: Credit where it's due

        My thinking as well.

        If one site managed to create a bork this big, then something's wrong with our designs.

      3. Anonymous Coward
        Anonymous Coward

        Re: Credit where it's due

        Without being able to provide examples specific to this incident is the absence of a deep dive, there are whole classes of failures that can and do wipe out decentralized systems without needing a singe point failure. These are in fact the bane of large scale systems, as they are often quite subtle as well.

        Redundancy is only one step, and has a price you pay in complexity and reliability that you trade for availability. Running a datacenter full of these systems is already pushing hard on the limits of manageable complexity. Now scale that up to a global cloud.

        The scarier class of bugs are the ones that show up when all of that redundant gear starts talking to each other. One of the nastier ones I saw involved a message passing library passing an error as a message. There was a bug that caused that message to crash the receiving machine generating anther error message. In a non HA environment this would propagate around and probably crash a third of the message daemons in the cluster. In an HA environment, as the system tried to restart the failing processes waves of crashes were constantly being broadcast, till not only was the message passing code wrecked, a bunch of the HA stuff flooded, triggering a second, similar problem on the management network that took down most of the stuff across the DC till someone started dumping traffic on the network core.

        As a consequence a lack of those kaiju sized problems, my onsite deployment has had better downtime numbers then most of the major cloud services we use, including Google, for the last 8 of the last 10 years. That said, some of our systems aren't even redundant, and the org can tolerate them being offline long enough for us to grap a spare server chassis, rack it, and restore its services.

        Going on a SPOF hunt isn't always the best use of resources. Now at the scale Fastly is at, there really shouldn't be many (for example, they probably only have one roof on their datacenters).

        1. Martin an gof Silver badge

          Re: Credit where it's due

          Isn't this why decentralised / redundant / fail-over / whatever used also to involve multiple completely independent vendors? Like the aircraft fly-by-wire systems with three independently-designed units so that an inherent bug is highly unlikely to be present in either of the others (though I suppose it depends how many standard libraries they pull in in common!).

          Or the practice with a multi-disc NAS of buying discs at the very least from different production batches, or preferably from different manufacturers? Or putting the redundant servers in a physically separate location, not in the (much more convenient) next rack along (holds hand up in guilt)?

          I'm sure it would add major amounts of complexity and cost to an outfit as vast as Fastly (or Cloudflare or Akamai) - when you need to make a configuration or software change, you can't just make it once and "deploy to all", you have to make it twice or three times (develop and test twice or three times) and "deploy to subset", but the alternative involves the clients somehow making Fastly and Cloudflare work nicely together and as someone with next to no network engineering experience I don't even know if that's possible.

          And it was fixed within the hour. I was looking for information about a particular network security outfit and got an error. I did something else for a bit, and when I came back the error had gone.

          M.

        2. heyrick Silver badge
          Happy

          Re: Credit where it's due

          "there are whole classes of failures that can and do wipe out decentralized systems without needing a singe point failure"

          Indeed - didn't a ship passing under power lines in Germany take out most of Europe's electric grid? 2006 or something?

          I went dark as a result of that (northwest France). I cycled up to the neighbour to check that their power was off (so not just me) and noticed that there were no lights in a nearby town. So I went back home, lit a candle, and made a tea by boiling water in a saucepan.

    3. Xalran

      Re: Credit where it's due

      To answer your question one single word can be used :

      Cost.

      now for the details :

      It costs less for the numpties to use a 3rd party to host their stuff than buy the metal, own the datacenter and manage the whole shebang. ( especially since the metal will be obsolete in no time and will have to be regularly replaced ).

      In the end it all boils down to the opex/capex excel sheet ( where you can have high opex, but capex should be kept as low as possible. )

      1. Anonymous Coward
        Anonymous Coward

        Re: Credit where it's due

        It's cost vs risk.

        Clearly they need to move the cutoff point here.

      2. sammystag

        Re: Credit where it's due

        To answer your answer in one single word - bollocks

        While it has its risks, cloud providers have massive advantages of scalability, expertise, etc. An in-house team of sys admins is unlikely to be able to keep up in the same way, be available around the clock, not be ill, not go on holiday, not be a bit shit at their jobs, be up to date on every patch and new feature. I've had far less trouble on AWS than I used to frequently had with teams of incompetent DBAs, Unix admins stuck in the eighties, etc - all blaming the devs of course and putting up roadblocks to getting to the bottom of things

        1. Anonymous Coward
          Anonymous Coward

          Re: Credit where it's due

          .. but if they fail they don't just rip a hole in your operations, but those of a lot of other companies too.

          In addition, if something doesn't work in my server room I can immediately dispatch someone to get it moving again. The support of most of these cloud companies tends to be a tad, well, nebukous in case of calamity because they use every trick in the book to cut down on client interaction as it represents a cost.

          To decide if the cloudy stuff is for you, assume first that things will go t*ts up and evaluate if you can handle the outage. If yes, go ahead. If not, avoid. Basic risk assessment: get your criteria, stick a number on them and work it out. It also forces you to face the risks upfront instead of having it lying dormant in your infrastructure until it goes wrong..

    4. steamdesk_ross

      Re: Credit where it's due

      "(Looks sideways at the employer's GitHub repository.)"

      But the whole point of using GitHub is that everything is distributed and generally recoverable - you can lose the master but not lose your history. Of course, if you're the kind of person who pushes *out* to production from the repository server instead of pulling *in* from the production servers themselves then it could make life a bit trickier for a short while.

      I thought it was a pretty fair response. As I read yesterday, attributed to A C Clarke, "You might be able to think of a dozen things that can go wrong. The problem is, there's always a thirteenth." It's not whether things go wrong that matters, it's how you deal with them when they do. Of course, you should avoid like the plague a software developer who gets more than their fair share of things going wrong ("You make your own luck").

      Still, "best practise" and all that. Real programmers know that hiding behind a framework like ISO9000 multiplies your dev time and costs by a huge factor but usually doesn't actually decrease the "incidents" count as much as hiring better developers in the first place does (and ensuring that the client listens when the devs say something can't or shouldn't be done...) We have a running joke in our office that "best practise" is just whatever the next wannabe has tried to sell as the unique factor in their company's operations - then we wait 18 months for the client to come back to us, tail between legs, after one too many long duration outages, or dealing with a pile of complaints that all say "the new system is so much slower than the old one" or "I *need* this feature that is no longer there".

    5. Michael Wojcik Silver badge

      Re: Credit where it's due

      Looks sideways at the employer's GitHub repository.

      Ah, GitHub.

      Linus: OK, here's an open-source distributed source-control system. It has a data representation few people understand and a user-hostile interface that periodically requires arcane incantations like "git gc --prune=now" to continue functioning under perfectly normal use. But the important thing is that it's decentralized, which is exactly what we want for the Linux kernel.

      Everyone else: Great! This would be perfect if only we centralized it.

  4. cantankerous swineherd Silver badge

    internet unavailable

    not many dead

  5. wolfetone Silver badge
    Coat

    If StackOverflow didn't use Fastly

    Then the Fastly guys may have been able to fix it quicker?

    1. Peter X

      Re: If StackOverflow didn't use Fastly

      Flip-side, they could've been call Slowly. So all things considered... ;-)

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021