back to article Fastly 'fesses up to breaking the internet with an 'an undiscovered software bug' triggered by a customer

Fastly has explained how it managed to black-hole big chunks of the internet yesterday: a customer triggered a bug. The customer, Fastly points out in a post titled Summary of June 8 outage, was blameless. "We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a …

  1. Korev Silver badge

    The company has therefore resolved to do four things:

    We’re deploying the bug fix across our network as quickly and safely as possible.

    We are conducting a complete post mortem of the processes and practices we followed during this incident.

    We’ll figure out why we didn’t detect the bug during our software quality assurance and testing processes.

    We’ll evaluate ways to improve our remediation time.

    Assuming these things actually happen, then I can't think of a much better way to respond to a screw up.

    1. John Robson Silver badge

      There is one step missing - we'll update our processes to make sure that *similar* bugs get caught (not just this one, but anything in this class).

      1. Anonymous Coward
        Anonymous Coward

        Agree - understanding what went wrong is one thing, fixing the current fault is another, but you need to take the further step of implementing (and verifying) processes to prevent a recurrence.

      2. DCdave

        I'd add another step - we will work on limiting the scope of any changes to cause such a widespread issue. A customer should maximum only be able to affect their own systems.

        1. happyuk

          Agreed.

          Also any blaming, no matter how subtle or indirect should be noted.

          That would raise a red flag for me, and would indicate a dysfunctional environment.

          In this case the dirty stick is being pointed at the customer somewhat.

          1. John Robson Silver badge

            Pretty sure they have said it was a *valid* customer config....

          2. Ken Moorhouse Silver badge

            Re: In this case the dirty stick is being pointed at the customer somewhat.

            Sounds like Badge of Honour is deserved, rather than a dirty stick.

            Would you point the dirty stick at Microsoft's entire customer base?

      3. Beeblebrox

        make sure that *similar* bugs get caught

        Please do note that non-similar bugs will continue to be undetected under these updated processes as previously.

    2. AW-S

      Simple things in a message, that most understand. Line drawn under the issue. Next.

      (like you, I can't think of a better way to put it).

      1. Denarius

        things missing

        techie thrown under bus outside company

        bonuses all round for CEO, board and pals.

        Snarking aside, full points for discovering cause in a minute. Seems like their monitoring code produces meaningful error messages, unlike some in the IT game.

        1. Nifty Silver badge

          Re: things missing

          'While scrambling through our logs for an hour we found the root cause was there in the first minute of the log'.

        2. This post has been deleted by its author

        3. 6491wm

          Re: things missing

          "Seems like their monitoring code produces meaningful error messages"

          Or it was a bug they were aware of and were just waiting to roll out the fix in a scheduled maintenance window.................

      2. Roopee Bronze badge

        Aspberger’s

        “Simple things in a message, that most understand.” - that’s the basic problem for Aspies, and I reckon there are a lot Aspberger sufferers on El Reg. Usually it is couched in terms of non-verbal cues such as inflection and facial expression but it includes idiomatic language too.

        It runs in the male line of my family, I’m one of the less autistic members...

        1. Roopee Bronze badge

          Re: Aspberger’s

          Oops, that should be Asperger’s!

        2. bombastic bob Silver badge
          Facepalm

          Re: Aspberger’s

          non-verbal cues

          are highly overrated...

          (except for icons)

    3. Lunatic Looking For Asylum

      ...and fire (the scapegoat) who let it through as soon as we find out who to blame....

    4. rcxb Silver badge

      I'd want to see a second layer of protection against misbehavior, not just trying to make their software perfect and bug-free.

  2. John Robson Silver badge

    And the bug was?

    See title.

    1. Terafirma-NZ

      Re: And the bug was?

      Exactly my thoughts, Cloudflare are very good at giving deep details of what went wrong e.g. last time they had a a big one they went as far as publishing their BGP filters to show how it happened.

      At times they have even shown the bad code.

      Yes a customer and not advocating for them above others but the level of openness is defiantly much better over there.

      1. John Robson Silver badge

        Re: And the bug was?

        I am somewhat sympathetic until they have finished pushing the patch around - but the deep dive is what really boosts confidence about said misfortunes.

      2. Colonel Mad

        Re: And the bug was?

        I do not think that any one country is better than another, openess depends on the organisations culture, and it's "definitely" BTW.

    2. Anonymous Coward
      Anonymous Coward

      Re: And the bug was?

      Mr. Robson?

    3. Ken Moorhouse Silver badge

      Re: And the bug was? See title.

      Careful: Diodesign will be on here complaining that they had to reboot TheRegister due to the endless loop you created there.

    4. penx

      Re: And the bug was?

      I'm thinking it was a timezone (GMT+13) issue, given their explanation, and that it was around 11am in the UK

      https://twitter.com/penx/status/1402908009253199877

      1. bombastic bob Silver badge
        Thumb Up

        Re: And the bug was?

        hmmm - that actually makes a LOT of sense depending on how the date/time math was being done.

        more reason to ALWAYS store and work with date+time info as time_t (as GMT), or something very similar, to avoid [most] date+time math issues (then just tolerate any others).

        I've done a LOT of date+time kinds of calculations with databases, etc. over the years, for decades even (from business analysis tools to capturing electric power waveform data to millisecond motion data capture) and the idea that a date+time calculation that crosses 0:00 might be responsible for a system-wide outage sounds VERY plausible.

        I think AWS (and others), had a similar problem once (or maybe MORE than once) due to a leap second and its effect on the world-wide synchronization of data...

  3. Greybearded old scrote Silver badge
    Pint

    Credit where it's due

    50 minutes to fix a hair-on-fire emergency? I'd call that a good performance under great stress. I'm sure I couldn't have done it. Bet the stressed engineers indulged in a few of the icon afterwards.

    More generally, there's a reason the internet has a decentralised design. Why do all these numpties keep rushing to any company that centralises it? (Looks sideways at the employer's GitHub repository.)

    1. Tom 7

      Re: Credit where it's due

      Something this close to the atom wide sharp bit of the pointy end should tell you pretty much exactly WHAT went wrong a small fraction of a second after it did. All power to the engineers being able to read the log files through the Niagra falls of sweat this would induce in most people. Once you've done that the WHY should be pretty clear thought the HTF do we fix it might take a couple of minutes going over the pre-written disaster recovery plan, which should include a big 'make sure this cant get in again' post mortem procedure which should explicitly exclude bean counters.

      1. John Robson Silver badge

        Re: Credit where it's due

        When the failure is a customer config triggering a bug that was introduced months earlier.... spotting it might not be that easy and obvious

      2. Charlie van Becelaere

        Re: Credit where it's due

        "post mortem procedure which should explicitly exclude bean counters."

        Thumbs up specifically for this.

        1. File Not Found

          Re: Credit where it's due

          Exclude the ‘bean counters’? Of course. That will make sure they are kept in the dark, uninformed and unable to either support or understand the work being carried out. And what you want in a well-run organisation are uninformed, excluded and ignorant people to provide and manage your budgets, don’t you? I’ve worked as an IT budget holder in those sorts of firms/orgs, and in the other more enlightened variety, and I know which works better (in both company and personal outcomes). Lay off this ‘bean counter’ bollocks.

          1. Anonymous Coward
            Anonymous Coward

            Re: Credit where it's due

            Downvoted because even though I work in a non-IT company, the wonderful beancounters are half the problem. They don't understand a fucking thing other than "that's too expensive". So something spec'ed as 'X' gets replaced by some cheaper flimsier crap that doesn't even manage to last a fraction of the time that the desired object would have managed. Over the long term, their decisions actually cost more time, more money, and a lot more employee unhappiness (and in one case, the loss of a contract, but that was conveniently whitewashed)...funny how the bean counters are most interested in the short term gains. Numbers and statistics can be massaged, and that's what they are best at isn't it?

            I get that somebody has to look after the budget else the employees would all want their own personal coffee machines, but that shouldn't take more than one on site accountant (on-site so they have hands-on experience of what they're talking about). As for all the handsomely-paid bean counters in head office? Fucking parasites, the lot of them.

            [ought to be pretty obvious why anon]

        2. Confuciousmobil

          Re: Credit where it's due

          Exclude the bean counters? Are you sure they weren’t the ones responsible for this bug? Have you seen their share price since the incident?

          I’m sure the bean counters couldn’t be happier.

    2. katrinab Silver badge
      Meh

      Re: Credit where it's due

      The whole point of Fastly and similar services is to provide decentralised design?

      Obviously it failed.

      1. Greybearded old scrote Silver badge

        Re: Credit where it's due

        Well Fastly's internal architecture may qualify as decentralised. But such a large slice of the web using the same company? Not so much.

        1. Jamie Jones Silver badge

          Re: Credit where it's due

          Yes, the blind rollout of any config / new software to all nodes is it's own single point of failure.

          1. John Robson Silver badge

            Re: Credit where it's due

            Hardly a blind roll out - the bug had been rolled out months before, it wasn't until *customer* configuration tripped a specific set of circumstances that it caused an issue.

            And the fix is still being rolled out, so hardly a rollout to all nodes...

      2. Anonymous South African Coward Bronze badge

        Re: Credit where it's due

        My thinking as well.

        If one site managed to create a bork this big, then something's wrong with our designs.

      3. Anonymous Coward
        Anonymous Coward

        Re: Credit where it's due

        Without being able to provide examples specific to this incident is the absence of a deep dive, there are whole classes of failures that can and do wipe out decentralized systems without needing a singe point failure. These are in fact the bane of large scale systems, as they are often quite subtle as well.

        Redundancy is only one step, and has a price you pay in complexity and reliability that you trade for availability. Running a datacenter full of these systems is already pushing hard on the limits of manageable complexity. Now scale that up to a global cloud.

        The scarier class of bugs are the ones that show up when all of that redundant gear starts talking to each other. One of the nastier ones I saw involved a message passing library passing an error as a message. There was a bug that caused that message to crash the receiving machine generating anther error message. In a non HA environment this would propagate around and probably crash a third of the message daemons in the cluster. In an HA environment, as the system tried to restart the failing processes waves of crashes were constantly being broadcast, till not only was the message passing code wrecked, a bunch of the HA stuff flooded, triggering a second, similar problem on the management network that took down most of the stuff across the DC till someone started dumping traffic on the network core.

        As a consequence a lack of those kaiju sized problems, my onsite deployment has had better downtime numbers then most of the major cloud services we use, including Google, for the last 8 of the last 10 years. That said, some of our systems aren't even redundant, and the org can tolerate them being offline long enough for us to grap a spare server chassis, rack it, and restore its services.

        Going on a SPOF hunt isn't always the best use of resources. Now at the scale Fastly is at, there really shouldn't be many (for example, they probably only have one roof on their datacenters).

        1. Martin an gof Silver badge

          Re: Credit where it's due

          Isn't this why decentralised / redundant / fail-over / whatever used also to involve multiple completely independent vendors? Like the aircraft fly-by-wire systems with three independently-designed units so that an inherent bug is highly unlikely to be present in either of the others (though I suppose it depends how many standard libraries they pull in in common!).

          Or the practice with a multi-disc NAS of buying discs at the very least from different production batches, or preferably from different manufacturers? Or putting the redundant servers in a physically separate location, not in the (much more convenient) next rack along (holds hand up in guilt)?

          I'm sure it would add major amounts of complexity and cost to an outfit as vast as Fastly (or Cloudflare or Akamai) - when you need to make a configuration or software change, you can't just make it once and "deploy to all", you have to make it twice or three times (develop and test twice or three times) and "deploy to subset", but the alternative involves the clients somehow making Fastly and Cloudflare work nicely together and as someone with next to no network engineering experience I don't even know if that's possible.

          And it was fixed within the hour. I was looking for information about a particular network security outfit and got an error. I did something else for a bit, and when I came back the error had gone.

          M.

        2. heyrick Silver badge
          Happy

          Re: Credit where it's due

          "there are whole classes of failures that can and do wipe out decentralized systems without needing a singe point failure"

          Indeed - didn't a ship passing under power lines in Germany take out most of Europe's electric grid? 2006 or something?

          I went dark as a result of that (northwest France). I cycled up to the neighbour to check that their power was off (so not just me) and noticed that there were no lights in a nearby town. So I went back home, lit a candle, and made a tea by boiling water in a saucepan.

    3. Xalran

      Re: Credit where it's due

      To answer your question one single word can be used :

      Cost.

      now for the details :

      It costs less for the numpties to use a 3rd party to host their stuff than buy the metal, own the datacenter and manage the whole shebang. ( especially since the metal will be obsolete in no time and will have to be regularly replaced ).

      In the end it all boils down to the opex/capex excel sheet ( where you can have high opex, but capex should be kept as low as possible. )

      1. Anonymous Coward
        Anonymous Coward

        Re: Credit where it's due

        It's cost vs risk.

        Clearly they need to move the cutoff point here.

      2. sammystag

        Re: Credit where it's due

        To answer your answer in one single word - bollocks

        While it has its risks, cloud providers have massive advantages of scalability, expertise, etc. An in-house team of sys admins is unlikely to be able to keep up in the same way, be available around the clock, not be ill, not go on holiday, not be a bit shit at their jobs, be up to date on every patch and new feature. I've had far less trouble on AWS than I used to frequently had with teams of incompetent DBAs, Unix admins stuck in the eighties, etc - all blaming the devs of course and putting up roadblocks to getting to the bottom of things

        1. Anonymous Coward
          Anonymous Coward

          Re: Credit where it's due

          .. but if they fail they don't just rip a hole in your operations, but those of a lot of other companies too.

          In addition, if something doesn't work in my server room I can immediately dispatch someone to get it moving again. The support of most of these cloud companies tends to be a tad, well, nebukous in case of calamity because they use every trick in the book to cut down on client interaction as it represents a cost.

          To decide if the cloudy stuff is for you, assume first that things will go t*ts up and evaluate if you can handle the outage. If yes, go ahead. If not, avoid. Basic risk assessment: get your criteria, stick a number on them and work it out. It also forces you to face the risks upfront instead of having it lying dormant in your infrastructure until it goes wrong..

    4. steamdesk_ross

      Re: Credit where it's due

      "(Looks sideways at the employer's GitHub repository.)"

      But the whole point of using GitHub is that everything is distributed and generally recoverable - you can lose the master but not lose your history. Of course, if you're the kind of person who pushes *out* to production from the repository server instead of pulling *in* from the production servers themselves then it could make life a bit trickier for a short while.

      I thought it was a pretty fair response. As I read yesterday, attributed to A C Clarke, "You might be able to think of a dozen things that can go wrong. The problem is, there's always a thirteenth." It's not whether things go wrong that matters, it's how you deal with them when they do. Of course, you should avoid like the plague a software developer who gets more than their fair share of things going wrong ("You make your own luck").

      Still, "best practise" and all that. Real programmers know that hiding behind a framework like ISO9000 multiplies your dev time and costs by a huge factor but usually doesn't actually decrease the "incidents" count as much as hiring better developers in the first place does (and ensuring that the client listens when the devs say something can't or shouldn't be done...) We have a running joke in our office that "best practise" is just whatever the next wannabe has tried to sell as the unique factor in their company's operations - then we wait 18 months for the client to come back to us, tail between legs, after one too many long duration outages, or dealing with a pile of complaints that all say "the new system is so much slower than the old one" or "I *need* this feature that is no longer there".

    5. Michael Wojcik Silver badge

      Re: Credit where it's due

      Looks sideways at the employer's GitHub repository.

      Ah, GitHub.

      Linus: OK, here's an open-source distributed source-control system. It has a data representation few people understand and a user-hostile interface that periodically requires arcane incantations like "git gc --prune=now" to continue functioning under perfectly normal use. But the important thing is that it's decentralized, which is exactly what we want for the Linux kernel.

      Everyone else: Great! This would be perfect if only we centralized it.

  4. cantankerous swineherd

    internet unavailable

    not many dead

  5. wolfetone Silver badge
    Coat

    If StackOverflow didn't use Fastly

    Then the Fastly guys may have been able to fix it quicker?

    1. Peter X

      Re: If StackOverflow didn't use Fastly

      Flip-side, they could've been call Slowly. So all things considered... ;-)

    2. Michael Wojcik Silver badge

      Re: If StackOverflow didn't use Fastly

      Headline: StackOverflow Down; Software Development Comes to a Halt

      (To be fair, SO is handy in the same way Wikipedia is: when you don't really care about the accuracy of the answer because it's just idle curiosity, or when you just can't think of some specific term and you need your memory jogged, or as a starting point to find what you need to look up in more-reliable sources. Unfortunately it will always be popular with the copypasta types.)

  6. Primus Secundus Tertius

    Design "reviews"

    It seems to me that so-called design reviews are just a box-ticking exercise so that an activity in the management plan can be marked as completed. Nothing really happens until the whole system falls over.

  7. Vulture@C64

    It looked and sounded like a BGP fat finger error again . . . but the cover up sounds much better than an engineer mis-typed a subnet mask.

    1. Robert Grant

      I assumed it was VCL.

    2. Cynic_999

      Nah. If that were the case I don't see any advantage in making up a story that amounts to a similar (or worse) amount of culpability, but which risks being exposed as a lie. Their explanation sounds perfectly plausible to me, and also nothing that I can fault as involving a large amount of inadequate planning, carelessness, recklessness or stupidity. They also responded promptly and corrected the issue within a perfectly reasonable time. Of course, hindsight always alows us to do things better.

    3. Jellied Eel Silver badge

      Be Gone Protocol

      Not sure it would have been BGP. Thinking being that-

      a) They're depresssingly common, so avoidable

      b) It was a customer configuration that borked things

      So I don't know how Fastly built/runs their network, but generally it's a ReallyGoodIdea(tm) to limit any BGP advertisements outside of their domain/instance to prevent routing fun.

  8. tip pc Silver badge
    Pint

    95% restoration in 49mins

    I think that should be applauded and beers all round.

    In complex environments it can take 49 minutes just to establish what’s wrong, especially if it’s a bug, let alone resolve and bring systems back on line.

    1. Anonymous Coward
      Pint

      Re: 95% restoration in 49mins (plus uptime)

      I agree and applaud Fastly for the speedy correction.

      But lost in the other posts is the fact I can find nothing {searching ElReg) about Fastly being down since a 2016 DDoS attack that downed managed DNS provider Dyn which affected Fastly.

      So we're talking 0.0001% downtime (1/(24*365) which is to say 99.9999% uptime. This reliability is why customers pay for the service.

      1. Anonymous Coward
        Anonymous Coward

        Re: 95% restoration in 49mins (plus uptime)

        Isn’t the downtime 1/(24 x 365 x 5)

        The 5 coming from the time since 2016

        (Approximately)

  9. Mike 125

    Fastly 'fesses up'

    Jeeez all the congrats on here??? WTF????

    "Even though there were specific conditions that triggered this outage, we should have anticipated it," he wrote.

    What utter gibberish. Show me an outage without specific conditions. Everything is a specific condition.

    The shares haven't budged. There's a significant perverse incentive for an outfit like this to occasionally 'prove' just how 'critical' it is. It needs to really hurt in the wallet when this happens.

    Also, where is the testing? Where is the evidence of the ridiculous uptime guarantees they make? Redundancy is a thing. Reliability engineering is a thing. Fault tree analysis is a thing. And if hardware can do it, so can software.

    Slowing down, Ok. Dropping out of the air stone dead- not acceptable.

    These outfits are a joke. If we're serious about crititical infrastructure uptime, they need to buck the 'uck up.

    1. Cynic_999

      Re: Fastly 'fesses up'

      The whole infrastructure has reached a level of complexity where it is unreasonable to expect to anticipate and/or test every eventuality. The best we can hope for is for a swift reaction to and fix of unforeseen problems - which I believe was achieved here. Yes, of course all the things you mention are "a thing" - but they do not guarantee 100% reliability by any means. Things are of course a lot more obvious with hindsight.

      1. Mike 125

        Re: Fastly 'fesses up'

        Fine.

        Then these outfits must stop claiming the whole edifice is anything other than a flaky, fingers-crossed, hackjob.

        1. Gene Cash Silver badge

          Re: Fastly 'fesses up'

          > the whole edifice is ... a flaky, fingers-crossed, hackjob.

          Um, isn't that the internet in a nutshell these days?

        2. logicalextreme

          Re: Fastly 'fesses up'

          I think you'd be hard-pushed to find anybody that's good at IT that didn't claim that their entire career output is a flaky, fingers-crossed hackjob. That's why it's fun.

      2. Adelio

        Re: Fastly 'fesses up'

        Any changes should be rolled out in phases over a period of months.

        A Pain I know but then if a bug IS introduced it hopefully will affect less of the system.

        I know more testing could have been done, but it is almost impossible to test every possible permutation, especially in a reasonable amount of time.

        1. John Robson Silver badge

          Re: Fastly 'fesses up'

          This wasn't a change they made - it was a customer configuration change which triggered a bug which had been rolled out gently and tested... months before.

          1. Anonymous Coward
            Anonymous Coward

            "triggered a bug which had been rolled out gently and tested... "

            "triggered a bug which had been rolled out gently and tested... "

            rolled out with self-evidently insufficient design review and rolled out with self-evidently insufficient testing.

            FTFY.

            Also, lots of software/hardware mishaps show symptoms long before catastrophic failure. Often these symptoms are conveniently swept under the carpet - "yeah, it always does that, wish we knew why, reboot and it'll be fine" sound familiar at all?

        2. YetAnotherLocksmith Silver badge

          Re: Fastly 'fesses up'

          That you don't realise what you say is impossible is worrying.

          "rollout patches over months" can't work in a web environment where patches are reverse engineered to find the core vulnerability within hours, and the threat du jour is hammering a million firewalls and gateways within the hour via automated weakness scanners.

          Oh, and there's a patch or update every single day on any complex system these days - Windows itself does it once a week on Tuesday, what day does every other service, daemon and driver get updated? And you want to wait a week?

    2. logicalextreme

      Re: Fastly 'fesses up'

      Is the infrastructure actually that critical though? As far as I can tell this CDN was supporting numerous websites, and I can't think of many websites that are critical in a life-or-death sense — especially given the timeframe in which this was fixed. DNS issues often hit harder and for longer.

      I'd expect any site or service which was so critical as to require uninterrupted availability to have contingency measures in place that meant that there was tolerance for the primary CDN going down. Sure, this could maybe have been avoided given enough resources (cash), but one of the safest ways to keep stuff running is to never allow any deployments ever again — which is fairly untenable. All any of us can do is to mitigate risk the best we can, and know how to lessen the impact when we fail to preempt something.

      At a certain point I don't see there being much difference between stuff like this happening and e.g. the pandemic, or a natural (geological) disaster, or even just heavy snowfall — except as far as I'm aware this didn't kill anybody, it just presented mild inconvenience to people trying to read the news or buy tat from Amazon on their coffee break.

      I don't wish to come across as condescending or a know-it-all; if there are truly critical services that were impacted as a result of this then I'm up for hearing about them — I'd just expect such services to know how to react to situations beyond their control. We got by pretty well without the web before it existed.

      1. logicalextreme

        Re: Fastly 'fesses up'

        And just to add to that — when it comes to uptime guarantees, that's what contractual SLAs are for. If they aren't met, a penalty is usually incurred. If there's no penalty written into the contract, or no SLA, then that's on the folks signing the contract. It's a fairly crucial concept in IT that I first learned about as a first line service desk bod, though I imagine it existed in other industries (tech and engineering, say) earlier.

        If Fastly consistently failed to hit their SLAs, I'd expect that people would stop using them and use somebody else. If all the alternatives were just as bad, I'd expect some irate customers to step up and provide their own alternative that learned from the mistakes they'd seen. It gets a bit different when a provider has a monopoly, but we've (nominally) got laws to protect against such things.

        1. batfink

          Re: Fastly 'fesses up'

          Exactly. If there's an SLA, and an outage breaches it, that should cost money, which prompts the delivery company to spend enough on prevention to make it unlikely.

          If there's an outage but still within SLA, well you're getting what you paid for, so that's fine (although see the Galileo debacle a couple of years back*).

          If there's no SLA, then STFU. You'll get what you're given. If you're too tight to pay for one then you have no redress and should not have any expectations of a reliable service.

          *Galileo: whole global navigation service completely down for a week, and this was still within SLA. Genius work by whichever service manager negotiated that one.

      2. Anonymous Coward
        Anonymous Coward

        Re: Fastly 'fesses up'

        Organisations supporting the rights of EU citizens in the UK have pointed out that, among the sites that were offline, was the UK Government's "proof of settled status" site.

        Perhaps not "life or death" but certainly pretty critical for someone entering the UK at the time, or having a job or housing tenancy application (or whatever other things are also on the list that the "papers please" zealots have dreamed up) processed by someone at that time. "Sorry, you have been thrown in a holding cell/deported/your application has been rejected because the internets was down" is really not a good look.

        1. logicalextreme

          Re: Fastly 'fesses up'

          That's the sort of thing I was after, thanks. According to the Graun gov.uk can switch to AWS manually, but didn't choose to do so as the outage was short. I know our Government's a bit wank, but I'd hope that in the event that someone's proof of settled status was unavailable due to an obvious network issue they'd default to "wait" rather than "deport".

          1. Anonymous Coward
            Anonymous Coward

            Re: Fastly 'fesses up'

            You've not met our Home Secretary then? Once she's got the laws changed how she wants, "they" will be lucky if they get shot in the street for littering while a suspected Immigrant.

    3. AndrueC Silver badge
      Meh

      Re: Fastly 'fesses up'

      Jeeez all the congrats on here??? WTF????

      Ah, Mr Perfect has shown up to tell us How It Should Be Done.

      Well done, sir.

    4. DS999 Silver badge

      Let me guess

      You're a PHB who denies budget requests randomly to be seen "cutting the fat" by your boss, then screams at your subordinates for issues that could have been prevented if that line item had been approved and tells them all the ways they should have done quality work despite your micromanagement and constant meeting invites where you can tell everyone how smartly you would have done things.

      1. logicalextreme

        Re: Let me guess

        If you're not claiming that £5…

    5. maffski

      Re: Fastly 'fesses up'

      I notice they don't bother to mention what the 'customer configuration change' was. If I was cynical I might be inclined to think it was something so simple they'd be embarrassed by revealing it.

      1. Dave559 Silver badge

        Re: Fastly 'fesses up'

        "I notice they don't bother to mention what the 'customer configuration change' was. If I was cynical I might be inclined to think it was something so simple they'd be embarrassed by revealing it."

        If that's the case, it could easily be any one or more of [missing|extra] [semi-colon|comma|space between parameters|line-continuation character], unquoted or otherwise not-properly-escaped string containing spaces or other special characters, etc, etc, etc.

        With the best will in the world and trying to always get these details right, I'm sure most of us here have probably made that sort of typo/mistake at least once?

  10. I Am Spartacus
    Flame

    Fastly cost savings

    I thought I would look at Fastly to see if they could host my companies website,

    eCommerce - Check

    Small business - Check

    Savings of £150,000 over a three year period. Sound good, that £35,000 savings per year,

    WAIT ONE - we aren't spending anything like that to manage a web server, three app servers and a database server. With associated firewalls, routers, etc. Its costing us a LOT less than this.

    People need to understand that the "cloud" should really be pronounced "someone else's computers"

    1. John Riddoch

      Re: Fastly cost savings

      Companies like Fastly don't just do web hosting, there's a whole bunch of other stuff going on:

      - DoS protection

      - DDoS protection

      - WAF filtering

      - Content delivery/caching all round the globe

      Can you do all of that for the same price as these companies and scale when some script kiddie takes a dislike to your website and tries to DDoS it? For MOST companies, the answer is no and the vast majority of them couldn't recover within an hour if it failed.

      Yes, a lot of websites fell over at the same time, but I'd bet if they didn't use Fastly or another CDN company, they'd have fallen over repeatedly over the last year or two for other issues, it just wouldn't have been as visible.

    2. yetanotheraoc Silver badge

      Re: Fastly cost savings

      "Savings of £150,000 over a three year period. Sound good, that £35,000 savings per year,

      WAIT ONE - we aren't spending anything like that to manage a web server, three app servers and a database server. With associated firewalls, routers, etc. Its costing us a LOT less than this."

      Putting on my amateur bean-counter hat, I think you didn't include personnel costs. Even one FTE is more than £35,000 per year, so how can your cost be less? I have no idea if Fastly can deliver the promised savings, though, and I don't think they know either.

    3. John Robson Silver badge

      Re: Fastly cost savings

      Yeah - that saving figure doesn't apply to you...

      That saving figure applies to larger companies.

      You might even be best off doing self hosting and just accepting downtime and complete loss of weekend when it occurs.

      But when an organisation is significantly larger... then having someone else manage the hardware can be a price worth paying - of course you *could* do it cheaper... but you are paying for them to take the risk of overtime and failed hardware, for them to source and maintain UPS, generators, diverse power and data feeds. For them to maintain the building all these things are housed in... the list goes on.

    4. Bonegang

      Re: Fastly cost savings

      Except it's not just "a web serverr, three app servers and a database server"...

      To meet a CDN's global delivery performance (which is what a CDN is ALL about) you'd need that kit in *every* geography you do business in and support 24x7.

      If you're only doing business in a single region then arguably you do not need a CDN.

  11. amanfromMars 1 Silver badge

    If queasy, take it easy and just check to make sure it is nothing to do with you.

    Do you know what you introduced and were doing on the internet on 12 May and triggered on 8 June? Have you checked? Do you have accurate records?

    Do you recognise anything strange or untoward and likely to cause any outage of services? If you could, would you realise it and await to see if there were any negative consequences, or pregnant pauses in the case of there being no direct indication or engaging notification of one being thought involved and/or instrumental in the seeding and feeding of an "undiscovered software bug"/novel disruptive means of creative information distribution and vice versa?

    One just never rightly knows nowadays how things are going to pan out whenever so much can be done remotely and practically out of sight in the Harry Limelight.

    1. Logiker72

      Re: If queasy, take it easy and just check to make sure it is nothing to do with you.

      Hi Mars, you still try to find patterns in the randomness of half baked software ?

      Why dont you try to find patterns in nature. Much more complex and interesting.

      I always find it extremely refreshing to walk in the forest for some hours.

      Cotswolds for you ?

    2. Anonymous Coward
      Anonymous Coward

      Re: If queasy, take it easy and just check to make sure it is nothing to do with you.

      Oh gods, from those links it seems that amfM has a website, and it looks about as comprehensible as his/its utterings here...

      1. Martin an gof Silver badge
        Alert

        Re: If queasy, take it easy and just check to make sure it is nothing to do with you.

        You mean you actually clicked on a link to a domain called "ur2die4"? Sounds at the very least like a scam "dating" site. Much as I love AMFM's postings and would dearly love to meet the deranged PhD student who created him, and see the hardware he runs on in the corner of El Reg HQ, I wouldn't go near that link, at least, not until a few days have passed without reports of multiple Reg Readers being pwned.

        M.

        1. Anonymous Coward
          Anonymous Coward

          Re: If queasy, take it easy and just check to make sure it is nothing to do with you.

          Yes, I did notice that the domain name certainly looked a bit dodgy (or possibly that of an angsty teenager), but curiosity about a site which (from the format of the links) would seem to contain an almanac-like entry for each day (and what wild ramblings such may contain) got the better of me.

          And there's always (always) NoScript to ward away any evil spirits which might reside.

    3. Danny 2

      Re: If queasy, take it easy and just check to make sure it is nothing to do with you.

      "Do you know what you introduced and were doing on the internet "

      When did you last see your father / grandfather backups?

  12. Jason Hindle

    Fastly? We now know who they are, and where they live

    Which, I'm guessing, is a level of exposure Fastly never wanted. Their greatest error was, erm, making a mistake :-/.

    1. Logiker72

      Re: Fastly? We now know who they are, and where they live

      Fastly sounds like a Donald Word.

  13. Logiker72

    DSL Modem, RPI, DynDNS

    That is in most cases more than good enough to run your own Web server.

    No need for a megacorp to control you.

    1. John Robson Silver badge

      Re: DSL Modem, RPI, DynDNS

      Ah yes... assuming you only want to serve a pretty static/small page to a relatively small audience.

      Note, those people aren't the customers of Fastly.

      1. Martin an gof Silver badge

        Re: DSL Modem, RPI, DynDNS

        Of course, there are people who do try to serve large numbers of dynamic web pages from Raspberry Pis, even if it isn't necessarily the "best" way of doing it. Try Mythic Beasts, for example!

        M.

        1. John Robson Silver badge

          Re: DSL Modem, RPI, DynDNS

          The Pi can probably do a fair amount, DSL is a second limit on the system.

        2. doublelayer Silver badge

          Re: DSL Modem, RPI, DynDNS

          It's not the Raspberry Pi that's the main problem in the suggestion. Everything involved is prone to failure and indicates a misunderstanding of Fastly. As compared to Mythic's deployment, the suggested system is a lot more fragile.

    2. IGotOut Silver badge

      Re: DSL Modem, RPI, DynDNS

      Sure. I'm sure reddit will be happy to run know that.

      Also I'm sure my 512 upload will present no issues at all. Just so long as some ass hat doesn't try to attack it a couple of hundred times a day.

    3. doublelayer Silver badge

      Re: DSL Modem, RPI, DynDNS

      That would be much more reliable, wouldn't it? No way the stuff could go down unless it was emphatically your fault, right? Nobody else could fail you. Well, except for your DNS provider, which has also had a period of not working after an attack, your ISP, which could break your service for any number of technical or financial reasons or could cut your connection for running a server they haven't approved (depending on your contract), your power, which could fail because a transformer responsible for your circuit decided it's tired, your HTTP server which could run out of threads pretty fast when someone decided to spam a login page with credentials, your storage because people have been requesting a lot of different files which the small amount of memory can't cache so it keeps going to the storage and wearing it out, or the board itself when it overheats and throttles performance so often that the server isn't running very well anymore.

  14. TheRealRoland
    Devil

    If it wasn't for those meddlesome customers...

    We would have gotten away with it!

  15. Kev99 Silver badge

    eh-yup. Posting everything you own on the internet is perfectly safe and secure. Nothing to see here, folks. Please move along.

  16. cd

    Fastly Over-rated?

    See title

  17. Boris the Cockroach Silver badge
    FAIL

    Hopefully

    there'll be a very nice "Who? Me?" story out of this followed by a nicer "On-call" story

    The internet was designed to avoid single point of failures...... dearie dearie me

    1. EBG

      also

      The internet was designed by people who didn't do Business Studies 101. Commodity supply chains naturally consolidate into oligopolies (typically 3 suppliers). Equally amusing to listen to the crypto libertarians complaining that BC mining is controlled by a handful of consortia.

  18. debater

    Missing the Point Completely

    I feel that headlines such as 'Fastly broke the internet' are a typical case of the media (El Reg excluded) firmly grasping completely the wrong end of the stick.

    In this case what happened---tell me if I'm wrong---is that the CDN that Fastly runs went down for an hour, and that caused thousands of popular websites to fail. Now please go back and read that sentence again carefully. It caused **thousands of popular websites to fail**.

    The fault is not with Fastly at all! The whole tacit deal with CDNs is, and always was, that they are not guaranteed to be up all the time. Some websites have the ability to fall back when fetching anything from a CDN: they fetch what they need from some other source if they can't get it from the CDN. The fault is **entirely** with all the thousands of websites that do not have a fallback mechanism. Presumably they still don't, and nobody is going to fix it.

    Admittedly, HTML5 currently does not mandate any such mechanism to be provided by the browser built-in, and website developers therefore have to roll their own or use a JavaScript library. That is a serious failing of HTML5, but it doesn't excuse the publishers of major prestige websites.

    1. Anonymous Coward
      Anonymous Coward

      Re: Missing the Point Completely

      JavaScript is definitely not the tool you want for this job.

  19. Stuart Halliday
    Pint

    Let's be thankful that decent well trained IT staff found and sorted the bug.

    1. arachnoid2
      Joke

      Maybe it was a bot that fixed it

      Just sayin

  20. bigtreeman

    measure twice cut once

    We will test our code before we deploy to a critical system.

    For a carpenter, measure twice, cut once.

    1. AndrueC Silver badge
      Joke

      Re: measure twice cut once

      Measure twice...

  21. Pat Harkin

    it was triggered by a valid customer configuration change

    SET GLOBAL_INTERNET_WORKING=FALSE

  22. Jeff 11

    Those harping on about the "poor reliability of cloud vs on-prem" don't appreciate what a CDN does and why it's not a simple binary of on-prem vs cloud.

    A CDN is generally in place for two purposes: to mitigate traffic issues for static assets, and to keep those assets as geographically close to customers as possible so as to minimise latency for your worldwide audience. You might be able to achieve the first one yourself with a lot of on-prem racks and fat pipes in a small set of DCs, but the second is *absolutely infeasible* to achieve for even the largest orgs; you would have to have kit in as many locations as the CDNs themselves. Given how critical low-latency is (perceived do be) in the modern browsing experience, that's why even Google, MS and Amazon use them and were partially affected by the Fastly outage.

    And CDNs like Akamai vastly predate the advent of cloud computing.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like