UTC
I think you mean reports started at 09:58 UTC, or 10:58 BST (10:58 UTC is 9 minutes in the future as I write this)...
A not-inconsiderable chunk of the World Wide Web, including news sites, social networks, developer sites, and even the UK government's primary portal, has been knocked offline by an apparent outage at edge cloud specialist Fastly – though your indefatigable The Register remains aloft. Mid-morning UK time (09:58 UTC) today, …
This post has been deleted by its author
> It's only people who want access to all those services at once that experience it as a single point of failure.
No, that's not really accurate.
Even if you're only focused on Reddit (to pick one), Reddit is(was) down - the reason? They built a single point of failure into their setup by using a single CDN vendor rather than a multi-CDN setup (or, alternatively, have just realised some metrics their multi-CDN status checker should have been considering).
What you're talking about - the fact that it broke a wide range of services is a *common* SPOF.
Fastly is still a SPOF for each of these services, regardless of whether any other service was using them. That a large proportion of the internet seems to be down to users is because there's a common SPOF that's just failed.
It's not reddit's job to avoid common SPOFs, but it is reddit's job (if they care about service availability) to avoid/mitigate SPOFs in the first place (though, really, cost comes into it too - you can mitigate most things, but it may not be worth the cost to mitigate the edge-cases)
AFAIK, Fastly don't offer a white label service (the thing that allows other businesses to present it as their "own" CDN), so from a delivery point of view it shouldn't be any.
But, anyone sane should be serving things like status pages through a seperate route, so it's quite possible that some others served theirs via Fastly, and the status page went down while service continued. Fairly small impact.
But, those small impacts can get quite fun once you start thinking about the spread of lots of them - how many companies build/test pipelines failed because they rely on assets that get pulled down from a site/service that's fronted by Fastly's CDN?
The first twitter post I saw was someone querying github as a dependency was broken.
The BBC posted a link to a breaking news item that I couldn't see because of the errors, 503 and connection failed. But after a brief "bbc.com does not exist" it started working. Someone said they'd switched away from Fastly, so I guess they had fallback measures in place, unlike most other places.
I loved the verve using a Google Docs page to get the news out, and forgetting to disallow editing....
One of the drivers for CDNs is the end user experience. If they are working, websites load more quickly and you've only got seconds before the user goes somewhere else.
The Beeb is big enough and considers itself important enough to spend extra for multiple CDNs and a CDN switcher. Most websites don't. They are content with their users seeing quick page loads for all but a few hours of the year.
It will be interesting to see if Fastly's transparency extends to letting us know what actually happened but whether or not they do I don't think it will affect their business much.
Their communications were rather _terse_ weren't they? I'd say based on the example I have seen from here and elsewhere they were so light on detail as to be almost useless.
While I don't think it's fair to expect a running tweet storm during an outage, If they don't post a really good post mortem, it will ding their brand in my view.
Good news is there are other CDNs, and anyone using fastly should consider being able to use more than one CDN anyway as some of the others pointed out.
The less they say the less they are liable for.
Also it’s likely those that where dealing with it where busy dealing with it and those who’s job it is to notify punters didn’t understand the details. Typically when you outsource it’s manager to manager comms and managers typically don’t do techy which is another reason for the lack of detail.
Who cares now, it’s up :)
re: "Their communications were rather _terse_ weren't they? I'd say based on the example I have seen from here and elsewhere they were so light on detail as to be almost useless."
I suspect their comms were aimed at the customer, who, when it comes down to it, probably just wants their website to work, not really caring about the details behind it. Also, they may not want the customer to know the technical details of their operations.
We don't use Fastly directly, but a few of the sites and systems we use do rely on it. While our systems techs wanted to know the ins and outs of why and how it happened (although I am unsure if they got it), the management wanted to know when it was fixed, so they could send out comms to the users to that effect.
> I suspect their comms were aimed at the customer, who, when it comes down to it, probably just wants their website to work, not really caring about the details behind it.
Yup, it'll almost certainly be that.
Their update on the cause overnight was rather lacking in detail in my view - "it was a software bug" is very vague.
But... the average customer doesn't care that you accidentally introduced an off-by-one or something, so actually their update is probably just fine for the majority of their customers. Whereas I want to be sure it's not something we might accidentally do later, so want more technical details
"If they are working, websites load more quickly and you've only got seconds before the user goes somewhere else."
That's actually a misfeature for me. I genuinely thought the web was a much more fun place when you had to wait a few seconds for the interlaced GIFs and progressive JPEGs, that you had carefully created, to do their rendering thing before your very eyes… «wistful sob»
And, of course, such constraints meant there was none of the third-party bloatware and spyware stuffed into web pages then that we have to put up with now.
Yeah, I might grumble if a web page were to take 30 seconds or more to load, but I really could care less about 5 - 10 seconds (streaming content being one of few exeptions, and even then any sensible browser or app should be caching a good dollop before trying to play it).
If you're watching something load then that can be entertaining in itself. However what usually happens these days is a delay with nothing happening then it all arrives in an instant. If you're staring at nothing happening then you think it's broken and move on. Those gifs resolving themselves would keep you on the page because you could see it was getting there.
"However what usually happens these days is a delay with nothing happening then it all arrives in an instant."
Yes, it's very annoying that browsers these days don't seem to be capable of on-the-fly re-rendering like in the old days. I get that it is probably because most connectivity is way faster than a 28.8kbps modem so dropping the continual rendering can help save power and stop flickering and such, however it's not a help at all when the entire site has stalled waiting on pulling some rubbish from this one stone age site, and all that's on screen is an empty white rectangle. I mean, surely it isn't hard to say "if nothing has happened in half a second, do a render"?
Often that's because what the browser downloaded was an HTML document with no visible content and a horde of scripts which are now busy XHR'ing a dozen different servers for tens of megabytes of pointless crap.
Microsoft (who invented XHR), Google (who popularized it), and designers (who jumped on it like flies on shit) largely ruined the web. RIAs and SPAs are horrible ideas.
Saw first hand the beebs response to the fastly failure and the cdn switcher isn’t a third party solution but rather internal dns that gets managed by change control. The Beeb has a mixture of in-house cdn running from 2 dcs and 2 separate cloud providers so has 3 way redundancy .
Anon for obvious reasons.
It's really quite difficult and expensive to build your own multi-CDN system and, given, that configuration issues become increasingly likely to be the SPOF, even that won't always help as evinced by the occasional Google SNAFU: Google effectively does run multtiple CDNs.
CDNs do take an enormous amount of risk out of the equation by filtering nearly all the aggressive traffic out there, and there is a lot of that!
> It's really quite difficult and expensive to build your own multi-CDN system
It is - at least if you want a sufficiently functional one.
Which is why, just as you'd buy turnkey CDN services, you'd buy a turnkey multi-CDN setup (Cedexis being the obvious example, but far from the only one).
With a *good* CDN switcher, you essentially just have another CNAME in the DNS chain - all CDNs respond to the same host header, and your switcher just routes traffic to different CDNs based on observed/learned status as well as your preconfigured ruleset.
The cost of those switchers is far, far less than the cost you'd incur self-implementing.
There is, though, still a cost associated with that - at a certain point it becomes a business decision: do we spend $xxx, or accept that $edge_case might occur?
> that configuration issues become increasingly likely to be the SPOF
They are, but your CDN switching config shouldn't be changing anywhere nearly as regularly as your edge config (which itself probably isn't changing as regularly as the stack below it, etc).
It does, but a smaller and less complex one.
Your CDN service has direct communication with end-users (because they connect in and request content) so:
- If something goes wrong, users are directly (and immediately) impacted
- It has to take a lot of load (obvs)
- Lots of load increases the chance you'll get either cascade failures, or a thundering herd leading to cascade failures
- The whole thing relies on TCP (HTTP/3/QUIC not withstanding)
- You (and maybe even your sales dept) make config changes semi-regularly
Things aren't quite the same for Cedexis:
- Users query their resolver, there's widespread downstream caching, so load is much, much lower
- That caching also means user visibility of issues is delayed (but, conversely, can be prolonged - it's a double edged sword)
- There's no TCP overhead, everything's UDP (unless you've fucked up)
- Config changes are likely to be quite rare
There's obviously some common scope for screw-ups - software releases are a potential bugbear for each.
But, CDNs also serve a wide variety of business - small file, VoD, linear video etc - it's all HTTP but optimal caching and delivery approaches (and often, desired reporting) differ greatly between them (even before we get onto built in optimisers and WAFs).
CDN switchers on the other hand have one main focus - DNS. Whilst the status checkers etc might have more complexity, if you've got a sane ruleset (i.e. some default fallback) the worst case scenario of an issue there _should_ be that you fall through and send all your traffic to the default - still not great, but at least there's some service.
So there's greater exposure to potential bugs on CDNs because they're more complex (software release improving performance for VoD customers just screwed your e-commerce site, sorry).
There is also the matter of trust - are you better moving from trusting Cloudflare to some CDN switcher run by a single guy out of his garage? Hell no. You're going to want a reputable org with established support lines - just as you'd do due diligence on your CDN provider, you'd do it on your switcher provider.
In the years I've been in CDN, I've only once seen a situation where the CDN switcher itself was an issue - https://www.bentasker.co.uk/blog/security/670-spamhaus-still-parties-like-it-s-1999 - even then it wasn't really an issue with the switcher so much as a 3rd party's understanding of modern flows.
Good plan ... except all this does is to shift the SPOF from CDN to CDN selector so doesn't solve the problem.
And it introduces significant additional complexity which experience shows is often the cause of problems. If you are serving a simple static page that's not an issue (but then you wouldn't be quoted here). Larger sites need to consider for example test coverage, troubleshooting, logging etc. which are all far more complex on multi-CDN.
Thinking you can just re-point a CNAME is, well, wishful thinking.
And that's not even touching on large services that need to do capacity planning with the CDNs and take selector decisions based on load.
> except all this does is to shift the SPOF from CDN to CDN selector so doesn't solve the problem.
True, except we've just moved the SPOF from a complex TCP stack (your HTTP(s) service) receiving direct user-connections, to a much less complex UDP stack that benefits from widespread downstream caching.
Is it still a SPOF? Yes. But it's also one that is much less complex, and less likely to go wrong.
> Larger sites need to consider for example test coverage, troubleshooting, logging etc. which are all far more complex on multi-CDN.
They are more complex, they are not "far" more complex.
There might, of course, be an engineering cost in getting to the point that you can do multi-CDN properly, safely, but ultimately your tests should be CDN agnostic, and you should have well-defined troubleshooting workflows that minimise/mitigate the complexity there.
Logging - if pulling logs from different sources is an issue for you, then it's not multi-CDN you've screwed up, it's your logging pipeline. If you can find a CDN provider that doesn't expose an API for you to pull logs via, then you've just found a CDN provider that you shouldn't be using.
> Thinking you can just re-point a CNAME is, well, wishful thinking.
Now you're misstating what I said
'With a *good* CDN switcher, you essentially just have another CNAME in the DNS chain - all CDNs respond to the same host header, and your switcher just routes traffic to different CDNs based on observed/learned status as well as your preconfigured ruleset.'
That's not saying "you can just repoint a CNAME".
> And that's not even touching on large services that need to do capacity planning with the CDNs and take selector decisions based on load.
That ability to configure that is built into pretty much every good CDN switcher.
Yes, you need to do some planning with the CDNs themselves to discuss what load you're going to send, as well as getting details of how they expose metrics for your switcher to consume. If you're big enough to need a switcher though, then you're more than big enough to be having those conversations.
This stuff _really_ isn't as complicated as you seem to think - it just needs a bit of planning and forethought.
I find it amusingly ironic that ElReg has a user survey currently asking folks about why they may or may not use "the cloud" as of yet. It's stories such as this one that answer their questions far better than anything I could post.
Why don't I use the cloud? Because the moment my data is in the hands of a third party, it's no longer my data.
"Why don't I use the cloud? Because the moment my data is in the hands of a third party, it's no longer my data."
So you build your own connections to every single user rather than relying on existing infrastructure? Wow, that must cost a pretty penny.
Pretty sure that for most users of these services it is much cheaper, and probably better, than rolling their own. Building a CDN is hard, and having to do so for just one site is prohibitive.
It means that a failure becomes really visible, but it probably makes those failures less likely than a whole bunch of disparate CDNs which aren't learning from each other's mistakes.
It's not really price that moves websites to CDNs but things like bandwidth and traffic filtering. Getting more bandwidth (and there are now attacks that have enough bandwidth to take out entire data centres) can be expensive but basically it's the filtering that brings the biggest returns.
For the only thing I do to which that might apply--a convention registration system--360 days out of the year, I am the only user. The other 5 days, it's physically moved to the convention hotel and entering new con members into the system is done by volunteers under my direction.
So why would I put it in the cloud and risk having the system go down because the databases can't be accessed?
I certainly won't claim to have eliminated all SPOFs, but I've gotten down to relatively few and for most of those I have spare equipment to swap in to cover.
The Cloud can be handy for many reasons but mostly because you get excellent performance at a decent price. However you really can put in your own cables and run your own Internet servers from your own office. It's not difficult, you just buy a bit more bandwidth and a few extra servers. After that there is no difference between Cloud hosted or Office hosted servers, they are interchangeable.
we ran our own web site, email, storage via sharepoint all onsite for many years, at least the 14 ive worked here. The management decided that we needed to shrink budgets so my dual internet, dual site, dual storage, dual stretch cluster was too expensive. Whilst dual redundant we only had power outages on both sites to consider - this happened once that I can remember - for loss of connectivity. Cloud DNS updated the records should we have an ISP failure and a lowish TTL meant this wasnt too much of an issue.
Management wanted me to move to the cloud and we decided that 365 was the best option (since migrating onsite sharepoint to 365 was supposed to be painless. Covid hit, backs were patted as we were already cloudy (which meant nothing as we could already access all the resources externally anyway), however 365 started to have a few hiccups and outages over the year. Throttling reared its head a few times, other dropouts were noticed. Basically we have had more loss of access in the last 6 months at least than we did over the entire preceeding period of onsite.
Cloud is definitely not more reliable.
"Cloud is definitely not more reliable."
And neither is on-premise. And I'm speaking as someone who has run their own kit (including in data centres) a lot over the years. What one hand giveth, and all that...
Example: for hundreds of small(er) businesses with limited or no IT budget, or aging kit that keeps falling over, or where there is no-one IT-literate to watch over it, the cloud *may* be more reliable for them *for their needs*.
Just because it isn't for *you* - even if you're doing very sensible stuff because you have knowledge and budget - doesn't make a hard and fast rule. As ever, quoting a particular specific to prove a general point is, well, a bit arse really? Also, if you suffered shrinking budgets then even your on-premise gear may have ended up less reliable over time...
A/C because I'm not arguing this. Everyone's situation is different.
Sometimes when you share you have the thing all to yourself. If you're the only one in the park then it's much bigger than your own back garden. However when everyone turns up to the park your back garden may offer more space.
The thing with Covid is everyone started using the shared service. This may still have been a problem with Internet bandwidth using your on site servers but at least the servers would have been fine.
If your website goes down at the same time as a bunch of other websites, your IT department can say "ah well, happens to the best of us".
If your website goes down but the rest of the internet is working, all fingers are pointing at your IT department.
So clearly, from their point of view, cloud services are a good thing...
Oh, and extra irony, the last XKCD I still had open in a tab before Fastly took it down along with it was The Cloud, after having followed a link from a Register comment earlier today.
How many more times are we going to see one major CDN issue take out a considerably large chunk of websites? Is it purely that it saves money and "only happens now and again" to justify it or is there no way to have two of these things?
I look at the Fastly status page and it says this which does not inspire confidence.
"Fastly’s network has built-in redundancies and automatic failover routing to ensure optimal performance and uptime. But when a network issue does arise, we think our customers deserve clear, transparent communication so they can maintain trust in our service and our team."
What happened with the failover? Backup services? I know nothing of any of this so I know I'm being overly simple. Anyone care to educate me?
Well, if they are as 'transparent' as they say they are, maybe wait until we get a post-mortem rather than randomly wonder about it? For all we know their backup and failover systems may have worked brilliantly the last 99 out of 100 times they were needed and we just never knew because they just worked, but just didn't this one time (although if failover had been needed that often maybe something is wrong, but you get my point).
- all eggs in one basket?
- did they get hacked?
- related to FBI action?
- was it Russians, Chinese, Iranians, or North Korea? Or just a mouse that chewed through a cable?
bbc reported amazon is down. I then shuddered as my world suddenly collapsed. What am I to do? Gov.uk can be down, and all data spilled, but AMAZON?! We're doomed!
Second appearance in this comment thread https://xkcd.com/908/
OS 3.2 has just been released, maybe the sites went down while they were updating Workbench.
So i assume its connected to the Fastly outage.
Hmmm. So this might be a good time to do that emergency maintenance we've been putting off and which will take down the network for a bit.
"Oh Yes, yes I'm sorry the network has dropped off at the moment. Yes it's due to this Fastly outage. Yeah it took down most of the internet and yeah we got hit as well. Dont worry I'm sure it will be up again soon..."
Seems someone tripped on the network cable ...
I started to notice issues this morning, then realised I was using a VPN with presence in the UK.
When I disconnected from the VPN to use the direct local ISP connection in Munich, Germany, everything worked and continued to work all morning.
I checked all the sites reported, like FB, Independent, Guardian, NY Times, etc. and all their sites were up.
Was this a case of inflated ego little British journos confusing UK for the World, with their reports of global outage???
Translation: We made a change during the night in the US and thought no-one would notice if we f**ked it up. Turns out it wasn't the night in other places in the world, who knew?
I bet their were some handsomely paid CIOs shopping for new underwear today...
I'm glad you found one of them has some mental activity going on. Whenever I've tried such logic on people who believe in that or similarly bizarre theories, electrical or technical limitations are always dismissed. Either the conspirators are much smarter than me and know how to do impossible things or I'm just stupid and can't realize that technical things are much better than I thought they were.