Twitter unaccessible
Always a silver lining
Internet services provider Cloudflare is suffering a major outage that has knocked chunks of the web offline – including The Register. The company acknowledged problems at 1148 UTC on November 18, stating: "Some services may be intermittently impacted." After a long half-hour, it reckoned systems were returning to normal, but …
Hell, I don't even use Cloudflare DNS, but the one I do use - ControlD - won't connect with my VPN this morning for any new connections.
So my phone is still online. But I can't get my PC connected after booting it up.
Yes.
https://www.mobileread.com/forums/showthread.php?p=4549332#post4549332
Solar flares are a risk too if strong enough.
To be clear, there is no evidence that this was the result of an attack or caused by malicious activity.
Indeed the hypothesis in "No Silver Lining" is that it's two separate automatic updates rushed out on a Friday evening. Major Internet disasters are likely always stupidity / accident rather than malicious. Except when the next big Solar flare hits. That shouldn't affect the the internet, but it will due to people thinking GPS and similar is good for timing. Beancounters saving small amounts It should only be used for navigation, because it's vulnerable to space weather and human attack.
GPS time is a good NTP source but you have to ensure your configuration knows what to do when the source goes insane (the mechanisms are built in. There's an assumption that every clock source will break at some point and that's why you always check multiples in your local server)
You funnel all the investment capital in the tech industry through Silicon Valley, allow them to build moats, then under invest everywhere else. That's how.
I could build a service similar to Cloudflare and I'm sure a lot folks here could, Cloudflare is essentially just a sprawling NGINX setup...but we'd never get the funding to get it off the ground.
When you happen to be pointed by a search engine to a far distant site that doesn't pay for a CDN, you soon know it. So the question isn't why sites pay for CDN service, it's why Cloudflare has such a large share of the CDN market. And the answer to that question is simple: unregulated capitalism. I believe that Dr Marx and Mr Engels pointed this out some years ago.
Have one company gating access to a chunk of the net
One?
Given what happened yesterday, and when Amazon went down a few weeks ago, and the Microsoft thing a few months ago there are at least three independent single points of failure which will take down a good chunk of the net. For some of it, any one of the three would take them down!
Routing traffic to your site via Cloudflare has always seemed odd to me. What's the fallback option? Is it easy to switch back to routing requests directly to your service when Cloudflare is glitching or unavailable? If it's as simple as changing an entry or two in your DNS zone then I suppose it's not too much of a problem. Busy sites though may not be able to support the load of doing that if they were using Cloudflare's content distribution.
> Is it easy to switch back to routing requests directly to your service when Cloudflare is glitching or unavailable?
Well, El Reg was up and down synchronously with some other sites for an hour or two; but now seems it is consistently up, while the other sites are still down. So maybe they did exactly that?
"but it's gonna cost and require expensive skills"
You must be one of them JS developers.
That measly two grand a month for an NGINX contractor could save you millions in potential lost business and prevent you looking like a wanker when half the Internet is down.
This area of tech is an area that is valued based on how much business you lose in a year based on stupid tech outages. Two grand a month to stave off hundreds of thousands or even millions in lost trading, broken SLA terms, compensation payments etc seems like a pretty cheap trade off to me. Where people fuck up is when they forget to value the skillset in this way and based the value on the time they get out of the contractor.
It's mental...deciding to lay off your infra guys is like deciding to lay off the guy that deploys a net every time you stray off the edge of a cliff because you're too busy staring at your phone and springing you back to your feet...just because you haven't died stepping off a cliff and you didn't notice the net, doesn't mean they weren't there deploying nets. If nothing is going wrong and they seem to be idle, they are doing their jobs...if your infra team is constantly on it, running around, flapping and busy as fuck...you've got serious problems not a good infra team.
The money you pay for this skillset is worth multiples of itself because the value is not in what it produces, the value is in what it prevents...it's the single best value technical skillset a business can invest in...developers and sysadmins are ten a penny, the University production lines crank them out by the truck load...you can find loads of them down your couch at a moments notice, just open your window and throw a rock, whomever it hits is highly likely to be a JS developer, including his dog...good infra guys though, they're worth many times their weight in gold and they are difficult to find...especially the ones willing to pick the phone up after 5pm.
I just wish the folks doing the hiring understood that technical skillsets, whilst part of the same whole are not all the same thing...it's like taking the guy that makes pizza at your local takeaway and splitting his skills in two...you've got the guy flipping and spinning the dough...precisely measuring things out, providing a solid base for the product (infrastructure) and the moron looking at the picture on the wall with his tongue out putting the toppings on, usually 18 massive scoops of sauce and a few strands of cheese if it's NodeJS (development)...then an even bigger fuckwit with a massive paddle standing at the oven waiting to shovel it in, hopefully not dropping it on the floor (devops).
With this in mind, if the base is fucked, none of the rest of the operation works. Developer has nothing to draw with his crayons on, and DevOps can't do anything because they've now got a half arsed product to deal with, it's all sauce, very little cheese and toppings and no base...can't get it on his paddle.
The worst part is, if DevOps drops the pizza on the floor or if the developers miss some slices of pepperoni...nobody notices...but if the base is shit, everyone complains.
£2k/month?! That's £24k/year and barely more than the minimum wage. You're hired!
But realistically, 3hrs is a blip. There's a good chance that most (but not all) of the lost sales will happen once the internet is back. (If people wanted X at midday, they'll still want X at 3pm.) Most firms aren't making "hundreds of thousands" in that time. And when half the internet is down, nobody is going to blame you too much.
If it keeps happening, then it might become a competitive advantage. But it's not worth it for once every few months.
"And when half the internet is down, nobody is going to blame you too much"
Yeah but not everyone is terminally online...they may not visit enough services to determine that it's a problem affecting multiple sites...as far as they know, it's your fault...and if it happens several times in succession...because you use three providers that had failures in a row...you're just going to look back.
One week your DNS on Amazon might fail because of US East.
Then your email backend might fall over because of Microsoft.
Finally, your site becomes inaccessible because of Cloudflare.
The man on the street isn't going to blame these individual massive corps, because they likely don't know that they're involved or even care...they going to blame you. This is what your infrastructure guys are for, to ensure that this doesn't happen by managing the infrastructure at times of failure...implementing redundant systems, switching over infrastructure etc etc.
3 hours is a blip, but to some it's massive...my mother in law was massively pissed off during the Cloudflare outage because she couldn't access some portal or other to book an appointment...she came kicking my door in to talk about the "bloody shit wifi in this place!".
You have an excellent way of insulting everybody. Of course, you clearly intended to insult everyone except infra admins, but you're also insulting them. Do you actually expect that a single infra person can be hired for £24k per year (I'm assuming pounds because $24k or €24k would be lower) and can fix all the problems you would have when CloudFlare goes down if you had originally decided to put them in front of your main systems? Those I work with cost more and have to plan a lot more ahead, including buying other expensive infrastructure in order to do that. You say nothing about how expensive having the infrastructure for that admin to manage would be, since a server swamped by everyone's requests now that there's no CDN to handle it isn't going to be much different from one that can't be reached.
Most businesses that lose millions if their connection goes down for a few hours do have redundant paths, and it costs them a lot more than that to have it. A lot of the internet doesn't have that business model. For example, how much do you think El Reg's outage with CloudFlare's wobbling cost Situation Publishing? A lot of readers would just come back later anyway. That even goes for many companies that involve lots of money. For example, if a video streaming service went down, they've already got subscriptions. Unless it happens so often that they lose subscribers, their customers feel the pain and they don't have a reason to care, and since they can blame CloudFlare this time, they have an excuse that will get rid of most complainants.
But I don't want to get too much in the way of you wanting to believe that you're worth your weight in gold and save millions by your every hallowed key press. One would hope that you recognize that a pizza that's nothing but raw dough is not really what people are going to buy, so though you can't manage the pizza without it, you also can't do so without the toppings or the oven.
Actually, CloudFlare and other CDNs are popular because they reduce the risk of single point of failure. 10 - 15 years ago DDoS launched by script kids were starting to cause so much trouble for sites, and even the data centres they were hosted at, that CDNs were a godsend. And things have only got worse since then. Given CloudFlare's track record and the fact that it's really just a thin proxy and not something that runs your applications, I think that's a reasonable risk/benefit approach.
I remember the pre-CloudFlare days, DDoS was pretty much a fact of life for every website. I don't host myself, but I hear it's so bad now that if you're not under a big CDN you're basically guaranteed to get buried in DDoS attacks (and now probably AI data harvesters, too). Not so much CloudFlare's fault for existing as it is that there aren't really many other good options.
I've been suggesting that for years. If you're only aiding them, we turn your power off for a week, if you're a hosting provider... ha, I hope you have tanks of diesel and BPs number because your electricity won't be back on any time soon.
You'd have to do it California because the only judge who (publicly) appears to know anything about computers and networks is Judge Alsup. I expect it would take 6-8 hearings and 2-3 days (if we had the use of a private helicopters and jets. We could walk into the miscreant's front door, or drop a bomb on his apartment block.
This has always surprised me because unless the GCHQ/Moss/KGB/DGSE/NSA/CIA/FBI are truly incompetent, they already know who's responsible for every DDoS attack even before the attack ends.
That's an approach we could take. Just wondering, who are your neighbors and do you think any one of them might have committed the smallest of criminal offenses? You see, I really don't like speeding, I consider that terrorism, so I'm planning to drop a bomb on your neighborhood if one of them broke a speed limit. That's not going to cause any problems for you, right? Oh, in case we live in different countries, would you mind telling your government that yes, this does look like a very overt act of war, but it is just normal law enforcement now and they should feel free to drop bombs on anyone over here they think committed a crime?
Some of my neighbors are murderers. Well, families of murderers, because the person who committed murder is actually still in prison.
A different neighbor murdered my former roommate. That person is also still in prison.
Speed limits are set because requiring a very high standard of driving would lower the number of licensed drivers by more than 50%. Since you can see the car speeding, its rather large, after all, I'd would prefer that you blow it up while its speeding.
If you are in a different country and are performing a DDoS on a service in my country...well, that sounds like war to me. Your goverment will probably drop a bomb on you to avoid having us do it since we would be much more indiscriminate. What do you think?
When someone is using a half million compromised wifi routers sitting on modern fast fiber home internet connections? You gonna cut them off, on the theory that if you have your equipment p0wned you deserve to be punished? Or do you want to go after whoever is behind the DDoS itself?
Can you PROVE with certainty who compromised the routers? How do you know that's also who sent the command to that botnet to initiate the DDoS command, as opposed to someone who compromised whoever compromised the routers and took over the botnet six months ago? Does it matter to you if that person is just some gun for hire type who takes his orders from someone who he isn't in a position to refuse, like a Putin or a crime lord of some sort? And if you do trace it back to Putin, or Xi, or Kim Jong, or Elon Musk on a ketamine bender, how exactly are you going to put your "take away their electricity" plan into action?
Sure your plan makes sense if it is some guy sitting in a big datacenter sending out all these packets. You cut off his datacenter. That's how not how these attacks work though, not ever.
Let's ignore the obvious implementation problems like not knowing who installed malware on the computers that are doing the attacking. I'm sure that, at great expense, we can take all the law enforcement people who investigate technology-related crime away from their investigations of ransomware and financial theft and crypto scams and put them on to investigating botnets that are DDOSing a technology blog, and then we'll find the culprits. Problem solved.
But are you really advocating the technique of treating a comparatively minor crime as terrorism because we don't like it? What would your opinion be when that got only slightly modified to be a crime you don't think should be a crime, like accessing a site without submitting your identity information as required by the Online Safety Act? Some people probably think that can be treated as terrorism too. Are you willing to go along with that? If you're not, and if you are we have much bigger problems, then this is not an acceptable replacement. DDOS is annoying and should be treated as a crime, but it is not terrorism.
Sorry, with the right tools, I could tell you exactly who installed any malware on any computer on the planet*. Do I have those tools? No. Do they exist? Yes. Groups like the EFF, and some goverments, don't like people like me to have those types of tools.
* - Or who is currently paying to use it, i don't think this makes any difference.
I know my little obscure website is under almost a constant barrage of syns, ack back, dead air. They shift around, usually like a 45.21.20.23, 45.21.21.25, 45.21.22.100, 45.21.23.220. So a group of 4 in the same /16 And that group of 4 is shifting all over the IP range. Could be on 160.20.xx.xx next time.179.50.xx.xx, ... LatAm seems to be the latest general origin. So my little machine reacts with a temp block on them that gets released around a day after they stop. The latest /16 I blocked I think has "knocked" around a million times over the past 3 days. Just burning up bandwidth. I've never gotten a extortion email, although it could have been bounced, I bounce alot of obvious crap. And I do get, those knocks may be spoofed packets and might even be someone trying to get me to send ack's to some other target. If it is, I'm doing my part by not sending the acks I guess. This is all why we can't have nice things. People that suck take advantage.
And I do get, those knocks may be spoofed packets and might even be someone trying to get me to send ack's to some other target
A lot of attacks are amplification attacks where x amount of data sent in a certain way somewhere results in 100x as much data sent to your target. Or maybe they have trying to goad their targets into attacking some middleman they are setting up, and if instead of blocking them you decided to attack them back you'd be playing into their hands.
You never know, because as you say some people suck.
"if you're not under a big CDN you're basically guaranteed to get buried in DDoS attacks"
Yeah not quite. I host a lot of stuff myself on my own servers and haven't ever been DDoS'd. I do see a lot of bot traffic, but that doesn't occupy a lot of bandwidth or resource, it just makes my logs a pain to read.
If you're a larger business not using a CDN...perhaps there is an increased risk of DDoS...but for the most part, not really...though I suppose it depends on how you setup your own hosting...everything I host is behind an NGINX proxy configured with fail2ban and various other things...so as soon as even a whiff of spam hits my server, IPs get blacklisted permanently.
As it stands my blacklist contains around 500 IP addresses, mostly Chinese and Russian, quite a bit from Cyprus (Russians) and a light sprinkling of Eastern Europe.
To be fair the primary (perhaps outdated now, given the other services they offer) reason for using Cloudflare was historically very effective DDoS protection. There are very few networks with the traffic capacity and automatic response to handle the largest bursts of traffic. If you handle traffic directly and are a target it can be exponentially more expensive, complicated and time consuming to defend against this.
Well you don't have to use just one CDN...you can use several.
The reason you use something like Cloudflare is to reduce your attack surface and provide some level of edge caching to speed up your site delivery and take load off your backend servers. It is a useful service, however, using just one CDN is not a great plan.
Personally, I think you're better off self hosting your own content delivery servers using NGINX...you don't need a lot of resources for a good NGINX setup, just bandwidth...a couple of PowerEdge R250 servers setup as a Proxmox failover cluster is more than sufficient. You can handle a lot of traffic with just a couple of NGINX boxes.
The trouble is, engineers willing to manage NGINX servers in a reverse proxy setup are hard to come by because they aren't cheap. I've done this kind of work in the past but because it's not a tangible kind of work, you quickly become seen as someone that does nothing...despite the fact, you are the first line of defense when it comes to certain types of attack...so even though most of the time, it's all just routine maintenance, that one time in the year when your site is under attack...you really need that guy. That's the point at which he earns his money.
90% of the time, this guy is checking logs and running updates. <-- this is what gets them laid off.
9% of the time, this guy is installing TLS certificates. <-- this is the first thing to fall over when you lay off your NGINX guy.
1% of the time he saved you millions in lost revenue that time you got DDoS'd or a Chinese bot went rampant and prevented your company looking stupid. Increasing customer confidence and preventing cancellations. <-- this is what you were paying him for but were too dumb to notice.
He's not taking the piss. If NGINX is in play, it's probably not a small system and probably has quite a few certificates involved.
I've worked on a lot of systems where automatic renewal just wasn't possible. It's not unusual for an NGINX reverse proxy to be accessible from the internet but have no access to the internet therefore making automatic renewals impossible. There is no reason for an NGINX instance to have access to the internet. Not even for updates. You would typically have a separate staging image for that sort of thing where you can roll out updates, test them, then when properly tested you create a new template with the updates applied. Destroy the existing proxies one by one, replacing each as you go and pulling in the configs from an internal git server or something.
This isn't to say you can't have some degree of automation, I have a custom built tool that I use for some customers that utilises a separate box in their infrastructure for issuing and renewing certs from external suppliers...because whilst LetsEncrypt is pretty good at automatically renewing certs on a single box / single entrypoint...it is useless if you have multiple entry points. You should also, always, make a point of inspecting your TLS certs when you know they've been renewed...typically there is some overlap with LetsEncrypt, you can renew a certificate several days before the current one expires, which gives you a window (by design) to check things before you push. CAs rotate their servers fairly regularly, and sometimes it can take client side devices a little time for their local trust databases to update...so you could find yourself in a position where you've rolled out a cert issued by a CA that isn't widely trusted by client devices yet...this has happened a few times for me with LetsEncrypt...particularly with older devices or devices that get infrequent updates.
Hopefully someone, suitably regomized, will be sending an email for the "Who, me?" column in the next few days or weeks about how they managed to thagomize half of the internet this time…
(Otherwise, they'll need to be keeping an eye out for freshly delivered rolls of carpet and poorly maintained lift doors for quite some time…)
That's something I'd love to see in a corporate statement. Instead of the boring, "lessons have been learned" in an after hack / downtime / breach report the corporate-speak was changed to, "new procedures have been introduced to avoid future breaches of our system. Including supplies of rolled up carpets prominently on display, underneath signs saying Death to all employees who reveal their passwords!"
Thank you very much! It just seemed to fit quite perfectly there!
(In the probably very unlikely event that any commenters are unfamiliar with the thagomizer, all credit has, of course, to go to The Far Side cartoonist Gary Larson, and to the entire paleontology community, who quite clearly do know a funny bone when they see one…)
CEO: “Why’s the entire company offline?”
CTO: “Cloudflare sneezed.”
CEO: “But you said they were the gold standard.”
CTO: “They are. That’s why everyone uses them. That’s also why when they faceplant, the entire internet turns into Victorian London fog. It’s a feature.”
CEO: “Didn’t we have a risk session where you said something about avoiding single points of failure?”
CTO: “Yeah, yeah, but that’s for the little people running Raspberry Pis in their garage. Real enterprises consolidate all traffic through one giant benevolent megacorp, because… economies of scale. Or something. I skimmed the brochure.”
CEO: “But surely we architected a fallback?”
CTO: “Absolutely. If Cloudflare ever goes down, we… wait for Cloudflare to come back up. Solid plan. Industry standard.”
CEO: “So our customers can’t access anything?”
CTO: “Not unless they enjoy watching 522 errors in different fonts. On the bright side, this is the most distributed downtime we’ve ever had. Global reach. Brand consistency.”
CEO: “Should we reconsider relying on one vendor for DNS, CDN, WAF, analytics, TLS termination, routing, edge compute, zero-trust, VPN…?”
CTO: “Look, we put everything behind Cloudflare because we wanted simplicity. Now we have perfect simplicity. Nothing works, equally, everywhere.”
CEO: “So what do we tell the board?”
CTO: “Say it was a spike in ‘unusual traffic’. That phrase is magic. Makes it sound like we know what we’re talking about.”
«CEO: “Should we reconsider relying on one vendor for DNS, CDN, WAF, analytics, TLS termination, routing, edge compute, zero-trust, VPN…?”»
If anyone from the C-Suite spouted that I would be checking myself in for urgent psychiatric assessment as clear my admittedly tenuous grip on reality had finally failed. As plot dialogue it's not credible. These chaps have trouble working a light switch.
«CEO: “But you said they were the gold standard.”»
Clue: the gold standard was obsolete and counterproductive over fifty years ago.
And real useful, didn't leave me having to guess who was at fault, was very comforting to know that I didn't have a problem.
The only improvement would be "yes, we know", "it's not just your access point having problems", and some ETA for a fix or indication of the scale of the problem as they perceive it.
The worst part of any outage, internet or physical world, is not knowing how long it will last for so you can't make an informed decision on the most appropriate action.
...and then, there are those who believe they have mitigated disaster, having spent shedloads of money, only to find the redundant fiber (power, servers) have been combined by the contractors "to simplify things".
e.g.: both primary and backup fiber exit the building on opposite sides and are then combimed in the same trench once they reach the roadside...
>>I thought the original idea of Darpanet was to automatically route packets around breakages...
Routing stiIl does.
Trouble is that the re-routing fails when many routes point to Cloudflare and the target server is behind Cloudflare's infrastructure with no "real world" IP address, so then any re-route attempt will just wind up hitting a Cloudflare gateway... and consequently be, effectively, black-holed.
It was (and still is) but the emergence of the backbones (Tier1 providers) in the 90s as telcos took over started concentrating the logical paths across the same phyisical circuits
Ironically, most telcos had very robust routing systems for voice to ensure that backhoes and friends only ever caused local glitches. That all went out the window when they switched from being service oriented to profit maximisation after the AT&T breakup of 1982. Even non-USA telcos were affected by the change in industry mindset and by the early 1990s the MBAs and quantity surveyors had taken over (always keep your aardvark on a short leash)
Yep. Used to be that The Bell System was a strategic resource (literally...the AUTOVON switches were on the floor below the civilian ones at the nodes) and as such, was required to be robust.
That went out the window with deregulation and the breakup. We can debate the merits for quite a while. I'll just point out that Bell could never quite make the Picturephone work and now, we have several choices for video calls. Packet networks, once the processing/switching power and fiber bandwidth arrived, blew away circuit switched for almost everything, and at a considerably lower cost to the end-user. On the whole, I'm glad it worked out this way, but I do miss The Phone Company's attitude of reliabitlity and quality work (but not the cost)
The original design was also for transmitting useful data using simple HTML, not cat videos with 8 different codecs and upteen different back-end standards and hundreds of hops while consolidating most of the resolving with just a handful of monopolies.
Yet here we are. ------------------------------>>>>>>>>>>>>
"Had to go highbrow and read Ars Technica instead."
Umm Arse Technical…Highbrow ?
You are a courageous chap, I give you that.
Eating lunch and taking the risk of reading one of Beth Mole's "challenging" articles and retaining it—your lunch I mean.
Yes I guess Ruined Lunchtime would cover that… quite umm nicely… and your keyboard.
This morning was the morning we identified which network properties have competent network and system administrators.
I was extremely surprised to find The Register was not in that group.
There was no need for a SPOF 25 years ago. The fact that people choose to have them now, blows my mind
YMMV
AAC
That all works fine until cloudflare goes titsup, at which point the competition does too thanks to the extra load
Network connectivity is an odd fish. Unlike analog systems which get progressively noiser, It works ok right up to the point it doesn't and then it.... doesn't.
Recovery always takes the load dropping back well below the original breaking point too
I've pulled the "I warned you" on a number of occasions when manglement ignored warnings that the system was getting to the cusp of overloading failure, then went into headless chicken mode when what they'd been handwaving away actually happened. It doesn't matter how much money you throw at it, it CANNOT be fixed "right now" and if the previous requests for upgrades were heeded, it would have cost 1/10 of what it's now going to cost.
The biggest risk from a MBA point of view is X hundred people not being able to do their jobs. If you paint the consequences in those terms they might get the message that spending £100k on upgraded kit is cheaper than losing £500k in an afternoon of people sitting around twiddling their thumbs, let alone any contractual damages that might happen.
Money. It's always money. The only language C-suite understands. But then people forget that time = money (say £100 / hour / person) in deciding whether to spend £300 on something or not that requires three levels of signoff and four people to check everything about the supplier. Spending £300 just turned into burning £3000 or more.
Except when those providers go tits-up. Like AWS did the other week.
BTW El Reg's lead story today is Azure's been on the receiving end of a massive DDoS attack.
And if you use some other provider on a PAYG basis, how do you move your data/applications/whatever to that provider when your first choice is unreachable?
That would depend on what business you were in. Let's take two hypothetical businesses and see how it would work for them. Our other alternative involves a couple million in expenses:
Stock trading platform:
Cost of being down for two hours: People who must trade now can't. The price changes significantly, so they lose an opportunity. Possible lawsuits, possible rich clients taking their valuable business elsewhere.
Is that higher than 2M currency: Yes.
Do they have backup: Yes.
Tech news website:
Cost of being down for two hours: People who want to read articles have to read them in the afternoon instead of the morning. Maybe a few of them go to some other tech site and read articles there instead, costing you fractions of a penny in advertising revenue you'll never get back, probably less than the amount of ad revenue they didn't get because I've got an ad blocker enabled.
Is that higher than 2M currency: No.
Do they have backup: No.
If your business loses a lot when you're offline, then you need more backup, and surprisingly enough, that's exactly what we see. A lot of online businesses don't lose much if they go offline occasionally as long as people don't expect it to happen very often. At that point, you do want your solution to be economical, because if implementing the ability to switch to something else, switching to it, and using it through the gap costs more than just letting the outage roll through, they'll do the latter. But, if you can make the alternative economical, more people will use it.
Is it me or does the Cloudflare press release message look AI generated?
"We do not yet know the cause of the spike in unusual traffic. We are all hands on deck to make sure all traffic is served without errors. After that, we will turn our attention to investigating the cause of the unusual spike in traffic."
It's not enough text for me to take a guess. The repetition of "spike in traffic" does sound unnatural, so maybe that suggests LLM usage, but since it's three sentences whose only purpose is "We're fixing it, then we'll debug the cause", I don't really know or care how they generated them.
I hope the dns root-server test server network isn't virtualised on VMware, since vmware now syncs time to the hypervisor whether you've configured it to do so, or not. This will cause their test network to return to the actual time instead of the +180days test network time.
vmware now syncs time to the hypervisor whether you've configured it to do so, or not.
Well damn, that is a mal-feature/bug!
Fortunately, we have no VMware instances (which we know about).
We did all our Y2K testing on real hardware, rather than VMs, to avoid the possibility of a Y2K-related lawsuit in which a solicitor could use the VM-vs-real-test-hardware-difference issue as an attack surface.
Just someone found out there was a limit on the file size and tipped it over the edge?
Being something of a tech know nothing what is it with restricting the size of what turned out to be a critical file. Did someone not think of what might happen in that event and put it on a bigger disk? Did we not learn anything from GOV and 64K spreadsheets?
"Given the importance of Cloudflare's services, any outage is unacceptable. "
That is a true statement, but outages will happen. The Internet is facing the issue that there are some very widely used infrastructure providers that are too big to fail. A larger number of providers that are competing would prevent these mass outages. Sure, some sites would be impacted when a mini-Cloudflare went down (sorry The Register) but it would not take down so many other sites at the same time.
Mentioned in the office today that I wouldn't be surprised to see half the internet go down before the end of the decade. Three major outages in what? Two months if that! I seem to recall a BGP cock-up a few years back which brought down the US east coast.
Makes me even more glad I hold an amateur radio licence. At least when the internet inevitably has a nervous breakdown I'll still be able to communicate with other people!
When was the last time the internet went down for a long time before it came back up, and did that affect everything else? Because from my experience, the answers are never and no. Stuff goes down. Unfortunately, a lot of services run on a small number of providers, and that means that small problems make big outages. But that doesn't tend to be a complete collapse of communications in an area. When one ISP fails, the others tend to still be working, and since I have both a wired connection and a mobile internet plan through different providers, at least one of them is likely to be working unless there's a widespread power issue. I have tons of systems through which I can communicate, from email (colo-hosted), Signal (AWS-hosted), Teams (Azure hosted), Google services (Google Cloud hosted), Jitsi (self-hosted in my house), and phone calls. I have never seen an outage that would take them all down.
And, when there are outages, they get fixed. If I can't talk to someone for a few hours because some system is in the way, that's probably not an emergency. None of these outages have tended to affect the things that really are emergencies. To have anything like what you're describing, you would need system failures that spread, even though that's not how tech failures tend to go, and you'd need nobody fixing them even though the companies that make the services need them running to make their profits and thus hire hundreds or thousands of people to fix them in that situation. The situation you describe could happen if there was a concerted attempt by attackers trying to break everything, but even that would likely be harder than you'd think and, if they were doing that, it would almost certainly be as part of an invasion which would be the bigger concern than not being able to call your friends.
Your amateur radio system is not much different. If you're using anything short-range, then you're likely relying on repeaters to get your signal to someone you want to talk to because otherwise you're only slightly longer range than a big megaphone would be. If power fails for those repeaters, you're disconnected from anyone not in line of sight distance. If you're using HF, then you have more ability to communicate directly with the person you want, though you will also need a lot more power at either end for that to work and I question whether you're operating backup generation for high-power HF transmission.
The Internet, at inception designed as a robust landline communication system for the military, one automatically capable of routing messages around hubs destroyed by an enemy, has fallen into the hands of behemoths intending to filter out malicious messages and to provide services which once upon a time were in-house for businesses, they now situated in clusters of servers located in 'clouds'.
Centralised services, even when physically disbursed, are prone to widely spreading failures. No longer does inconvenience arise mainly from ISP glitches.
Assuming ISPs function normally, and that the physical backbone of the Internet remains robust, one can postulate a future wherein the only people able to use the network as intended will be those engaged with diffuse peer to peer darknets.
They couldn't route around Manchester, because it wasn't the only one down. I spent the morning seeing that Cloudflare Denver was down. And despite what Cloudflare says, it wasn't 'some traffic' it was all traffic for affected sites. It was intermittent at the start, then just stopped trying at all. And for more than just a couple hours.
And I'd like to point out that they haven't resolved all their problems yet. Just now trying to reply to your post I kept getting 'rate limited' error messages. After a few tries I then got a message that The Register had banned my IP for too many attempts. (5?) But I could access any other Reg URL that I tried.
The nature of today's Cloudflare errors points to a fundamental design flaw: a single configuration error was able to take out multiple services, BUT ALSO their admin dashboard which was my only means of mitigating the outage. I could have simply turned off the problem service, but no. Huge fail!
Cloudflare Timeline:
* 2025.07.26: announce big switch: Prod cutover to Rust!
* 2025.11.18: everything collapses
Their announcement is here: "Cloudflare just got faster and more secure, powered by Rust". (archive, also Internet Archive.)
Salted with the usual religious Rust stuff. (Although this was a new one on me -- heresy, surely?: "It’s easy to write code in Rust, which causes memory corruption.")
Ideology/Religion is no substitute for competence or battle-tested code.
And the error ("thread fl2_worker_thread panicked: called Result::unwrap() on an Err value") suggests a level of (first 2 weeks at work) basic coder slop that's sphincter-puckering: chaining without checking.
Recall what I pointed out recently re one aspect of the systemic risk being introduced by forced infestation of core tools by Rust code written by Rust Religious Zealots.
And the source for the Rust change being responsible for or even involved in this is where, exactly? Or might that be your religious objection to it? They've got a description of the cause up now. Maybe you'd like to read it before deciding whether to continue with this argument.
I quoted from their comms, so clearly had already read them...
If you yourself have got caught up now, then you will have discovered I was correct: the errors were all in the Rust code and all derived from programmer carelessness, incompetence, or getting caught up in their dreams of Rust. They used a null as a flag of a known state --failure of explicit size limit check, where the limit had been deliberately set low to maintain performance by closing-out CPU hogs-- then instead of coding the response to that (eg, log a message), they played in their dream of safety and specified to coredump the system if the "impossible" happened. All in Rust.
Triggering that limit was due to an external/nonRust error, true. But the Rust code both triggered that external call/code and "handled" the response. It should have been a no-brainer.
Ah, so your definition of "Rust's fault" is that code written in rust was running when the problem occurred. Not generally the definition most people use, but sure, let's go with it. Anyone who has ever written the wrong condition or failed to check for an error can blame their programming language, because somehow it was supposed to stop them. This is definitely not what a religious view on programming languages looks like at all, trying to crowbar any problem that happens into evidence of that language's inherent badness when it's patently obvious you have no basis for that but deeply want to have.
Strongly suggest you find out what the Rust Community claim re their Magic Wand Of Superior Virtue. Your claims here strongly contradict theirs.
It's like you don't actually know anything about the topic.
Regardless, I'll let you two fight it out.