It has to be
DNS cryptocurrency. It's always DNS cryptocurrency.
Cloudflare has admitted that one of its engineers stepped beyond the bounds of its policies and throttled traffic to a customer's website. The internet-grooming outfit has 'fessed up to the incident and explained it started on February 2 when a network engineer "received an alert for a congesting interface" between an Equinix …
In the days before Cloudflare existed - indeed, very nearly the days before services like Cloudflare's existed, being as it was almost exacly the time Prolexic got off the ground - my site was DDoSed by an organised crime gang, and our ISP's 'help' was black-holing us entirely, on the grounds that the DDoS against us was taking down all the other clients in the datacentre. To be honest, I understood their decision then and I'd understand it again now.
So while on the one hand, obviously it's bad that Cloudflare did this, all the "oh noes, I had no idea a network provider could do such a thing!" hand wringing is a nonsense. At the end of the day, what else do you expect to happen? All the other customers to salute and go down on the ship with you?
Enough with the conspiracy theories...
Hear hear. Quite unhelpful that the article makes this curious statement:
Actually throttling a customer without warning will likely fuel theories that Cloudflare, like its Big Tech peers, is an activist organization that does not treat all types of speech fairly.
I mean, I suppose some corner of Q or Parler will probably have a whinge, but since they weren't actually suppressing a political outlet, or in any way treating speech (nor suppressing traffic on any basis other than sheer volume), it's not a useful data-point in the general discussion of "Is too much of the internet going through CF, are they becoming a monopoly provider, is it a bad thing that the Venn diagram of "the internet" and "cloudflare" have been trending towards a circle for a while now (they'll never get there entirely of course, but there's perhaps more overlap than is healthy in a supposedly diverse and distributed network).
my site was DDoSed by an organised crime gang, and our ISP's 'help' was black-holing us entirely, on the grounds that the DDoS against us was taking down all the other clients in the datacentre. To be honest, I understood their decision then and I'd understand it again now.
I agree to a point... but your datacentre will have had some sort of fair use policy and you were presumably billed on transit, or some portion of your hosting was related to data - whether port speed or data usage.
Cloudflare has a policy too... but you literally sign up with CF to avoid things like DDoS. That's the package (and the entire point) - CDN, DDoS-protection and all-you-can-eat bandwidth (notwithstanding the explicit exceptions for images/video if someone tries to build the next Flickr/Imgur/YouTube behind it). If CF struggles with capacity, that's rather their problem - not the customer's. Of course it seems to have wrong-footed them that this was huge amounts of legitimate traffic with large requests - not a DDoS that they could kill at the edge, which rather stressed their individual link to that DC. But it's still their problem to solve.
All that being said, 3000requests/sec at 12MB per request comes to... 288Gb/sec, and $200/mo doesn't buy you a 10Gb port on most exchanges, so they were getting their money's worth whilst it was spiking!
To be honest, I understood their decision then and I'd understand it again now.
If I were Cloudfare and had a customer effectively DDoSing the service I'd probably pull the plug on them until the issue is sorted; the needs of the many outweigh the needs of the few, or something like that.
Where things seem to have fallen below expectations is in liaising with the customer and Cloudfare having perhaps not known what had been done to mitigate the issue until the customer started bitching.
So, all-in, a case of the engineer having done the right thing, but not having checked with the higher-ups that it's the right thing to do first.
Own the issue, admitted it was human error due to procedural omissions and stated they'll make changes to remedy it from happening again in the same manner. They haven't said the fix was completely wrong, but that how it was applied, was.
If anything this actually gives some confidence in them and right now seeing the engineer punished wouldn't be worthwhile. If anything I'd say said engineer is now more qualified than any other to address similar incidents going forwards.
All in all, well done Cloudflare for putting this out there.
Teacher icon because every day is a school day.
Actually, I'd say it's just a learning experience. It's an example of the Frame Problem; you can think of the majority of things you need to know to stop something stopping dead in its tracks, but there's always the unexpected. If you decide you won't do anything until you know every edge case and have a solution to everything, you'll never actually get round to doing anything other than searching for ever more unlikely edge cases to evaluate.
Having to make a split second decision in the heat of the moment does not give you the benefit of hindsight, or that extended period of deliberation in calm, amongst many minds.
Aye — and the crucial thing is to make sure that when the unexpected (or even the mildly-expected-but-not-deemed-likely-enough-to-be-worth-coding-for) happens (or happens often enough), that you have processes, architecture, people and a codebase that can be adapted to deal with it using a reasonable amount of time and effort.
I've seen it happen all too often that a line is drawn (perfectly reasonably) under the number of cases that have been programmatically anticipated in order to avoid endless searching for edge cases without ever doing an actual release, but then adding in a new case (edge or otherwise) is so difficult due to rigid processes etc. that it never gets done.
All the cloud providers do crap like this. They say that they have established processes and procedures, but then in any perceived crunch, they go outside of their own established rules that the customers have been told to expect, but then their environment comes to a screeching halt because they did something that was outside of the expected error handling.
Well if I have read the article correctly, I suspect the customer initiated a restore from their cloud backup/archive provider.
I suspect storage providers typically have large amounts of inbound traffic, but infrequent requests for bulk exports and thus large and prolonged amounts of outbound traffic.
Well I mean at least they pledged to improve the process. Yes CF is a corporation and as a corporation has no sense of morality, but I truly believe that the people working there are mostly trying to improve the world. So kudos to them for improving. I just hope they don’t fire the engineers, prior to this there apparently wasn’t a clear procedure so you can’t fault an employee for not following a rule that doesn’t exist.
That's a phrase that gets misused too often by companies. Hopefully not in this case though.
I've only worked at one place that officially had a Blame Free Culture, and plenty of others that just naturally didn't feel the need to blame individuals for faults and errors unless it was genuinely deserved.
The place with the official policy used to go to a lot of effort to identify exactly who it was that we shouldn't all blame, and made sure that everyone new who it was that wasn't getting blamed. They even held high-level meetings to discuss the individuals who weren't being blamed, because, officially, they were so caring and people-focused.
jira.domain.com, now offers