It never gets any less true.
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
Cloudflare has published a detailed and refreshingly honest report into precisely what went wrong earlier this month when its systems fell over and took a big wedge of the internet with it. We already knew from a quick summary published the next day, and our interview with its CTO John Graham-Cumming, that the 30-minute global …
Regex is wonderful and frightening. I use it to do complex edits and find errors proof reading can't find. Obviously Saving first and Ctrl-Z are both good.
I'm too much of a coward to put regex into production release code, or a web page unless it's very simple and doesn't loop. I test that on a test machine. I've also used it to convert WS3 files to plain text and sometimes that did unexpected things.
I use perl and regular expressions all the time. Just like most other programming constructs, you need to engage brain and think what you're doing.
That doesn't mean I haven't made mistakes and had to debug WTF is going on when things have not worked quite as expected. Usually issues are due to missing some rare (or thought to be nonexistent) case in testing.
Debugging regex issues can be challenging, but that doesn't change that fundametally they are extremely useful.
TLDR; You can pry regex out of my dead cold hands
Oh and thanks for the honest explanation Cloudflare, you certainly have gone up few notches in my book. Very refreshing.
This post has been deleted by its author
A regex expression is comprised of lots of operations strung together, just like hand-coded assembler or compiler generated code.
While we might say that compilers have withstood the trials of time and that the code they generate is almost (99.999996%) correct, this has only been achieved through years of real-world testing.
True, regex's are hard to read for you mere mortals. But then so is assembler, firmware, C++, Ada, etc. Time to bring back MS Basic!
And they're not fixing this by not using regexes: they're fixing this by switching to a NFA-based regex engine, so the nearly-exponential explosion no longer happens. (Instead, you can get an explosion in NFA states, but this is statically detectable at regex compile time rather than stabbing you in the face at runtime without warning. Much better.)
this is statically detectable at regex compile time rather than stabbing you in the face at runtime without warning. Much better.
I realise everything is relative but if the choice is blowing away a few toes or getting stabbed in the face I'd rather stay in bed.
Compared to other cloud outages,this one is very minor. Not only was it detected and acknowledged quickly, it was also resolved extremely quickly and the postmortem let's you know exactly what went wrong in great detail.
Outages happen. If only they were all this pleasant to experience.
When something goes wrong in a complicated system it can sometimes be difficult to pinpoint the problem. On the Dartmouth Time Sharing System we would sometimes get bugs that would show up after months of flawless operation. When we finally tracked down the bug and analyzed what the effects of the bug would be we were surprised that the system worked at all. Checking recent code changes would, of course, not be profitable in such a case.
Debugging under pressure is hard.
Fair play to cloudflare for the openness
Sounds like there S.I.O.F plan (shit it's on fire) worked, could do with a polish but a solid B+ in terms of response. To be fair these sorts of plans are always best guesses so finding delays in the 2fa email hitting inboxes isn't a massive problem for example, although I would hope they invest in an additional factor like keyfobs to mitigate delays in inbox access from killing your network.
Of course I'm looking at this as though they were a normal company and not a pervasive part of net infrastructure, so I'm ignoring the damage done to customers, but even so the fact they kept heads under that pressure and stuck to a plan shows good discipline and training (as you would expect with the responsibility they have)
So yeah my take is that the response is acceptable, they will clearly be patching holes in the process, and until the plan is put to test under real conditions you can't know where the deficiencies in it lie
The single thing which would have been useful is something more informative than "502 Gateway Error". It took a while to figure out it was "Cloudfare done bad" and not something else.
I am wondering how well an automated roll-back when something is deployed and the system goes down would work? I expect there is potential for that to go wrong or compound the issues. If such a thing were in place and had worked it may have reduced the down-time to around a third of what it was..
Not that I think the half hour it took was unreasonable. And the near hour for the final fix was not that bad either considering 'we've just screwed up badly, and we don't want another' would have been at the forefront of everyone's minds.
And they certainly deserve credit for something more than mere platitudes about how their number one priority is serving customers with complete disregard to having failed to do that.
For at least being honest.
Usually you get PR fudge that may or may not point the finger at the company itself (and holding up its hand saying 'mea culpa, mea maxima culpa'.
This is a good, refreshing post that is clear about 'we fucked up, and we're fixing it this way'.
It's weird that more companies don't do this. I've taught a bit of PR at a business college (even though I have only the slightest experience in the field - it was a shitty college, ok? Oh, and I mean college in the UK sense - not like in the US) and I used to run the sessions by looking at a bunch of case studies, good and bad, and getting the students to come up with lists of best practice for dealing with similar situations. The 'we fucked up' one invariably ended up looking like:
1) Admit something went wrong
2) Admit it was your fault, and say sorry
3) Explain what happened (potentially going so far as to give names, but don't throw people under the bus)
4) Explain what you did or are doing to make this particular mess better (possibly give compensation, but if you do the rest of this well enough, most of the time this isn't necessary)
5) Explain what you're doing to try to avoid it happening again
Obviously, you need to actually do the things you say in 4 and 5, but unless you really have no concern about losing business, you'll do that as a matter of course. CloudFlare did all of this (no idea whether they compensated anyone) and they're getting exactly the kind of forgiving response from their paying customers I'd expect. I really can't understand why any companies still go down the route of issuing vague apologies and payouts - most customers don't actually want compensation, or the meaningless apology: they want the bad thing not to have happened, and if you acknowledge that the bad thing did happen, but explain logically how you'll avoid it happening again, people are usually satisfied with that.
A good rule of thumb is that authorisation to roll out a change includes authorisation to roll it back in an emergency. It shouldn't need someone else to be consulted.
A second is that if things go pear-shaped promptly on rolling out a change it should be rolled back PDQ. Even if the problem was actually something else you're no worse off than you were before and at least you now know it wasn't the change.
However this is the way to handle the PR side - not the self-serving, transparently untrue boilerplate response we usually get. It actually raises Cloudflare's reputation.
We're tempted to use the phrase-du-jour "perfect storm,"
Resist the temptation, regardless of circumstances.
First, that phrase is, at best, du-2000, when the (overrated) eponymous film was released. It's been nearly two decades. Enough.
Second, it's a grating, idiotic expression that contributes nothing of value to the sentence in which it appears, even if it were novel.
Third, it has been direly overused, particularly by the sort of people who would do us all a favor by just shutting the hell up.
Third, it has been direly overused, particularly by the sort of people who would do us all a favor by just shutting the hell up.
Fourth, it lends itself to a ranty debunking by random commenters that obfuscates any real content
Fifth, some wag will take it upon himself to address said ranty debunking therefore perpetuating the cycle.
...
<maximum recursion depth exceeded, please reboot universe>