"We have taken action with that particular employee"
Presumably involving a a shovel, a roll of carpet and two 20kg bags of quicklime.
The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus. "We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that …
Might be worth keep them onboard - they have just learned about a scripting issue which could mean that they will never have this happen again because the tech will be damn careful next time! Replace them with a new tech and the chances are that this will happen again in a year or two.
I was in a situation a number of years ago when an engineer an upgrade which resulted in lots of problems. Took a little while to full resolve, but we did. A good few months later and the same system needed another upgrade and the same engineer was assigned - but I was too - I prefer 1 late night than a week of them.
During this work it was clear that the first device to upgrade shuold be really be done on its own due to the nature of the changes it was doing. The engineer then went to the second device and was about to start the upgrade there. I stopped him and said "is this what happened last time ?" I pointed out that the first device should really finish and then we an upgrade the others at the same time as the backend upgrade had finished.
So, no, some do not learn from their mistakes and won't be more careful
"Replace them with a new tech and the chances are that this will happen again in a year or two."
its ok, they had someone automate something so that if someone wants to do this again they can only do it through the automated system. They won't have the skill or experience to know what they are doing, they will just run some automation.
In the same vein, did this engineer who ran the script actually know what the script did?
blindly running a script or automation for years that someone else somewhere else wrote or jerry'd up isn't the same as being a skilled operator with many years of experience and the capability to write said script or automation and understand what it does.
In the race to outsource to the lowest bidder the skill sets of those doing the work plummets.
May also depend on whether they tracked down the engineer at fault (eg they were keeping their head down and hoping to avoid blame), or the engineer immediately put their hand up and admitted they were the cause of the fault.
Assuming they were otherwise competent at their job, IMHO someone who will admit to their mistakes rather than trying to cover their tracks is someone it's worth keeping around. Their replacement could be of the cover their tracks variety, and that could lead to even worse issues when something goes wrong.
Who approved the delivery of change under that emergency procedure?
They also have culpability in approving a change that they didn't understand in terms of impact.
Sometimes you need to look at what drove the behaviour in the first place rather than just blaming the guy at the sharp end.
I suspect that all the engineers use this "emergency" global rollout all the time to get things done quickly. In this case it went badly wrong, and now everyone's pretending they've never done it and just leaving the one person to take all the blame
This is definitely a case of needlessly ambiguous terms.
>Presumably involving a a shovel, a roll of carpet and two 20kg bags of quicklime.
Well that is the companies official emergency management procedure. Preferably implemented before the minion get a chance to tell their side of the story.
I don't get it? Been running DNS for about 25 years now. It's super rare that a problem is DNS related in my experience. I certainly have had DNS issues over the years, most often the problems tend to be bad config, bad application(includes actual apps as well as software running on devices such as OS, storage systems network devices etc), bad user. In my experience bad application wins the vast majority of times. I have worked in SaaS-style (as in in house developed core applications) since 2000.
But I have seen many people say "it's always dns", maybe DNS related issues are much more common in windows environments? I know DNS resolution can be a pain such as dealing with various levels of browser and OS caching regarding split dns where DNS names resolve to different addresses if you are inside or outside the network/vpn). I don't classify those as DNS issues though, DNS is behaving exactly as it was configured/intended to, it was the unfortunate user who happened to do an action which performed a query whose results were then cached by possibly multiple layers in the operating system before switching networks and the cache didn't get invalidated resulting in a problem.
I know there have been some higher profile DNS related outages by some cloud providers(I think MS had one not long ago) but still seems to be a tiny minority of the causes of problems.
It makes me feel like "it's always DNS" is like the folks who try to blame the network for every little problem when it's almost never the network either(speaking as someone who manages servers, storage, networking, apps, hypervisors etc so I have good visibility into most everything except in house apps).
I think by "it's always DNS" is that it's because DNS is so fundamental that if you screw it up, bad things often happen and these bad things often take a few hours to resolve due to caching.
If a service vendor such as Salesforce screws up one of their applications, it's crap, but it won't usually take out their entire service and the testing of such components should be readily possible to reduce the chance and quickly reversible when bad things happen. As a result even Salesforce don't screw up this kind of thing that often. However, DNS changes are considerably harder to test safely and, in particular, remotely and when something goes wrong are a PITA to fix.
Who sets a super long TTL these days though?
It used to be necessary to have a long TTL because DNS queries could eat up a lot of bandwidth on your own DNS server, but these days DNS traffic is largely trivial unless you're an ISP.
Longest TTL I've seen in recent times is about 10mins. Whereas way back in the day you'd see upto 24 hours and occasionally even longer.
Even then, most public DNS providers ignore the TTL.
It makes me feel like "it's always DNS" is like the folks who try to blame the network for every little problem when it's almost never the network either(speaking as someone who manages servers, storage, networking, apps, hypervisors etc so I have good visibility into most everything except in house apps).
I don't blame the network. DNS isn't the network, it's an app that allows wetware to make use of a network. If DNS is down, the Internet isn't, it's just those cat vids have gone into hiding.
But I have in the past had good cause to blame DNS. Usually when the pager went beep while I was dreaming of new ways to torment sysadmins. Mainly because they controlled the ping to beep software and decided that inability to ping the domain name of a DNS server meant it was a network problem. I'd ping the server by IP address, it'd respond and I'd get to try and wake a sysadmin. That particular issue was eventually solved by a combination of billing for overtime, and creating a shadow ping box that would ignore those, test for both network and app reachability and wake the right person. Sysadmins may like to think they run the network, but that's a neteng's job.
But I digress. DNS isn't my speciality, but curious about a couple of bits. Like why it would be necessary to restart servers for a DNS change. AFAIK that's still a thing if you want to change a server's IP address, but I'd have hoped that in the 21st Century, that could be done on the fly. Then the good'ol push vs pull isuse, like manglement not always understanding that DNS changes aren't pushed. So usual routine of dropping TTL ahead of changes, and hoping resolver/client caches play nicely. And why stuff fell over under load. By stuff, I've seen issues where anti-DDOS systems have caused problems when DNS activity increases due to TTL being lowered, but a decently configured DNS setup should have been able to cope. If not, bit of an oops in Salesforce's capacity management.
Sysadmins again.. :p
. . . "Like why it would be necessary to restart servers for a DNS change"
Because I need to change the service my process connects to by name. The process/service is not stateless so it starts up get's the IP, connects and stays connected until you break it. NFS comes to mind. . . May as well push the "Big Red Switch".
I'm currently in an argument with some developers over "stateless" services behind a load balancer that needs the load balancer to create affinity with a cookie. . . It's not stateless if the load balancer must create state with a cookie.
You can create this environment, it just adds a lot of complexity no body wants to deal with on the development side.
Partially because DNS is so critical for operations.
Partially because DNS is considered "simple", but yet requires a bit of operational knowledge to know that xxx is a valid configuration, and yyy is an invalid configuration that can burn down the whole stack. (let alone what it takes to actually implement a DNS server now-a-days, hundreds of RFCs is it now?).
Partially because DNS in a large enterprise can get much more complex that a small installation would ever see. And large DNS deployments is a different operational requirements than most people realize.
But I suspect the biggest issue here is that there was a config issue, and instead of the server handling it gracefully it just chucks itself out the window, as most do. And probably onto the next one, and the next one, all crashing on the same config error, all dying the same way until nothing is left.
I've seen so many inhouse hacks of a DNS management system. Almost none of them had any error checking/error handling. One error can bring the whole thing down. Apply the change, oops, it died, now what? We can't back out the change.
Now we have to figure out where our boxes are, and how to login to them without DNS working. Oh lets check the IPAM, but where is that kept without DNS working?
"It's always DNS"
There are two reasons for saying it.
- Most customer will deny ever having changed it.
- Any DNS failure (and you are right, these are normally configuration errors made by a person, not actual errors in DNS) will cause widespread and varied problems.
So whilst "It's always DNS" could be read as blaming a DNS server or DNS itself, it's more that the error that screws everything up is that the application doesn't get the response it needs from a DNS server.
The reason behind that is almost always human error, but usually a human who is quite disconnected from the chaos they have caused.
In my experience it's most often non-technical people including management who suggest "it'll be the DNS", often because that's the one technical thing they've heard of and they want to sound like they know what they're talking about.
Certainly among my colleagues whenever we refer to "it's always DNS" we're not seriously suggesting that it is, we're taking the piss out of former bosses who insisted that may be the cause of any issue.
It's always DNS because it's soooo easy to "Chicken and Egg" yourself if you are not thinking aware.
Bind 9 is outrageously picky and likes to silently simply fail... ISC please fixme. As the backbone of DNS it spews hundreds of lines in your logs. There was an error on line 273 of 418, did you catch that?
In general the biggest C&E is exactly the issue SF ran into. In particular in a virtual environment VMware (for example) can be dependent on DNS to attach say ... it's backend storage. Unfortunately if your primary DNS for your VMware cluster is running on itself... Nothing comes up.
I get in these arguments all the time with some of my peers (mostly because I've bitten myself in the butt with this one too many times). Unless you have TWO (or preferably three) discrete clusters, IMNSHO Vcenter should be a standalone machine, Firewalls should be physical, DNS and DHCP should be on something physical. . . Make sure the lowest level items can or do all bootstrap without aid or in the correct order. If this scales up, that can get very tedious. . . "Why on earth did you shut down booth clusters at the same time? Don't you know they are inter-dependent? So let's start with the hosts files. . ."
Here's a great example of a bad app. Java. I first came across this probably in 2004. I just downloaded the "reccomended" release for Oracle Java on linux (from java.com) which is strangely 1.8 build 291 (thought there was java 11 or 12 or 13 now? ) anyway...
peek inside the default java.security file
(adjusted formatting of the output to make it take less lines)
# The Java-level namelookup cache policy for successful lookups:
# any negative value: caching forever - any positive value: the number of seconds to cache an address for - zero: do not cache
# default value is forever (FOREVER). For security reasons, this caching is made forever when a security manager is set. When a security manager is not set, the default behavior in this implementation is to cache for 30 seconds.
# NOTE: setting this to anything other than the default value can have serious security implications. Do not set it unless you are sure you are not exposed to DNS spoofing attack.
#networkaddress.cache.ttl=-1
I don't think I need to explain how stupid that is. It caused major pain for us back in 2004(till we found the setting), and again in 2011(4 companies later, couldn't convince the payment processor to adjust this setting at the time had to resort to rotating DNS names when we had IP changes) and well it's the same default in 2021. Not sure about newer than Java 8 what the default may be. DNS spoofing attacks are a thing of course(I believe handling them in this manor is poor), but it's also possible to be under a spoofing attack when the jvm starts up resulting in a bad dns result which never gets expired per the default settings anyway.
At the end of the day it's a bad default setting. I'm fine if someone wants to for some crazy reason put this setting in themselves, but it should not be the default and in my experience not many people know that this setting even exists and are surprised to learn about it.
But again, not a DNS problem, bad application problem.
The engineer followed a protocol and it got approved. Sure they may have used the "break/fix" option, but "Going down the EBF route meant fewer approvals" still means people higher up the chain did actually approve it.
Bad call by the engineer maybe, but someone should have spotted it if it did indeed go through a formal channel.
More likely is that they invented all this talk of procedures to help confidence in their product after it broke.
I doubt they really had layer after layers of procedures plotting out exactly this scenario that an engineer somehow failed to follow..... but I can easily believe that a PR man in ass covering mode, decided to claim such layers of procedures existed.
DNS is just not that complex a thing to manage, its simple, but its nServers X simple. The complexity is in rolling it in parallel to a large cluster of servers, so they are not out of sync.
So the basic claim that they roll these out bit by bit 'for safety' sounds faulty. You would have to examine all the use cases for partial server changes in that case, what happens if some data comes from the new one and not from the old? The slower the rollout, the more likely that is. Adding unwanted complexity.
Hence I think all this talk of layers of procedure from Salesforce is just PR bullshit.
I work for Salesforce and reviewed the internal RCA. I assure you everything Dieken said in this article is accurate. EBF has been a real thing for some time, and it does circumvent the normal process, which requires multiple approvals and staggered deployment and verification.
In fact, Salesforce has instituted such a rigorous process for change management in the last two years to prevent this kind of thing that it's a real challenge to preserve morale. Many engineers get fed up dealing with the red tape necessary to deploy their work.
In fact, Salesforce has instituted such a rigorous process for change management in the last two years to prevent this kind of thing that it's a real challenge to preserve morale. Many engineers get fed up dealing with the red tape necessary to deploy their work.
Which is exactly why the engineer resorted to EBF. My personal response to that kind of situation is to let my changes stack up until I get complaints that problems take too long. Once it reaches that point, the complainant gets the job of getting the approvals and verifications.
Tech: The new Hyperforce install is done.
Management: Let's go live!
Tech: It will take a few days to slow-roll the DNS changes.
Management: We need to show billing this week! Can we do the DNS changes faster?
Tech: Yes, but it is considered an emergency procedure.
Management: Is it risky?
Tech: Not really. I've done it several times.
Management: I'm giving you verbal authority to do it now!
Tech: OK.
Management: What just happened?
Tech. Oops. Something broke.
Upper Management: What just happened?
Management: THE TECH DID IT! THE TECH DID IT! IT'S ALL THE TECH'S FAULT!
Tech: Sigh.
You are spot-on, and I feel a lot of sympathy for the tech. I used to experience exactly this sort of crap while working for "A Well-known cloud Service provider" ... the company was dichotomous, one half demanding safety, the other demanding rapid deployment. And where would these two crash unceremoniously into each other? Yup, at the point where some poor tech had to deploy something. And when it went wrong (which it regularly did) it was all the tech's fault. And the outcome was always exactly what we've seen here: they're unable to deal with the REAL problem (which is the company's behaviour) so go for a stand-in: they need more automated testing. And that's true as far as it goes, but flies in the face of what they call it: "A root-cause analysis". Sigh.
I've been caught at that crash point more than once, but after the very first occasion, it was always safety complaining a fix hadn't been implemented yet to which my answer always was, that the fix had been ready for a while already but not yet approved by them (with the documentation to prove it after the second time). Somehow safety had a pretty high turn over of staff, all medical, the workplace seemed to induce ulcers in safety staff.
"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."
It is very typical: About 10 years ago in one of 9999.5 datacenters at Fu*u (with dual-power redundancy and multiple ISP connections), a 1ms glitch knocked out DC for many hours just because there was a loop dependency between NIS servers for a cold start ... But, the cold start has never been tested in this 9999.5 environment, this is just not the case.
This post has been deleted by its author
Sorry, Chief Availability Officer. RTFM and all that. In this case, read the article properly. However, an even more weird a title. Is he the bloke that reporters ring up at 3.00 am to check on the truth of a rumour? In that case he would be the Chief Sense of Humour Removed Officer.
Rather than firing the guy, a really smart employer would be able to 'encourage' said employee to sign Indenture papers (to the amount of the cost to SF of the outage/reputational damage), effectively tying said employee to the company for the rest of their life.
Said employee, now sadder, poorer and very much wiser, would be a voice of reason when some young script-kiddie, newly hired, says 'Oh, I'll fix that in a jiffy....'
Because mistakes are the only thing that we learn from, success hardly ever teaches us anything.
And smart employers should want employees who have *already* made all of the classic mistakes, and still remember how to avoid repeating them.
Voice of reason?
Said employee, now sadder, poorer and very much wiser, would be a voice of reason when some young script-kiddie, newly hired, says 'Oh, I'll fix that in a jiffy....'
I think you assume too much. No script-kiddie is going to listen to some old wiseguy with attitude.
The real issue is that the use of the emergency patch method is clearly not enforced correctly. Who ever did authorize use of this method decided it was ok to use it rather than booting it back to said engineer and saying no - this is not an emergency use the standard procedure.
A secondary issue is that the engineer clearly knew how to game the changed control system and has perhaps done so before but been luckier, how many others do this.
Finally there is a bug in a script, the only positive from the story is that it has been found. However if the 2 things above hadn't happened it wouldn't have been an issue.
You build a monster to be proud of, 5x redundant everything, a towering monument to intellectual prowess, a work of art. But still, on page one of the procedures is the requirement that the super golden master box, be online before anything else, in a cold start. Which never happens, so is never tested. You go on vacation, and someone decides to test the UPS/Generators. Oops.
And yes, it is always DNS. After 30+ years at this, when things go all wibbly wobbly for no reason, check DNS first. I did not used to believe it, but you would be amazed if you start watching, how often it actually is.
I'm recently retired. 50 years in IT. Programmer, (what we used to call Systems Programmer), technical support staff/management for a DBMS vendor, management/technical lead in the DBA group of a multi-billion dollar company, operations Management (vendor-side SDM over IBM GS delivery team) at a large healthcare organization. ITIL certified, CompSci degree from U Cal.
If you find the processes you have to follow onerous, either try to get them changed or find another job. Don't assume you know the history of why those processes were put in place, or that you know better. Your job as an administrator is not to make changes, it is to ensure the safe and continued operation of those systems under your care. Sometimes that means making changes, but you need to be paranoid whenever you do. If you're not, find another line of work, as people depend on you to keep things running.
If you are pressured to do things in an unsafe manner, find another employer.
I have only found it necessary to fire one employee in my career... usually I can find a safe place for most people who have worked for me, until a DBA repeatedly failed to follow processes... even after coaching. If you think an admin can screw things up, consider what a DBA can do to large database, and how long it can take to recover from data corruption/lost data. That's why ransomware is so lucrative.... the pain can last for days.
So; no sympathy, except yes; likely others need to be 'corrected'.
Did the tech ignore procedure or "ignore procedure"? I worked at a place like that for a while (not IT)... There were procedures to be followed (developed by some people who had not worked on the line at all), and a schedule to be met, but it was physically impossible to meet the schedule if the procedures were followed. So, in actuality there were on-paper procedures to be followed, and the procedures followed in reality (which saved about an hour a shift without affecting quality at all.)
It does make me wonder, if this tech really kind of "went rogue", or if this was just routinely done until something went wrong. To me, it sounds like a middle ground would be good -- having so much red tape and asking permission for a change they know they have to make that it's tempting to skip it is not great. Rolling out everywhere at once is not great. It seems like a fair middle ground would be not to require all the multiple permissions and red tape (since they know DNS records must be rolled out, jsut a notice that thye're doing it now seems like it should be adequate); but rolling out to 1-2 regions/partitions first then the rest after hours to days is definitely a good idea to avoid a global outage.