The sound of rumbling rubber could be heard today as Salesforce threw an engineer responsible for a change that knocked it offline under a passing bus. "We're not blaming one employee," said Chief Availability Officer Darryn Dieken after spending the first half hour of a Wednesday briefing on the outage doing pretty much that …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Wednesday 19th May 2021 18:11 GMT elkster88

"We have taken action with that particular employee"

Presumably involving a a shovel, a roll of carpet and two 20kg bags of quicklime.

32 1 Reply
1. Wednesday 19th May 2021 18:33 GMT chivo243
  
  Re: "We have taken action with that particular employee"
  
  "We have taken action with that particular ex-employee"
  
  I doubt a cock-up of that magnitude will result in keeping your job...
  
  8 2 Reply
  1. Thursday 20th May 2021 06:07 GMT Psmo
    
    Re: "We have taken action with that particular employee"
    
    Well lessons like that are expensive so maybe keep them around?
    
    Ideally you want people who have experience of recovering from that sort of error on someone else's system, though.
    
    20 0 Reply
    1. Thursday 20th May 2021 09:59 GMT Strahd Ivarius
      
      Re: "We have taken action with that particular employee"
      
      Usually you keep them around as an example once you have reduced their head using ancestral techniques
      
      3 0 Reply
    2. Thursday 20th May 2021 16:01 GMT Imhotep
      
      Re: "We have taken action with that particular employee"
      
      I worked for a company that maintained a facility in the hinterlands of Mexico. People that incurred the CEOs displeasure were transferred there to think on their sins. Some came back, eventually.
      
      2 0 Reply
  2. Thursday 20th May 2021 12:58 GMT Anonymous Coward
    
    Re: "We have taken action with that particular employee"
    
    Either that or promotion to manager.....
    
    8 0 Reply
2. Wednesday 19th May 2021 19:02 GMT Version 1.0
  
  Re: "We have taken action with that particular employee"
  
  Might be worth keep them onboard - they have just learned about a scripting issue which could mean that they will never have this happen again because the tech will be damn careful next time! Replace them with a new tech and the chances are that this will happen again in a year or two.
  
  53 0 Reply
  1. Thursday 20th May 2021 08:25 GMT Julian 8
    
    Re: "We have taken action with that particular employee"
    
    I was in a situation a number of years ago when an engineer an upgrade which resulted in lots of problems. Took a little while to full resolve, but we did. A good few months later and the same system needed another upgrade and the same engineer was assigned - but I was too - I prefer 1 late night than a week of them.
    
    During this work it was clear that the first device to upgrade shuold be really be done on its own due to the nature of the changes it was doing. The engineer then went to the second device and was about to start the upgrade there. I stopped him and said "is this what happened last time ?" I pointed out that the first device should really finish and then we an upgrade the others at the same time as the backend upgrade had finished.
    
    So, no, some do not learn from their mistakes and won't be more careful
    
    25 0 Reply
  2. Thursday 20th May 2021 10:36 GMT tip pc
    
    Re: "We have taken action with that particular employee"
    
    "Replace them with a new tech and the chances are that this will happen again in a year or two."
    
    its ok, they had someone automate something so that if someone wants to do this again they can only do it through the automated system. They won't have the skill or experience to know what they are doing, they will just run some automation.
    
    In the same vein, did this engineer who ran the script actually know what the script did?
    
    blindly running a script or automation for years that someone else somewhere else wrote or jerry'd up isn't the same as being a skilled operator with many years of experience and the capability to write said script or automation and understand what it does.
    
    In the race to outsource to the lowest bidder the skill sets of those doing the work plummets.
    
    11 1 Reply
  3. Thursday 20th May 2021 13:51 GMT Keith Langmead
    
    Re: "We have taken action with that particular employee"
    
    May also depend on whether they tracked down the engineer at fault (eg they were keeping their head down and hoping to avoid blame), or the engineer immediately put their hand up and admitted they were the cause of the fault.
    
    Assuming they were otherwise competent at their job, IMHO someone who will admit to their mistakes rather than trying to cover their tracks is someone it's worth keeping around. Their replacement could be of the cover their tracks variety, and that could lead to even worse issues when something goes wrong.
    
    6 0 Reply
  4. Thursday 20th May 2021 16:02 GMT Imhotep
    
    Re: "We have taken action with that particular employee"
    
    The scripting issue aside, the person demonstrated a willingness to bypass procedures put in place to avoid just that sort of thing. Who needs that?
    
    0 2 Reply
    1. Friday 21st May 2021 09:53 GMT David Neil
      
      Re: "We have taken action with that particular employee"
      
      Who approved the delivery of change under that emergency procedure?
      
      They also have culpability in approving a change that they didn't understand in terms of impact.
      
      Sometimes you need to look at what drove the behaviour in the first place rather than just blaming the guy at the sharp end.
      
      0 0 Reply
    2. Thursday 27th May 2021 08:05 GMT Wild Bill
      
      Re: "We have taken action with that particular employee"
      
      I suspect that all the engineers use this "emergency" global rollout all the time to get things done quickly. In this case it went badly wrong, and now everyone's pretending they've never done it and just leaving the one person to take all the blame
      
      0 0 Reply
3. Wednesday 19th May 2021 19:50 GMT Len
  
  Re: "We have taken action with that particular employee"
  
  This is definitely a case of needlessly ambiguous terms.
  
  15 0 Reply
4. Wednesday 19th May 2021 21:13 GMT Chris Miller
  
  Re: "We have taken action with that particular employee"
  
  Deputy Heads will roll.
  
  4 0 Reply
5. Thursday 20th May 2021 16:18 GMT zuckzuckgo
  
  Re: "We have taken action with that particular employee"
  
  >Presumably involving a a shovel, a roll of carpet and two 20kg bags of quicklime.
  
  Well that is the companies official emergency management procedure. Preferably implemented before the minion get a chance to tell their side of the story.
  
  2 0 Reply
6. Sunday 23rd May 2021 05:47 GMT macjules
  
  Re: "We have taken action with that particular employee"
  
  I might consider employing someone with "I brought down Salesforce for 5 hours, globally" on their CV. Worst UI and UX I have ever come across in my life.
  
  2 0 Reply
Wednesday 19th May 2021 18:33 GMT Nate Amsden

wth is it with always dns?

I don't get it? Been running DNS for about 25 years now. It's super rare that a problem is DNS related in my experience. I certainly have had DNS issues over the years, most often the problems tend to be bad config, bad application(includes actual apps as well as software running on devices such as OS, storage systems network devices etc), bad user. In my experience bad application wins the vast majority of times. I have worked in SaaS-style (as in in house developed core applications) since 2000.

But I have seen many people say "it's always dns", maybe DNS related issues are much more common in windows environments? I know DNS resolution can be a pain such as dealing with various levels of browser and OS caching regarding split dns where DNS names resolve to different addresses if you are inside or outside the network/vpn). I don't classify those as DNS issues though, DNS is behaving exactly as it was configured/intended to, it was the unfortunate user who happened to do an action which performed a query whose results were then cached by possibly multiple layers in the operating system before switching networks and the cache didn't get invalidated resulting in a problem.

I know there have been some higher profile DNS related outages by some cloud providers(I think MS had one not long ago) but still seems to be a tiny minority of the causes of problems.

It makes me feel like "it's always DNS" is like the folks who try to blame the network for every little problem when it's almost never the network either(speaking as someone who manages servers, storage, networking, apps, hypervisors etc so I have good visibility into most everything except in house apps).

4 20 Reply
1. Wednesday 19th May 2021 20:09 GMT Nick Ryan
  
  Re: wth is it with always dns?
  
  I think by "it's always DNS" is that it's because DNS is so fundamental that if you screw it up, bad things often happen and these bad things often take a few hours to resolve due to caching.
  
  If a service vendor such as Salesforce screws up one of their applications, it's crap, but it won't usually take out their entire service and the testing of such components should be readily possible to reduce the chance and quickly reversible when bad things happen. As a result even Salesforce don't screw up this kind of thing that often. However, DNS changes are considerably harder to test safely and, in particular, remotely and when something goes wrong are a PITA to fix.
  
  30 0 Reply
  1. Thursday 20th May 2021 07:31 GMT Anonymous Coward
    
    Re: wth is it with always dns?
    
    Who sets a super long TTL these days though?
    
    It used to be necessary to have a long TTL because DNS queries could eat up a lot of bandwidth on your own DNS server, but these days DNS traffic is largely trivial unless you're an ISP.
    
    Longest TTL I've seen in recent times is about 10mins. Whereas way back in the day you'd see upto 24 hours and occasionally even longer.
    
    Even then, most public DNS providers ignore the TTL.
    
    7 1 Reply
    1. Thursday 20th May 2021 13:27 GMT A.P. Veening
      
      Re: wth is it with always dns?
      
      Even then, most public DNS providers ignore the TTL.
      
      A good reason to use Unbound, preferably in combination with Pi-Hole.
      
      1 0 Reply
2. Wednesday 19th May 2021 22:25 GMT Jellied Eel
  
  Re: wth is it with always dns?
  
  It makes me feel like "it's always DNS" is like the folks who try to blame the network for every little problem when it's almost never the network either(speaking as someone who manages servers, storage, networking, apps, hypervisors etc so I have good visibility into most everything except in house apps).
  
  I don't blame the network. DNS isn't the network, it's an app that allows wetware to make use of a network. If DNS is down, the Internet isn't, it's just those cat vids have gone into hiding.
  
  But I have in the past had good cause to blame DNS. Usually when the pager went beep while I was dreaming of new ways to torment sysadmins. Mainly because they controlled the ping to beep software and decided that inability to ping the domain name of a DNS server meant it was a network problem. I'd ping the server by IP address, it'd respond and I'd get to try and wake a sysadmin. That particular issue was eventually solved by a combination of billing for overtime, and creating a shadow ping box that would ignore those, test for both network and app reachability and wake the right person. Sysadmins may like to think they run the network, but that's a neteng's job.
  
  But I digress. DNS isn't my speciality, but curious about a couple of bits. Like why it would be necessary to restart servers for a DNS change. AFAIK that's still a thing if you want to change a server's IP address, but I'd have hoped that in the 21st Century, that could be done on the fly. Then the good'ol push vs pull isuse, like manglement not always understanding that DNS changes aren't pushed. So usual routine of dropping TTL ahead of changes, and hoping resolver/client caches play nicely. And why stuff fell over under load. By stuff, I've seen issues where anti-DDOS systems have caused problems when DNS activity increases due to TTL being lowered, but a decently configured DNS setup should have been able to cope. If not, bit of an oops in Salesforce's capacity management.
  
  Sysadmins again.. :p
  
  6 6 Reply
  1. Thursday 20th May 2021 01:28 GMT f3ew
    
    Re: wth is it with always dns?
    
    Java is famous for caching valid DNS entries at startup, and then never rechecking them. You change an IP, and then spend a whole bunch of time restarting all your long running JVMs.
    
    16 0 Reply
    1. Thursday 20th May 2021 12:34 GMT Slabfondler
      
      Re: wth is it with always dns?
      
      Indeed, we had to write a script to restart logstash when the IP of an endpoint (a LB we don't control) changes. This after quite a few - why aren't we getting data from x anymore - tickets were raised.
      
      2 0 Reply
  2. Friday 21st May 2021 12:00 GMT Anonymous Coward
    
    Re: wth is it with always dns?
    
    You'd need an insanely low TTL for a DDoS to be possible.
    
    I think around 10-20 minutes is fine.
    
    0 0 Reply
  3. Sunday 23rd May 2021 17:27 GMT KSM-AZ
    
    Re: wth is it with always dns?
    
    . . . "Like why it would be necessary to restart servers for a DNS change"
    
    Because I need to change the service my process connects to by name. The process/service is not stateless so it starts up get's the IP, connects and stays connected until you break it. NFS comes to mind. . . May as well push the "Big Red Switch".
    
    I'm currently in an argument with some developers over "stateless" services behind a load balancer that needs the load balancer to create affinity with a cookie. . . It's not stateless if the load balancer must create state with a cookie.
    
    You can create this environment, it just adds a lot of complexity no body wants to deal with on the development side.
    
    0 0 Reply
3. Wednesday 19th May 2021 23:08 GMT Anonymous Coward
  
  Re: wth is it with always dns?
  
  Partially because DNS is so critical for operations.
  
  Partially because DNS is considered "simple", but yet requires a bit of operational knowledge to know that xxx is a valid configuration, and yyy is an invalid configuration that can burn down the whole stack. (let alone what it takes to actually implement a DNS server now-a-days, hundreds of RFCs is it now?).
  
  Partially because DNS in a large enterprise can get much more complex that a small installation would ever see. And large DNS deployments is a different operational requirements than most people realize.
  
  But I suspect the biggest issue here is that there was a config issue, and instead of the server handling it gracefully it just chucks itself out the window, as most do. And probably onto the next one, and the next one, all crashing on the same config error, all dying the same way until nothing is left.
  
  I've seen so many inhouse hacks of a DNS management system. Almost none of them had any error checking/error handling. One error can bring the whole thing down. Apply the change, oops, it died, now what? We can't back out the change.
  
  Now we have to figure out where our boxes are, and how to login to them without DNS working. Oh lets check the IPAM, but where is that kept without DNS working?
  
  10 0 Reply
  1. Thursday 20th May 2021 11:26 GMT Anonymous Coward
    
    Re: wth is it with always dns?
    
    "Oh lets check the IPAM, but where is that kept without DNS working?"
    
    In a .CSV file on the USB drive in the top-left desk drawer, right? I mean, that's where I keep mine.
    
    4 0 Reply
4. Thursday 20th May 2021 00:13 GMT Anonymous Coward
  
  Re: wth is it with always dns?
  
  Calm down dear.
  
  It's the *misuse* of DNS not DNS itself that is the problem. If you've been running DNS for 25 years then you know this already.
  
  17 0 Reply
  1. Thursday 20th May 2021 17:08 GMT Anonymous Coward
    
    Re: wth is it with always dns?
    
    >It's the *misuse* of DNS not DNS itself that is the problem.
    
    DNS configuration is like porn, as long as we let people have access to it they will fiddle with it.
    
    1 0 Reply
5. Thursday 20th May 2021 08:30 GMT John Robson
  
  Re: wth is it with always dns?
  
  "It's always DNS"
  
  There are two reasons for saying it.
  
  - Most customer will deny ever having changed it.
  
  - Any DNS failure (and you are right, these are normally configuration errors made by a person, not actual errors in DNS) will cause widespread and varied problems.
  
  So whilst "It's always DNS" could be read as blaming a DNS server or DNS itself, it's more that the error that screws everything up is that the application doesn't get the response it needs from a DNS server.
  
  The reason behind that is almost always human error, but usually a human who is quite disconnected from the chaos they have caused.
  
  4 0 Reply
6. Thursday 20th May 2021 14:29 GMT Keith Langmead
  
  Re: wth is it with always dns?
  
  In my experience it's most often non-technical people including management who suggest "it'll be the DNS", often because that's the one technical thing they've heard of and they want to sound like they know what they're talking about.
  
  Certainly among my colleagues whenever we refer to "it's always DNS" we're not seriously suggesting that it is, we're taking the piss out of former bosses who insisted that may be the cause of any issue.
  
  2 0 Reply
7. Sunday 23rd May 2021 17:17 GMT KSM-AZ
  
  Re: why is it with always dns?
  
  It's always DNS because it's soooo easy to "Chicken and Egg" yourself if you are not thinking aware.
  
  Bind 9 is outrageously picky and likes to silently simply fail... ISC please fixme. As the backbone of DNS it spews hundreds of lines in your logs. There was an error on line 273 of 418, did you catch that?
  
  In general the biggest C&E is exactly the issue SF ran into. In particular in a virtual environment VMware (for example) can be dependent on DNS to attach say ... it's backend storage. Unfortunately if your primary DNS for your VMware cluster is running on itself... Nothing comes up.
  
  I get in these arguments all the time with some of my peers (mostly because I've bitten myself in the butt with this one too many times). Unless you have TWO (or preferably three) discrete clusters, IMNSHO Vcenter should be a standalone machine, Firewalls should be physical, DNS and DHCP should be on something physical. . . Make sure the lowest level items can or do all bootstrap without aid or in the correct order. If this scales up, that can get very tedious. . . "Why on earth did you shut down booth clusters at the same time? Don't you know they are inter-dependent? So let's start with the hosts files. . ."
  
  1 0 Reply
Wednesday 19th May 2021 18:49 GMT Nate Amsden

bad app and DNS

Here's a great example of a bad app. Java. I first came across this probably in 2004. I just downloaded the "reccomended" release for Oracle Java on linux (from java.com) which is strangely 1.8 build 291 (thought there was java 11 or 12 or 13 now? ) anyway...

peek inside the default java.security file

(adjusted formatting of the output to make it take less lines)

# The Java-level namelookup cache policy for successful lookups:

# any negative value: caching forever - any positive value: the number of seconds to cache an address for - zero: do not cache

# default value is forever (FOREVER). For security reasons, this caching is made forever when a security manager is set. When a security manager is not set, the default behavior in this implementation is to cache for 30 seconds.

# NOTE: setting this to anything other than the default value can have serious security implications. Do not set it unless you are sure you are not exposed to DNS spoofing attack.

#networkaddress.cache.ttl=-1

I don't think I need to explain how stupid that is. It caused major pain for us back in 2004(till we found the setting), and again in 2011(4 companies later, couldn't convince the payment processor to adjust this setting at the time had to resort to rotating DNS names when we had IP changes) and well it's the same default in 2021. Not sure about newer than Java 8 what the default may be. DNS spoofing attacks are a thing of course(I believe handling them in this manor is poor), but it's also possible to be under a spoofing attack when the jvm starts up resulting in a bad dns result which never gets expired per the default settings anyway.

At the end of the day it's a bad default setting. I'm fine if someone wants to for some crazy reason put this setting in themselves, but it should not be the default and in my experience not many people know that this setting even exists and are surprised to learn about it.

But again, not a DNS problem, bad application problem.

18 0 Reply
1. Wednesday 19th May 2021 21:10 GMT J. Cook
  
  Re: bad app and DNS
  
  Your problem is Java. :)
  
  OW! OW! Put down the pitchforks, I'm going!
  
  19 0 Reply
Wednesday 19th May 2021 18:51 GMT 9Rune5

Four-year-old

and the engineer also had a four-year-old ~~script~~ to do the job.

FTFY!

11 0 Reply
Wednesday 19th May 2021 18:55 GMT Anonymous Coward

Having been on the receiving end of some of Salesforce's products, there's probably a fair few people who would happily buy the guy a pint for that period of blessed relief.

28 0 Reply
Wednesday 19th May 2021 19:40 GMT DarkwavePunk

So...

The engineer followed a protocol and it got approved. Sure they may have used the "break/fix" option, but "Going down the EBF route meant fewer approvals" still means people higher up the chain did actually approve it.

Bad call by the engineer maybe, but someone should have spotted it if it did indeed go through a formal channel.

41 0 Reply
1. Wednesday 19th May 2021 23:22 GMT jake
  
  Re: So...
  
  Don't talk sense. It'll only get you into trouble.
  
  _{So can beer, but have one on me anyway.}
  
  21 0 Reply
2. Thursday 20th May 2021 03:51 GMT Anonymous Coward
  
  EBF is PR, it smells funky
  
  More likely is that they invented all this talk of procedures to help confidence in their product after it broke.
  
  I doubt they really had layer after layers of procedures plotting out exactly this scenario that an engineer somehow failed to follow..... but I can easily believe that a PR man in ass covering mode, decided to claim such layers of procedures existed.
  
  DNS is just not that complex a thing to manage, its simple, but its nServers X simple. The complexity is in rolling it in parallel to a large cluster of servers, so they are not out of sync.
  
  So the basic claim that they roll these out bit by bit 'for safety' sounds faulty. You would have to examine all the use cases for partial server changes in that case, what happens if some data comes from the new one and not from the old? The slower the rollout, the more likely that is. Adding unwanted complexity.
  
  Hence I think all this talk of layers of procedure from Salesforce is just PR bullshit.
  
  4 2 Reply
  1. Thursday 20th May 2021 08:06 GMT Anonymous Coward
    
    Re: EBF is PR, it smells funky
    
    I work for Salesforce and reviewed the internal RCA. I assure you everything Dieken said in this article is accurate. EBF has been a real thing for some time, and it does circumvent the normal process, which requires multiple approvals and staggered deployment and verification.
    
    In fact, Salesforce has instituted such a rigorous process for change management in the last two years to prevent this kind of thing that it's a real challenge to preserve morale. Many engineers get fed up dealing with the red tape necessary to deploy their work.
    
    9 0 Reply
    1. Thursday 20th May 2021 13:27 GMT A.P. Veening
      
      Re: EBF is PR, it smells funky
      
      In fact, Salesforce has instituted such a rigorous process for change management in the last two years to prevent this kind of thing that it's a real challenge to preserve morale. Many engineers get fed up dealing with the red tape necessary to deploy their work.
      
      Which is exactly why the engineer resorted to EBF. My personal response to that kind of situation is to let my changes stack up until I get complaints that problems take too long. Once it reaches that point, the complainant gets the job of getting the approvals and verifications.
      
      5 0 Reply
Wednesday 19th May 2021 20:46 GMT BobC

What probably happened...

Tech: The new Hyperforce install is done.

Management: Let's go live!

Tech: It will take a few days to slow-roll the DNS changes.

Management: We need to show billing this week! Can we do the DNS changes faster?

Tech: Yes, but it is considered an emergency procedure.

Management: Is it risky?

Tech: Not really. I've done it several times.

Management: I'm giving you verbal authority to do it now!

Tech: OK.

Management: What just happened?

Tech. Oops. Something broke.

Upper Management: What just happened?

Management: THE TECH DID IT! THE TECH DID IT! IT'S ALL THE TECH'S FAULT!

Tech: Sigh.

54 0 Reply
1. Thursday 20th May 2021 08:06 GMT Anonymous Coward
  
  Re: What probably happened...
  
  No. EBFs do not require manager approval, by design. It was a process error that the EBF didn't get more scrutiny, in addition to the judgment error to make it an EBF to begin with.
  
  1 2 Reply
  1. Thursday 20th May 2021 10:16 GMT Anonymous Coward
    
    Re: What probably happened...
    
    You mean that I can replace all the Windows boxes by Linux ones (or the reverse) using an EBF without management approval???
    
    On my way to see some more downtime then....
    
    4 0 Reply
2. Thursday 20th May 2021 08:32 GMT Anonymous Coward
  
  Re: What probably happened...
  
  You are spot-on, and I feel a lot of sympathy for the tech. I used to experience exactly this sort of crap while working for "A Well-known cloud Service provider" ... the company was dichotomous, one half demanding safety, the other demanding rapid deployment. And where would these two crash unceremoniously into each other? Yup, at the point where some poor tech had to deploy something. And when it went wrong (which it regularly did) it was all the tech's fault. And the outcome was always exactly what we've seen here: they're unable to deal with the REAL problem (which is the company's behaviour) so go for a stand-in: they need more automated testing. And that's true as far as it goes, but flies in the face of what they call it: "A root-cause analysis". Sigh.
  
  3 0 Reply
  1. Thursday 20th May 2021 13:26 GMT A.P. Veening
    
    Re: What probably happened...
    
    I've been caught at that crash point more than once, but after the very first occasion, it was always safety complaining a fix hadn't been implemented yet to which my answer always was, that the fix had been ready for a while already but not yet approved by them (with the documentation to prove it after the second time). Somehow safety had a pretty high turn over of staff, all medical, the workplace seemed to induce ulcers in safety staff.
    
    0 0 Reply
Wednesday 19th May 2021 21:23 GMT EarthDog

Why was EBF used in the first place?

There is nothing to indicate that there was an actual emergency. Someone wanted to cut corners to speed things up and someone, presumably a manager, decided EBF was appropriate sans an emergency. Several people are at fault here.

14 0 Reply
Wednesday 19th May 2021 21:45 GMT The Oncoming Scorn

Shiney Happy People

The Salesforce team has tools to deal with sad servers.

7 0 Reply
Wednesday 19th May 2021 23:24 GMT jake

Remind me again ...

... why I'm supposed to put any of my corporate eggs into the cloud basket?

9 1 Reply
Thursday 20th May 2021 01:34 GMT Anonymous Coward

It is always DNS...

It is always DNS, except when it is a certificate.

12 0 Reply
1. Thursday 20th May 2021 11:20 GMT Anonymous Coward
  
  Re: It is always DNS...
  
  Good point. But I'm still going to make sure that our CR approval checklist contains a line item: does the change involve DNS?
  
  1 0 Reply
Thursday 20th May 2021 02:51 GMT Netty

Got to feel for the tech person, nice values - together and all that good stuff..

1 0 Reply
Thursday 20th May 2021 02:58 GMT Anonymous Coward

typical loop dependency

"In this case," he went on, "we found a circular dependency where the tool that we use to get into production had a dependency on the DNS servers being active."

It is very typical: About 10 years ago in one of 9999.5 datacenters at Fu*u (with dual-power redundancy and multiple ISP connections), a 1ms glitch knocked out DC for many hours just because there was a loop dependency between NIS servers for a cold start ... But, the cold start has never been tested in this 9999.5 environment, this is just not the case.

2 1 Reply
Thursday 20th May 2021 04:16 GMT dithomas

Chief What?

Chief Accessibility Officer!

Now there's a title looking for a purpose; possibly accessing the engineer's internal organs.

1 0 Reply
1. This post has been deleted by its author
Thursday 20th May 2021 07:57 GMT dithomas

Chief What? Correction

Sorry, Chief Availability Officer. RTFM and all that. In this case, read the article properly. However, an even more weird a title. Is he the bloke that reporters ring up at 3.00 am to check on the truth of a rumour? In that case he would be the Chief Sense of Humour Removed Officer.

4 0 Reply
1. Thursday 20th May 2021 17:24 GMT zuckzuckgo
  
  Re: Chief What? Correction
  
  Chief Availability Officer
  
  He's there to make sure there is always at least one red shirt around and available to take the fall.
  
  2 0 Reply
Thursday 20th May 2021 08:31 GMT Potemkine!

PR BS

"We're not blaming one employee," but "We have taken action with that particular employee"...

8 0 Reply
Thursday 20th May 2021 10:09 GMT anonymous boring coward

Ehh.. If it looks safe, has been done before many times, and won't kill anyone, just go for it!

Implement it globally, all at once. People are such delicate flowers.

1 0 Reply
Thursday 20th May 2021 12:13 GMT MadAsHell

Experience is the name that we give to our mistakes

Rather than firing the guy, a really smart employer would be able to 'encourage' said employee to sign Indenture papers (to the amount of the cost to SF of the outage/reputational damage), effectively tying said employee to the company for the rest of their life.

Said employee, now sadder, poorer and very much wiser, would be a voice of reason when some young script-kiddie, newly hired, says 'Oh, I'll fix that in a jiffy....'

Because mistakes are the only thing that we learn from, success hardly ever teaches us anything.

And smart employers should want employees who have *already* made all of the classic mistakes, and still remember how to avoid repeating them.

2 2 Reply
1. Thursday 20th May 2021 16:11 GMT Bitsminer
  
  Re: Experience is the name that we give to our mistakes
  
  Voice of reason?
  
  Said employee, now sadder, poorer and very much wiser, would be a voice of reason when some young script-kiddie, newly hired, says 'Oh, I'll fix that in a jiffy....'
  
  I think you assume too much. No script-kiddie is going to listen to some old wiseguy with attitude.
  
  1 0 Reply
Thursday 20th May 2021 13:43 GMT Anonymous Coward

The real issue is that the use of the emergency patch method is clearly not enforced correctly. Who ever did authorize use of this method decided it was ok to use it rather than booting it back to said engineer and saying no - this is not an emergency use the standard procedure.

A secondary issue is that the engineer clearly knew how to game the changed control system and has perhaps done so before but been luckier, how many others do this.

Finally there is a bug in a script, the only positive from the story is that it has been found. However if the 2 things above hadn't happened it wouldn't have been an issue.

2 0 Reply
1. Thursday 20th May 2021 17:06 GMT A.P. Veening
  
  The real issue is the perceived need to use the emergency patch method, it tells something (and nothing good) about the organization and its management.
  
  1 0 Reply
Thursday 20th May 2021 14:08 GMT Oh Matron!

"The engineer instead decided erroneously"

Not for one second do I believe this. If this is true, the processes around change control at SF must be non existent!

1 0 Reply
Thursday 20th May 2021 14:28 GMT LordHighFixer

We have all been there

You build a monster to be proud of, 5x redundant everything, a towering monument to intellectual prowess, a work of art. But still, on page one of the procedures is the requirement that the super golden master box, be online before anything else, in a cold start. Which never happens, so is never tested. You go on vacation, and someone decides to test the UPS/Generators. Oops.

And yes, it is always DNS. After 30+ years at this, when things go all wibbly wobbly for no reason, check DNS first. I did not used to believe it, but you would be amazed if you start watching, how often it actually is.

2 0 Reply
Thursday 20th May 2021 16:10 GMT Anonymous Coward

You're not going to like this

I'm recently retired. 50 years in IT. Programmer, (what we used to call Systems Programmer), technical support staff/management for a DBMS vendor, management/technical lead in the DBA group of a multi-billion dollar company, operations Management (vendor-side SDM over IBM GS delivery team) at a large healthcare organization. ITIL certified, CompSci degree from U Cal.

If you find the processes you have to follow onerous, either try to get them changed or find another job. Don't assume you know the history of why those processes were put in place, or that you know better. Your job as an administrator is not to make changes, it is to ensure the safe and continued operation of those systems under your care. Sometimes that means making changes, but you need to be paranoid whenever you do. If you're not, find another line of work, as people depend on you to keep things running.

If you are pressured to do things in an unsafe manner, find another employer.

I have only found it necessary to fire one employee in my career... usually I can find a safe place for most people who have worked for me, until a DBA repeatedly failed to follow processes... even after coaching. If you think an admin can screw things up, consider what a DBA can do to large database, and how long it can take to recover from data corruption/lost data. That's why ransomware is so lucrative.... the pain can last for days.

So; no sympathy, except yes; likely others need to be 'corrected'.

2 0 Reply
1. Thursday 20th May 2021 16:30 GMT Anonymous Coward
  
  Re: You're not going to like this
  
  Note: above rant applies to changes made to operational/production systems only, although I will also say that it SHOULD apply to operational/development systems as well, as they are 'production' for development groups.
  
  2 0 Reply
Sunday 23rd May 2021 20:28 GMT Henry Wertz 1

Did they ignore procedure or "ignore procedure"?

Did the tech ignore procedure or "ignore procedure"? I worked at a place like that for a while (not IT)... There were procedures to be followed (developed by some people who had not worked on the line at all), and a schedule to be met, but it was physically impossible to meet the schedule if the procedures were followed. So, in actuality there were on-paper procedures to be followed, and the procedures followed in reality (which saved about an hour a shift without affecting quality at all.)

It does make me wonder, if this tech really kind of "went rogue", or if this was just routinely done until something went wrong. To me, it sounds like a middle ground would be good -- having so much red tape and asking permission for a change they know they have to make that it's tempting to skip it is not great. Rolling out everywhere at once is not great. It seems like a fair middle ground would be not to require all the multiple permissions and red tape (since they know DNS records must be rolled out, jsut a notice that thye're doing it now seems like it should be adequate); but rolling out to 1-2 regions/partitions first then the rest after hours to days is definitely a good idea to avoid a global outage.

1 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Topics

Special Features

Vendor Voice

Resources

COMMENTS

"We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

Re: "We have taken action with that particular employee"

wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: wth is it with always dns?

Re: why is it with always dns?

bad app and DNS

Re: bad app and DNS

Four-year-old

So...

Re: So...

EBF is PR, it smells funky

Re: EBF is PR, it smells funky

Re: EBF is PR, it smells funky

What probably happened...

Re: What probably happened...

Re: What probably happened...

Re: What probably happened...

Re: What probably happened...

Why was EBF used in the first place?

Shiney Happy People

Remind me again ...

It is always DNS...

Re: It is always DNS...

typical loop dependency

Chief What?

Chief What? Correction

Re: Chief What? Correction

PR BS

Experience is the name that we give to our mistakes

Re: Experience is the name that we give to our mistakes

"The engineer instead decided erroneously"

We have all been there

You're not going to like this

Re: You're not going to like this

Did they ignore procedure or "ignore procedure"?

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

Salesforce apparently poised to slurp data management outfit Informatica

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Datacenter outages are on the decline, but when they hit, they hit hard

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Tech trade union confirms cyberattack behind IT, email outage

McDonald's ordering system suffers McFlurry of tech troubles

LinkedIn's turn to fall over: Outage hits thinkfluencer hub

World-plus-dog booted out of Facebook, Instagram, Threads