Everybody knows that
million-to-one chances crop up nine times out of ten. (© Sir Terry Pratchett)
Welcome again to On-Call, our regular Friday tale in which we ease you into the weekend with readers' tales of the ridiculous things they've been asked to do on evenings, weekends and the other odd times techie folk are asked to go out and fix things up. This week's contribution comes from reader “JF”, who tells us that “About …
Right now there's several major transportation links in a UK city that have their CCTV, entire SCADA systems including Fire Detection, radio and data comms and pretty much every other life-critical system running through a single toggle switch.
Anyone want to hazard a guess at the Risk Level here....?
AC for obvious reasons
I think that toggle switch was hit by accident for my ISP a few years back. BT trainee accidentally deleted another *provider* off the system.
Was a laugh as of cause "it can't happen" attitude meant it took a couple more hours to fix as they refused to even check if the claim was true.
The problem is that impact is very hard to figure till it happens, and the rarer the event, the less likely that the impact will be properly appreciated.
We've just had "1 in 100" years flooding 6 years apart in the Lake District and no proper plan in place, because politicians seem to think "1 in 100" means it is going to happen after the 99th year, not the the probability is exactly the same for every one of the next hundred years, and so it needs to be prepared for right now.
Exactly - the problem is that virtually nobody is taught real statistics these days. If the chance that it will happen today is 1:100 then the chance that it will happen tomorrow is still 1:100 even if it happens today - and happened the day before.
The bottom line - in the flooding/climate change issues at least - is that we do not have enough samples to make an accurate estimate so, like JF, we are all pulling numbers out of thin air when the questions comes up.
Nope - you are making a common mistake and are assuming that the probabilities are independent. Now if you are talking about an idiot tax such as the lottery, you are right. If you are talking about complex installations with multiple dependent components, all of which have some differing probabilistic failure mode then you are incorrect. If you are talking about a standardised 'move' process with identical kit and to/fro locations then the probability might diminish as you get better.
Back to school I think.
Actually not necessarily so - we have plenty of good data on flooding risk and frequency in many watersheds - the risks are not invented as guesses. Also some phenomena are independent and some are independent, you need to understand the underlying processes and causes to make the judgment about that.
Which is why having a parliament filled with Law, Business, and Politics grads is a bad idea.
Seems like a basic understanding of statistics and technology should be required for public office (or functioning as an informed adult in today's world)
Well, no. I'd rather not defend politicians because most of them are oxygen thieves, but the real reasons are politics, and the realisation estimates might be wrong.
It's the same reason the UK has such poor infrastructure for dealing with snow, weather serious enough to cause problems is infrequent enough that spending money to defend against it costs substantially more than dealing with it at the time.
The planning in the Lake District was just fine - the flood defenses were improved, and they did not fail - they were overwhelmed in a few areas with exceptional weather. Again it comes down to the fact that improving defenses against an event that is that unlikely costs more than dealing with it at the time. The small number of years between the last flood is politically embarrassing but does not invalidate that point.
What will worry people is the possibility the estimates are wrong. The flood defenses were improved somewhat beyond the high watermark point at the last flood, and they did not break as far as I'm aware. If the estimates are right, people will grumble, get on with their lives, and it was correct not to spend substantially more on flood defenses. If this is a climate change issue affecting probabilities, the country is in trouble.
Sure, and it's horrible for the people involved, but everyone needs to realise that government only cares about people on a general basis. If the impact costs substantially less than coping with truly exceptional weather, or being cynical, the cost of losing all the voters in the area, they're probably not going to see it as a realistic use of money. If you're absolutely, cast iron sure that this is truly exceptional unlikely to be repeated weather, then it isn't sensible to spend on it. The question is : are the projections accurate.
This is not the same as prior floods when either the money was not spent on defenses and failed precisely because of that, or worse, one particular council refused flood defense improvements as the residents didn't like the proposed visual impact : end result, the counties with improvements were fine, and the one without flooded. muppets..
Actually that's an effect of flood defences.
Normally a watercourse will gradually spread out over its flood plain, causing a very shallow and very large lake. By bottling this up behind flood defences, you ensure that when, not if, they fail you get a sudden surge and a rather deeper lake in the place where the failure occurs.
The real answer is not to build houses on "prime building land in a picturesque setting" or "a flood plain" as I like to call it.
I suspect for my friends whose houses were underwater, it doesn't matter if the defences "did not fail" or were "overwhelmed in a few areas" - when you've got to move out while repairs are made when there was something there to prevent that happening, then that something failed!
This post has been deleted by its author
Managers: fuck 'em. They're paid The Big Bucks so presumably they are suppposed to know the right questions to ask. I don't see it as any part of any flunky's job to correct their superior understanding. [sarcasm alert]
The interesting question is this: how on earth do so many stupid people make it into the ranks of management?
...by employing good change practice. Plan the change carefully under the control of the Change Manager, include the "business" in the planning process, document it, test it if possible and implement it. In the plan make the assumption that it will go wrong and in that case figure out what you would do to then recover - this should have been a major part of the plan. This is basic major change management practice that neither JF nor the IT management people seemed to employ.
Exactly the right question sir. This year I did 5x enterprise class Arrays (literally everything the company had in data form) which required their inputs being pulled and re-routed (turned out the guy who put the Arrays in had bypassed the UPS!). Lordly that was one long CAB I can tell you. In the end I got what I wanted, which was every support team on standby and backups completed prior, as the risk to business was small, but the impact if it did go FUBAR would be huge. The changes went as planned, no one died and afterwards everyone wondered why I'd gone to "all that trouble"... yes, why indeed ;-)
JF should not have just stood there when the boss said "You said there was only a 1 percent chance..."
He should have pointed out that once an x% chance event occurs, its 100%. Even math literate people often don't appreciate this.
BTW: Unless JF had valid statistics on the electrical company doing this exact same job, his 1% odds estimate is bogus.
"BTW: Unless JF had valid statistics on the electrical company doing this exact same job, his 1% odds estimate is bogus"
True, but it raises an interesting question.
How exactly DO you answer a risk-odds question? (Without being able to pass the buck to someone else so the odds become their problem?)
How exactly DO you answer a risk-odds question?
I never do this. The reason people ask is because they seek "experts permission" to cut corners for their own benefit (they "save money", if all goes well) and your loss (You! said it would be OK!!).
By giving "odds", you are just betting against yourself!
Unless your job 9-5, Mon-Fri, 365d is bypassing UPS, you can't accurately predict. These types of change are unique (well how many times are you going to do this change in your environment?) and therefore stats on outcome will be none existent.
The question isn't one of risk if the poop might hit the fan, it's one of impact if it does. And for every person involved in the change who says "no worries", add another 5% to the risk rating, because everyone should be very worried with this type of change.
To estimate the risk, you have to ask how many times have the engineers removed a UPS from a live data centre. If he answer is never or less than a couple of times in the last year, then the estimate is a 1:1 chance they'll cock up the procedure.
At least you got it back up in 36 hours, I'd have said a minimum of 48.
I'd agree and suggest that as 'the coordinator' he should have suggested, 'let's have a meeting with the team doing the electrical work and get their take on the situation'.
In answering that question with a guess, he took entire responsibility.
And besides, Rule #1 in his job is; hope for the best, plan for the worst.
Given enough attempts, the probability of one of them going wrong is near enough 100% - so the manager should not be asking what the likelihood of it going wrong is, but working out the most efficient and effective way of mitigating the expected failure. If you are not planning for a disaster, you are planning for a disaster (as my BCP* colleague once said).
If you reckon that out of 100 attempts at doing something, one of them will fail, then you really need to have a plan in place for coping with the failure. Would you play Russian roulette with a (really big) revolver that had room for 100 cartridges where a single cartridge had been loaded and the cylinder spun? Not having a plan for dealing with the live cartridge being selected will have messy consequences.
*You can probably work out when this was said by the use of BCP (Business Continuity Planning). It is odd how there are fads in nomenclature - Staff Management - Personnel - Human Resources; or in this case, Disaster Recovery - Business Continuity Planning - Incident Response
The risk was trusting incompetent workers (and therefore bad management), not hardware failure. Nothing to do with beancounters or even IT.
And thinking that "1-in-a-100 chance of failure" would put someone off is inexcusable; would a "99/100 chance of success" put anyone off? Do you always pull numbers out of your arse like that?
Yeah. Looking back over the years I've been doing IT, any kind of data centre power-feed/UPS work (by electricians) seems to have about 1/3 chance of massive failure.
There's no way I'd have low-balled the estimate like "JF" did, thereby causing his company to roll the dice on it (badly in this instance). :(
1/3 chance of failure - that bought up an old memory - I was working on the Westgate Shopping Center in Oxford when they were installing the backup generator. A nice big three-phase beast, delivered to the roof of the building via a crane - now when you wire that up, what's the chance of getting those three wires connected correctly?
A single 3-phase supply?
The actual phases don't matter, only the phase rotation.
So if Neutral is correctly identified, you have a 50% chance of getting phase rotation right.
Get it wrong and the motors spin backwards - bad for aircon and lifts for obvious reasons...
If neutral is swapped with any of the other three, instant boom.
My fave ever support phone call, from the factory floor, (a Romanian chap known as Fred because he looked like Fred Flintstone) said on the phone in broken English "My light - she is too bright". (Don't you love the way any problem becomes an IT problem?)
I quickly decided that walking down there would be quicker than trying to make sense of his phone call.
I found a normal 3 phase isolator, wired with old pre-harmonised colours (red, yellow, blue, for phases, black for neutral.) with a Myford 7 lathe connected. All ok. Then I spotted a suspiciously bright and shiny white twin core (no earth) flex that dissapeared into the isolator, looked inside and found brown joined to red and blue joined to blue. Then I followed the other way with a sense of dread and found a DIY 3 pin double socket on the back of his lathe with a lamp plugged into it.
Yep, a 400v 3 pin socket with no earth or RCD protection. Knocked up in his lunch hour and praised by the line leader for his 'initiative'. How that light bulb didn't explode in his face is still a mystery to me. I managed to stay calm enough to request a private meeting with the line leader and point out that he could have easily had a death on his hands, and that there is a reason things have 400v labels on them.
You have three phase and 220/240V confused. Three-phase has 3 hots. Connect between two of the hots in North America and you get 208V. Connect between one hot and a neutral (the fourth wire) and you get 120V. And you're correct about motor phase. Even a three-phase motor needs correct phase for desired rotation.
"would a "99/100 chance of success" put anyone off?"
Anyone who's thinking in terms of datacenter risk scaling, yes. If you have a contractual obligation to 9 9s uptime, then 99/100 chance of success is horrifyingly risky. Think about it by converting it into the number of days you are allowed per single day of downtime.
99/100 means 3 days and a half days of downtime in a year.
99.9 means 8 hours downtime in a year.
99.99 means 50 minutes downtime in a year
99.999 means 5 minutes downtime in a year - this is the minimum level any serious hosting data centre would ever claim to.
By the time you get to 9 9s, you have about 30 milliseconds - as in, your customer won't notice the downtime in the middle of a ping test.
So, when the IT guy says 'there's only a 99% chance of success', what he's saying is 'this is ten million times more risky than our uptime SLA allows for, do not do this under any circumstances'. You can then schedule downtime which is excluded from your SLA uptime target.
Beancounters really ought to understand this, since shoveling risk around is part of their job.
"So, when the IT guy says 'there's only a 99% chance of success', what he's saying is 'this is ten million times more risky than our uptime SLA allows for, do not do this under any circumstances'" --- Naselus
That's what he is saying to a fellow techie. What the same sentence says to management is "yeah, it's definitely going to work" Remember, many of these people not only think that ninety nine point nine recurring is not exactly equal to a hundred (a little bit stupid) but are prepared to argue it with someone who does know (a little bit more stupid) and to not even change their mind when it's proved to them (unbelievably stupid).
My answer would have been "It's not a risk I would be happy to take: I think the chances of anything going wrong are small but the consequences, especially if we don't plan a mitigation strategy, would be fairly disastrous"
would a "99/100 chance of success" put anyone off?
Yes, it would in anyplace that has a 25,000 square foot data center. At least the IT guys would be put off by it. I'm not saying management types are dumb, but they don't seem to get that 99% success means that there's a very real chance of failure. When you're dealing with tens of millions of dollars worth of equipment which probably holds hundreds of millions of dollars worth of data a 1% risk is far too much to willingly accept.
In cases like this I try to keep away from giving hard numbers for a probability of some failure or other happening, instead expanding on the effects of the various failures possible (and an estimated time-to-fix) and leaving the pulled-out-of-arse probabilities to the department, company or contractor(s) who are going to do the actual job.
- Nothing goes wrong, no effects, zero time-to-fix
- 10kV across power feed, all systems emit magic smoke, business continuity plans need to be taken out of filing cabinet in disused lavatory etc, and put into action.
I've listened to bean counters talking.
Setting aside the fact that they do rather forget that someone has to produce the beans before they can be counted, they do seem to have plenty of horror stories about organisations with no idea of cost base, or no controls over expenditure. Of management treating company accounts like personal ones and so on.
I can do better; a company whose initials are "S" and "T" spent a LOT of money on a super-duper machine that could reline water and sewage pipes from the inside - without having to dig up the road.
They then trained THREE people to operate it, and at the end of the course they decided to fire one of the guys, let a 2nd have early retirement, and transferred the 3rd to a supervisors role elsewhere in the company.
Last I heard, the machine was still sat there unused.
Same company that bought the lasers had a factory in Swindon (I was originally hired with the plan of moving over to the Swindon site as an industrial engineer). They spent several million on a bespoke manufacturing system (somehow they were sold a manufacturing system by a company that said it was completely configurable, but my best guess is it was really vapourware). Every few months they'd demo the system and management would go "that's great, can you make it do this as well?" Well as you can guess after 18 months they had a factory with no manufacturing system as every change made it less user friendly (NEVER let management design a system that's to be used by other people). The entire system was dropped for a quick Access database system that I knocked up in a week that did everything the engineers and operators needed (including stock control and batch management). I basically took the functionality of the system I'd used at a previous job and built that into the database.
Whilst I do appreciate the point you are making the major takeaway i got from your statement is that you managed to create an access database which stopped you needing to move to Swindon?? If that is indeed the case then you are a very clever person and I would like to read your book...
A.C of Swindon...
AC of Swindon:
The access database just meant the operators and engineers (primarily myself as I was the one tasked with running the reports from the data) had a usable system that did what was needed. The collapse of the telecoms industry was what prevented myself from having to move to Swindon. It was the choice of redundancy or moving to Northampton that persuaded me to move jobs (plus the nice redundancy package)...
These days, before any major change Risk of failure isn't the only variable taken into account, it's just a multiplier. The impact of failure is the other factor and in this case, the impact is the highest it can be, even with the risk being small. Bad management for not knowing this and bad JF for not making the impact clear.
No "employee of the month" for anyone involved. :-)
I had to produce written risk assessments for pretty much anything we did that had any possible significant risk. Things like taking kids with serious behaviour difficulties on a school trip, crossing major roads, I did RA refresher courses every few years.
The principle is simple enough, even of the methods vary. You gauge probability - not usually a number, because that's impossible. There are too many variables in too many combinations.
And you multiply by the severity of outcomes.
In this case maybe likelihood of loss of power I'd maybe put as Low based on that 1:100 figure, ( not my own guess, which would have been at least a cynical Medium).
But consequence as Very High. (Loss of valuable equipment, data, business and lives)
On the basis that LowxHigh=Medium I'd have said this was a Medium to High risk.
And that's how I always produced my RAs. Which also had to contain a plan to minimise risk.
As you say, it's a "Risk matrix" thing
Ranking the issues by taking the likelihood of something happening "multiplied" by severity of event if it happens.
As the British HSE says "It is suitable for many assessments but in particular to more complex situations. However, it does require expertise and experience to judge the likelihood of harm accurately. Getting this wrong could result in applying unnecessary control measures or failing to take important ones"
In this case, probability of bad thing happening 1/100 (estimate but sounds like it was based on wrong premise - that the work would be carried out absolutely best practice) times the severity of an incident (Whoa! Overtime forms out. Is my CV up to date?) = really serious risk.
Outcome - decide to reduce the likelihood of the event, or the severity of the event.
there are many ways of attempting to quantify the likelihood of [sumfing] happening, plus the significance thereof [if it does].
When I used to do risk consulting, I always sought a narrative example of how the resulting mess would be cleared up if the sh!t really hit the fan, by an amazing coincidence this often resulted in the aforementioned being re-appraised...
Disclaimer, I used to do risk mgmt on big infrastructure progs and pre/post risk consulting for big outsource-related programmes, both predominantly in the UK.
And yes the sales fuckwits really took a dislike to me - which I thought indicated I was probably doing a reasonable job.
"Anything involving major work to the power of a datacentre should be regarded as having a high-likelyhood of failure."
No, it shouldn't. The likelihood of the failure is low, provided you have a competent electrician. The impact of the failure should be considered alarmingly high, though. It's like a meteor impact - the chance is very, very low, but the impact is very, very high, so the risk level is somewhere in between.
The problem in the story is that they asked him about likelihood, which he honestly answered as being about 1%, but they took it to mean RISK, which is a matter of chance*impact. They made no serious assessment of what the impact would be (in this case, all SLAs broken, complete business disruption for clients for over a day and a half, massive financial penalties and legal liability, and probably chapter 11 bankruptcy proceedings within 6 months or so). If you came to me and said 'what are the chances that this will completely destroy the company?' and I said 'about 1%', then you should not take that chance if there's an alternative way to get the same result.
Hmm, it needs more than a competent electrician. I've never had anything involving industrial electric work go right enough not to justify bringing down services in a controlled manner.
Its needs the install to be 100% documented and understood. A lot of operators do the work, file the map and ... forget about it. Then you have someone plug in other stuff etc etc etc.
It should not happen but it does. Understaffing, personnel churn.
The main point is that this business runs on electricity. Anything that goes wrong whacks the business.
You cant blame the sparkys without knowing the full story; more than once I have turned up to a job and found out that some prior idiot had wired it ass-backwards.
The first time was as a green apprentice, I was asked to isolate a gantry crane and pull the fuses (300A 600V 3 phase), which I did; however when the sparky in charge came to put them back he went white as a sheet; the isolator had been wired up back-to-front (isolator before fuses),, so I had unbolted the fuses "Live".
Another time, a someone replaced a transformer the wrong way round and the £1,000,000 PLC cabinet got fed 1,600V instead of 110V; the transformer wasnt labelled input/output, so no-one knew for a fortnight. (Siemens make good stuff, it survived without damage!!).
Then there is the GUS/KAY&Co factory where 24 picking cranes had been fitted with every single circuit breaker bypassed, and single phase isolators on 2 phase equipment, and no one had noticed for 17 years. (I wrote up a further eight pages of serious violations for the HSE for that one factory, not one was ever corrected).
What? You believed that pulling a fuse isolated something?
1) Test that it's live.
2) Pull the fuse/breaker and lock.
3) Test that it's dead.
If someone tells you it's dead, do not believe them!
On the bright side, you clearly have good "No touchy shiny bits" skills on account of not being dead yet.
I was a 2nd year apprentice on my first shop floor excursion, and I followed the instructions given.
1/ Isolator to "OFF"
2/ Open Isolator/Fuse board and unbolt the fuses.
I said 100A, but thinking back they may have been 300A fuses, fricking HUGE, I have seen smaller thermos flasks.
(This was some Italian gear with no way to lock off).
For some reason it had been wired directly from the sub-station, so the Isolator/Fuse board for the machine was it; they had to get the sub-station switched off to make it safe.
Yeah, good sparkys do it with one hand behind their back, although in this case I think my particularly dry skin helped (I can usually hold 240V mains and not feel it).
16kV hurts though, luckily it was low current.
I didn't actually witness this myself but I was told about it by the guy who did, and I believe him.
Someone owned two businesses which shared a building. He then sold both of them, but to different people.
During the changeover period it was noticed that over the years, cross wiring had happened, which meant power from A going to B and vice versa (the distribution boards were apparently adjacent in a room). The electricians came in and carefully separated the two systems, which took quite a long time.
At the end of this process a fair sized isolator was found that took power from company A to a cable that literally disappeared into the floor. Rather than try to trace it, the electricians decided simply to turn off the isolator and then go round looking for stuff that had no power. They couldn't find any problems, so they isolated the circuit and disconnected the cable for safety.
Later that day the local golf club, a quarter of a mile away, telephoned to ask if the factory was also experiencing a power cut.
At some time in the past, when the clubhouse was built, a naughty man had run a cable and powered it for free from factory A.
As an electric apprentice in Germany the first thing you get tatooed on your forehead are
The Five Security Rules
1. separate completely (isolate the installation from all possible sources of electrical power);
2. fix (protect against reconnection) in the open position all the breaker components or switching device in the on position, or adopt preventative measures when that is not feasible;
3. verify there is no electrical power, after having previous identified the place of work and the installation which has been placed without electrical power;
4. ground and connect in a short circuit;
5. protect against nearby power sources and delimit the working zone.
Well, in most of the rest of the world one can call himself an electrician once he managed to identify the working end of a screwdriver.
Apprenticeship as electrician lasts 3.5 years in an electrical company, half of the time spent in schools of the chamber of crafts. One even learns metalwork like operating a lathe, welding and blacksmithing.
At the end you undergo theory and practice exams. Well, that certifies as engineer in the rest of makeshift wrigglers world. To certify as engineer in Germany you have to pack up another 2.5 years of studying and internships. Spending at least 5 years on site after that, you might be accepted as a real engineer by your fellow work mates.
If you then get your arse up to work abroad your´e in for a culture and probably electric shock.
Same in the UK if you want to call yourself a real engineer with Eur Ing on your business card.
One of my kids is a real engineer. The usual 4 years at U, a year working for Network Rail and a lot of work experience in Japan and Africa. Followed by 5 years at a consultancy. Work can be seen at places like Canary Wharf.
The problem isn't British engineers but British clients, who are quite capable of having serious construction work done by a Polish guy and his mates because it's cheap, only to complain later when things collapse.
Need to replace a socket, pull labelled fuse and cos Mrs Anonymous didn't have any stupid children plugged a lamp into the socket to check it's dead.
Take off the cover and get a shock
Discover that in the colonies they wire the top and bottom socket in each outlet from different fscking lines and by chance I had plugged the lamp into the same top one that had blown - but touched the bottom one.
I remember talking to a store manager while they going through a remodel. One of the major problems was none of the electrical drawings were remotely correct unless by complete accident. And the plumbing drawings were almost as bad. So it is possible for the sparkys to be very competent but be totally misinformed about the actual wiring.
"The likelihood of the failure is low, provided you have a competent electrician."
Bollocks. The likelihood of problems is going to be medium, at best, in anything other than a brand-new datacentre.
Being a competent electrician means you can rely on them not doing anything 'directly dangerous' it says nothing about their ability to understand a complex system that has been modified, probably outside of the applicable regulations, in places that are simply not visible. For example, a competent electrician will know that mixing phases from a three-phase supply is 'a bad thing'TM but there is no easy way that I am aware of to test whether some incompetent has actually done this.
As with anything, you have to ask whether the operation is routine for that electrician, or is there any chance of a mis-understanding? I don't blame the sparkies for this - clearly they were just chucked in there and told to get on with it - no doubt without anyone mentioning how important it was.
"a competent electrician will know that mixing phases from a three-phase supply is 'a bad thing'TM but there is no easy way that I am aware of to test whether some incompetent has actually done this."
It is dead easy and I have done it. You just put the multimeter between two live terminals on different outlets. If they are different phases, there will be line voltage between them.
Then your competent electrician goes round carefully checking outlets and colour coding them.
Except the best, plan for the worst.
The mistake here is with the risk assessment. There might only have been a 1 in a 100 chance of this happening (quite high I'd say) but the cost of it happening was huge and therefore it's a risk that should have been addressed. How you would go about addressing it though is another matter. Someone not thinking for a moment and cutting the wrong wire is mistake that's going to happen occasionally.
You see, there's about a 1 in 100 chance that whilst on bypass, the bypass supply might fail. That's why you have a UPS to begin with. JF assumed that the electricians would follow a staged work plan that wouldn't kill the bypass supply. That they didn't understand the wiring was a BAD THING. Mind you, I've done similar - powered off one of the two UPS for a battery swap knowing that the RP would pick up off the other UPS, forgetting that the backplane switch doesn't have an RP and mistaking which of the two UPS it was on.
This post has been deleted by its author
This post has been deleted by its author
An example of a committee coming to a formally idiotic conclusion. In a sensible world positive powers of 10 would be capitalised and negative ones lower case, and Greek letters would be omitted so the very common micro prefix didn't require a special key combo. But no, make the upper/lower case split at 103. And use really hard to confuse terms like yotta and yocto (in many other cases like maths oct- is the root of the word for 8, otto is the Italian version). Yet it would actually make no problems whatsoever to allow K and H as alternatives for k and h.
Science is supposed to help us make sense of the universe, but often special nomenclature and confusing reuse of terms in different disciplines suggests that obfuscation plays its part in there too.
These are historical artefacts. No biggie, as any science graduate can easily keep them apart.
Given that most people can't distinguish between bits and bytes (as in kb and kB) anyway, I don't think it would help much if a few prefixes were changed.
Lack of planification, electricians should have nade sure that proper steps would be applied. Sounds like they were not prepared for the big task, like reading electrical blueprints or labeling everything before hand. Management should have asked the electricians for the risk of failure. Movies about cutting a wire to defuse a bomb and making that decision comes to mind.
Yep; Any major work like that requires certificates to proceed like a 'hot works' or 'permission to enter a closed vessel' in a brewery ( CO2 asphyxiation risk.) Plus prior labelling of the kit to be fiddled with and a check list of steps taken.
I've even gone to the point of not entering dangerous machine areas (Mixers and conveyors) unless I had the fuses from the starter panel in my pockets..
My father works in similar industrial installations and he's supplied with what I can only describe as a swiss cheese padlock. It's a hoop with two plates full of holes to close it, which you then thread a padlock through, locking the isolator in the safe/off position.
If someone else was working on another part of the installation that used the same isolator they'd thread a second padlock in to the plate so that the first person couldn't inadvertently re-start the unit
In the US Navy, it is (or was, 15yrs ago) red-tagging all relevant controls and logging them. Since each bit of work got its own round of red tags, and it's helpful to combine different maintenance into a single window (work aloft on the masts, when any ships nearby would ALSO have to tag out their radar/HF emitters), you could sometimes get a piece of equipment with over a dozen red tags on it, and woe be to the sailor who left their tag up after their maintenance was completed.
"I've even gone to the point of not entering dangerous machine areas (Mixers and conveyors) unless I had the fuses from the starter panel in my pockets.."
I used to work (as a subcontractor) in industrial environments and did just the same, but also made sure there was a shortcut in place after the removed fuses, because there's always an in-house technician wandering around with some spare fuses somewhere in the drawers of his workbench...
but also made sure there was a shortcut in place after the removed fuses, because there's always an in-house technician wandering around with some spare fuses somewhere in the drawers of his workbench...
The levels of "logic" of some of these supposedly intelligent people often beggars belief. Fuse is missing. Either it blew with such force that the entire fuse block vapourised (or was ejected from it's holder), or someone removed it for a purpose. You either have an incredibly major fault in your equipment, or someone nearby is expecting to work on disconnected equipment.
(With any work I do in future, I'll use whatever I can fit in the box to make sure someone else doesn't replace a fuse that I've taken out.. What's the fault isolation speed of a fuse-shaped block of C4?)
This post has been deleted by its author
We had a few occasions where truck drivers tried to drive off with our forklift drivers still in the trailer unloading... after discussing lots of high-tech and convoluted processes to stop it happening we solved it for £5. By putting a hook on each of the roller-doors in the loading bay.
When the truck comes in, their keys are hung on the hook before the roller-door is opened, leaving the keys hanging 15ft in the air. Then they physically can't get their keys back (or drive off) until the door's closed again.
In the summer my neighbour mentioned that the company he works for has precisely this system to ensure the safety of forklift operations. He also mentioned one of the London depots had a refrigerator trailer decide to immolate itself whilst under their canopy when there were six other vehicles alongside. They managed to get one away by quickly getting the shutter down and retrieving the keys, but the rest were burned out.
Risk always seems to sneak back in somehow.
Of course there was a disaster recovery plan. It was even detailed in the article.
“Hurry!” JF exhorted, before calling everyone else in the business who could possibly help, such as folks on the application, sysadmin, network admin, and DBA teams. The business hadn't bothered to have any of them on-call, because with just a one percent chance of failure why did it need to bother?
Perhaps, in afterthought, it was not a particularly _good_ disaster recovery plan, but that word didn't appear anywhere in the requirements document.
The only criticism I have of JF is that he didn't properly scope out the ability of the electricians - even that is assuming he was the one who employed them. Most ordinary electricians I have had to deal with are quite frankly numpties.
Also, it is the management's job to assess possible impact on the company not his. Especially when they went directly against his advice on replacing the kit. It seems they didn't even want to know what the risks were, let alone how to manage them.
> The only criticism I have of JF is that he didn't properly scope out the ability of the electricians - even that is assuming he was the one who employed them. Most ordinary electricians I have had to deal with are quite frankly numpties.
The question he was asked was not what chance is there that a cyclone will flood our factory, making the only hard drives in the world, which would obviously ensure that the globe does or does not trip the invention of modern USBs.
He was asked literally: "what are the chances of a manager with my self esteem hiring incompetents?"
And the answer to that meant he felt it wise to resign after the fact.
The obvious conclusion is that his job satisfaction was one in one hundred.
And if you view it from the other end of the spectrum he felt his own level of competence was above one in one hundred which in fact turned out to be true.
He got the place up and running despite further mismanagement bollocks and realised that someone had to fall on their sword, doing so himself.
This post has been deleted by its author
That's not the joke that you think it is.
Assume that in a theoretical process there may be 100 actions taking place, if there is a 1% risk across the process the risk of one going wrong becomes 1/100 x 100 .
Or to put it another way, as the complexity of an operation increases the probability of an error tends towards certainty. So you plan for that.
The real risk factor is based on which of the actions may lead to serious consequences.
"Assume that in a theoretical process there may be 100 actions taking place, if there is a 1% risk across the process the risk of one going wrong becomes 1/100 x 100 ."
No, the actual answer is as follows:
For each action, a 0.01 probability (i.e. 1%) of something going wrong means a probability of 0.99 of it turning out OK. For 100 actions, the probability of them all turning out OK is 0.99 ** 100 = 0.366. The probability of at least one action failing is 1 - 0.366 = 0.634, i.e. a 63.4% chance of something going wrong.
The basic point is correct though: even someone who is fairly clueless about percentages will probably be scared by a percentage risk that is in double digits.
Indeed, or to use an analogy..
Ok, this plane has a 1% chance of exploding, but with get you there in 4/5 time and 1/2 the cost. Sounds like a good bet?
Fine, here is your itinerary, erm... please note the 30 flights (26% chance of high altitude fiery death), still happy with those odds?
This is the difference between an I.T professional and an I.T professional with business sense, it is indeed more important to consider the consequences of if it goes wrong over the *if* it happens.
I appreciate many think its not an I.T professional's place to have honed business sense about them, but experience teaches me the last place you can trust to exercise business sense is the management :)
It's like the higher up the pole you go the more incompetent the people you find.
a bit of business sense means you know which buttons to press on the management.
Naively dropping into discussions " but we'll be covered anyway under our insurance, won't we?" or casual (but distinctly audible) cooler conversations "there was a news item in the paper about an accountant in a company who got fired because of some mixup - and apparently if they'd held backups he'd have been OK"
Ah yes, I still do this with the Legionella regs.
Oh, you don't want to bother with a risk assessment? Sure, that's OK. You remember the Barrow-in-Furness case don't you? Oh yes, she was acquitted of manslaughter. After 4 years, a retrial, a couple of hundred grand of legal costs and a £70,000 fine.
People tend to go a bit silent on me after I say this. But it often has the desired effect.
In the Barrow-in-Furness clusterfuck the council replaced the engineer in charge of their pool with an architect. Who knew nothing about water quality. So she apparently sacked the legionella testing contractors, as she didn't know why they were spending this money. Controlling the water quality in swimming pools and cooling towers is a bugger of a job - whereas architects are best left to draw pretty pictures.
Forget about management ever taking an engineer (of whatever persuasion) seriously (the sane exception proving the rule here). And then, most IT staff isn't paid enough to take responsibility for the failure of HR/the board/company owners to hire competent management. If you would want me to fix all that, I'm now interim CEO of your outfit, with Carte Blanche to fire nitwits on sight and completely revamp the entire operation to my fancy. Short of that, you will have to make do with reports that contain technical language and require some interpretation...
I remember our entire system spontaneously going down midday in the insurance company I worked for years ago.
They were getting some kind of maintenance work done in the server room, and one of the guys propped a piece of panelling up against the wall - failing to notice it was also leaning on the big red emergency shutdown button...
The simplest ones are the worst. Because the panel was covering the button, I bet no-one saw it and thought that was the problem. In the machine printing room I used to work in, there were so many emergency stops and printing presses that the conduit to each one had, at ceiling level, a three-state indicator, green, amber or flashing red. The conduit was painted day glow yellow onto a grey wall. Green, the supply was on, amber the supply was off, flashing red the emergency stop had been pushed. Useful when you couldn't hear above the din if someone shouted.
I myself had an incident when studying engineering at college. As part of the electrics module we had to go to the colleges main electrical plant room (a quite small cupboard or want of a better description) and draw the things we saw in there and say what they were, etc. Any road up there must have been 7 or 8 of us in there and I spied the main breaker for the whole college, it was a kind of one armed bandit style thing chuffing great handle you pulled to cut the breaker. Anyway I knew what it was and not to touch. But on the side of it was a small little toggle switch, which (to this day I don't know why) I decided to press. CLUNK the entire campus went in to darkness, lecturers coming out of rooms wonder what was going on, etc Turns out said switch was the test button for the breaker! We were all good mates and stuck together and said my A4 binder had knocked it. The college clock and bell was about 5mins out for the rest of term.
Our sysadmin was in the server room putting some stuff on a high shelf. He was (foolishly, he agreed) using a stepladder that was a bit short, and as he overreached himself it wobbled. He grabbed the nearest available object to steady himself, which turned out to be the main breaker lever. He said afterwards that the sound of his footsteps echoing across the silent room was even more unnerving that the initial "clunk".
I've been on a project for about 8 months now, objective, retire an hierarchical storage system that is EOL/EOSL at both hardware and software layers, and the vendor has advised that there will no longer be support for the implementation. Data stored? Financial accounting records for the core business, and *cough* billing data that the government can require as part of customer tax records.
Right off the top management believed that they could throw away 12 years of this data as "They can't possibly need this". It took me three or four weeks to get the two managers making this claim to go back to legal and get a comprehensive response about that data. Initially *legal* stated that they only needed 5 years. I had a quiet discussion with legal pointing at a series of tax regulation changes in the last three years that affected financial statements retroactively for 10 years. Legal went and reviewed things and finally came back with a sane answer.
Currently there is data on disk and on tape, and there are three copies of tape data, one permanently offsite.
Initial estimates of cost - well -- there was only one implementation, in one site. Missed on current vs projected storage volumes.
We re-wrote the proposal that the management put together to include 2 years of growth, a georemote copy and the bandwidth to support data replication and the price of course went up - the management team lost it and denied the request.
I got a copy of the proposal done to the financial committee - NO WHERE did it mention the risk to the business if said data was not available. I've since re-written it with corrected risk analysis including the point that if the historical corporate financial data is unavailable, there is no substantiation of the corporate stock values, and if the historical billing data is missing the company is liable to the fed's for up to $500k *per day* of fines. Changed the financial committee's perspective. Sadly however, they still don't want to spend enough on the solution ..... I've explained why "good enough" isn't a final solution. Its all I can do.
There is no way to paraphrase your comment to just print the kernel. What your company problem is turns out to mean that closing up and rebranding would save them bothering with keeping records. Maybe that is really why companies are sent to the wall all the time tax breaks be blown. It is not fiddling the tax when it is cheaper to throw away the company.
Ah, lovely! We once had a back-hoe dig up the power line for our main building during construction of an extension. What a noise it made! All of our servers came back with out issue, but our 3Com( yes, many years ago) switches didn't like it. Only one refused to boot properly, but they started falling like dominoes in the following weeks.
Well that's what we used to call Facilities Management where I used to work, it was an on going battle, such as finding fire doors propped open with a door stop (done by lifting the floor panel and rototating it). Whenever I found this I would put it back, as the eletrical room was next along the corridor and our offices the other side, eventually I got hauled up by FM and told to leave it alone, I pointed out it was a fire door, they said due to fact they'd move the UPS batteries outside to a new building it was no longer a fire door. I suggested they remove the label saying it was. I'm sure it never happened.
Next up was the firemans breaker switch on the building panel in the hallway, I pointed out it was the wrong colour (red) denoting an alarm break box rather than a power break box (yellow I think), again they did nothing about it, lo and behold one weekend when we were doing a massive upgrade and had the whole department in, the majority of which were contractors and costing a fortune, FM decided to test the alarm boxes, it was annoying enough having the fire alarm going off every few minutes, until CLUNK, no power, even the computer room which had a UPS was off, work stopped and everyone started to panic as we had tight deadlines to deliver our upgrade.
The only light bulb working was the one in my head, I knew pretty much instantly what they had done, they had opened the red box on the building control panel (depsite being labeled firemans breaker) and cut the power, hence why the UPS wasn't working.
No issue just power it back up. Oh that's odd where has all the kit gone from the room where we were shown it was and trained to use. Ah, FM built a new building outside and moved everything, which is why it turned out we had a special film on our windows now in case of an explosion and why the fire door was no longer a fire door.
FM had no idea where anything was in the new building or how to use it.So I had to go through all the building outside with the head of FM who had not a scrap of paperwork. Eventually I found the breaker I remembered from training hidden in a false panel at the bottom of the eletrical cab but the head of FM wasn't prepared to throw it so I had to do it.
I pointed out he really should ensure the breaker box was changed as per my request, he had plans of the 'new' electrics, instructions on how to use them and that he needed to retrain the operations team. Doubt it ever happened.
Funny thing was I was working for a major power company at the time and our building was constantly short of power. We had frequent requests to power off kit until they eventually added an extra supply.
Proper Planning and Practice Prevents Piss Poor Performance...
A walk through of the required isolation's could have identified this potential failure prior to actually doing it, ideally by someone who knew what they were doing, an electrician doesn't necessarily understand the 'process' the electrical supplies are feeding no matter what his level of competence.
It's pretty clear the task just wasn't planned effectively.
It's not just for programmers.
This is just a massive catalog of errors and really there should be some rolling of heads. Mainly management. As usual. I am speaking as a manager.
It has already been mentioned in the comments that there were a number of issues not dealt with, which are the responsibility of management. IT management that. One of the golden rules of IT is that any activity in the machine room must absolutely be regulated. To that end I removed our company President off the mag lock door ACL. The guy has no reason to be there. If he does then he'll be escorted.
Any power works within a DC is not minor. I have to ask where the work orders were from the sparkies. Had they been dry runned? Amazing what you remember when you go through each step.
No roll back scheme either?
I am pretty speechless but not at all surprised that this happened. Too many utterly incompetent people in IT these days.
Of the time of the Hurricane in '87. I was at hope wondering how to get the tree off my roof. Got a phone call from the office to tell me that one of the data centres was down (yeah, that's the one that had no electrical backup).
I told the lowly IT bod that I could not make it, and that he should go round the systems and turn all the power switches to off. And to only turn them on, one by one, when the power came back on. Of course, he was overruled because "when the power comes back we need all the systems back on line quickly".
Well, the power did come back on. Fortunately no-one was in the computer suite at the time. This was the time of Vax's, with those old disk systems that look like top loading washing machines. They all power up at the same time. About 40 of them. And there was a huge power surge that the power switch board really didn't like. So much so that it leapt three feet off the wall and fused the whole building.
It took us three weeks to get that lot wired up again!
I resigned shortly after this.
"This was the time of Vax's, with those old disk systems that look like top loading washing machines. They all power up at the same time."
Well, some disk drives of a certain age might. For a large part of the life of DEC disks and VAX systems, large disk drives (some of which might be up to 3kW steady state and briefly more at startup) had sequence control capabilities built in.
But you had to connect the sequence control cable to the drives in question, otherwise they just came on when the power came on.
Lots of people forgot or didn't bother, with the consequences as described.
Lots of people
At the time of the 1987 UK hurricane (badly predicted as I remember), I was called out to a call centre that report as bening off air. I arrived to find the equipment rack very wet and a whole in ceiling above dripping - well more than dripping, more running, eater over said rack. I gently powered down each server/driver/etc then covered rack in a sheet of plastic I found. then proceeded to inform control about the situation and to divert serivces until said rack had dried out. Have you seen the maintanence man, control asked, no, where is he? he's on the roof looking for a leak. Really, I replied, in the hurricane? alas I never met with maintenence man or ever found out of he got down from the roof via the ladder or in free fall. Call centre was off air for 2 weeks but, good news, the entire rack came back with no problem, once dry.
A few years back I got some useless sparks to decomission a redundant 20KVA UPS for me - they assured me they knew what they were doing. Having disonnected the input power is was beyond their comprehension that it was all still live and bleeping. Just managed to stop them killing themselves. Yep - its a UPS!
I bet JF is glad all these armchair statisticians / electrical supply / datacentre experts are here to give him the benefit of their wisdom after the fact and tell him what he did wrong.
I thought my post was quite clearly aimed at management, not the poster.
I am a manager. That his manager was not capable of dealing with this issue is not the fault of JF. If I were JF's manager I would be mortified to have taken the attitude of his manager.
I like to assume a 1 in a 100 chance of any operation NOT failing terribly.
Then the only thing that can be done to avert certain disaster, is "everything".
Treat every single element and action that is part of the project as if it were a sleeping dragon, only waiting for a minute change in the environment to begin spewing lightning and sulphuric acid into every connector it can find, to unleash catastrophic failure upon your entire operation.
Take inventory of the many many ways disaster can strike; known or unknown, worst case and best case scenarios. Management can deal with the numbers: we now have 13.768 scenarios for catastropic failure (a made up number, but realistic for a datacenter IMO.), and we only have 768 ways to stop it. Now management can break their pretty heads over the 13.000 things that can go wrong for which there is no solution, and the impossibility of preventing every possible doom-scenario, and realize the inevitable: we probably need to invest a lot more to make this work: more money, more time, more preparation, more people, and more effort on the part of management to ensure business continuity, even in the face of certain disaster. And yes, we are out of here at 5 and won't be back until wednesday.
With a good plan and good people, it can be done.
If either of those are missing, it cannot.
I know of one site that moved over 500 circuits from old system to new during an overnight shutdown, with no faults. The new system took over in the morning, and the site ran perfectly. Still is.
And another that moved two circuits and literally vaporised the contents of the electrical cabinet.
One was planned, rehearsed and undertaken by very good electricians, the other wasn't.
Nope. There are permutations, combinations and variations e.g. cut neither wire; cut one only; cut the other only; cut one wire then the other; cut the other wire then the one; cut both wires together; fail to sever a wire completely; ... it's really not hard to think of many more possibilities.
It was a mistake to quote a probability figure, and a further mistake to ignore the business consequences ... and it's hard not to think of more mistakes in this scenario, with the benefit of 20/20 hindsight (or is that 50/50?).
That boss needs shooting..
Had a job on a customer's business machine. Minor 2 min fix. Probably a 1 in 1,000 chance something would go wrong. I took a full backup just in case, wasting a little over an hour of my time.
I needed the backup. Something a little more broken than I had detected, and a straightforward fix became a slightly different fix. 1st rule of computing, quick and easy jobs always turn into a disaster, especially if it's business critical or nearly home time/weekend etc.
(2nd rule - 1 in a million chances 'n all that..)
I tend not to do it for home computers, as most say "I have nothing on there", but I agree that everything else, there is a chance the HDD fails and you loose everything.
Even if your just there to press the on switch the same time a lightning strike hits. It was not your fault, but you were the closes person to get the blame!
... to assess risk. The Annualised Loss Expectancy calculation, although rough cut, would have prevented the acceptance of this risk. https://www.langtonblue.com/2015/03/information-security-budget-planning-donkey-tail/
The donkey was not harmed in that article, I am assured.
Our electricians were working on the DC doing a minor job and didn't realise the risk of unbalance phases, the guy was very badly burned, The Service, well we lost a zone but nothing more, lesson learnt whole DC shutdown every six months, with works done once isolated, the first cycles were so painful after a year or so we got slick all load being transferred and a solid shutdown / black start process. We were complacent and Needed the pain to learn the lesson,
Risk assessment requires cost as part of the evaluation. So loss equals a million dollars and a one in one hundred chance means you should as a minimum spend one hundredth of the potential loss to ensure it does not happen. So million dollar loss with a one in one hundred chance, should get $10,000 minimum spent to specifically ensure it does not happen.
The bigger the loss and the bigger the chance, the more you spend making sure it does not happen and where that investment exceeds the gain, you simply do not do, what you were intending to do, quite simply it is stupid and fiscally irresponsible thing to do.
our DR site was in the Data Centre of another division as part of a "gentlemens agreement" (cue major stress and anxiety over service delivery once the "gentlemen" concerned both left).
To replace the Data Centre power bar (or link, or something) took 3 complete attempts involving full outages, over different weekends, because the sparky either had bad plans, inaccurate documentation, or forgot to take a panel off to look beforehand. Rumour had it the overloaded component was GLOWING! under normal load. I pitied the poor sysadmin who had to coordinate this debacle, he was off on stress leave for a month later that year.
The Bussiness usually does not see itself as owning IT, but sees IT as owning IT. This leads to the cognitive bias of the type of "Empathy Gap".
Instead of raising serious concern with the business about the effect on them, warnings by IT tend to be viewed as "overly concerned techies" who need a "reassuring fatherly talk".
Seems more like: If we do nothing wrong it's a 0% chance of failure. If we THROW THE WRONG SWITCH, it's 100% chance of failure. I have no idea where the 1% came from... Somehow he/she knew that the odds of someone making a big mistake was 1%? Making a big human error could probably have been prevented by marking things that mustn't be touched, and making a checklist for the correct procedure before performing it.
A colleague phoned a school in the wilds of Scotland to diagnose an IT fault and wanted the guy to reset the modem. "Turn the power off, count 5 then turn it on again." The guy puts the phone down then my colleague heard a distant voice "but he told me to turn it off." Luckily it was a small one-room school!
Once worked in a media broadcast centre where everything ran off DC. We got a job to check the float battery was OK. This was a big room filled with bathtub sized lead-acid cells and to check each was giving 2V we first had to isolate the battery from the rectifier unit in a room across the corridor. We had told everyone what we were going to be doing and to ignore the big blue POWER FAIL light... well we thought we had. We had checked a couple of cells when my mate got that "Oh $#!t!" feelings... he rushed across the corridor to discover the rectifier had been tripped! The one guy we hadn't spoken to had seen the alarm, got to the rectifier and, in a panic, pressed everything in site... including the one that took the rectifier offline. At the 'inquest' they said the power had been off for 27s... another 3s and the mega-bucks compensation clause kicked in!
Then there was the day that London Electricity managed to connect 2 phases across live/neutral and we had to go up and down Edgware Rd with the petty cash buying up all their stocks of 20mm fuses
between the laughter (& tears)... remembering - I was the 'Designated Old Fart'
on the install of a new Production Studio at an elderly TV station located on a
hill so as to increase coverage for the antenna (ooh, Analog eh!), when a storm
'struck' literally - with a lightning hit, that took out the main power. 'lights out'
fershure but stayed out! with panicked people proliferating I ran to the back-up
gennie to hear it going 'rur rur rur rur rur..' due to yep = empty fuel tank.
"oops/oh-that..." ha - odds of a lighting strike on a hill ? yes