
If it got interrupted...
Then its not a UPS. Its a DHL. Dumbass High-risk Liability.
An IT bod from a data centre consultancy has been fingered as the person responsible for killing wannabe budget airline British Airways' Boadicea House data centre – and an explanation has emerged as to what killed the DC. Earlier this week Alex Cruz, BA's chief exec, said a major "power surge" at 0930 on Saturday 27 May …
First we know everything was slowing down. Maybe they decided they couldn't fix it live and wanted to force a failover.
The Times suggests a big red button was pressed in the data centre by a contractor and the power went down. That might be when BA claimed there was a power failure.
That would be the point when the failover failed. Perhaps that is why the CEO said something about there being millions of messages, although he seems to have stopped saying that now, maybe because it suggests there's something wrong with their IT.
Then I guess they tried to bring the data centre back up, and it looked like the bridge of the Enterprise, shaking about, staff falling to the floor, and smoke everywhere. That would be the power surge.
I wonder how long it was since power and switching to secondary or backup data centres were tested.
Electrical installation is rarely done in-house and is quite a specialised task. You'd have to be special to cock this up.
I can see it now.
Electricians apprentice - Whoops why has it gone quiet?
Foreman - Quick, restart everything and get out before anyone notices.
Oh, that takes me back some years. New standby generator went in on Friday night/Saturday morning. We're on site, shut down everything, new generator in place, all well, so stuff gets brought back up. 9:30 am, apprentice sparky cut a wire that caused the whole thing to shutdown. I got a call from security, arrived at 9:45 am (I stay close) and the place was like the Marie Celeste. Open cans of diesel for the generator, warm cups of tea and not a bloody person on site.
Similar to where I worked, except the backup power gen came on, picked up the load and then promptly died. Seems there wasn't much fuel in the tank. Maintenance guy was fingered for it as it was in his job description to fuel the generator and keep it topped off. The lesson was that firing off the generator once a week for 10 minutes to test uses fuel... duh!!!!!!
The electrical generators are almost certainly three-phase generators ... the trick here is connecting the three phases in the right order. I saw a generator test years ago in Oxford fail on the initial installation test after the phases why connected incorrectly. The generator spun up, and as the power switched over there was one heck of a bang and a lot of smoke ... and no more electricity.
> "I saw a generator test years ago in Oxford fail"
Yeah... I was told of a similar incident, but in a power station. When the station is powered on it needs to sync to the grid before linking as it is vital that not only the frequency matches exactly, but also the phase. The traditional way to do this was with a dial showing the phase-error. Apparently when the plant was down for maintenance they also had cleaners in to give the control room a going over. One of these cleaners discovered that it was possible to unscrew the glass fronts of the dials to clean the glass. In the process they knocked off the needle, and replaced it... 180 degrees out of phase. When the power station was brought back online the generators apparently detached themselves from the floor... with considerable (i.e. demolition-grade) force!
IT staff rarely go near the electrical stuff, it's far too dangerous for that.
Er, IT staff work on things that are run off the very same electrical stuff. I do hope you are not implying that data centre grade equipment is too dangerous ?
Having said that, even a complete Muppet can hurt themselves with nothing more than a mildly sharp stick or an LR44 button cell battery that looks like a sweetie, like everything, its all down to training and understanding the job and the risks of the job. Take a look on YouTube for the chaps that work on the live 500KV power lines, or the guys that maintain the bulb at the top of the radio towers
Everyone should have seen all the warning signs on the way into the facility (that tick the boxes in the H&S assessment) and had the prerequisite training.about safe escape routes if gas discharge occurs (no, not that gas, the other one), the presence of 3 phase power; the presence of UPS power; various classes of laser optics; automated equipment such as tape libraries that can move without warning and of course the data centre troll who's not been seen for a couple of weeks now, oh and of course, the ear defenders due to the noise plus the phone that you can't hear as its too noisy.
My point is that data centres are no worse than any other environment - like maintaining a car engine or running a mower in your garden.
I bet he was called in and told to pull the plug as a consequence of the system grinding slowly to a halt yet not switching over to secondary.
If you really want to force a failover that way, you do so by shutting down the small number of systems that would cause the monitoring system to detect a "critical services in DC1 down, let's switch to DC2". If you can't log in to those systems because of system or network load you connect to their ILO/DRAC/whatever, which is on a separate network, and just kill those machines. If the monitoring system itself has gone gaga because of the problems, you restart that, then pull the rug out from under those essential systems. Or you cut connectivity between DC1 and the outside world (including DC2), triggering DC2 to become live, because that would be a failure mode that the failover should be able to cope with.
You. Do. Not. Push. The Big. Red. Button. To. Do. So.
Ever.
At one facility I worked at & I don't have the full story of why......
The offshore support insisted on the former plant Sysadmin hitting the plants BRB, pictures were sent to the remote guy via email, he confirmed that was the button he wanted to be pushed & goodness gracious me it was going to be pushed. He was advised again of what it was that would be pushed & the consequences, the plant manager dutifully informed of what was required, what the offshore wanted & what would be the fallout.
& so it came to pass that the BRB was pushed on the word of the Technically Competent Support representative.
(Paraphrasing here........)
"Goodness gracious me, Why your plant disappearing from network?"
"Because the BRB you insisted that was the button you wanted pushing, despite my telling you that it was never to be pushed under pain of death has just shut down the entire plant."
I think it took until about 15 minutes before production was due to commence the following day to get everything back up & running.
Usually near the door in each DC hall...
But not so near that they can be mistaken for a door opener button by the dimmest of dimwits. At chest/shoulder height and at least a few steps away from the door appears to me the most sensible location.
That said, I've seen a visitor who shouldn't have had access to the computer room in the first place look around, totally fail to see the conveniently located, hip-height blue button at least as large as a BRB next to the exit door, and killed the computer room because a Big Red Button high up the wall and well away from the exit is obviously the one to push to open the door for you.
Unfortunately, tar, feathers and railroad rails are not common inventory items in today's business environment; rackmount rails are too short and flimsy for carrying a person.
"The Times suggests a big red button"
These exist in many date centers. But the are not intended for normal, sequenced shutdowns or to initiate failover to backups. They are usually placed near the exits and intended to be hit in the event of a serious problem like a fire. They trip off all sources _Right_Now_ and don't allow time for software to complete backups or mirroring functions.
*Usually for events that dictate personnel get out imediately.
My last employer had the big red shutdown button conveniently located next to the exit door. Unfortunately just in the position that the door open button would be. One lunchtime a visiting engineer carrying a few boxes of spare parts accidentally pressed it trying to open the door...
I think the best solution to prevent accidental use is to have 2 big red buttons. And to require both to be simultaneously pushed to trigger the power shutdown.
In fact I saw a UPS product having exactly that feature, two EPO buttons that you needed to push simultaneously to shut it down.
Having two buttons doesn't mean a single person can't operate them. You can place the two buttons close enough for that. But it ensures the person operating them really knows what it's doing, and it's not randomly pressing buttons.
Of course, if the purpose of having such buttons is to allow even untrained people to shut everything down in case of emergency, it would complicate things. But a large warning message, for example "In case of fire, you need to press these two buttons at the same time!" should take care of that as well.
My guess about the initial failure: ATS left in bypass or failed on power transfer. 15mins for someone authorised to manually switch ATS to good power source or bypass failed ATS. UPS/generators not specified (or too much new kit added to the DC) for full startup load of the whole data center which then failed again.
Some systems probably started up and began a re-sync then the high load crapped out the generators returning everything to silence again, leaving the replication in an unknown state when systems were manually restarted over a longer period to manage the initial load.
My bet would be that a cluster failover was initiated by the power failure, then fail-back manually triggered, but the primary site failing again with power surge starting the secondary systems. With a manual failback an engineer would be needed to failover again and not just a bargain basement operator.
If the people that manage the servers are from TCR and they were unable to recover from the power failure in a reasonable amount of time then I deduce that they are at fault. Maybe not for the initial outage but the subsequent problems. They would also be responsible for the disaster recovery procedures so the fact it all failed in the first place also lies with them.
So we've got an explanation that fits part of the information released so far (i.e. The power issues), but there seem to be large gaps between what we are being told (active-active or active-active-passive DCs that provide fault tolerance) and the two-plus days of outage.
In addition, saying that UK staff ran UK data centres AND covers off the actions they took, leaves a lot of questions about who runs the systems that stopped and the slow or problematic attempts to recover them.
My question for BT would be "if you had not outsourced, would you have expected to experience an outage and if it did would it have taken two days to recover?" Given the impact too BAs business, I would hope the answer is no and TCS screwed up...
At a previous company, we did have a full-time electrician. When he wasn't fixing something, he was supervising an upgrade or replacement, designing future electrical buildouts, meeting with DC tenants to be sure we didn't overcommit the electrical supply, fixing stuff around the offices, and generally being Very Useful.
To be fair, we had more than 1 data center, with complex power requirements. Much like BA, come to think....
I've worked both as a direct employee and as a contractor. A company has much more control over direct employees.
"A contractor isn't the same as outsourcing."
CBRE are a little vague as to what they do. It seems to come down to "advising" but it certainly doesn't sound like electrical contracting. Maybe they were overseeing electrical contractors. See their site at http://www.cbre.co.uk/uk-en/services/global_corporate_services/data_centre_solutions where they claim "With Data Centres, knowledge is power".
This is surely going to go down in the history books as a lesson in how to turn a crisis into a disaster, on every level from technology design and implementation, to financial impact, public relations and more.
Do BA/IAG management not realise that what they're saying isn't plausible, not even to people with only very basic clue [1], and what BA/IAG management are doing and not doing isn't helping?
[1] Anyone who is publically supporting BA/IAG's version of events might want to consider what has been possible in IT (and in PR) for a few decades, and look at the definition of "useful idiot".
So poor old sparky pulled the wrong plug, panicked, hit the wrong button and everything went boom in DC1. All expected scenarios so far.
Why did DC2 blow up? If they're active-active it's not a failover scenario, it's simply a reduction in capacity. What's the betting they've gone active-active for cost reasons (i.e. less kit needed than for active-passive) but woefully underspecced the whole thing for a failover? Or simply allowed the amount of applications to exceed the load of one DC alone because "this will never happen"?
To switch things off *before* powering back up ?
Otherwise you're just risking it tripping again ...
and again ...
and again ...
The proper procedure is:
1) power goes
2) power everything off (physical switches if needs be)
3) fix underlying problem
4) begin power-up sequence. Which is laid out somewhere in the DR manual.
But then I am 50, so have some smarts you'd have to pay for.
That would only explain the loss of DC2 if the initial fault in DC1 caused it to go Byzantine and corrupt the state of DC2.
That's the missing link here. We're all professionals, we know fuck ups will happen, we know even relatively minor errors can bring down something as complex as a datacentre if proper procedures are not followed. What hasn't been explained is how the death of DC1 brought down DC2.
Options include:
1) BA are lying and this is purely an RBS/Natwest style fuckup at the app/data level.
2) BA are telling the truth about the root cause but the DCs weren't truly independent at the infra level causing one power fault to screw both
3) Same as 2 but incompetent applications management/implementation caused the failure to be incorrectly handled
4) It is a genuine accident and a dumb contractor screwed DC2 by hitting the wrong reset button, resulting in multiple days to replace damaged kit and replay lost data from backups.
5) It was an attempted forced-failover from DC1 to DC2 that has broken DC2 and left a crippled DC1 covering all the load on the hottest, busiest day of the year.
Place your bets.
Quite.
We had a rather unscheduled event once, where the fire brigade threw the Big Red Switch in the outside feed. During the time the cleanup was done (mopping up the water and ventilating the building) we worked out the startup sequence for the stuff present: network gear and standalone servers that wouldn't care about connectivity, servers that would need network or else their network configuration would be totally bonkers, and servers that would need to see particular other servers, otherwise their would be in a bind with the best way to recover being rebooting once the other end became available.
Energy was told to switch off all local circuit breakers before restoring power to each of the computer rooms, so that we could switch off all systems before the racks got powered.
With that crib sheet things went as good as flawless.
Good effort, but why didn't you have a plan prepared in advance?
Good question. Next question please.
What it comes down to, because that particular information just wasn't there, and basically the best thing to do in the time available was distilling it from a connectivity matrix, combined with noting whether systems were essential, auxiliary or 'meh, can wait'.
We have detailed info on all systems, including how to start them up from zero. Like for the main VMS cluster basically: 1) network, 2) storage management, 3) storage shelves and controllers, 4) the cluster nodes themselves, but only few of those documents describe the how and why of interaction with other systems. That info tends to be in another class of documents, not the system operation manuals. There are sections that describe what has to be done to neighboring systems in case of a total system shutdown, but that assumes you can log in to those systems to shut down the affected comms channels and such. With the power cut having done so for you, site-wide, there was a certain 'fingers crossed' involved, but the hardware proved surprisingly robust (two minor errors over the entire site, AFAIR), and the software only needed minimal prodding to get the essential bits working again.
Why aren't the different components not integrated?
Hysterical raisins, for a large part. Security comes into play too: systems that have outside connections in whatever form are not allowed to even communicate with the main clusters directly, let alone have processes running on those clusters. And there is stuff like data conversion systems supplied by third parties, hardware and software, therefore not integrated as as matter of fact.
Furthermore, you don't have your monitoring system integrated with whatever you're monitoring, do you? Or your storage management system integrated with the systems you're providing storage to?
This is the biggest load of BS ever. There's no way this started with a power failure at all.
What's more likely is there was a data/application error that they'd never encountered (planned for, or tested) and someone decided to kill the power ("have you tried turning it off and on again"). Because the systems at the other sites mirror the one that had the problem, they will then have wondered why killing the power did absolutely nothing to fix the problem. So then they'll have killed the power at the other sites and tried to power it back up. As the applications came back online they may have been faced with loads of data corruption which were possibly fixed either manually and/or with a combination of tools built into their applications.
The article quotes someone as saying a data problem is easier to fix than a hardware one. No idea where you got that total bullshit from. It depends on the circumstances. Even if you had to replace some hardware, that can generally be done faster than trying to fix a set of applications with corrupt or otherwise invalid files that are all trying to talk to one another.
And as for "this all happened in the UK and isn't outsourced" - who developed and tested the applications? Oh yeah, outsourced Indian workers. *Slow clap*
"Do you know something we don't? There are a thousand ways and more this could have started with a power failure."
No, I don't. However consider all of this....
1. Assume there was a power failure at the primary DC.
2. The primary DC has backup/UPS power - why doesn't that work? The article suggests *maybe* the main power and backup were applied simultaneously causing the servers to use 480V. Fair enough.
3. How does (2) affect what happens at the secondary DC? Why does exactly the same thing happen on a redundant system which is designed to mitigate against such problems occurring at one DC?
If the power management is also controlled via software, that is a data/application problem, which hasn't been tested - if you are sending the "wrong" data to the secondary DC it will only have the same results, and replicate the problem there!
I can't see how this would just come down to a "freak" power incident (nobody else in the area has reported it either) that knocked out two physically separate data centres, whereby the UPS also failed to work. It's just too coincidental and convenient.
It's more a case of whatever happened at DC1 was mirrored at DC2 - either by humans - or by data sent from one site to the other.
2. The primary DC has backup/UPS power - why doesn't that work? The article suggests *maybe* the main power and backup were applied simultaneously causing the servers to use 480V. Fair enough.
That (getting 480V ed into the racks) suggests a more than grave wiring error that would have caused one or more of 1) seriously frying the output side of the UPS, 2) seriously frying the generator, 3) causing an almighty bang, 4) causing parts leaving their position at high velocity, 5) the electrician(s) that did the wiring leave their place of employment at high velocity, and 6) one or more electrical certification agencies not previously involved in certifying and testing this setup taking a long, hard look at the entire process from commissioning the installation to the aforementioned result.
3. How does (2) affect what happens at the secondary DC? Why does exactly the same thing happen on a redundant system which is designed to mitigate against such problems occurring at one DC?
Did it? Or was the second DC karking the result of the primary DC splaffing corrupted data as it went down, and thus corrupting the failover?
2) 480 volts. BS: There's no way that any UPS i've ever seen could operate or be wired so that could happen. The only possible thing that might have happened is that grid power was restored out of phase with the UPS inverter output, which could cause a big bang. However, UPS's are specifically designed to deal with that, with state change inhibited and the UPS phase adjusted until it's in sync with the grid input. No UPS could work without that logic.
UPS systems are usually either true online, or switched. With true online, the UPS inverter always supplies the load, with the grid input powering the inverter and float charging the batteries. With switched, the grid normally supplies the load directly and the batteries are idle, on float charge. For the latter, when the grid goes down, the inverter starts up from batteries to power the load, which is switched over from grid input, to UPS inverter, within a few cycles at most. All these things have smart microprocessor control, which continually samples mains quality and has lockout delays to ensure that brief transients don't cause an unnecessary switchover, Also, delays to ensure that grid power is clean and stable before re connecting to the load.
We've now had several stories abouty this, none of which seem real and ignoring the giant elephant in the room, which is that their disaster recover failed completely. The tech to do this has been around for decades, so how could it go so badly wrong, in a company of that stature ?...
Never underestimate the power of human arrogance and stupidity - if it can happen, it will.*
*recent events surrounding NSA having dangerous tools taken from them and used to bitchslap corporates and gov agencies that should have known better prove that pretty well...
"their disaster recover failed completely. The tech to do this has been around for decades, so how could it go so badly wrong, in a company of that stature ?..."
That is the $128+M question. The rest has by now been repeated so often that it's starting to get a little less entertaining than it was.
A power surge occurred at Redbus 8/9 Harbour Exchange back in 2004 ish. Hundreds of power supplies had to be replaced after going bang due to the same thing delivering 480v to all rack servers. At the time the MD personally paid for a plane shipment from china full of server power supplies to get all customers back up and running. Seems strange BA had enough hot spares for this task.
Lost neutral is seriously bad juju.
One or two phases will go hot, and once any phase goes over about 250VAC it'll rapidly kill things.
And it doesn't take a big imbalance to do that.
I've seen a few major installs with a rusty neutral - they tested out fine initially, then blew things up a few weeks or months later.
I think this has some credence. I have experienced a faulty earth on a newly rewired three phase system. This can cause voltages +280V (Interphase voltages). This could cause higher voltages in the logic circuits and thence zapping of chips and insidious faults. Furthermore, the higher voltages in the logic circuits could pass to the connected DC2 via the data links (NOT POWER CIRCUITS). That could take out the DC2. Just a hypothesis.
480 Volts is unlikely, as is the suggestion that supplies were connected in series to give 480V. Entirely possible when a fault or mistake happens during synchronising three phase generators to the mains or another 3 phase supply, is a lost or floating neutral, which usually results in phases being substantially under or over or voltage by anything from near zero change to near 240V. If your are lucky you just fry equipment but less than a month ago such an incident also burnt down a pair of semidetached houses
The article quotes someone as saying a data problem is easier to fix than a hardware one. No idea where you got that total bullshit from. It depends on the circumstances. Even if you had to replace some hardware, that can generally be done faster than trying to fix a set of applications with corrupt or otherwise invalid files that are all trying to talk to one another.
Indeed. And even if half your hardware is fried, it should be possible to bring up the other half with a reduced set of applications in a way that core functionality can be restored. And your DR plan should have tables of what machines can be reallocated to other tasks in a case like that.
Corrupted data is another matter entirely. Can you fix it by rebuilding a few database indexes or zeroing some data fields, do you need to restore a backup or is the 3rd line support tiger team huddled over their monitors amid mountains of empty coffee cups, alternately muttering lines of logging or code, and obscenities?
If... the power control software controlling the multiple feeds into the data centre didn’t get the switch between battery and backup generator supply right at the critical moment ... it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v...
There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load. If there isn't, someone has seriously screwed up on safety and you've got a *very* dangerous situation on your hands.
Failover from utility to generator power is fully automated in our DC. You'd need to be in the plant room with wellies, gloves and a big set of jump leads to override it. Even if you succeeded it would trip out the breakers PDQ.
Anyway, "a power surge broke the computer" is totally a "dog ate my homework" excuse, so I say they're bluffing to cover for an even more embarrassing reason.
Indeed, all the backup systems I have worked with have both electrical and mechanical locks to prevent this happening. It's a Very Bad Thing (TM).
I have seen a photo of what happens if you connect 1MW of diesel gen-set to the grid without syncing the phase. The result is that the stator (there's a clue in the name) rotates by the phase difference.
"There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load."
There was an Open University "System Design" course programme about such things. One of the real-world examples was a local tram system. At the end of the line the operator had to throw a switch to reverse the electric motor power for the return journey.
The switch had three positions - with the centre one being "off". What happened several times was that the operator went through "off" so fast that the reverse power was applied to a motor that was still moving. This produced a destructive surge.
The simple modification was to have a separate key-lock "on/off" switch for each of the two directions. The switches were separated by a large distance on the panel. There was only one key. This gave sufficient time delay while both were in the "off" position.
"There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load."
Back in the 1960s a university laid on a tour for prospective students. They demonstrated winding up a small generator - and watching the phase lights to time when to sync adding its output to the national power grid.
The lecturer told the story of someone getting their timing wrong - with the result that the generator was momentarily competing against the national grid. The generator stopped dead - and disintegrated.
Good comments. Proper power "Transfer Switches" should automatically "break" before they "make". However, there are times when you want to isolate a UPS and run direct from mains for maintenance. This requires a sequence of switching to occur. Get it wrong and you lose power. This still doesn't address the lack of DC2 coming on stream.
"There should be some serious interlocking going on to prevent two completely different power inputs being connected at the same time to the same load."
As an electrician, the simplest ways I can think of of getting 415 V where 240 V should be; is ether to swap over one phase and the neutral, or to accidentally disconnect the neutral on a 3 phase supply.
The former is almost always an installation error (I took out a bank of lighting when I was an apprentice), while the latter is generally poor design (inappropriately switched neutral) or an equipment failure.
I also seem to recall, (but cannot find*), a news story where a disgruntled employee damaged part of a DC by temporarily disconnecting the neutral.
* I think it was on El Reg [citation needed]
The classic way to do the latter is to forget hooking up the neutral after insulation testing. Can have very fun results, depending on how skewed the load is across the phases.
There has also been a number of cases of lost neutral from overhead wires torn down, people crashing into cabinets, etc with buildings literally catching fire as a result.
WTF are you even touching the neutral for any testing?
That's not needed for any form of electrical testing that I recognise. Low N to PE resistance on the load side is easily found by other means, and on the supply side they're either bonded together or they aren't.
In fact, if you disconnected the neutral to any system while any live was still connected I'd throw you straight off site and end your career.
We used them for our website hosting, and SQL backend to that.
Came in Monday, noticed SQL agent had stopped. Bit more digging and discovered an unplanned outage.
Spoke to UK FAST who - without any hint of shame - admitted that both the primary and backup SQL boxes had been connected to the same power bus, so when they had an outage we lost both.
A note to anyone who is using these jokers: they genuinely didn't understand why it was a problem.
When our small company out grew personal printers and "floppynet" we installed a Novel Server (3.12 IIRC) in what was then the warehouse.
As an non-IT person but the closest we had, I was nominated "IT Mgr".
I had the presence of mind to have a non-switched box on the wall for the IEC lead that powered the server and all seemed good.
That was until our kettle in the kitchen started to fur up and trip out the kitchen power.
Sadly the switched socket that I had replaced with a sealed box was on the same circuit so the server crashed unceremoniously every time this happened.
We had that changed promptly and ordered a small but suitable UPS and commenced negotiations with the ISP next door for a feed from their Diesel Genny.
Good lesson to learn early.
Still not an IT Mgr and from the tales I read here I'm bloody glad I'm not. PP
I did think it might be something like this. Maybe the UPS was put in to bypass so work could be done on it then someone hit the main breaker and bobs your uncle fannies your aunt the whole lot goes! Then in the panic power gets restored whilst the other DC's is part way in to taking over and you have a right old shit storm happening. Now we only run a small DC at our site with something like 24 racks 4 CRAC units and a UPS supplying the room, but that could happen with our setup, we don't have a genny. To put the UPS in to bypass you need a key to move the bypass switch lever.
Had a similar situation with a planned building shutdown, I was scheduled to be on-site for the testing of all desktop equipment 9am Sunday morning at our office suite.
Nobody had switched the bypass on the UPS before Friday shutdown (servers powered down safely though), UPS dead, India thought I would be at my desk 40 miles away from where the building was at 3am & left lots of messages about why were the servers not coming back up remotely.
I turned up on schedule & in the dark (figuratively), electrical contractors for the UPS had to turn up & test each battery in turn just in case one "blew" when power was restored.
"no, we're not going to pinpoint it on a map for you"
Is this some sort of; we don't want to be accused of helping them darned terrorists thing?
I am rather surprised Caliphate Inc. haven't claimed responsibility as they seem to claim they have inspired or are responsibility for everything bad which ever happens.
"After a few minutes of this shutdown of power, it was turned back on in an unplanned and uncontrolled fashion"
That's got to count as one of the most ambitious attempts to switch something back on in the hope that no one will notice it ever got switched off in the first place.
Whoops!
I worked in a data centre once when a power contractor disconnected a cable he should not have that was live. The cable then progressed to wave about in the air like a snake on fire and arced several items of power supply kit including UPS batteries.
After the immediate problem was solved it took more than a few hours to determine what kit was damaged. This include many server PSUs, many network interfaces and much more less important items.
Luckily this was a facility that was not live, but because the kit was sprayed over a period of time (perhaps a minute or 2 ) there was always a chance data would be out of sync or corrupted.
I am extremely suspicious that nobody from the inside has tipped off the Reg with the real story. it suggests an unfeasible level of loyalty (given all the outsourcing) or that nobody involved reads this esteemed organ which, in turn, suggests a paucity of people who actually know anything.
"nobody from the inside has tipped off the Reg with the real story. it suggests an unfeasible level of loyalty (given all the outsourcing) or that nobody involved reads this esteemed organ which, in turn, suggests a paucity of people who actually know anything."
The stories (and comments) reaching the public about what allegedly happened are almost infinitely improbable, which (as you and others have noted) is quite remarkable.
Perhaps the BA datacentres have turned into the Marie Celeste of airline IT: perhaps nobody was actually in operational control? That's as likely as most of the other stories circulated so far, starting with the original "a big boy blew my power supply and ran away".
I have known cases where everyone concerned has been very careful not to pin down the exact cause too precisely, just in case it turns out to be their side of the fence.
This may well be the case. If it was indirectly due to inadequacies in the outsourced service then the person who points it out will be making top executives look foolish, and pointing out the state of the emperors clothes is usually followed by a man with a large weapon telling you what you didn't see. If it were in house then you're making in house liable to be outsourced, and if its a UK contractor jobs will be at risk as well. So it may well be in no-one's interests to define the rot cause to accurately to senior management.
Maybe if BA were running active:active they have just found out a single point of failure that they did not notice before in the design.
For example maybe taking out the power completely knackered the Enterprise Service Bus (or access to it). If messages could not get to the backup then it fits in with what Cruz said about a power supply issue causing a network problem that meant messaging failed between all the systems.
The backup data centre might have just been twiddling it's thumbs.
Unless of course you do something REALLY stupid.
The only time I have ever come across anything this stupid was while working in a GUS centre; are they and BA linked in any way??
They were running 240volts via 2 phases of a 3 phase power line, but fitted with SINGLE pole isolators, so in the "off" position, all the equipment was still "live" - and someone got fried.
To compound the issue, every single CB was bypassed at all 25 stations, so even if you tripped all the switches in the control cabinets, it was still all live.
Installed in 1973, I found it the first time I open a cabinet box to try and find out why the guy had been fried - in 1992.
Even if they managed to use two different phases as "Live" and "Neutral", you would still only get 380volts.
It sounds to me like a scape goat has been chosen, and they are making up a story to use.
https://www.theregister.co.uk/2017/04/11/british_airways_website_down/
As you stare at the dead British Airways website, remember the hundreds of tech staff it laid off
The staff who knew the quirks and oddities of the system were dismissed and the remaining staff had no clue. The rest is nothing but shouting.
I've seen this happen a few times. Quite a few times.
Manglement always thinks a complex network is plug and play and so are the staff. And this is always the result.
So much schadenfreude here.
Reminds me of one occasion a few years back. Had installed an active-active firewall pair across two sites with each handling one leg of a 2MB leased-line to the internet (see, I said it was a few years back!) and automatic config hot-synchronized between the pair.
One firewall sends alarms to say fan playing up. No problem, will just power it down, replace with a good unit then bring it back up again. Power down the offending unit, traffic fails seamlessly over to the secondary and all good. Re-cable from dead unit to the new one, power up. Still all good. Give the "re-synch" command and hear someone outside wondering why the internet has stopped working. Upon closer inspection I determine that the re-synch has indeed happened, with the configuration being dutifully copied across - but I now have two units each with a blank config. Heart stopped and blood ran cold as I realise what I've done...
Fortunately it was only a 15 min sprint to the other site where I had a laptop with all the backed up configs. A lesson learned the hard way indeed!
But TBH in BA's case I can't believe that the Friday of a Bank Holiday weekend wasn't considered a change freeze with nobody allowed to do anything other than emergency work in their DCs.
Too many years ago (37 , if you want to know) I talked to a customer who had just tested his backup process ("just to get the feel of the syntax")
At that time , some utilities had a syntax of "$Copy {input device} to {output device}" while others had a "$backup {target device} from {source}".
This chap managed to copy an 8" floppy to his 100MB database drive (and it worked... perfectly.... database was now about 300KB! )
Could BA have just run a backup / recovery in the WRONG direction?
Jc
In any building which consumes anything like a megawatt the power supply will be three-phase, as will be the backup generators. The UPS (or, rather, presumably a number of them) in an installation of that sort of size will be supplied directly from the three-phase supply and will provide 240V single-phase AC only at the UPS outputs.
So we know stragiht away that any talk about 240V+240V=480V is nonsense, because to get 240V in the first place you had to come up with some sort of 'neutral' which is NOT provided by the generator and divide the, er, real numbers by the square root of 3. This is why you'll see mention of 415V all over the place in serious power circuits. That's the phase-to-phase voltage. There isn't any 'neutral' and you can't wire three-phase circuits 'in series' - the suggestion makes no sense.
I'm not a Chartered Electrical Engineer for nothing.
240 Volts??
In my young day it was 220V. Then, ca 1968, it was upped to 240V. That fried a lot of light bulbs, making work for the working man.
Then, ca 1996, the EU standardised on 230V. That left a lot of light bulbs running until eternity, so they had to invent an eco-scare to force people to keep replacing light bulbs.
The UK used to be 240VAC +/-6% (225-254VAC), and various EU countries at 220VAC and 230VAC nominal.
Then the EU harmonised at 230VAC +10%/-6% (216-253VAC)
This range was carefully chosen to ensure that no EU install had to change anything at all.
Today, most sites in the UK still get 240VAC.
New builds often get 245-250VAC, as they tap high to allow for future additional load without having to touch anything.
"The UPS (or, rather, presumably a number of them) in an installation of that sort of size will be supplied directly from the three-phase supply and will provide 240V single-phase AC only at the UPS outputs."
What about computer systems that require three-phase?
Even if such beasties are no longer common, I would have thought that someone the size of BA would have some legacy kit using three-phase.
Even if such beasties are no longer common, I would have thought that someone the size of BA would have some legacy kit using three-phase.
Total bollocks. The kind of UPS that are installed in even the most modestly sized data centers have three phase inputs and three phase outputs.
I once worked at the main telephone exchange of the capital of a particular country. The power could not fail - they had a huge room of lead acid batteries constantly float-charging to supply the 50V power (@ circa 4000 amps) to the electromagnetic exchange. Any one of the 3 container-sized power supplies feeding the batteries could handle the full load, and those were fed from two different sections of the national grid. Two enormous diesel generators (primary and backup) each in a separate room would take over in the unlikely event that both grids went down.
A bush fire 100 miles away weakened a few 400kV pylons which collapsed and took out one arm of the grid. The additional load on the other arm promptly tripped it offline.
The primary diesel generator started up automatically within seconds but soon made horrible noises and stopped due to the fact that the maintenance guys had drained the oil the previous week as routine maintenance, then realised they had no oil in stock so it had been left dry while they ordered some (but had been put back online so the boss wouldn't notice).
The manager quickly went to the secondary generator and started it manually, but while it roared into life it generated no power at all. The batteries could power the exchange for about 30 minutes and time was running out fast. A loose terminal on the excitation winding of the secondary generator was found 5 minutes before we were about to undertake a complete shutdown as the battery voltage was falling to below 45V, and a hasty repair was made. Just as power was about to be switched to the now functional generator, the grid came back on, making it unnecessary.
A very close call - bringing up an electromagnetic exchange is not straightforward. You cannot simply power it all up in one go, as almost all the solenoids would energise at power-up and overload the whole system. Hundreds of fuses must first be removed, thousands of electromechanical switches manually moved to their home positions, and then the fuses replaced in a particular sequence.
A lesson in Sod's law.
I likewise worked in a company in the travel industry that had an enormous 5 (IIRC) generator UPS and a car park on top of the diesel tanks. Every friday it was tested and the A/C went bang. It was unfashionably modern for its time.
One night I was on shift on site first line support when the A/C went bang and the lights dimmed. Hmmm. The lights shouldn't dim, we're on UPS. Then the alarms started :) The two fellows from the energy centre, one as white as a sheet, came in shortly after. One half of the UPS (for it was twinned, with the fifth a spare) had blown up after it had kicked in following a half second power drop due to expected engineering work. The chappie had walked into the generator room as some switch thing had arced onto the floor right in front of him drawing a scar into the concrete.
We lost half the UPS, which, as it turned out, powered all the DASD. They all had to be cold started and it also turned out that we had two large floppies to IML them... it took a while.
I was 24... nobody died... happy days... :)
edit: PS no power supplies were damaged by being switched off and on. And this was the days when humidity, not just temperature was important. I remember placing saucers of water round a small DC when the humidifier in the single AC failed. But that was some generations ago.
I built and installed a system on a government owned, contractor operated site a few years back. It was the kind of place that really, really didn't want to lose power, and their UPS set up was pretty impressive. Then, a few months afterwards there was a lot of flooding in the region and the grid went down in the area. I held my breath, I happened to be spending Christmas in the area and I was waiting for the phone call asking me to go in and sort out the system after the inevitable borkage that had no doubt occurred.
The call never came. The entire site's UPS had worked flawlessly, it never missed a cycle as it failed over first to battery and then generators, and ran quite happily all week on the fuel they'd got in stock. When the grid came back they'd still got another month's worth of fuel in the holding tank and a bowser on standby, just in case.
I was very impressed as, i) this was GOCO, ii) it worked, iii) it wasn't costing the tax payer an arm and a leg, iv) someone in government had managed to get the contract right
BA
It sounds like BA have got years of engineering debt built up over decades of doing their own IT. The thing about doing your own IT is that you have to invest in it, otherwise these sorts of problems will occur and get worse. BA need to get serious about either building a new infrastructure, or moving onto someone else's. It sounds like things are pretty marginal in their current (only?) building, and a "simple" thing like a fire will kill their business stone dead, permanently, BA ceases to exist.
That's a corporate wipe out risk they're running. Compared to the cost of a planned replacement / duplication of their current infrastructure (a guessed hundred million-ish if they do it themselves, considerably less if they use someone else's?), corporate wipe out is veeery expensive.
I wonder how the shareholders feel about that?
Oil
The oil industry is the same. A friend worked in one major company, he reported that their IT was in a hideous mess. It was the sum result of decades of projects that had been started during good years for the oil industry, and aborted half way through when the next slump came along.
See,I always thought BA stood for bloody amateurs..
Looks like I was right all along !!!
All I can say is that it couldn't happen to a more deserving bunch of incompetent,expensive morons..
Oh I do so hope their insurance says no,you caused it,you pay for it..
That should put a dent in any manglement bonus's this year...
There are a few such "world-class" data centres around Docklands that I have experience of. Sure, I've seen the acres of batteries and the generators, but I have never found one that was genuinely UPS.
In one case a FTSE 100 company fitted its own UPS next to its servers, despite already paying a huge rent to a facility that was supposed to provide it. That's mad right?
Or was the international bank that relied on the sales brochure, contract and SLA, mad? Its pretty hopless waving your contract around when your European systems have no power.
"CBRE Global Workplace Solutions is a facilities management company for commercial property."
Synchronicity? Karma?
http://www.cbre.co.uk/uk-en/services/global_corporate_services currently says
"A big boy eat my power supply and ran away".
Well, actually it says:
"An error occurred while processing the request. Try refreshing your browser. If the problem persists contact the site administrator"
but they're both equally helpful.
This one works though:
http://www.cbre.co.uk/uk-en
[edit, 9 minutes later: both "working as expected"]
These things definitely happen. I was overseeing the recommissioning of the main propulsion switchboard on one of HM's Submarines. 720V DC stuff (shiver). Jack forgot the DC Shore supply was connected and switched to Batteries-in-Series. Big bang. Magnetised Submarine. Cost £5M to degauss. Embarrassment all round. It was a long, long time ago.
The power-supplies are servers themselves, with a network connection and complex software. There is a big kill-switch, but it is all controlled through software. Possibly a software fault on the power-supply network caused trouble, and the power-supply was killed abruptly, resulting in damaged hardware, and possibly software. Restarting can break some hardware itself.
When our troubles come, they come not single spies, but in battalions...
Well Willie, the first rule of Outsourcing, is you can't outsource responsibility. The whole Outsourcing thing, is simply a way to reduce pay and conditions for workers. Also, companies who outsourced IT Operations and Infrastructure, (sure it’s only tin), now realise it’s neither cost efficient or offers a better service. When something goes wrong, it takes much longer to recover, when IT is scattered to the four winds.
I am not sure why most of the comments and the UK press are so negative on BA. I am from the Netherlands and just want to know what happened. Reading a lot of nonsense in the newspapers.
Human error based on a single source. Denied by the contracor. Does any newspaper check facts or they do not care and just publish?
There are many documented failures of power in datacenters. So BA is certainly not alone. An overview here http://up2v.nl/2017/06/02/datacenter-complete-power-failures/
And a detailed post about what went wrong here. Hope someone will learn from it
http://up2v.nl/2017/05/29/what-went-wrong-in-british-airways-datacenter/
Active:active without estrangement between the two is a flying farce. In other words, your secondary live site must be autonomous, not interlinked. You move to it with no electrical interplay. What kind of collective ignorance is at play here?
Nobody loves the smell of fried computers in the morning. Or do they?
"Nobody loves the smell of fried computers in the morning. Or do they?"
Not too sure about that. One especially warm weekend a customer's aircon failed and the computer system got so hot that the only sensible thing to do was condemn it, claim from the insurance, and buy a new replacement.
Our hardware salesman was not displeased with an extra sale landing in his lap.
"One especially warm weekend a customer's aircon failed and the computer system got so hot that the only sensible thing to do was condemn it"
How many people running datacentres DON'T have a crowbar set to drop on the power if room temps go over a set limit? (usually 35C)
"How many people running datacentres DON'T have a crowbar set to drop on the power if room "
Almost everyone running anything mission critical? You would start manually turning less critical stuff off first if you had a heat build up.
With a industrial electrical background I can say that only a fool would have a system that could in any way "combine" in series to supply 480v.
It doesn't seem logical either, because for that to happen the systems would need to be in series to begin with, that just seems highly unlikely, only someone who doesn't understand the basics of electricity would do something like that.
What I see more likely to have taken place was a failure of the generator / battery system and someone instead throwing mains feeds straight into the barn. The load is excessive and the voltage being so far out causes some of the supply units to go pop or for breakers to even get thrown off again, they cycle the power again trying to fix the issue but it doesn't help. Hook up enough smps units and try to power them all at once on a leg that cant support the starting current and you will see it happen every time!..
Inrush current of 100 amps for one computer is bad, about 20x max load, so if you imagine this in a data center with some fool turning everything on at once, melted lines and dead equipment does seem possible.
How to get to 480V?
Your incoming power feed at 240V plus your UPS/generator feed which is a separate 240V feed.
And a big switch between them. With software to control and manual override. That someone screwed up based on the article, although The engineer maybe being thrown under a bus and the details are more subtle.
BA are supposedly around 250 racks per DC - given the age of the DC's and the likely equipment (mainframe and comes heavy), they are likely around 1-1.5MW per DC. Nothing will be small..
Remember kids, connecting power sources in parallell increases the current, connecting them in series increases the voltage.
If you connect two AC feeds in parallell, which is what would most likely happen if a switch borked, you'd either get a short (if they were out of phase or the phase order different - many years ago, SUNET had a hilarious DC failure caused by a UPS bypass switch accidentally engaging), or nothing at all (in fact, big feeds are actually several parallell conductors already since a single one would get too big).
Connecting AC feeds in series? No, sorry, doesn't work like it does for DC...
At most, the latter could cause things monitoring the direction of current in the HV feed to trip. Or a generator to run backwards, which could very well spell disaster - for the generator, not for the server.
Otherwise the major concern would be this happening during a power failure and frying the poor lineman trying to fix it.
The typical ways to get overvoltage from wiring errors in a three-phase system would be to either
a) lose the neutral wire in which case your single phase loads would get 0-415V, depending on the load balance between the phases
or
b) hook up a single phase load across two phases instead of a single phase and neutral, which would give it 415V.
Neither case gives you 480V, although they certainly won't be good for the equipment regardless.
Both cases are highly unlikely to result from some sort of switch failure since the neutral is never, ever switched in the first place.
I guess you could somehow theoretically end up with two out-of-phase sources supplying your 3 phases, which could cause more than 415V between phases but still wouldn't hurt your single-phase loads (atleast not unless the neutral wire gets very overloaded as a result).
Except for big financial organization, a lot of companies don't really have a real DR plan. True, they might have a secondary data centre but to actually get down and test the DR plan by flipping off the power to the primary DC? Not going to happen even if it has the blessing from the CIO/CTO.
To divert traffic and data away from the primary DC, there is a lot of preparatory work that needs to happen. This alone defeats the purpose of redundancy. The sensitivity of data traffic has now gone to a level of stupidity that it's not enough to just configure the primary and secondary path and hope that the client will be able to send the traffic down the secondary path in case the primary path is down or detected to be down.
Now that's the technology side of things. How about the financial side of the coin? How much money does BA have to spend (annually) to build and maintain a mirror of the BoHo? And now here comes the question akin to "what are the odds of winning the lottery?": How often will BA see a system-wide outage? And then throw in the equation of "how much will a system-wide outage cost BA?" and then one will come to a final conclusion that, with the event that just happened, it is still cheaper not to have a DR site.
Apologies for the long post.
"Although various people have speculated that operations and jobs outsourced to India's Tata Consulting Services (TCS) contributed to the cockup, both the airline and TCS vehemently deny it."
*He (They) would, wouldn't he (they)? Wikipedia Link.
US Operator: Not likely to be Indian site; their primary power is so bad that they are testing the fail-restore disaster plan several times a week.
Also most server racks down split three phase power service and route three single phase hot-neutral to three single phase power supplies to supply lots of high amperage 3-5-12V DC. Electrical Codes have surge standards to take "ordinary" inside the building shenanigans. They are useless to handle trees and construction equipment bridging two phases of 1300+ V (or one phase to the local distribution lines).
And now for something completely different: Will European standard electric cars protect themselves when plugged into bad supply voltage? Ah, not so completely different--I lied.
"Electrical Codes have surge standards to take "ordinary" inside the building shenanigans. They are useless to handle trees and construction equipment bridging two phases of 1300+ V (or one phase to the local distribution lines)."
But you design your switchgear on the incoming side for those possibilities regardless - and the obvious one in many countries of a 6kV or 11kV distribution line falling on the 240V lines where the distribution is above ground and poles are susceptable to cars.
Most DCs have a dedicated 11kV feed and local distribution transformers but you'd be surprised at the kinds of _shit_ that comes up the power lines
It's not just the USA. UK power feeds are far from clean, with 1920s power standards still being acceptable in terms of dropouts, short brownouts and spikes. Our 2 large online UPS systems see an average of 5 notifiable events PER DAY. The stored kinetic energy (flywheel UPS) is used several times per week and the diesels are run in anger 3-5 times per month - mainly due to incoming power being well out of spec, rather than a complete outage.
I’m surprised the comments on here contain so much speculation on the way the power issue may have caused the problem. I think BA are lying and it wasn’t anything to do with power and was an application error.
There’s been previously reported IT problems at BA going back months if not years. System slowdowns and crashes bringing staff close to tears. Then the system goes down during the busiest period on a bank holiday. That’s not a coincidence.
This is a poorly conceived cover up to hide the fact that offshoring IT to India has resulted in unstable systems.
It was certainly an application failure. The failure to apply proper procedures and respond properly. Compounded by the PR failure and lack of control of the message.
I understand that senior management went into lock down to resolve the issue, but why didn't they have a front man/woman to communicate this?
They're not very reliable. And often *cause* problems, up to catching fire.
It's like having a guard dog that sleeps through burglaries, chews the furniture, poops on the floor, and occasionally eats one of your children.
Present UPSs often occupy the functional niche where something potentially useful and far less harmful should be.
Our building has had to be evacuated twice due to the smoldering UPS.
And because only the servers were "protected" by the UPS, and the 100+ PCs were not, data was still lost with each power outage.
UPSs are typically daft. An embodiment of human stupidity.
One could imagine a UPS done correctly, and they obviously do exist. Somewhere.
"because only the servers were "protected" by the UPS, and the 100+ PCs were not, data was still lost with each power outage."
Hmmm. Have your IT department been around long enough to have heard about "thin clients" (using servers and storage on the network), or even just ordinary desktop PCs with zero local file storage (just local apps, local display, local processor, and memory, and (presumably) local Windows licence)? Or is that too 1990s for them?
"And because only the servers were "protected" by the UPS, and the 100+ PCs were not, data was still lost with each power outage."
Modulo the point that leaving data on client machines is a "really bad idea"
In that case, you need a whole building UPS. If you've got £3-1100k to throw at the job (depends on sizing) I can point to a few suppliers. These come as 20-40 foot shipping containers (again, depends on size) so be prepared to lose a couple of car parks.
"They're not very reliable. And often *cause* problems, up to catching fire."
Only if you don't maintain the things properly. They carry significant energy and MUST be respected. Don't put them in the same rack (or room) as the computing equipment, they need their own environment.
And yes, I've had one "catch fire" - after being put into service following a 6 month furlough due to a blown power transistor.
PHB of the techs had insisted it be stored outside because it was cluttering up the workshop, where it'd gotten damp - it was only under a tarp instead of being properly wrapped up. About 12 hours after it was hooked up to the load, it decided it didn't like its environment anymore and would make that fact clear by smoking up a storm.
Staff reaction to the rancid smoke filling the building? They opened a few windows as they came in at 5am (This was a radio station). And it wasn't until management arrived at 8am that the building got evacuated.
Yeah, we had that. We had an idiot do just that: He deliberately pushed the RED button and everything just went dark and quiet very, very quickly. (The emergency shutdown button was clearly labeled and could not be accidentally pressed as it was inside an enclosed structure.)
On the other hand, it was also a good way to really, really test the redundancy of the systems and DR. Nevertheless, it failed. Completely.
Power was restored in 45 minutes after the button was pushed but the rest of the IT system took three to four hours to recover and all required manual intervention.
If y'all think that was funny, the postmortem was even funnier. So all the executives in charge of the different systems sat around the meeting trying to explain to the chief of the IT department why the system took so long to recover after power was restored and why the redundancy didn't work. Guess what, status quo was maintained. No one wanted to ask or answer why critical systems didn't have an automatic failover mechanism and requires a large amount of manual intervention to get things moving. No one.
Please note that the people in charge of the systems were 40% full time staff and the rest are highly paid contractors.
And I forgot to mention that the site was a 1200-bed hospital.
Asking the questions doesn't help. In my stint as Unix sysadmin about 20 years ago I pointed out that we couldn't really tell if our plans would really work and asked for funding to run a disaster recovery exercise. The request wasn't denied, just ignored.
"I forgot to mention that the site was a 1200-bed hospital."
Wanna name names/locations? I will - this sorry tale from a decade or so ago when I was a routine visitor to North Hants Hospital (Basingstoke, UK) which is now around a thousand beds, and wasn't that different a decade ago. Could've been various other places too.
There's not even a big red button in this picture, and the whole hospital lost power.
There was particularly extensive construction work going on around the site, externally and internally. On the occasion in question it was late afternoon (just after visiting time) at a time of year when late afternoon means relative darkness.
The lights went out without notice, the sockets (and eveything else that didn't have its own batteries) lost power, etc. Not good, but these things happen and are planned for, so staff were initially not too concerned.
Unfortunately power+lighting was not restored in the timescale the staff expected. Nor was there any power to the small number of "critical services" including a few sockets. Stuff that had its own batteries might still be OK (e.g. some portable or life-critical stuff). Pretty much everything else - zilch.
To add to the fun, the main corridor in the hospital was one of the areas being worked on, and a full height partition had been erected along much of its length. So it had its locally-powered emergency lighting, but the lighting was useless because various people had allowed an inappropriate partition to be built without preserving the emergency lighting facilities. So people couldn't even see where they were going.
I realised just how unprepared people (staff, management, contractors, etc) were for something like this, and left as quickly as I could.
Turns out from later leaks that an inadequately supervised digger working in the car park had taken out both the main grid incomer and the feed from the standby generators, which shared a common underground duct?.
How many things had to go wrong, how many people had to *not* do their jobs properly, to enable something like that to happen?
Not to dismiss any of the technical possibilities that have been discussed here, the single most likely reason for such a catastrophic outage is that IT budgets have been shaved consistently over a period of years to a point where all the senior IT managers understood that their staffing levels, processes and infrastructure were probably going to be inadequate to survive a catastrophic failure or series of failures, but were equally aware that telling their finance and operational colleagues this would probably result in their being side-lined, fired, down-sized or moved to "special projects"...
For an organisation in this state of denial the quarterly bottom-line is everything, and long-term is only the next quarter. You could reasonably argue that this sort of failure is possibly the only way BA's IT investment could ever increase to address the long-term failure to invest responsibly
He may be right when he says that BA's system administration is not outsourced to another country. That does not mean it's not outsourced though, and it does not mean that it being outsourced did not contribute to the problem.
Regarding the comment someone made earlier about customers not getting compensation because BA outsourced the IT service. For the purposes of compensation (and any potential legal action), that may be irrelevant. The customer's contract (such as it is) is with BA. If an outside contractor is maintaining a system that BA relies on, and that system fails, preventing BA from providing a service, then it's up to BA to provide the compensation (and they will also get any legal action). They can launch any actions needed to reclaim the money from their contractor..
"...our source said, the power control software controlling the multiple feeds into the data centre didn’t get the switch between battery and backup generator supply right at the critical moment – or, potentially, if someone panicked and interrupted the automatic switchover sequence – it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v, causing a literal meltdown."
Hogwash.
No. No. NO. Any modern DC has a sequence where the utility feed comes into a box called an ATS (Automatic Transfer Switch). Under normal circumstances, ALL the equipment in the DC is powered from the UPS which is on-line at all times.
The UPS has two roles - maintain the output voltage at a steady 230V to cater for input supply power fluctuations, and to load-balance the input currents across the three supply phases (most modern UPS systems don't care much about output load balancing, though it's good practice to try to balance them to get the most out of the UPS).
On-line UPS systems take the input power feed, convert it to DC to charge the batteries, then they have inverters that take the DC from the battery and convert it into clean AC for the data center. Unless the UPS is in bypass mode for maintenance, this is the case at all times for all modern data center UPS systems.
If the utility feed fails, the ATS detects this and immediately sends a signal to the genset to start the generators.
The UPS should have about 10 minutes of run time with no utility input - this is to cover the start-up of the genset, and if the genset doesn't start, time to send a signal to the servers that they need to do an orderly shutdown (this is done by s/w agents on the servers - signalling normally by IP).
At NO TIME should there be a detectable fluctuation in the power to the data center - all that happens when the input utility supply fails is that the batteries in the UPS are no longer being charged and start to discharge (hence the 10 minute run-time) and the genset starts up..
As soon as the ATS detects that the genset is producing the correct voltages, the utility supply (which has failed anyway) is automatically disconnected from the UPS and the genset output connected in its place - this happens in fraction of a second, automatically, and again, there is NO interruption to the supply to the data center - as the servers are, as always, running off the batteries, not the UPS input supply.
The UPS batteries are now being charged by the genset, and life continues as normal. Normally, gensets spin up and stabilise in a couple of minutes (our one, a 450kW jobbie on the roof, takes under two minutes). The genset should have at least 24 hours of fuel on-site.
When the ATS detects that the utility supply is restored AND STABLE, it disconnects the genset from the input to the UPS and connects the utility supply in its place. After a few minutes of stable running, the genset is switched off.
There is simply NO EXCUSE for a modern (read "last 10 years") DC to lose power the way BA did. Just unforgivable.
"On-line UPS systems take the input power feed, convert it to DC to charge the batteries, then they have inverters that take the DC from the battery and convert it into clean AC for the data center. Unless the UPS is in bypass mode for maintenance, this is the case at all times for all modern data center UPS systems."
For larger sites, Flywheel UPSes are the same (incoming mains drives the flywheel motor-generator), but allowable dropout time is usually in the region of 15-20 seconds.
The ATSes are as described and gennies are on the input side of the flywheel.
This gets around the _substantial_ issues associated with battery maintenance, but indroduces its own dangers - a 2 ton magnetically levitated flywheel in a vacuum chamber with _that_ much stored kinetic energy is not something to mess with, lest it exit its chamber (and the building) at ~100mph.
"For larger sites, Flywheel UPSes are the same (incoming mains drives the flywheel motor-generator), but allowable dropout time is usually in the region of 15-20 seconds."
Dated technology these days. Gas fuel cells is usually the way to go: http://www.datacenterknowledge.com/archives/2012/09/17/microsoft-were-eliminating-backup-generators/
Once, as a secure US military installation, which was key in all current wartime communications, the technical control facility manager decided to take the building's UPS offline and go direct to mains power. The unit being active:active at all times. The reason was simple and necessary; replacing a room full of dead UPS batteries.
Regrettably, he only skimmed the instruction manual, didn't want to wait for the installation electrician and flipped the twisty switch.
The entire server all went down hard. When he put the switch right (he was one position off from the correct setting), one key rack didn't come online and remained dark.
At the time, this BOFH had been wearing the information assurance hat, but am an experienced BOFH and also a certified electronics technician in industrial automation and robotics. So, reading industrial electrical blueprints is ancient news to me.
"Where is the electrical blueprint?"
Spreads several blueprints out on the floor, kneeing, tossing the incorrect diagrams aside, I rapidly locate (paraphrased, to protect NDA information), "Ah! Circuit breaker 57A, in bank 12F. Where is it?"
Predictable look of confusion and consternation and disclaimers of such arcane knowledge.
A swift heel and toe express around the battery/UPS room located the breaker - conveniently located behind a one-off bank of several hundred batteries, seriously out of view and traffic. Sure as can be, the breaker was tripped.
There was one chance in three that I'd flip that breaker on my own authority, on a US military base, and worse, in wartime. Slim, fat and none.
"OK, here's the culprit. *I* am not going to touch the damned thing, it's way outside of my job responsibilities and I won't accept responsibility. So, it's your ball. Wait for the installation electrician or push it yourself and *you* take any resultant heat for hardware failure."
The manager considered, "It'll be two hours before the electrician gets here!" He switched the breaker off, then to on position. The rack lit up.
It took nearly 12 hours and a very upset COMSEC custodian, to restore all services. Each crypto device required rekeying, requiring the presence of said custodian to provide the appropriate USB (and other devices) keys.
Six months before, we had a similar outage, due to a blown transformer and the aforementioned room full of dead batteries. A room that was ignored, right until a US General couldn't use his telephone, due to the outage.
Suddenly, we had the budget to replace that which we had complained of twice weekly.
As an IT professional, I don't believe any of the explanation. In the unlikely event that there is any truth in it then the CIO should be fired immediately as well as the head of IT.
Neither has happened which reinforces my belief.
No one, since many years ( I worked on planning fail-over back in the late 90's and it wasn't new then), that has such a large corporation replying on IT systems does not have a triangulation system.
Triangulation is never based on sites in close proximity so where are BA's three sites? How could all three fail?
It is unimaginable that BA does not have such a system in place but if it doesn't, Tata is almost certainly the consultancy that devised the backup plan.
Whatever the reality, the CIO and head of IT have to take full responsibility and either resign or be fired, Anything less and you have two problems. 1. BA has proven it has no commitment to client service. 2. There is a rat and no one is admitting it which comes back to 1.
As reporters, you need to keep digging 'cause you're being sold a pup.
"it could have resulted in both the battery supply and the generator supply being briefly connected in series to the power bus feeding the racks. That would result in the data centre’s servers being fed 480v instead of 240v, causing a literal meltdown"
Has anyone here in the electrical/electronic/computer field ever heard of an automatic system being configured in such a manner? Power supplies capable of being connected in series? Parallel maybe..., but then with 3 phase you don't get 480v. Somebody with a 1+1=2 understanding of ac electrics has come up with a bullshit "press release" from a "source".
How many months before the truth comes out?
Paris, very saucy. ( but probably knows more about electrics than the source)
"...two different data centres, both of which are no more than a mile from the eastern end of Heathrow's two runways. Neither is under the flightpaths....
... and from aerial views (no, we're not going to pinpoint it on a map for you) BoHo looks to be around about that size...."
oh i see... that secret location.. just a little bit east of the two runways... just between the runways... and doesn't have all that coolant plant on the roof... gotcha... ssshhhh don't tell anyone.
nothing like the article hinting to where it is located lol.
https://tinyurl.com/ybk6b5hp
"Normal Accidents" contributed key concepts to a set of intellectual developments in the 1980s that revolutionized the conception of safety and risk. It made the case for examining technological failures as the product of highly interacting systems, and highlighted organizational and management factors as the main causes of failures. Technological disasters could no longer be ascribed to isolated equipment malfunction, operator error or acts of God.
Extracted from Wikipedia.
Previous post referral
Thanks to whoever referred us to this.
Outsourcing the IT to TCS may not have caused the outage, but may have contributed to the delay in the recovery of the IT systems, prolonging the outage.
Too often we see companies penny pinching in the wrong areas, and it comes back to bite them. The accountants look at staff numbers on a spreadsheet and see one UK permanent resource costing X and one offshore costing less than half, and choose the offshore. I know it's not just about the daily cost, it's also the overheads that go along with the permanent staff, but replacing someone with 20+ years experience of the airline, how it runs, and the IT systems, in all their intricacy, with a well qualified offshore resource, you lose 20+ years experience you can never get back.
I'm not saying this would definitely made a difference, but "you do the maths".
BA / IAG seem hell bent on destroying what was once a great brand. BA is no longer a full service airline.
The profits may be up, for now, but how long will passengers put up with all the reductions in service but not in fare?
BA / IAG need a change of management and change of direction. Quality counts and is appreciated and remembered. Mention BA these days and they're used as an example of failure!
From the Grauniad today.
Willie Walsh
How bad a week was it? “It wasn’t a week,” says Walsh swiftly. “The thing you’ve got to recognise is that BA was back to normal pretty quickly. The critical thing was to get moving again and get customers looked after. I think some of the criticism has been unfair – but it’s easy for me to say that.”
There was a phony online promotion as of late advising individuals that in the event that you update to the new iOS7 that your iPhone will be waterproof. A few people attempted it. They dunked them in water, and now their iPhone is a paper weight. Doh! I don't realize whether to feel frustrated about them, snicker, or both. Sheesh.