More importantly
Is it just me, or does that picture look like David Cameron is about to be eaten by a dinosaur? Have we accidentally uncovered Angela Merkel's real identity?
One of the key principles of designing any high availability system is to make sure only vital apps or functions use it and everything else doesn't – sometimes referred to as KISS (Keep It Simple Stupid). High availability or reliability is always technically challenging at whatever systems level it is achieved, be it hardware …
@boltar: "Thumbs down from me only because I'm bored sick of reading political posts in almost every comment section now."
You usually see these kind of posts when they're attempting to derail the subject and distract from the contents of the main article. That being, the crap IT system at BA and who is responsible for it.
In my experience, it has more to do with ignorance. I am always amazed at how poor most people are at judging risk. In particular, people tend to overestimate and agonize over vanishingly small risks and underestimate the mundane, everyday risks they face. Examples of each are being killed by a terrorist in the US versus being killed in a car accident. That can make for bad policy and decisions, in politics and in business.
From the article "Indeed, it was far from clear that even senior NASA management were actually capable of understanding the warnings their engineers were raising – often having neither an engineering or a scientific background."
This - in every company I've worked for. Even in the ones where they had some engineering experience it was so out of date as to be useless or actually only had a talent for climbing greasy poles. The best boss I ever had was an utter charlatan but he had the sense to leave the engineering to those that knew about it
I've used this before.
The greatest test of an engineer is not his technical ingenuity but his ability to persuade those in power who do not want to be persuaded and convince those for whom the evidence of their own eyes is anything but convincing.
Extract from "Plain Words" in The Engineer 2nd October 1959
>The greatest test of an engineer is not his technical ingenuity but his ability to persuade those in power who do not want to be persuaded and convince those for whom the evidence of their own eyes is anything but convincing.
Wise words for 1959 but in this post fact world the engineer needs more in the way of seminary skills than logic or debate.
I am always amazed at how poor most people are at judging risk. In particular, people tend to overestimate and agonize over vanishingly small risks and underestimate the mundane, everyday risks they face.
Well spotted. Here's a bit more on that.
https://en.wikipedia.org/wiki/Risk_perception
In the wods of Mr Monroe, "Six hours of fascinated clicking later,.. "
This sounds scarily like a standard blue print for a service oriented architecture gone horribly wrong. in financial markets it was common for "tib-storms" to crash a broadcast network with re-requests to sync topics, but capacity, tiering and investment addressed it.
My money is on a panic'd recovery - like the RBS CA7 debacle.
The problem for many businesses is that their competition are cutting corners and cutting out as much spend as possible too. Customers then reward that behaviour - there's no point having the most reliable IT stack if you have no customers left to fund it all. You end up with a capture effect where the luckiest cheapskate has all the customers until their luck runs out, then people build resilient systems from scratch all over again, which immediately start getting cost reduced.
No, not ignorance and greed.
Try micro services in additional to having legacy systems in place where it is cheaper to add another micro service in to the chain than it is to rewrite the original service, test it, with the additional feature.
The one advantage is that if you have only a certain class of travelers who have an additional process to check some sort of security... you don't have to run everyone thru that process.
Note, I'm not suggesting that this is the case, or that this model is the best fit for BA, but it could be viable and it what is happening when you consider stream processing.
The issue is that at some point you run in to a problem when the chain gets too long and it breaks in places and you don't know how to move forward or handle the errors.
We provide cloud services and connectivity is the key factor in the actual uptime of our services to our end users. IT managers regularly make the wrong decision based on perceived priorities. If something is working, then tasks related to that will go down the list and even are forgotten.
For example, keeping some services running over ADSL when you have a new leased line available and not prioritising the work to switch because everything is working. Statistics tell us leased lines have better availability and quality of service but the customer often only reacts when a failure happens.
I know that they said outsourcing is not to blame but getting rid of people that know how things work or are held together is a dangerous risk. How many companies know nothing about what all the boxes do let alone how they are all dependant on each other?
getting rid of people that know how things work or are held together is a dangerous risk
This is an inevitable consequence of the regular "efficiency" pogroms that most companies undertake against their own support services (and that applies to functions like finance and procurement too). There is a vast amount of tacit knowledge in employees' heads that is never written down, and which the business places no value on until things go wrong. By then it is too late, because these pogroms are always selective - the people seen as whingers, the challenging, the "difficult", those simply so clever or well informed that they are a threat to management, all are first on the list to go. And unfortunately those are often the people who know how much string and sellotape holds everything together.
I work for a large company that has a home brew CRM of great complexity. It works pretty well, costs next to nothing in licence fees (cf SAP or Oracle), and we have absolute control - we're not beholden to a tech company who can force upgrade sales by "ending support". Over recent years we've outsourced many levels of IT to HPE, and each time something new goes over the fence, HPE waste no time in getting rid of the expensive talent that has been TUPE'd across. We did even have a CIO who understood the system - but he's been pushed out and replaced by a corporate SAP-head. You can guess what's going on now - the company is sleepwalking into replacing a low risk, stable CRM with a very high risk, high cost SAP implementation, and at the end of it will have a similarly complex CRM, except that it will cost us far more in annual licence fees, we'll have no control of the technology, and the costs of the changeover alone will total around £400m, judging by the serial screwups by all of our competitors.
"and which the business places no value on until things go wrong"
The only way to find out whether everyone still employed knows how to rebuild the system, is to provide them with an opportunity to do it. (It needn't be the actual system. You can let them assemble a clone.) Of course, that's expensive, but that is the cost of finding out whether the proposed efficiency drive is safe. My guess is that if that cost was included in the business case for the efficiency drive, the case would disappear.
Taking the argument a step further, it is easily seen that it isn't safe to let any of your staff go until you have reached the point where the system can be rebuilt by script. That's going to be an unpopular conclusion within management circles, but its unpopularity doesn't mean it is wrong.
"it is easily seen that it isn't safe to let any of your staff go until you have reached the point where the system can be rebuilt by script."
And even then, when the staff are let go you may find nobody knows what the script actually does and you will even more likely find that nobody knows why it does it.
Not only do you need to retain knowledgeable staff, you need to have succession planning in place.
I know that they said outsourcing is not to blame but getting rid of people that know how things work or are held together is a dangerous risk. How many companies know nothing about what all the boxes do let alone how they are all dependant on each other?
Another factor that I see happening is feature sprawl, add-ons often being introduced as 'nice to have', with a low priority to fix if broken. Problem is, even if those features keep being handled at low prio[0], each of those features adds to the knowledge the first and second line support have to have at the ready, as well as simply adding to the workload as such. Having to not just physically but also mentally switch from one environment to another if a more urgent problem comes in and you have to suspend or hand off the first problem because you're the one who best understands the second one is another matter.
[0] and often they don't, because the additional info they provide allows for instance faster handling of processes, smoother workflow, better overview, etcetera, and after a while people balk at having to do without them. So even when they''re still officially low prio, call handling often bumps them to medium or even high because "people can't work". Oh yes they can; how about remembering the workflow that doesn't rely on those add-ons? The workflow they were trained in?
Worked for a large firm and we switched from one provider to another for some fairly mission critical stuff. The previous system required essentially one program to be running and that was it. The new system required several add on programs (some TSR) to be running in addition to the main program on a users PC. The first time we noticed this wasn't long after deployment when someone couldn't start the main software on their machine. We tried various things with the vendor on the phone before they suggested that one of the other little progs might not be running or have stopped.
Spoke to someone else who used their software and he said that they didn't really do traditional updates to their software. If some functionality was additionally needed they'd just write another small add on program to provide this. Then after a few years release a completely new product complete with new name plus the bells and whistles added to the old one and the cycle restarts. Didn't exactly fill me with confidence.
TSR? That brings back memories. You could also understand this as a reaction to "creeping featurism" on the part of the client company.
In DR-DOS I used to use TSRs to achieve needed functionality on a work PC. Difference is, in that world, nobody ever produced a single program to provide the same functionality.
"The chance that an HTHTP pipe will burst is 10^-7." You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate. It was clear that the numbers for each part of the engine were chosen so that when you add everything together you get 1 in 100,000.
From "What Do You Care What Other People Think", Richard Feynman.
@TDog
"You can't estimate things like that; a probability of 1 in 10,000,000 is almost impossible to estimate."
It's also wildly meaningless. 1 in 10m whats? Messages through the system? Milliseconds? Times the life of the universe? (Remember six sigma events happened daily during the biggest move days of the financial crash... either the universe is impossibly old and all those events happened in a row, or these sort of 1 in x statistics are complete bunkum).
Indeed.
But I could never get the image of him as a New York taxi driver chewing a cigar out of my head.
"How 'bout that Quantum Chrono Dynamics, huh? Virtual particles mediating force transfer in a vacuum. Tricky stuff. You in town on business?"
Joking aside the world is poorer, not just for his intellect and vision but also for his ability to explain complex ideas. His rubber band in a cup of ice water (modelling the root cause of the Challenger crash) was a classic. Simple enough for even the "I don't understand science" crowd to grasp.
I think he was born at the southern end of NY, but with that accent he should be from Noo Joizy.
I always fondly imagine him wearing a zoot suit and spats, carrying a violin case.
One unarguably great thing Bill Gates did was to buy the rights to the lecture series so we can all watch them for free.
" [feynman's] writing is hugely entertaining as well as educational."
Closer to home in the UK, there's a senior judge called Charles Haddon Cave. He's a lawyer not a scientist or engineer, but if you need an inquiry done properly, he seems like a good man to have on your side. His writing is also educational, and entertaining in a way.
See e.g. his talk(s) on "Leadership*&*Culture,!Principles*&*Professionalism,!
Simplicity*&*Safety*–*Lessons*from*the*Nimrod*Review"
RAF Nimrod XV230 suffered a catastrophic mid-air fire whilst on a routine mission over Helmand
Province in Afghanistan on 2 nd September 2006. This led to the total loss of the aircraft and the death of all 14 service personnel on board. It was the biggest single loss of life of British service personnel in one incident since the Falklands War. The cause was not enemy fire, but leaking fuel being ignited by an exposed hot cross-feed pipe. It was a pure technical failure. It was an accident waiting to happen.
The deeper causes were organizational and managerial. This presentation addresses:
(1) A failure of Leadership, Culture and Priorities
(2) The four States of Man (Risk Ignorant, Cavalier, Averse and Sensible)
(3) Inconvenient Truths
(4) The importance of simplicity
(5) Seven Steps to the loss of Nimrod (over 30 years)
(6) Seven Themes of Nimrod
(7) Ten Commandments of Nimrod
(8) The four LIPS Principles (Leadership, Independence, People and Simplicity)
(9) The four classic cultures (Flexible, Just, Learning and Reporting Cultures)
(10) The vital fifth culture (A Questioning Culture) "
See especially point 10: A Questioning Culture.
In various places, just search for it (I have to be elsewhere ASAP).
As well as the Nimrod enquiry, from memory he also did the inquiry for Piper Alpha oil rig disaster and the Herald of Free Enterprise ferry disaster.
Another tangent: accident investigation reports can be very thought provoking, as well as interesting in their own right. Chernobyl, both Shuttle accidents, the Deepwater Horizon / Macondo 252, Piper Alpha, and all sorts of air accident investigation reports -- all have lessons, and describe similar patterns of organisational and system design or operation failures or accidents waiting to happen to those in many fellow commentards' workplaces. Recognising them doesn't necessarily help you stop them happening, because the root causes are often many pay grades above one's own., but it does make saying "I told you so" more fun,.
"Recognising them doesn't necessarily help you stop them happening, because the root causes are often many pay grades above one's own."
True.
"it does make saying "I told you so" more fun,."
Please don't take this the wrong way, but how much fun is there when being ignored by management leads to e.g. a a fatal incident which could easily have been avoided?
That'll do nciely, thanks. Charles Haddon-Cave's Piper Alpha 25 presentation session is a good place to start. It's nearly an hour long, but can mostly be treated as radio.
There is an almost identical script (or maybe transcript) at
https://www.judiciary.gov.uk/wp-content/uploads/JCO/Documents/Speeches/ch-c-speech-piper25-190613.pdf
Voyna i Mor - "(I would prefer that Daesh supporters continued to believe in miracles rather than science, thanks.)"
Really? If they believed in science, surely they'd stop supporting Daesh?
What is the scientific likelihood of enjoying 72 virgins (or white raisins) after death?
Everyone should read the Rogers Commission appendix by Richard Feynman at the very least:
"For a successful technology, reality must take precedence over public relations, for nature cannot be fooled."
https://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt
""For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.""
I can't recall if it was him or AC Clarke who commented "Against the laws of Physics there are no appeals."
The Universe does not care how rich, famous or powerful you are. If a meteorite comes through your roof all that matters is are you in its path or not (yes people really have died of this).
When working on fibre optics we used to work for an error rate of less than 1bit in 10**14bits. Its actually not that hard to work out if you are above or below that level at the theory level . Sitting in the lab for whatever was required to check that less than 1 bit every 3 days is wrong on average for 400Mb is another matter all together.
This sounds like a situation where each worker aggressively defends his or her patch. "No, you can't possibly merge my legacy paper reporting system with Bob's new email reporting system, because [insert ridiculous reason here]." Given the chance, most of us will defend the systems we maintain (and by extension our jobs): it's human nature. A manager's job is to challenge the ridiculous reasons given.
BA's management are squarely to blame here.
VMs and containers might be making it even more complex - now everyone will want her or his system work separately from everything else, and they need to share something (and in the worst situation you have a lot of duplication to keep in sync) and communicate with each other.
While there are often good reason to have multiple separate system, there may be also sometimes good reason to consolidate them (to simplify the architecture) and then make them redundant and fail-safe (and it will be easier to achieve it when the architecture is simpler).
And why do you think it will be any different if every single one of them is perceived as cost to be shoveled off to TaTa?
I didn't say anything about outsourcing. Outsourcing doesn't solve the problem at all: it merely shifts the problem to another company, and conceals the complexity from the end client.
Rather, it's an internal problem of employees being allowed to take "possession" over their little piece of the system (or in BA's case, their 1 system out of the 200). It then becomes hard to move or replace that person, and they become very resistant to change. I've seen this happening in a lot of places, especially large government or quasi-government organisations. The way to avoid it is for management to rotate employees around different systems so that everyone knows a bit about how three or four systems work, rather than just knowing a single system in-depth. This also helps you recover if/when the critical employee leaves.
I don't have any specific knowledge of the BA situation; but 200 critical systems in an organisation with strong unions (making it hard to fire intransigent workers) suggests something like this may have happened.
"The way to avoid it is for management to rotate employees around different systems"
Ouch! This is how the Civil Service produces senior officials who can avoid responsibility for anything. Something goes wrong on A's watch and he immediately blames predecessor B who in turn blames predecessor C who immediately blames A and/or B.
> Ouch! This is how the Civil Service ...
Yes, fair point. But with developers, you only get rotated around 3-4 systems, so you eventually come back to code you previously worked on. The Civil Service path is one-way, so you never have a chance to apply lessons learned elsewhere to your previous mistakes.
Given the chance, most of us will defend the systems we maintain (and by extension our jobs): it's human nature.
I had a tour around the Mercedes engine factory in Stuttgart a few years ago. Our guide proudly told us that they'd never made anyone redundant since the 50s and it meant that employees were encouraged to suggest ideas that would mean they'd need fewer people without worrying about their jobs. If more companies took their attitude then who knows how more efficient they'd be.
IMHO, internally, even religious organization wish themselves to be able to exploit their gullible devotees in this world and explore its sins fully still for a long, long time. It's no surprise that many prophets of doom asking to repent and renounce to worldly riches and pleasures were sent to the stake....
Once worked for a company which ran every service or app on a different server and each of those servers were duplicated off site. Even SQL databases, one database per server (and then duplicated).
Rack apon rack of Compaq DL3x0 servers. Now I didn't work for the IT dept (I was pimped out to paying customers) but when you find out the IT manager's nickname was Kermit, it didn't need explaining what a bunch of muppets the IT dept was. It was as though no one knew servers could multi-task !!
Always seemed to me the vast majority of suppliers would only guarantee support for their systems, especially windows systems, if they were the only one on the box. As boxes are cheap and support is expensive then one box per system, no matter how stupid, was the sensible option. If you have multiple software vendors on the same box the first response to any problems is almost guaranteed to be finger pointing.
And don't give me 'you should migrate to Linux'. Firstly Linux suppliers weren't much better and secondly if the Windows system is the one that meets the users' perceived needs best then you're stuck with it unless you run a remarkably dysfunctional non customer foccused shop.
Pretty much this.
"You need to be the only on the box? OK!" *builds VM for app* "Here you go!"
89.9% of the time, none's the wiser, and the other 10% of the time, the vendor is basically "Oh, it's a VM. We support that too!" (It's not *quite* 100%, because there's always that ONE VENDOR who INSISTS they be the only tenant on that host/set of hosts because their code sucks that badly and they tend to be of the 'throw more hardware resources at the problem' types.)
Running different workloads competing for resources on a single system may not be an easy tasks, especially on OSes that don't let partitioning and assigning resources easily - i.e. Windows, something can be done using jobs, but that require applications support. Otherwise a "rogue" application can exhaust all physical memory and/or CPU, and starve other applications.
But developers capable of writing software that "play nicely with others" are rare, because it's far simpler and quicker to use as much resources as you can find, instead of writing clever and optimized code. And even when the options exist, they are often overlooked and the default installation is designed for a standalone application, because other types of tuning need to be done on an ad-hoc basis.
Thereby the simplest way became VMs, especially when the hypervisor can control the resources assigned to each VMs.
I'd rather deal with a company like that than some bunch if smart arses who believe their own bullshit that they can make a few boxes do everything.
I wonder what the chances are of a setup like that,with similar style power supplies would end up doing a ba ?
There is such a thing as the law of diminishing returns,if you run realy vital services,doing it the simple but pricey way makes sense if you cannot afford or survive a single total failure,ever..
Did you leave or did they kick you out ?
We do the same but with vms on a clustered host. Management is simple with tools, the only real extra cost is storage as vm licensing is cheap enough. I dont duplicate offsite though as one of the clusters is in a different part of the site.
Backups are different of course, they go to another part of the site.
I'd love to hear examples of really large companies that wrip-out their IT and start again to get genuine resilience back after x years of smooth operating.
I'd be willing to bet the only time this is gotten close too is if a government or parent company forces them to do so.
Thoughts?
I can think of one project during my career which was a complete rip-out-and-replace exercise...
A company I contracted to had a large system which they'd built up from scratch. They got bought out by a much larger company, who had their own corporate standard system for such things, and within a short time a decree was issued that company #1's system should be replaced.
We then had a very long and drawn out project (actually it was more like a programme of projects) to do a migration.
To be fair, the outcome was what had been requested, but I don't think that the amount spent on the migration could be paid back in terms of tangible benefits (at least not for the company involved - personally I scored quite a bit of overtime and an end-of-project bonus which sorted me out with a new fitted kitchen)
I have worked on one Data centre transformation program that had exactly that goal. Complete migration of all services to a new hosting provider and rearchitecting all services to give them the right level of resilience/recoverability.
I have worked on another similar program that didn't have quite the scope to rearchitect but we rebuilt to resilience/recovery patterns and rewrote and tested the IT Service Continuity plans.
Currently working on a smaller scale program of work to identify resilience gaps and close them. My current client would have a few critical services out of action for at least a couple of weeks if there was a proper 'doom' scenario and its debatable as to whether some of the other services that look like they should be recoverable actually are as no-one has ever tested. We are going to both remediate and test.
Apparently the former IT Director was a penny pincher.
How about an example of a large 24x7 company that has a resilient system with DR sites and multiple systems that have grown over many years overseen by different people, where the Senior tech or CTO has the balls to test it by walking up to the master power switch in Datacentre 1 to show it all switching over seamless while the system is live.
The risks could be catastrophic and if the systems are running just fine who would risk it? I certainly wouldn't.
Maybe that is what a CTO in BA was trying to prove - the resilience of the datacentre even on a bank holiday weekend!
I'm afraid that this is precisely what a CTO has to be brave enough to do.
At least if you throw the switch yourself you can choose the moment. You can tell all your front-line staff beforehand, so that they can tell the customers what is going on. You can have a manual back-up plan resourced and in place, so that your front-line staff can actually respond. Best of all, you can not do it on a Bank Holiday weekend.
If you don't flick the switch yourself, it is only a matter of time before Fate throws the switch for you. Then you lose at least two of the aforementioned benefits and probably all three because many failure modes are more likely at busy times.
You really don't want to do an unplanned failureunless you really have to. Not without a massive amount of planning anyway. The risks to your business are enormous, especially if the primary site comes back online in an unplanned way (that can be worse than the initial failure and I suspect is what fucked BA up).
What you do need to do is plan assiduously the following:
- what actions to take to provide service continuity if one of your sites fails (this might be nothing - your architecture might be resilient to site failure)
- how to prove that you can do that in a planned way
- what do if its unplanned (its different)
- how to bring the failed site back online safely
If you are ever going to try and prove things work as designed and tested in an unplanned scenario you need to do it on a production like test environment and that gets very expensive. Lots of people do it - finance institutions mainly, though have done it at a big energy firm as well.
I know of two. One when i was a lowly apprentice at British Leyland, DAF mandated a new IT system. It was a monster with robotic tape library (8 drive lord knows what capacity). This replaced some weird bespokeish system.
The second was after i finished Uni and BAe were shutting down strand road in Preston. They amalgamated and partially replaced Warton systems to accommodate. That was a years hands on after my HnD. They paid for my final year BSc too which was nice.
"Not without a massive amount of planning anyway."
You should have the massive amount of planning in place anyway. If you don't test it yourself on your own terms Murphy will do it for you and not at a time of your own choosing.
My DR test failover, at an operator of systems hosting financial tradiing:
The DR / failover plan existed on paper but hadn't been tested for years, since when enormous changes had been made to the code, systems and environments. Eventually management let ops spend a weekend testing it out. On paper, and in regulatory filings, it took 30 mins. The first time it was tried took 14 hours. After three months of working on all the issues that came to light, tried again: 2 hours this time. Another iteration of fixing and testing. Third time: 27 minutes. They now test it every quarter. They were in the fortunate position of having Friday night and most of the weekend to make changes with zero customer impact, but everything had to be fully operational by Sunday evening, ready for the start of trading; doing that if you're a bank, or an airline, or any other 24/7 operation must be enormously difficult, and of course the longer it's left untested, the harder and more dangerous it gets to test.
I had a client - small business, maybe a dozen employees - who did this in the run-up to Y2K.
His servers were Xenix with a fairly old version of Informix and custom applications. He did a rip and replace with SCO and a packaged system allegedly Informix compatible; he wanted various custom tweaks adding and there were more of these over the years. Also over the years I gradually discovered various "interesting" aspects to the alleged Informix compatibility that ended up with me directly amending the data in sysindexes so they reflected the actual indexes.
When he retired he sold the business to a group who presumable ripped and replaced with whatever they ran on as a group; certainly I never heard from them.
"I'd love to hear examples of really large companies that wrip-out their IT and start again to get genuine resilience back after x years of smooth operating."
It does happen. I know one company that did exactly that building two new parallel DCs to replace the adding tat with new, shiny, reliable kit. The problem was that retirement of the old DCs became a tangled and difficult process that took over five years to complete leading to a doubling of costs for those five years. Even then it wasn't perfect. Decommissioning the last DC resulted in a massive outage because someone had forgotten something important.
I have to be against what you have just said. While 200 systems might seem a lot, for a big critical application it might be "just right". Example:
- load balancing web front end - 20 to 40 systems, all "critical" (if one goes down, the service is up - the down system is still critical and the situation has to be solved fast). From your BA example, 3 datacenters = they are split in 2 or 3 differently placed groups
- database high availability: at least 2 database servers if we speak about small numbers. We don't, so we speak about much more: a lot of data, a lot of connections/queries. 20? 40? All critical. One goes down, there is no problem (theoretically).
- batches, middleware and application servers: we add some more critical servers again
- virtualization high-availability: again we speak about big numbers, it is not a good idea to stuff too many critical machines on the same critical physical server
- storage: again, big numbers, all critical
And it is not over yet. Depending on the quantity of data to be treated, 200 critical devices might be just right. They are all critical, on the other hand if some of them are going down, the others should have absolutely no problem to hold the load. At least in theory.
Agree - 200 doesn't sound unusual for an organisation as IT dependent as BA, bearing in mind their business went IT dependent very very early.
Whether its good or not I'd hazard a guess that its NOT an outlier compared to other organisations its size.
Rather suspect the Author hasn't worked in anyone as big and hasn't had to contend with the sheer inertia that creates.
Also rather suspect this was 1 rogue application (ESB or similar) spamming corruption further out.
Double suspect that there are a few architects & CIOs waiting for the full BA post mortem.
Having "200 systems in your critical path" is indeed a tad worrying. Being a large, distributed multi-national enterprise like BA etc. isn't really an excuse, either. We can drone on and on about "corporate inertia" and saying that having a system built "layer upon layer upon layer" is to blame, making it just too large and unweildy to be manageable or indeed fit for purpose.
Well, yes - I would have to agree.
I think the challenge is how to manage enterprise-level systems getting built layer upon layer and becoming fossilised through corporate interia, making it hard to (a) evaluate their ongoing operational function in an objective way and (b) engineering an improved system that perhaps replaces entire chunks/layers with more modern, more performant solutions. This is about mapping where the problems are, finding out what the critical chunks are that *must* be improved and then building a simpler more maintainable system to perform the task in hand. In short, building a live, functioning system that is under continuous evolution. This goes way beyond continuous delivery.
"This is about mapping where the problems are, finding out what the critical chunks are that *must* be improved and then building a simpler more maintainable system to perform the task in hand. In short, building a live, functioning system that is under continuous evolution."
This. It's also easier to do as you go along. A good maxim would be to aim for a situation in which the result of each added development is that the system looks as if it were designed that way from the start.
Depends what they mean by "system". I read that as 200 applications, rather than interpreting as system = device/server.
In your interpretation, I'd completely agree, 200 devices/resources is not at all OTT.
But 200 applications, each with their own reliance on supporting systems and infrastructure, each application having their own data flows... I wouldn't fancy detailing that data model (unless it was for a fee, mind).
A few years ago I worked for a far more modern and competent airline than BA and I can assure you that 200 interdependent systems is not unusual.
Running an airline is a very complex business and there are a lot of factors that need to be accounted for. It's not just the airline itself either; it has to communicate with the systems in the airport too, and this is true for all of the outstations too.
"Depending on the quantity of data to be treated, 200 critical devices might be just right. They are all critical, on the other hand if some of them are going down, the others should have absolutely no problem to hold the load. At least in theory."
But that's the whole point: not that they are critical, but they are critical path. Which means that if any one of them goes down, the entire system goes down. These aren't redundant devices -- they're each one necessary for the whole to work.
I never took a statistics & probability class, but if I'm not mistaken, the chances of failure in this scenario increase exponentially with each device added. (If I am mistaken, I'm sure somebody will cheerfully point it out. I just want it to be the guy who actually took S&P classes and not the guy who thinks the probability of getting 10 heads in a row is 1 in 10.)
"But that's the whole point: not that they are critical, but they are critical path."
Nobody said that the 200 were critical path, that's an entire fabrication that the author made up in order to justify this article.
It wouldn't surprise me if there were though; the login app, the pilot hours app, the crew roster app, the passenger seat allocation app, the cargo load app, the weight and balance app, the weather app to get wind direction for the fuel prediction app which in turn drives the fuel load app, the emergency divert planning app, the terror checklist app, the crew hotel booking app...
Anyone who claims they can deliver five nines availability, even for discrete components let alone a complex web of hardware and software, is talking out of their arse. Five nines means you can have a maximum 0.864 second outage in any given 24 hour period. Of course you can start saying that the up time calculation should be done over a week, month or year but where do you stop - a decade? Up time stats only have real meaning over short periods.
So, hands up, who for any amount of money is going to guarantee less than 0.864 seconds of downtime over DC, comms, hardware, and 200 interdependent applications. And how do you even define what counts as "up"?
It's basically all finger in the air stuff.
"Anyone who claims they can deliver five nines availability, even for discrete components let alone a complex web of hardware and software, is talking out of their arse."
You may want to speak to people in Telecoms....our Meridian was up for over a well over a decade, even when a lightning strike melted 3 of the cards in it. Hot swapped out, jobs a good un.
Or how about our ISDN to SIP converts, some of these have been up for over 4 years, never missed a beat.
I would think there are mainframes out there doing the same...
They have one job to do, and they do it well.
Not everyone relies on x86.
5*9's realiability isn't a measure of how long your systems could be down for in any given time period, it's a level of confidence that you are prepared to back with financial compensation should you not meet that target.
It has nothing to do with *actual* downtime that could happen.
Unless telecoms are an end in themselves to your business, their reliability is just one part of the overall reliability of the stack. When your ultrareliable comms is part of a flow that includes firewalls, load balancers, message brokers, buses, abstraction layers, transformation layers, warehouses, lakes, and all the rest of the shite that's layered on to what at the outset is a simple transaction, five 9s begins to look like a very high mountain.
No.
You are confusing telecoms meaning a bit of wire with a phone at each end.
What telecom systems are these days are large scale, distributed computers that just happen to manage telecoms services, which may be be circuit (phone line) or packet based (4g). And need to perform complex, realtime billing too.
Again, have a google around Erlang/OTP.
Anonymous coward obviously.
Many years ago when I was a young Technical Officer in a BT, we had a situation where we lost the whole System X switch where Mr Disgusted lives.
There was a power failure and the batteries kicked in followed by the generator to keep the batteries charged. All went well and phones kept on working.
Trouble is no one was apparently trained to turn off the generators and it was done incorrectly. On the next power failure the whole town went quiet after the generator failed to start.
In my time on the PSTN, I witnessed numerous major service failures and the root cause was always human. The system was designed to be brilliantly resistant but as soon as you let humans touch things then Murphy's law takes over. That is why we joked that the modern switches just needed one man and a dog to run them. The man to feed the dog and the dog to stop the man touching the equipment.
Fortunately I was never the PTB (person to blame).
Coming up with a 5 Nines design isn't that difficult.
The problem is that once these things are installed and running someone usually buggers about with them. My good lady used to be the liaison engineer between a telecoms company and an IT supplier. The telecoms company had loads of HA applications running across as large number of clusters. When she took over the account she got worried about some of their admin practices and persuaded the company they really needed to have an audit done to see what their chances of surviving a failure were. To start with the company wasn't convinced, all the clusters and all the applications were tested and signed off when they went live*. Anyway they agreed when she said the IT supplier would pay for the audit. The result? There wasn't a single application left which would switch over automatically in the event of a failure. Everyone of them had been buggered up by people making quick online changes without understanding the HA implications of what they were doing. HA is not something you can buy off the shelf. It's mostly a mindset.
(*) Most HA projects I've been involved with haven't been tested properly.
Most projects over run.
The testing comes at the end
So when things are late the pressure is put on the time for testing. Management would rather see the project brought in on time than having everything tested properly. So once the application seems to work, they want to go live NOW. The managers then hope they'll get promoted (as a reward for being on time) out of the way before anything breaks.
"Erlang has hot code loading. Its very clever."
Clever me hoop. TPF has had hot code loading and fallback for, hmmm, thirty years. Down to the individual component of a program level (function as it once was, alas, individual strand of spaghetti might be more appropriate).
So if you want to rewrite everything from the ground up in Erlang and never touch it, and write all the firmware and the smart switches and the NAS and the DR replication software and the firewalls and the front ends and the web servers in it too, that's be marvellous. Here's my budget for three years. And here's 12,000 servers and a mainframe written in thirty different by ten different national development centres. What, more then 3 years and more than the current annual budget?
All big organisations (indeed all organisations) accrete IT systems through natural growth, takeovers, changes of directions, changes in regulations, ego clashes, inertia, lethargy and general risk aversion, so it's probably a bit unfair to single BA out here. You look at the IT system diagram for any large organisation - if such a diagram even exists - and weep.
Even with systems now, the fashion is that writing from scratch is bad, so new developments are a hodgepodge of what looked cool in the open source world, what sort of works now, and developer glue - all held together with the sort of build systems that would've once been deemed systems themselves. I've seen people download huge open source frameworks just to use 5% or so of the functionality; hell, I've seen multiple frameworks in use at once, because no-one took the time to the money to properly integrate.
No traditional company spends enough time, money or skill on IT. That's because the accountants who run these companies don't understand what IT does, why it costs, and why really skilled (which usually means expensive) IT people are a valuable asset. By outsourcing and downskilling, accountants see easy cost savings - and they're probably correct as long as everything always works perfectly and no outside influences or unforeseen events change this. There are no black swans in an accountant's world.
So BA is certainly not alone here - not that this excuses them for an avoidable f**k-up of epic proportions, mind.
I think the analysis of failure is spot on. But we're missing the fact that one system is ‘high availability’, the other ‘safety critical’. Only reputation died at BA.
Ironically, the reasons for the failure are identical. But the bigger fault lies with NASA, by far. From what I remember, they *did* rip out and start again... which worked.... until Columbia.
Complexity is hard, and harder to manage, which is why managers don't do it.
Fundamentally, risk management is composed of two primary elements.
1. Chance of failure.
2. Impact of failure.
For example, a huge asteroid hitting the Earth has a low chance of occurring (relative to a human life-span for example) - but since the impact it would have would be fatal to the entire planetary ecosystem and all the life that needs that ecosystem, the overall risk assessment would be HIGH RISK.
On the other hand, if the chances of my pen failing is quite high it wouldn't be classed as HIGH RISK since the impact would be low* (I can just use another pen).
*Unless the pen in question was being used as a wedge that prevented a switch closing which would detonate the self-destruct device on my spaceship - but that's just bad design and probably a different conversation :)
On the other hand, if the chances of my pen failing is quite high it wouldn't be classed as HIGH RISK since the impact would be low* (I can just use another pen).
You just failed risk management 101 :) Your initial risk analysis should not include the mitigating effect of countermeasures.
So if you were an old-fashioned author or a proof-reader and your pen failed, then it would have a high impact on the service that you provide. Fortunately, in this case there is a low-cost and effective countermeasure.
Just look at some websites that have over 2 dozen external 3rd party servers these days.
That's how and it's fucking bollocks. Utter fucking bollocks.
(yes I know it was a rhetorical question and a damn good one that needs to be asked. That is damn serious FAIL architecture.)
"That is damn serious FAIL architecture."
And we all saw the failure when that guy pulled his string-padding routine out of a widely used JS library a few years back and half the web fell over.
And I'm guessing that hardly anyone has changed their web-site since then.
Fortunately ... everyone knows that the web is as flaky as a box of Kellogs and only a complete numbskull would stake their business (or their continued good health) on a web service.
Another opportunity to plug this 30+ year old book about large scale system accidents (nuclear plants, air crashes, Apollo 13, oil tankers, etc) which, although computers aren't the focus of the book, taught me more about stability, reliability and security than many books supposedly on those topics. Pick up a copy from Abe Books or your favourite non-Amazon supplier today, you won't regret it.
https://en.wikipedia.org/wiki/Normal_Accidents
The biggest problem is that management fears testing of failover, because those even higher up the chain will then blame them (middle management) for taking "risks" that weren't needed. I look at the stuff Netflix has done with the Simian Army and wish I could do the same here. But... Netflix has a totally different model and application stack. They scale horizonally and completely. So it's trivial to add/remove nodes. And killing them off is just good testing because it does show you the problems.
But most applications are vertical. It's a DB -> App -> Web server -> users. At each level you try to scale horizontally, but it's hard, and takes testing and a willingness to break things and then fix them. Most management aren't willing to break things because they view that through the short term mindset of it's losing them money or costing them customers. Which are the easy things for them to measure.
But knowing that your have *tested* resiliency, and a better understanding of bottlenecks, that's good but damn hard to quantify.
I had an issue where we got a large engineered DB setup sold to us, with integrated storage of two different types. We started using the slower basic stuff for storage and DBs because it was there and we had already paid all this money so why not use it? And it turned out that as the load creeped up, the thing plateaued and then fell off a cliff due to contention. Once we moved stuff off to other storage, things were ok.
The point I'm trying to make here is that until you have an event that is quantifiable or business impacting, the tendency is to wring as much performance or utilization out of a system as possible. Having a spare system sitting around doing nothing costs money. Why not utilize the other 50% of your capacity when you have it available? So what if when one node fails in an HA pair you suddenly have 150% load on the system and it craps the bed? It meant you didn't have to buy another pair of systems to handle the growth! Which costs hard money and is easy to justify the denial at the time.
Afterwards, you just be the money will flow to make things better. For a bit. Until the next bean counter comes in an looks for cost savings. But of course it does go too far in the other direction, where you have lots and lots of idle systems sitting around wasting resources for an eventuality that doesnt' come often.
But, back to my original arguement, that testing is hard on management because if it does go wrong , they're up the creek which they don't like. If it goes well, then they don't care. So if you can, try to setup your systems so that you actually test your failover setup constantly, so you won't be surprised when it does happen.
And let me know how it goes, I want to learn too!
I have seen, and suffered, this stupid idea of adding systems to the critical path.
I was the Project Leader of a reporting system (among others, I was working for the outsourcing company).
The client wanted to interconnect two systems on the cheap.. and hey, they had the data there.. so they could just make a few custom reports... in a couple of years, the system was providing critical data for billing two different types of projects plus giving data to the regulator.
Of course, as they had to "save money", a "low value" system as this old reporting one got no money. So little they got that I had to fix it myself, on production, several times, being the external project leader, and potentially changing billing data and data being sent to the regulator.
Soon after I left, the system broke for more than two months.. years of neglect and technical debt had their toll.
Note: not even almost free upgrades in base SW were done as the client somehow expected us to assure (as in assure with money) that everything would work afterwards.
The PHBs that made all those brilliant decisions got rises, etc.
Not being able to bill some services, well, that is for the poor sod that inherited the job to explain.
This post has been deleted by its author
Nice article ....but when your secondary site is still routed through the network devices on your primary site , power is not the issue nor is it anything other than poor technical planning and cost cutting , yeah we will mirror our server/applications environment to a redundant site , but if it is still routed through your primary ....when that fails , everything else is well rather pointless.
I read that as that it was defined from up on high a long time ago that the shuttle needed to be reliable to that level, so they just assumed it was / refused to listen to any contrary information. Because that would require a lot of expensive design reviews and fixes to reach that level (if it is even possible to do so at our current level of technology)
If NASA required that level of reliability from all manned missions, we would still be waiting to put a man in space.
Microservices are in fashion now so it must be the right thing to do. Coworkers tell me it's how you isolate failures to a single system. I've been assured that it's not just multiplying points of failure and causing exponentially complex fault cycles like it would appear to.
I have a colleague who used to work at BA. From what he told us that they use an Enterprise System Bus (ESB) which is a Java base message queue system which can be a publisher/subscriber or point-to-point configuration depending on the messaging needs.
And from experience, this ESB sometimes is a pain trying to get it started, even after a scheduled power cycle by experienced people. Plus the fact that if you want the fastest possible message response, the messages are stored in memory with no physical backup; you can use a database as the message queues but that slows things down a lot.
I am guessing that this piece of software failed to restart properly and since it was configured for maximum speed to handle the load, all the messages before the crash was lost.
Just a wild guess.
My money is still on the ESB availability failing and then trouble afterwards resyncing all the systems. Even in the article it says:
"However, within the comments of the BA chief executive there is one telling statement:
Tens of millions of messages every day that are shared across 200 systems across the BA network
and it actually affected all of those systems across the network."
So, according to the BBC, BA are now saying it wasn't a 'power surge' but an engineer who switched off the UPS.
http://www.bbc.co.uk/news/business-40159202
Doesn't explain of course the lack of failover and so on.
The Space Shuttle, as magnificent as that system was, was a primitive first-generation attempt at a "space truck" to get things into orbit. They were practically rebuilding those things after every flight, and every reasonable person knew that it was risky and complex.
So what REALLY puzzled me about the Challenger disaster was that NASA didn't have PR response to a crash that should have been seen as "inevitable someday". Why hadn't all the astronauts recorded video messages of the "If you're seeing this, then I've died doing what I loved and trying to advance the causes of humanity. But we cannot allow my tragedy to derail the effort" category? It would have seemed to be an OBVIOUS way to mitigate, in some way, the PR disaster that would accompany the loss of a shuttle.
NASA and the news media together conspired to to the unthinkable; they made space travel BORING.
– it was far from clear that even senior NASA management were actually capable of understanding the warnings their engineers were raising –
Five nines, is it? On a good day, you might get that out of a resistor.
You want it actually to run software and process data? This is the system that never fails, is it? Did you say "lowest bidder"? It was a really good bid?
Oh dear.
.
Once upon a time, I had the pleasure of ensuring compliance to an IT audit comments at a medical charity. There were plenty, nothing special to talk about. What is really shocking is how the application works.
As each patient is processed, records were transferred server to server in a daisy chain. I can't do anything about the architecture, since there are massive workflows and replacement is in the pipeline.
So I did the next best thing. I virtualized the 10 year old servers into a couple of ESXi on new hardware and told HOD to get their act into gear.
Apologies that I don't have time to read all comments, as I'm sure others will have made this point, better than I, already ... but it bears emphasis: notwithstanding the reference to "200" systems and the well-made points about criticality, this issue still boils down to a simple and appalling fact—that a power interruption (a power interruption, FFS!!) could be a single point of failure for the top-to-bottom minute-by-minute operations of a global billion-dollar business ... a business which, need I add, functions in one of the most safety-, security- and reliability-conscious regimes that exist on Earth.
I don't care if Mr Conveniently-Junior-Guy pulled and replaced the plug twenty times while widdling into a server cabinet and waving his EMP blaster around. This simply shouldn't be possible. It is a crushing indictment of business continuity and disaster recovery design and engineering.
That several senior executives haven't already been thrown into the ocean from 37,000 feet is unbelievable. Remember where the buck stops, and how they justify being BIG bucks? Whoever allowed this should return to running a kennel.
And one can only wonder what other atrocious penny-pinching corrosion awaits discovery.
ordered by Willie Walsh into computer mess. The Times this morning reminds us that BA has had 4 big outages in the last 10 months. Iag and Walsh have been criticised for not having enough (shewerly "any") IT expertise at board level: only one non-exec has IT experience.
Willie says inquiry will be peer-reviewed and "we will be happy to disclose details".
Who'd like to be the poor EE who kicked that first UPS domino? (Asssuming we believe the story).
I can't give BA too much crap. Amazon screwed up a few months ago by getting themselves in a catch-22 where there were circular dependencies because they hadn't actually tested to make sure they could do a full restart. And Amazon has had the luxury of modern software development tools and unimaginable financial resources.
https://aws.amazon.com/message/41926/
and the legacy code that airlines use to manage their fleets and reservations is horrendous. think mainframes, COBOL, even *vacuum tubes* -- ugh! those development teams are practically heroes. i cannot imagine the soul sucking drudgery of maintaining *60 year old software* and trying to literally keep the planes in the air while they update their software underneath. and i have only seen ONE case of airline companies getting hacked (no credit card info was stolen in that one) whereas companies with much newer code have been hacked and credit card info was stolen (home depot, michaels, bank of america, anthem, premera, ebay)
https://www.linkedin.com/pulse/legacy-code-can-cost-you-billions-just-ask-airline-greg-leffler
"
SABRE is an especially interesting case in point, as SABRE (and its spinoffs for other airlines called PARS) has been around since 1960. Air travel has changed just the tiniest of bits in the interceding 57 years, but these systems haven’t changed as much. At their core, these systems are still based on these legacy operating models, data structures, and interfaces. To operate and maintain these services, expertise in the legacy backends (and legacy programming languages — COBOL, anyone?) simply must be found and retained by the airlines. The PARS installation used at Delta (called Deltamatic) runs on an IBM 7074, a system that is even still in use today by parts of the federal government.
"
bletch.
I had some fun at the Infosec exhibition at Olympia this week by going round the stalls, picking out those pushing their pet solutions for "Total Security" and/or "Incident Response" and grillling them about how their pet systems would have protected the system in a BA-type scenario (power outage causing failure of a single server, failed backup, and legacy systems of all ages dating back to the Wright brothers), had such a system been installed.
Not one vendor produced even a plausible reply.
The number one priority for BA is to make sure that their planes keep flying passengers from point to point. If they can't do that, nobody is going to be interested in buying tickets or planning a holiday on their "award winning" web site. What that priority in mind, it makes sense to maintain a fully redundant in-house system that isn't being asked to do anything else. If there is an IT staff that is spending a portion of their workday improving their table tennis, that would show that they are doing a good job and not an indication that the company should make them redundant and outsource.
I have run across this type of scenario time after time. If there is something that can bring the company to it's knees through failure or become a safety problem, it must be handled in house, be backed up and staffed with good people. It cost BA more than twice the ticket prices and lost them the confidence of a good many people that were stranded somewhere or failed to get to a wedding/funeral/important function or a long awaited holiday. They're not going to be booking with BA again unless they have no other choice.
Sometimes one has to spend the money. I work a lot as a contract engineer from home. My CAD/CAM PC only has the engineering software I use on a regular basis installed and does not connect to the internet. I use my Mac for internet and business tasks and usually handle code on a VM from my Mac. By being properly paranoid and suspicious of email attachments even purportedly from my mother, I haven't caught any nasties. I could do all my tasks from one Window PC, but the cost to mitigate the risks of letting a Windows PC loose on the internet is cheap in comparison to losing work in progress and missing critical deadlines.