"about 0.0035% downtime for the last 17 years"
Which is "four nines" in management-speak.
EUROCONTROL, the organisation that provides air traffic management for Europe, has apologised for an outage that made a mess of air transport across the continent yesterday. The problems with the organisation’s Enhanced Tactical Flow Management System (ETFMS) hit at around 13:00 UTC on Tuesday April 3rd. EUROCONTROL quickly …
Maybe it's just me, but I took that as sarcasm given the outstanding performance so far.
I've actually been at Eurocontrol - I used to live in the vicinity, had family working there and often visited the tower of the nearby airport.
Very impressive place.
"even if the products they use tout 5, 6 or 7, or more."
If you want to achieve 4x9 then you need to use products that individually achieve 6-7 or more - and insist on even higher factors in a lot of critical areas.
When you start multiplying the factors together you'll understand why.
I would have thought that as well, before I learned SRE at Google. In fact, building high reliability systems out of lower-reliability parts could be considered a major part of the job of an SRE at Google.
Suppose I have a service responding at http://myservice.google.com. DNS for myservice.google.com returns three IP addresses. If the server for each of these addresses has 99% resilience, it's not that hard at all to get 99.999%. All you have to do is ensure that the three servers are sufficiently isolated that their downtimes are uncorrelated. So...they are on different electrical substations. (If you are paranoid, they are on completely different electrical grids.) You ensure that the maintenance schedules for the three servers do not overlap. Stuff like that.
In fact, SRE-supported services at Google pretty much only go down because someone screws up a configuration file. (In fact, most of the time, a configuration file screw up only causes pagers to go off, and not actual outages.) Most years, there are 0 outages cause by things like substations going down. This is the results of proper planning and engineering.
And the secret sauce involves how to do it while overprovisioning by 30% or less, instead of 300% as in the above example.
This post has been deleted by its author
This post has been deleted by its author
They've now identified the problem:
The trigger event for the outage has been identified and measures have been put in place to ensure no reoccurrence. The trigger event was an incorrect link between the testing of a new software release and the live operations system; this led to the deletion of all current flight plans on the live system. We are confident that there was no outside interference.
https://www.eurocontrol.int/press-releases/statement-etfms-outage-3-april
Bear in mind that they didn't shut down the airspace. Maybe they were in disaster mode within minutes, and it just took a few hours to sort out the problem and return to primary.
I'd suggest that having a standby system capable of handling 90% of the load is pretty good. You don't expect to lean on it very often, and it'll be a damn sight cheaper than one specced for 100%. I know - ideal world and all that. But realistically that's not bad.
How many readers have a DRP that allows implementation and return to normal within a day?
5 hours is pretty good, the thing that seems to be overlooked here is that the plan works.
To misappropriate a famous saying. "Any landing you walk away from is a good landing."
5 hours is appalling for a mission critical system, even for a non-safety critical business.
Given that they only needed to cut 10% airspace capacity to cope with the outage suggests that whatever they did seemed to have done the job. Planes don't drop out of the sky the moment those systems fail, and the real test is how much time it took for whatever replacement system they had to take over. It appears to have been well within the otherwise rather tight margins of the airspace they manage.
That strikes me as a better metric to evaluate.
How many readers have a DRP that allows implementation and return to normal within a day?
I'll be honest, we've about 10 systems that we *could* flip over in about 12 hours. Minor data loss. Bringing them back would take roughly the same time. I've built 2 specialized environments that were designed to be actively moved from one DC to another in flight, loosing only uncommitted transactions. Those however never saw light of day in practice due to outsourcing changes.
I was travelling from Manchester to Helsinki when this happened. I had to transfer at Stockholm for an almost impossibly short 35 minute flight to Helsinki.
Having boarded the plane all ready to depart, the captain announced that there was a problem at Eurocontrol and we weren't allowed to take off. A delay of at least 20 minutes was to be expected and possibly longer.
At the time this was fairly frustrating, given an expected flight time of 35 minutes, a 20 minute delay felt like an eternity...
Having read this article however, I've a newfound love for Eurocontrol. That's some impressive resilience!
Went for an interview there sometime in the early 90s.
Worst interview of my career. As they explained the project to me ( reusing an Some X end code base to front a legacy COBOL backend ) I could not conceal my horror at what they were attempting.
I would have got the job except the agent tould me afterwards that one of the interviewers was partialy deaf. As a softly spoken Scott he could not hear me hence his look of confusion throughout the process was a perfect match for mine.
Is this about right due to switching over to purely manual handling by Air Traffic Controllers?
Or perhaps national systems (if they still have them in Europe) couldn't automatically hand over at the border?
Noting the explanation posted upstream that a testing error ditched all the stored flight plans a minor drop in capacity seems pretty damn good!