back to article EUROCONTROL outage causes flight delays across Europe

EUROCONTROL, the organisation that provides air traffic management for Europe, has apologised for an outage that made a mess of air transport across the continent yesterday. The problems with the organisation’s Enhanced Tactical Flow Management System (ETFMS) hit at around 13:00 UTC on Tuesday April 3rd. EUROCONTROL quickly …

  1. Anonymous Coward
    Anonymous Coward

    "about 0.0035% downtime for the last 17 years"

    Which is "four nines" in management-speak.

    1. Anonymous Coward
      Anonymous Coward

      Re: "about 0.0035% downtime for the last 17 years"

      That's pretty crappy TBH. Although most organisations aim for 4x9s, even if the products they use tout 5, 6 or 7, or more.

      Risk, vs cost....

      1. Gordon Pryra

        Re: "about 0.0035% downtime for the last 17 years"

        Pretty crappy??

        From your statement I can see you have never worked for .Gov.uk, the NHS or any bank in the UK.

        Or any private company in the UK

        or any NCO

        probably anywhere actually.

        1. Anonymous Coward
          Anonymous Coward

          Re: "about 0.0035% downtime for the last 17 years"

          Maybe it's just me, but I took that as sarcasm given the outstanding performance so far.

          I've actually been at Eurocontrol - I used to live in the vicinity, had family working there and often visited the tower of the nearby airport.

          Very impressive place.

      2. Alan Brown Silver badge

        Re: "about 0.0035% downtime for the last 17 years"

        "even if the products they use tout 5, 6 or 7, or more."

        If you want to achieve 4x9 then you need to use products that individually achieve 6-7 or more - and insist on even higher factors in a lot of critical areas.

        When you start multiplying the factors together you'll understand why.

        1. Claptrap314 Silver badge

          Re: "about 0.0035% downtime for the last 17 years"

          I would have thought that as well, before I learned SRE at Google. In fact, building high reliability systems out of lower-reliability parts could be considered a major part of the job of an SRE at Google.

          Suppose I have a service responding at http://myservice.google.com. DNS for myservice.google.com returns three IP addresses. If the server for each of these addresses has 99% resilience, it's not that hard at all to get 99.999%. All you have to do is ensure that the three servers are sufficiently isolated that their downtimes are uncorrelated. So...they are on different electrical substations. (If you are paranoid, they are on completely different electrical grids.) You ensure that the maintenance schedules for the three servers do not overlap. Stuff like that.

          In fact, SRE-supported services at Google pretty much only go down because someone screws up a configuration file. (In fact, most of the time, a configuration file screw up only causes pagers to go off, and not actual outages.) Most years, there are 0 outages cause by things like substations going down. This is the results of proper planning and engineering.

          And the secret sauce involves how to do it while overprovisioning by 30% or less, instead of 300% as in the above example.

  2. Anonymous South African Coward Silver badge

    Would really like to hear what caused the hiccup.

    Probably a cleaner unplugging a critical server somewhere...

    1. David Tallboys

      Didn't that happen in a hospital with the result that one particular bed in a ward had a high death rate?

    2. This post has been deleted by its author

      1. Anonymous Coward
        Anonymous Coward

        DNS. It's always DNS.

        1. Anonymous Coward
          Anonymous Coward

          Because it's such a BIND

          1. This post has been deleted by its author

            1. rsjaffe

              They've now identified the problem:

              The trigger event for the outage has been identified and measures have been put in place to ensure no reoccurrence. The trigger event was an incorrect link between the testing of a new software release and the live operations system; this led to the deletion of all current flight plans on the live system. We are confident that there was no outside interference.

              https://www.eurocontrol.int/press-releases/statement-etfms-outage-3-april

              1. Mark 65

                So some fuckwit was able to connect test code to the prod database or thereabouts? Unbelievable. Zero segregation of networks. That reliability to date is clearly through luck not planning.

  3. dom_f

    Better than "four nines five" is pretty impressive TBH....

    1. Anonymous Coward
      Anonymous Coward

      4.46 nines actually.

      -l(.000035)/l(10) = 4.45593195564972436451

  4. Bill M

    5 Hours to switch over to Disaster Recovery System

    To switch over to one's Disaster Recovery system should be < 1 hour

    ** < 30 minutes to make the decision

    ** < 30 minutes to do the switch over

    1. Anonymous Coward
      Anonymous Coward

      Re: 5 Hours to switch over to Disaster Recovery System

      There's also the process to move from Disaster Recovery back to normal operations. In some setups, that is another couple of hours in itself, just to ensure you don't trigger a new problem.

      1. Anonymous Coward
        Anonymous Coward

        Re: 5 Hours to switch over to Disaster Recovery System

        Did they switch over due to a hardware fault? Or was it a software fault in which case switching to the DR site may not help!

    2. defiler

      Re: 5 Hours to switch over to Disaster Recovery System

      Bear in mind that they didn't shut down the airspace. Maybe they were in disaster mode within minutes, and it just took a few hours to sort out the problem and return to primary.

      I'd suggest that having a standby system capable of handling 90% of the load is pretty good. You don't expect to lean on it very often, and it'll be a damn sight cheaper than one specced for 100%. I know - ideal world and all that. But realistically that's not bad.

      1. Korev Silver badge
        Coat

        Re: 5 Hours to switch over to Disaster Recovery System

        It's certainly a better plan than just winging it.

        1. phuzz Silver badge

          Re: 5 Hours to switch over to Disaster Recovery System

          That pun was just plane terrible.

          1. defiler

            Re: 5 Hours to switch over to Disaster Recovery System

            I'll confess I had to read it twice before I groaned. First time it flew right over my head... :-/

            1. Anonymous Coward
              Anonymous Coward

              Re: 5 Hours to switch over to Disaster Recovery System

              I don't understand why everyone is in such a flap about this.

    3. Mikey

      Re: 5 Hours to switch over to Disaster Recovery System

      Right, that's it... all of you that cracked those bad jokes... you're grounded.

      1. Gordon Pryra

        Re: 5 Hours to switch over to Disaster Recovery System

        Those last 4 posts covered my keyboard in coffee and spit :(

      2. hplasm
        Devil

        Re: 5 Hours to switch over to Disaster Recovery System

        "Right, that's it... all of you that cracked those bad jokes... you're grounded."

        bloody Helicopter Parent!

    4. Anonymous Coward
      Anonymous Coward

      Re: 5 Hours to switch over to Disaster Recovery System

      How many readers have a DRP that allows implementation and return to normal within a day?

      5 hours is pretty good, the thing that seems to be overlooked here is that the plan works.

      To misappropriate a famous saying. "Any landing you walk away from is a good landing."

      1. Rusty 1

        Re: 5 Hours to switch over to Disaster Recovery System

        5 hours is appalling for a mission critical system, even for a non-safety critical business.

        When life-threatening situations may arise as a result of the failure, the DR procedure really does need to be good AND TESTED.

        1. Anonymous Coward
          Anonymous Coward

          Re: 5 Hours to switch over to Disaster Recovery System

          5 hours is appalling for a mission critical system, even for a non-safety critical business.

          Given that they only needed to cut 10% airspace capacity to cope with the outage suggests that whatever they did seemed to have done the job. Planes don't drop out of the sky the moment those systems fail, and the real test is how much time it took for whatever replacement system they had to take over. It appears to have been well within the otherwise rather tight margins of the airspace they manage.

          That strikes me as a better metric to evaluate.

      2. Alistair
        Windows

        Re: 5 Hours to switch over to Disaster Recovery System

        How many readers have a DRP that allows implementation and return to normal within a day?

        I'll be honest, we've about 10 systems that we *could* flip over in about 12 hours. Minor data loss. Bringing them back would take roughly the same time. I've built 2 specialized environments that were designed to be actively moved from one DC to another in flight, loosing only uncommitted transactions. Those however never saw light of day in practice due to outsourcing changes.

  5. Claptrap314 Silver badge

    After my time as an SRE with G, this does seem to me that 1) the capacity drop should have been avoidable and 2) 5 hours is a really long time for a recovery of this sort.

    I know, government efficiency and all, but I really suspect that this can be made a lot better.

  6. danbishop

    A changed man...

    I was travelling from Manchester to Helsinki when this happened. I had to transfer at Stockholm for an almost impossibly short 35 minute flight to Helsinki.

    Having boarded the plane all ready to depart, the captain announced that there was a problem at Eurocontrol and we weren't allowed to take off. A delay of at least 20 minutes was to be expected and possibly longer.

    At the time this was fairly frustrating, given an expected flight time of 35 minutes, a 20 minute delay felt like an eternity...

    Having read this article however, I've a newfound love for Eurocontrol. That's some impressive resilience!

  7. James Anderson Silver badge

    Omg

    Went for an interview there sometime in the early 90s.

    Worst interview of my career. As they explained the project to me ( reusing an Some X end code base to front a legacy COBOL backend ) I could not conceal my horror at what they were attempting.

    I would have got the job except the agent tould me afterwards that one of the interviewers was partialy deaf. As a softly spoken Scott he could not hear me hence his look of confusion throughout the process was a perfect match for mine.

    1. Anonymous Coward
      Anonymous Coward

      a softly spoken Scott

      Ah, the mythical soft-spoken Scot. Quietly playing his bagpipes.

  8. TrumpSlurp the Troll

    Drop in capacity?

    Is this about right due to switching over to purely manual handling by Air Traffic Controllers?

    Or perhaps national systems (if they still have them in Europe) couldn't automatically hand over at the border?

    Noting the explanation posted upstream that a testing error ditched all the stored flight plans a minor drop in capacity seems pretty damn good!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like