back to article Scared of flying? Good news! Software glitches keep aircraft on the ground

It has been a bad few days for anyone with a fear of flying or, perhaps more accurately, a fear of getting to an airport only to find that flying is the last thing that will happen. United Airlines was the latest to suffer problems after a software issue resulted in all its aircraft being briefly held at their origin airports …

  1. DS999 Silver badge

    I wonder if United's software update

    Was made during business hours, or it was made overnight but caused reduced performance which meant their daily load they would have otherwise been able to handle was too much? If the latter kudos to them for figuring that out so quickly.

  2. Anonymous Coward
    Anonymous Coward

    Hmmm...

    Assuming the official version of events is genuine, are we seriously expected to believe that the previous 15 million flight plans were entirely error-free?

    Something doesn't add up here.

    1. Neil Barnes Silver badge

      Re: Hmmm...

      Perhaps not entirely error-free, but that this is the first time _this_ error has occurred?

      1. Fonant

        Re: Hmmm...

        I heard somewhere that this particular plane's planned route was most unusual. So triggered the problem when all well-worn commercial flight routes would not have one.

        ICAO are also trying to eliminate waypoints with duplicate names, but since this needs international agreement, it's taking some time...

        1. yoganmahew

          Re: Hmmm...

          So it's a known issue? A known issue that wasn't tested?!

          Every admission is a new scandal :)

          IIRC a dodgy flight plan caused the last outage. How did they not learn the last time that the first rule of resilience is to get back operational? Find the error data, poke it out, restart quickly. They're not saying a restart takes 4 hours are they?

          1. Headley_Grange Silver badge

            Re: Hmmm...

            They're saying that as long as they didn't know what was going on then they were going to play it safe. All flight plans had to be processed manually, which takes time and adds workload to ATC so they increased separation for safety and that meant fewer planes could fly. If the problem had been found and fixed in a few of minutes it's still feasible that hours of disruption could have resulted because of the scheduling knock-ons.

            The first rule of resilience in a safety-critical environment is to make it safe. If this had been a cyber attack, and not a simple database error, then binning the faulty flight plan and carrying on (which is what I'm inferring you meant in your post, so apoligies if I've misunderstood) could have been pretty scary.

            I agree that it looks like NATS had some crap code, error handling, data integrity issues, and they deserve stick for it, but I don't think they can be criticised for playing safe until they knew what was going on.

          2. Dave@Home

            "They're not saying a restart takes 4 hours are they?"

            I'm sure I read that the the buffer of the system is 4 hours and they had to wait till it had cleared the route before they could start processing

    2. Russ T

      Re: Hmmm...

      I wonder if the two identically named waypoints hadn't been in the same flight plan before, but since planes have been re-routing around Russian air space, this was a first occurrence?

  3. Inventor of the Marmite Laser Silver badge

    At least three systems are required

    If one has two systems offering two different results, the simple and embarrassing question is which one should you believe.

    At least with three systems one could hope to vote 2-oo-3.

    Dual systems? Forget it.

    1. S4qFBxkFFg

      Re: At least three systems are required

      "If one has two systems offering two different results, the simple and embarrassing question is which one should you believe."

      neither

    2. Anonymous Coward
      Anonymous Coward

      Re: At least three systems are required

      Are you kidding? The bean counters around here only cough up enough for ONE system that doesn't meet spec. Well, it meets spec... if your spec is 5 years old.

      1. Annihilator

        Re: At least three systems are required

        It meets *a* spec. Just not yours.

    3. Tom Chiverton 1 Silver badge

      Re: At least three systems are required

      Their probably coveting hardware failure, in active-passive fail over...

    4. JulieM Silver badge

      Re: At least three systems are required

      First system sees duff data, says "Your problem, meatbag", copies duff data into backup system, hands over shuts and down in a sulk.

      Backup system sees duff data, says "Your problem, meatbag", copies duff data into second backup system, hands over shuts and down in a sulk.

      Second backup sees duff data, says "Your problem, meatbag", hasn't got another backup system to copy duff data into so just shuts down in a sulk.

      1. Martin Gregorie

        Re: At least three systems are required

        Any system that can't recognise and reject duff data WITH A CLEAR DESCRIPTION OF THE ERROR is poorly designed and should never have passed acceptance testing.

        Any system that crashes or fails over to a parallel backup process just because it receives data containing errors was

        (a) designed by somebody who does not deserve the title of 'system designer'

        (b) should never have passed its design review before coding started.

  4. Tron Silver badge

    No problem.

    Just compensate everyone - passengers, airlines, freight shippers and airports - and things will be fine.

    1. adam 40

      Re: No problem.

      Absolutely. The bean counters decided it should be implemented to this quality.

      So, the liability rests with them, it's a simple equation.

      I'm thinking a class action against NATS would be appropriate here, because the airlines are shrugging off liability (and, who can blame them?)

  5. Richard 12 Silver badge
    Windows

    NATS crashed.

    The description given in public very obviously means it simply crashed on unexpected and/or bad data.

    And then the backup went ahead and crashed as well, as expected. Same code, same assumptions, same data, same crash.

    Worse, it clearly didn't create a useful log (or even core dump?)

    If it had done then the staff would have been able to figure out which flight plan crashed the system, remove it from the automated queue and try again before the four hour "major disruption" deadline.

    Or at least which small block of 10-100 flight plans contain the problem. Drop those out, continue.

    Then they could manually process the funky flight plan(s), and finally set someone to work on figuring out why that flight plan crashed the software - without a nasty deadline hanging over them.

    Asking someone to manually process ten flight plans in the knowledge that one of them made the automation fall over is also a very effective way of finding the flaw. Handing them 10,000 is a very effective way of making sure they ... don't.

    1. Boris the Cockroach Silver badge
      Facepalm

      Re: NATS crashed.

      Missed something as simple as

      Try

      {

      Process.nextFlightplan;

      }

      Catch

      {

      Deleteflightplan_from_list();

      PrintF("Incorrect flightplan, please manually process");

      }

      Guess NATS will be adding a sanitation module to the code in the next week.....

      1. anothercynic Silver badge

        Re: NATS crashed.

        By the sounds of it in other media, the systems vendor was eventually called in because the engineers couldn't figure it out (but then again, if you're an engineer with no flight management experience where you might've spotted two identically-named waypoints with different coordinates, it's entirely possible to just go "WTF won't it work"). Apparently some changes have been applied since to stop this happening again (probably said sanitation/validation module).

        I would not be surprised if this is something that was brought up during acceptance testing at some point, but given low priority because the chances of this happening were very low (1:15 million is low to those who don't realise how many flight plans are filed daily) or non-existent. Now that it's happened and it ground the whole thing to a halt, someone over at the vendor's probably dug this out of their "low priority, won't fix" queue and added a fix that was probably documented already.

        This is yet another Swiss cheese "all holes align at the wrong time" moment. It happens. The system and the backup did what they were meant to - fail safe. That it then caused a massive backlog with manual re-entering/validation is more of a problem, especially when there are a lot of flights in a day that sort-of run by the skin of their teeth in terms of filing, getting clearance and flying out, and then repeating it at the other end.

        There's no doubt that there'll be a historical RCA to see why this slipped through or wasn't approved/wasn't mitigated against at the time this was raised, and NATS will find itself sailing along with no major incident in the future when it (inevitably) happens again.

        1. Headley_Grange Silver badge

          Re: NATS crashed.

          1/15 million is just based on the fact that NATS has processed 15 million flight plans since the Wright brothers with no problems and was a small/big-number-we-can-quote-to-make-us-sound-good. The probablity of error in the database is dependent on how many opportunities for error there are and the likelihood of there being an error. The risk is the probability of it happening multiplied by the impact. Given the impact in this case - going to manual processing - was likely go to cost hundreds of millions of pounds, fuck a lot of people and, more importantly for the high-paid-help, make them look crap in the news, then it was worth testing for. The fact that they've already got a potential solution shows it was an easy fix. Hindsight is great, I know, but this was foreseeable and testable in my view.

          1. yoganmahew

            Re: NATS crashed.

            For that route, the probability of error was 1/1...

            1. adam 40

              Re: NATS crashed.

              actually, it was 2 in 1 (because it crashed the backup system too!)

          2. anothercynic Silver badge

            Re: NATS crashed.

            Of course, but we are talking about semi-government here, which inevitably means T&M contracts, and which inevitably means beancounters somewhere along the line went "nah, that's ok".

            The end effect now has been FAFO (f*** around and find out). They sure as hell found out.

            1. Dan 55 Silver badge

              Re: NATS crashed.

              But in the end NATS won't pay for the downtime, the airlines will, so it's all good. Beancounters win again.

      2. Tom Chiverton 1 Silver badge

        Re: NATS crashed.

        Your error log fails to include relevant information, like the batch or plan number. Maybe much like the real code :-)

      3. Anonymous Coward
        Anonymous Coward

        Re: NATS crashed.

        If the point of origin (departing airfield) is available, have the system deny the flight plan order.

        Force the inconvenience back to where the anomaly would be introduced.

        If take off is denied - no problem.

    2. Martin Gregorie

      Re: NATS crashed.

      It seems to me that NATS handling of mistakes in flight plans is just plain wrong. Given that a flight plan seems to be a self-contained item that must not clash with any previously submitted flight plan, it follows that mistakes in flight plans should never be treated as system errors but rather, that NATS should reject that flight plan for correction and re-submission by its originator.

      In a critical system like this, errors in flight plans should never cause standby or duplicate processes to crash.

      What should happen is that, if the flight plan validation process finds an error, the plan being validated is rejected and an error report returned to the originator so the plan can be corrected and re-submitted by its originators. If the flight plan was automatically generated by the originator's software, then its up to them to manually correct and re-submit rejected flight plans while their support team fixes the error(s) in their software and/or database.

      1. ITMA Silver badge
        Devil

        Re: NATS crashed.

        Precisely.

        The question is - what happens when other types of errors appear in flight plans that are submitted but fail validation?

        Surely this must happen.

        What does it currently do in that situation? I presume (hope!!!) it does something sensible!

      2. sgj100

        Re: NATS crashed.

        A problem with this is that the flight plan wasn't submitted directly to NATS. It was submitted to Eurocontrol’s Integrated Initial Flight Plan Processing System (IFPS), which is the central Flight Planning tool for the International Civil Aviation Organization (ICAO) European Region. This accepted the flight plan because it was correctly formatted. IFPS then distributes the plan to all relevant Air Navigation Service Providers (ANSPs) of which NATS is one. Presumably the software used by the other ANSPs were able to deal the duplicated waypoint name!

        1. ITMA Silver badge
          Devil

          Re: NATS crashed.

          And what? NATS should just accept it and assume that because it is "properly formatted" that it isn't "properly formatted gibberish"?

    3. Hans Neeson-Bumpsadese Silver badge
      Boffin

      Re: NATS crashed.

      And then the backup went ahead and crashed as well, as expected. Same code, same assumptions, same data, same crash.

      To my mind, "same code" means that's not a proper implementation of a backup system. On an aircraft, backup systems are different code written by different people* from the main system. If there's an issue with the data that the main system has a problem with, then the backup system should handle it differently. There's always the possibility that the data issue is so out there that it causes the backup system to have a problem as well, but at least you get two different bites of the cherry.

      * I used to work with a project manager who, in a previous life, had worked on systems for aircraft where 3-way redundancy was the norm. He said that he had to assemble teams who were truly independent, e.g. If someone from team 'A' had previously worked at 3rd party company 'X', then that precluded any other former company 'X' folk from working in teams 'B' or 'C', as they may have certain coding habits in common. In such an incestuous industry, that made his life rather difficult.

  6. elsergiovolador Silver badge

    More

    Expect more events like this due to Tory destruction of our IT services.

    No expert in their right mind is going to work in these conditions we have now (poor rates, extremely high taxes, no employment rights).

    1. Anonymous Coward
      Anonymous Coward

      Re: More

      And this is the mindless Tory bashing we get on this site. Even though it has nothing to do with Governments of any colour as it happens. People who are vehemently anti Tory are generally foaming at the mouth, this comment proves it. Best of luck with your Labour Government next year.

      1. Boris the Cockroach Silver badge
        IT Angle

        Re: More

        Dont worry..... when labour get in all will be sweetness and light............ until they start making the stupid decisions led by a stupid idology

        <<<already loading the brick cannon for labour.... run out of bricks to throw at the tories due to them buying the lot and then dropping them on their feet

      2. elsergiovolador Silver badge

        Re: More

        Can't be any worse than this lot of corrupt incompetents.

        1. JulieM Silver badge

          Re: More

          That sounds suspiciously like an offer to hold someone's beer .....

        2. Intractable Potsherd

          Re: More

          You usually have more imagination than that @elsergio!

    2. anothercynic Silver badge

      Re: More

      As much as I detest the Tories, this has nothing to do with the Tories or "Tory destruction of our IT services".

      This is more to do with something that happens less often than one thinks slipping through and gumming up the works. It happens. It's been fixed. NATS and its systems vendor have learned from it. It (most likely) won't happen again in the future.

    3. Anonymous Coward
      Anonymous Coward

      Re: More

      We need a version of Godwins law, any discussion of any topic on El Reg will end up with some moron blaming the tories and/or brexit.

      Disclaimer: I'm not a support of the tories or the other bunch of incompetents who occasionally get the opportunity to prove themselves equally capable of doing nothing but damage and lining the pockets of their chums.

  7. Primus Secundus Tertius

    Design for errors

    Designing systems to cope with errors, i.e. raw user input, is difficult. To give constructive error messages you have to parse a range of inputs that include error cases, not just the perfect working case. So system design is a bigger and more expensive task, not always appreciated by techies let alone by beancounters.

  8. Phil O'Sophical Silver badge
    Facepalm

    Forget DDOS attacks, the next time some unfriendly state wants to cause chaos, they just need to file a corrupt flight plan. I can't help but feel there will be a lot of random junk being fired into NATS from Russia and China over the next months.

    1. anothercynic Silver badge

      And chances are that none of that random junk will make any difference because that 'vector of attack' will have been closed/mitigated against.

      1. Anonymous Coward
        Anonymous Coward

        Well, the "2 waypoints with the same name" hole will have been plugged. How many others are waiting to be found?

        1. Mike 137 Silver badge

          Multiple levels of hazard

          Duplicate waypoint names are an intrinsic hazard and have previously caused accidents in their own right, e.g. American Airlines flight 965, (20 Dec. 1995). So there were two apparent sources of hazard at work here -- the duplicate waypoint name that triggered the shutdown and the faulty logic that responded in that manner to it. Just as in the case of the now famous Überlingen incident there are almost always multiple layers of hazard that all have to be active together for the outcome to occur. Consequently, fixing the obvious notional 'root cause' is practically never a complete solution.

    2. Anonymous Coward
      Anonymous Coward

      Yup, I said that at the time.

      Absolutely not a comfortable feeling. OTOH, at least the planes don't drop out of the sky, that's apparently Boeing's job now, having taken over from Airbus (yes, they had problems too).

    3. Graham Cobb

      Is that what happened here? There is a lot of deliberately very vague talk about an "unusual" flight plan. How "unusual" was it? Was it a deliberate attack (with or without knowing whether it would cause an actual problem)? Have there been lots of other flight plans failing sanity checks recently?

      I am sure we won't get an answer - if it is an attack no one will want to comment, of course.

      1. Dan 55 Silver badge

        We already have the answer. It was in-spec but the flight loading program choked over it. The plan should have been moved to the "have a look at this" queue with a description of the problem.

        1. Graham Cobb

          Yes - but has someone been trying to generate just such a flight plan, knowing there is no "have a look at this" queue and that the system will crash?

  9. Michael Hoffmann Silver badge
    Windows

    Bah, humbug!

    Back in my days we didn't have all this software around airplanes!

    You tuned your NDB to the nearest country-and-hillbilly AM station and off you went into the wild blue yonder! IFR meant "I Follow Roads"!

    Your approach guidance systems was the volume of the passengers screaming in the back! Dispatch was when the stewardess(!) served you donuts and coffee.

    Icon, as it's the closest I could find to a geezer ->

  10. Grey_Kiwi

    "The Register approached NATS for comment on how the software was purchased and validated."

    According to some of the aviation-related websites I've looked at, the core software was purchased the U.S. Federal Aviation Administration in the 1970s - yes fifty years ago - and currently runs in emulation mode on more recent, but by no mans new, hardware. I have even heard that it runs in a virtual image of an older model mainframe, which is in turn emulating the original hardware. The software has of course been heavily patched, but there's nobody still working who was there at the beginning and who knows how it all actually works, they've all long retired and/or shuffled off their mortal coil.

    The only solution is a complete rip-and-replace, but since NATS has been sort-of privatised, that prospect brings on an attack of the cold sweats in the beanie department and the Board.

    Mind you, I don't blame them, this would probably be a Major IT Procurement Project with a huge budget, maximum visibility, long timescales, many many stakeholders, a lot of politics and enormous technical risk.

    There are times when I'm glad I'm retired :)

    1. ChoHag Silver badge

      > The only solution is a complete rip-and-replace

      Found the developer.

      That is never the solution, especially when the system who's purpose is keeping people alive is working perfectly well as it is.

      1. Brewster's Angle Grinder Silver badge

        Your argument amounts to "This RAAC roof is intact and working. So we shouldn't replace it as doing so introduces risk that the new roof collapses."

        Their system is working now (except when it doesn't...) The problem is how long can that be sustained with ever more ridiculous levels of emulation...? That it hasn't collapsed, doesn't mean it's not going to. And when it does collapse, you're left with nothing. So, they should be developing a replacement system today, because it will likely take many years; just as we should have been replacing RAAC roofs over the last decade.

        (And for the record, I don't think a publicly owned service would be any more willing to spend than a private one.)

    2. anothercynic Silver badge

      Well, ideally, you would hope that 'rip and replace' will mean running the new system in parallel with the old system, use the live feed into the old system as a live mirrored feed into the new system too, let the new system send outputs somewhere that isn't live (but can validate the outputs), so that any bugs the the old system mitigated against are also not present in the new system.

      Of course, that requires time and effort, but as you point out, will inevitably bring on an attack of the cold sweats in the beanie department and the board over the cost of running things in parallel ;-)

  11. J.G.Harston Silver badge

    In other threads we have people insisting that all NHS providers and contractors should use the same identical monolithic system........

  12. anonymous boring coward Silver badge

    We could land on the moon in the 1960s, with kBytes of memory, and a computer made of from logic gates, but can't handle air traffic today, when a simple PC can do a billion operations in about 1/10 of a second? Guess we got our priorities wrong somehow.

    1. Anonymous Coward
      Anonymous Coward

      Most of the heavy lifting was done back in Mission Control prior to the flight and that 'kBytes of memory, and a computer made of from logic gates' was just for guidance in the final few minutes prior to the landing

      But it wasn't a flawless landing. That tiny guidance computer was giving out alarms and partially rebooting because it was getting flooded with spurious data from the radar... do they ignore the alarms and continue the decent or do they abort?

      Tune in next week to see if they survive... does Buzz lean out of the hatch and guide them down the last few feet?... does Neil just thump the thing with a wrench?... does a boom mic suddenly appear in shot?... does the film director get thrown off the set?

  13. Andy E
    Mushroom

    It's complicated....

    It's fascinating reading the comments but its clear people don't have any idea how complicated air traffic control is or how it works. With long distance flights that terminate or cross UK airspace, the flight plans are passed to NATS 4 hours before the aircraft enters UK airspace. So the plane may well be in the air by the time NATS gets the flight plan. The flight plans are submitted to a local or regional organisation who does some validity checks before routing it to the authorities who manage the airspace the flight crosses. These organisations will have different systems so there may be a lot of data translation involved. The software module that crashed validates the flight plan and adds UK specific information to it for the air traffic controllers before passing it to the next system. That it caught the error is good but that it failed in the way it did was not so good. In the bigger picture, things failed safely.

    For an even more embarrassing episode concerning NATS and software not behaving as expected, read the report of the 12th December 2014 air traffic fiasco.

  14. Anne Hunny Mouse

    "We're sure the software engineers among our readership will also be scratching their heads at how such a problem could happen and how it was acceptable that the result was to fall into maintenance mode and write to a log rather than simply record and flag the failure and move onto the following file."

    With the old Plessy \ GPT \ Siemens ISDX platform, if there was a call the system doesn't know how to route, or a routing route, would crash or flip processors if it had dual processors.

    Even with 2 processors this situation still causes issues as all calls are dropped during a processor switchover.

    Newer systems tend to drop a call in such a situation, but it is possible to easily bring a Cisco CUBE to it's knees if a voice routing loop has been accidently configured on systems connected to the CUBE.

  15. Anne Hunny Mouse

    Suspect that it was never explained to the software developer creating the software that there are duplicated waypoints across the world.

    I've read that there are at least 4 duplicated waypoints but are very geographically separated to avoid issues.

    Wonder if the plane's flight / navigation computer coped with the duplicated waypoint name or the flight crew had elected to only enter part of the route into the plane?

    1. Anonymous Coward
      Anonymous Coward

      (have told this before...)

      I wanted to find out how long it should take to go from Stirling up the road to Perth.

      I went to google.co.uk/maps and entered 'stirling to perth' in the search box and it said 'no route'found'. Things have now improved and it now says its a 20 hour flight.

      Turns out that Google thinks that people in Stirling would prefer to visit Western Australia than travel 33 miles up the A9

      1. Benegesserict Cumbersomberbatch Silver badge

        It got easier when LHR-PER direct flights started.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like