back to article UK air traffic woes caused by 'invalid flight plan data'

Mystery still surrounds the technical issue at the UK's National Air Traffic Service (NATS) on Monday, which is being blamed on incorrect flight plan data being received, leading to the system reverting to manual processing and causing delays and cancellations of flights. You'd might be thinking the system and its operators …

  1. elsergiovolador Silver badge

    Expertise

    Always assume that the form may be filled by a dog or someone sitting on the keyboard.

    If someone submitted wrong data and managed to take down the whole system?

    That is reassuring.

    I think if we start paying engineers even less, then quality of work will certainly improve /s

    1. cyberdemon Silver badge
      Devil

      Re: Expertise

      A passenger manifest including Little Bobby Tables, perhaps?

      1. Yet Another Anonymous coward Silver badge

        Re: Expertise

        Ryanair fly to the new "London ^C" airport

      2. Anonymous Coward
        Anonymous Coward

        Re: Expertise

        That was my thought. Somebody filed a flight plan going to "EGLL'); DROP TABLE Flights;--"

        Little Bobby Waypoints, they call him.

        1. katrinab Silver badge
          Alert

          Re: Expertise

          I'm pretty sure they don't use anything as mordern as this new-fangled SQL stuff.

          More likely it was a lower-case letter, an incorrect/missing newline character or something like that.

          1. Killfalcon Silver badge

            Re: Expertise

            Carriage return without a line feed.

            1. ITMA Silver badge
              Devil

              Re: Expertise

              RyanAir charge extra for submitting either of those.

              1. Ken Moorhouse Silver badge

                Re: RyanAir charge extra for submitting either of those.

                Legacy baggage handling?

          2. Tom 7

            Re: Expertise

            More likely a desperate squirrel to try and blame it on someone in the EU rather than go anywhere near the truth of the matter.

          3. Frank Bitterlich
            Holmes

            Re: Expertise

            They said the French did it. So it was probably a circumflex character....

        2. JulieM Silver badge

          Re: Expertise

          Unless (1) the programmer used 'single' speech marks in their naïve quoting and (2) that column is the last in the table Flights, the whole command line would fail, and `DROP TABLE Flights` would never be executed.

    2. UCAP Silver badge

      Re: Expertise

      If someone submitted wrong data and managed to take down the whole system?

      From what has been said so far, I think it is more a case of the system not being able to parse input data submitted by one of airlines and activating what can only be described as an extreme fail-safe strategy (i.e. shut down all automated processing until the problem has been rectified). The question I would like to ask is: was that strategy actually appropriate, or could another strategy (for example, rejecting the dodgy data set and forcing the airline responsible to resubmit a corrected one) have been employed that is more appropriate given the impact that a complete system shutdown will have?

      1. elsergiovolador Silver badge

        Re: Expertise

        as an extreme fail-safe strategy (i.e. shut down all automated processing until the problem has been rectified)

        In olden they you would call it brittle and something like this wouldn't get shipped to a customer.

        But you can certainly reframe poor error handling as a "fail-safe strategy".

        try {

        data = extractFormData(message);

        } catch () {

        // TODO: implement error handling, but for now call it fail-safe strategy

        volatile int32_t planeWankers = 1 / 0;

        }

        1. katrinab Silver badge
          Windows

          Re: Expertise

          How do you do that in S/360 assembly language?

          1. TimMaher Silver badge
            Coat

            Re: S/360

            First, you climb into the loft and dig out your ancient copy of “Principles of Operation”.

            To be continued…

            1. ChrisC Silver badge

              Re: S/360

              This could become the new weekly go-to read on El Reg...

            2. mikepren

              Re: S/360

              The old yellow cheat sheet. Amazing how much debugging assembler dumps translates into debugging hava stack dumps

          2. Someone Else Silver badge
            Holmes

            Re: Expertise

            How do you do that in S/360 assembly language?

            Seriously, why would anybody care?

        2. NeilPost

          Re: Expertise

          … are you sure it’s INT_32 and not crapping out on 65,535 resubmits ;-)

      2. James Anderson

        Re: Expertise

        Its sounds like the poisoned message problem. You design a system so that no data is lost during the switchover.

        A message with invalid data crashes the primary system, the backup system starts up from where the main system left of and starts by processing the poisoned message............

        1. Anonymous Coward
          Anonymous Coward

          Re: Expertise

          Novochik for Komputerz comrade. is very effektive.

        2. Mike Pellatt

          Re: Expertise

          That was the proximate cause of the last NATS major TITSUP

      3. Anonymous Coward
        Anonymous Coward

        Re: Expertise

        They were back up relatively quickly compared to us and the vendor from hell. Four days after the initial unexpected and unexplained shutdown the vendor had located the problem. There was one character, in one field! in one record, that their system didn’t like. We used a clause in the contract to say goodbye to them sharpish, something about performance being abysmal from memory.

        Anon for safety as the vendor wasn’t very happy.

        1. Anonymous Coward
          Anonymous Coward

          Re: Expertise

          There was that time when a young programmer wanted to print all 255 ASCII characters on paper.

          Some of them are not available on all printers, some unprintable.

          He never got them, because it crashed the IBM Mainframe computer.

          IBM supplied a fix and that was that.

          This happened more than 30 years ago.

          1. FIA Silver badge

            Re: Expertise

            255?

            Did he (sensibly) decide to omit the space?

        2. Denarius Silver badge

          Re: Expertise

          sounds like a problem in my late career. Mainframe process stopped processing input. The data, supposed to be vanilla EBCDIC looked OK in its ASCII incarnation on way thru Windows and unix middleware boxes. Then I ran od -c on input streams and found embedded ^Z chars from one company's emails. Who uses VAX end of file markers in 2010 from DOS computers? Easily fixed by using tr to filter all incoming messages but still a WTF moment encouraging a rigorous input cleaning attitude.

          1. Xalran

            Re: Expertise

            those embeded ^Z reminds me of a call worthy of *On Call*

            the short version is : backbone router upgrade failed, On call engineer had to go on site as remote access was dead, he found that only half the configuration was loaded and had to reload the other half through the local serial port... post-mortem showed several ^D & ^Z hidden in the configuration file... Configuration reload had stopped at the first one. ISP was unhappy and requested a verification tool.

            ( and putting the conf in vi and :set list was not an acceptable verification tool... despite the fact that it's how we found out )... Got a kludgy shell script wrote in 10 minutes that they probably still use.

            1. Brewster's Angle Grinder Silver badge

              Re: Expertise

              tr -d \4\26

              (Or sed, or awk, or perl, or any of the others...)

              1. JulieM Silver badge

                Re: Expertise

                Ah, yes; but that doesn't produce a slowly-brightening-and-dimming map of the site on screen and a real-time count of the number of illegal characters detected like the computers in the movies.

              2. Xalran

                Re: Expertise

                I went for awk...

      4. Dan 55 Silver badge

        Re: Expertise

        activating what can only be described as an extreme fail-safe strategy (i.e. shut down all automated processing until the problem has been rectified).

        Also known as SIGSEGV.

        1. ITMA Silver badge
          Devil

          Re: Expertise

          "extreme fail-safe strategy"

          As I mentioned in another El-Reg item, because the fail-over backup system did exactly the same thing...

          Their fail-over fell over and failed.

          A strategy along the lines of:

          Mode change to "HUFF".

          State change to "SULK".

    3. Valeyard

      Re: Expertise

      probably caused by this poor bloke who decided on a change of career after taking down fastly by submitting bad data

      https://www.theregister.com/2021/06/09/fastly_explains_web_blackout/

    4. smudge

      Re: Expertise

      Always assume that the form may be filled by a dog or someone sitting on the keyboard.

      In my first job, we had someone whom the dog would probably have outscored in an IQ test.

      I don't think he ever actually sat on the keyboard, but he certainly did things to systems which you wouldn't have thought possible.

      Invaluable, he was. If you want to make a system idiot-proof, first catch your idiot...

      1. Caver_Dave Silver badge
        Mushroom

        Re: Expertise

        We had two testers, Cheryl and Caroline, who were incredibly devious in the scenarios they could think of. e.g. "I want to put -1 in this children field as he had a child, but his wife got custody in the divorce." (A deceased child only took it down to zero. The bug was that the field filled an integer rather than an unsigned integer, and so would let them do that.)

        This was a trivial case that I am allowed to recount.

        Once you prised them from the keyboard, they were wonderful people.

        1. Yet Another Anonymous coward Silver badge

          Re: Expertise

          >I want to put -1 in this children field

          Checking what happens if you can put -1 in an uint field is a good test.

          Wasn't there a story here about a POS that let you enter a negative number as a tip and get a free meal ?

        2. John H Woods Silver badge

          Re: Expertise

          The old joke...

          A tester walks into a bar. They order 0 beers. They order 0.31 beers. They order -1 beers. They order 256 beers. They order asdfghjkl beers... &c

          1. Ken Moorhouse Silver badge

            Re: A tester walks into a bar.

            Brilliant, where's the 100x upvote button?

            Do they go into the toilets together to compare outputs?

            1. Fruit and Nutcase Silver badge
              Coat

              Re: A tester walks into a bar.

              The expected outputs are 0, 1, 2 or 1 and 2.

              Erroneous input can sometimes end up with a visit to the casualty department of the local hospital

          2. Ken Moorhouse Silver badge

            Re: A tester walks into a bar.

            What about the food?

            To test for buffet overflows.

          3. katrinab Silver badge
            Pint

            Re: Expertise

            If you go into most British pubs and ask for 0.5 beers, the bar person will generally give you a small glass of beer, or tell you that they don't sell that. But either way, it is a perfectly valid thing to ask for.

          4. A. Coatsworth Silver badge

            Re: Expertise

            ... then an user walks into the bar, asks for a payphone and the bar promptly catches fire

        3. Denarius Silver badge

          Re: Expertise

          I had a clued end user manager with a newbie staff member testing some client software. They found bugs no-one else could.

      2. Arthur the cat Silver badge

        Re: Expertise

        Invaluable, he was. If you want to make a system idiot-proof, first catch your idiot...

        In my case it was Pete. Pete could be totally relied upon to do the most improbable and ridiculous thing when faced with any situation. Absolutely brilliant for ensuring my code could keep running no matter what got chucked at it.

        Strangely, the one time all our kit got fried, Pete was in the room but not actually touching anything. He turned up in my office white as a sheet making "bu, bu, bu" noises. What nobody had known was that building management had for some reason run the lightning conductor down the wall outside the computer room. Pete was the only person in the room when a lightning strike hit the building and every piece of equipment turned into a plasma ball toy due to induced EMF while Pete cowered in the middle of the room thinking he was going to die.

  2. Howard Sway Silver badge

    Resiliency – we've heard of it

    Input validation - what's that?

    One wonders how bad the rest of the system must be if they couldn't even code simple checks to reject an incorrectly formatted flight plan.

    1. smudge
      Headmaster

      Re: Resiliency – we've heard of it

      I hadn't. I'd heard of resilience, though.

    2. John Sager

      Re: Resiliency – we've heard of it

      Not simple. Flight plans can be quite complex documents and doing a thorough input sanitation early on may not be feasible. I wonder if the whole truth of what happened will be released publicly. They may want to keep quiet about exactly what caused it to fail, if indeed it was a duff flight plan issue.

      1. Dave Pickles

        Re: Resiliency – we've heard of it

        There's a lot of Educated Guesswork by folks knowledgable in the subject here:

        https://www.pprune.org/rumours-news/654461-u-k-nats-systems-failure-4.html

        Posts 61 and 80 are particularly interesting.

        1. Brewster's Angle Grinder Silver badge

          Re: Resiliency – we've heard of it

          Thanks for that. That's the sort of thing I was expecting.

          Given how many flights run everyday without problem, it had to be something ridiculously obscure that nobody had managed to provoke before and where the most sensible thing to do was back out and ask for help. I'm surprised people are not more forgiving. If you find your invariants are broken, what else can code do but sound the alarm and wait for help?

          1. sabroni Silver badge

            Re: I'm surprised people are not more forgiving.

            Really? Haven't you met many?

          2. Someone Else Silver badge

            Re: Resiliency – we've heard of it

            If you find your invariants are broken, what else can code do but sound the alarm and wait for help?

            Constants aren't, variables won't.

        2. wsm

          Re: Resiliency – we've heard of it

          I'll believe post #77 until it's proven otherwise.

      2. Ken Moorhouse Silver badge

        Re: They may want to keep quiet

        I'd imagine if it was due to insufficient budget assigned then they will want it shouted from the rooftops.

      3. PerlyKing

        Re: Resiliency – we've heard of it

        Flight plans can be quite complex documents and doing a thorough input sanitation early on may not be feasible.

        As complex as 800MB+ XML documents that can be validated against a schema? That sounds unlikely.

        1. david 12 Silver badge

          Re: Resiliency – we've heard of it

          XML validation only validates the schema. The next step is execution of the content. That data-program may crash because you have coded a stack overflow, not because of syntax error. It may even be the case that the transform crashed because of an edge case in its physical model, rather than in the input program.

          1. bazza Silver badge

            Re: Resiliency – we've heard of it

            If you write an XML schema properly, and use proper tools that fully understand XML schema (a rarity), then you can in the schema define valid content. For example, you can constrain the valid values of an integer, or the length of a list, and parsers / serialisers generated by proper tools will generate code that checks that such constraints are honoured.

            So far, so good. As I hinted, most XML XSD code generators I've come across ignore the constraints and generate no code for them. Examples include Microsoft's xsd.exe, with a lame excuse (I found their reasoning buried deep in some docs) that amounted to "computers do not constrain the values of integers or the lengths of lists". Well duh, that's why we have software and write it to impose constraints when we want values and lengths constrained.

            However, so far as I know XML XSD is far from "complete". I'll explain what I mean. Whilst you can express constraints in an XSD file, you get nothing in the generated code to tell the program that uses the generated code what the constraints actually are. So, touching on your "That data-program may crash because you have coded a stack overflow" point, it's hard for the program or developer to know, for example, how much memory to allow for the length of a list when parsing some XML data. Perhaps not a problem on a server / desktop application where memory is bountiful and memory allocation is automatic in the generated code, more of a problem in an embedded system where perhaps the developer has had to decide how much heap a process will need.

            A serialisation technology that is "complete" is ASN.1, though not all the ASN.1 tools fulfil the whole standard. With ASN.1 you can define messages, setting value and length constraints as you do so, just like XML XSD, or JSON schemas. What is unusual is that you can also define static constants of messages, including integer constants, which can be used as the constrains in the definition of other messages, and these static constants also show up in generated code. So for the developer, the generated code contains 1) parsers / serialisers for objects (messages), 2) automatic constrains checking whilst parsing / serialising, 3) constants that can be used to understand the extents of constraints which can then be use for all sorts of purposes in the code the developer writes.

            One such purpose might be iterating over the length of a received list. If the list is defined as containing 10 entries, then the for loop can be from 0..listlen-1, where listlen is a constant that comes from the schema, not from the developer.

            The consequences can be quite profound. System behaviours related to the constraints on valid / invalid data can all be driven from the schema, not from developer-written source code. This means that all such constants have a single definition - irrespective of programming languages used across the whole system. Change the schema, rebuild the system, and the entire system is updated with the new constants and thus the new behaviour.

            This can have profound consequences on how you run projects. If you have a risk of stack overflow due to the amount of data to be received, you can have the stack size driven by the constants in the schema. It either works (there is enough memory), or the code throws an exception when it can't get enough stack in advance of needing it. If you need the extend the length of a list to contain more items, and you need programs to generate / process the extra items, so long as they're using the constraint constants from the schema-derived generated code a single change in the schema brings about the required code change system wide.

            That means you no longer need the developer to make the change, the schema author can safely make the change, at any point in the project life cycle, even quite late. You can be agile with the definition of messages in the system right throughout the project development cycle, because changes to message definitions do not have to result in any re-work.

            Pretty neat for a useful, old technology. Especially as it can emit / consume binary formats as well as JSON and XML...

            It's possible that JSON schema can, in some circumstances, pull off the same trick (I'm less familiar with JSON schemas, but I know that JSON is essentially executable JavaScript so who knows what can be done!). However, when I survey the vast array of serialisers out there it's remarkable how bad a lot of them are in terms of what they can actually do for developers. For example, Google Protocol Buffers is much lauded and widely used, but it does absolutely nothing to help developers valid inputs; there are no constraints (apart from an independent alpha/beta quality extension), developers (if they bother to validate messages at all) have to communicate by email or a word document or comments in the .proto file to understand what the valid range in a message can be. Most serialisers out there have not considered the role of serialisers in project development, or in project management.

            1. Anonymous Coward
              Anonymous Coward

              Re: Resiliency – we've heard of it

              what doesn't help is people defining schema's that don't understand the schemas.

              I've had one idiot send me a schema that contained an optional entity, so I made my output to them have an optional entity, days later there system started rejecting sent data due to missing entity, they had made the optional entity required!.

            2. david 12 Silver badge

              Re: Resiliency – we've heard of it

              The point is, that a flight plan is a program. It's not "developer written source code", it's generated source code, or "pilot written source code". That program is eventually executed by a pilot in an airplane, but it goes through a transform in the route scheduling system.

              Formal definition of programs is a thing, but it's not the same as schema validation.

              1. Anonymous Coward
                Anonymous Coward

                Re: Resiliency – we've heard of it

                you seem to have confused data and code. are you a pointy hair boss? or a mangler? thats the type of mistake manglers and bosses make and why we end up with shit like this.

      4. Richard Pennington 1
        FAIL

        Re: Resiliency – we've heard of it

        Input sanitation is when you know you're feeding it crap.

    3. vogon00

      Re: Resiliency – we've heard of it

      Input validation 'with extreme paranoia' is mandatory for me I'm afraid - for just this sort of reason.

      Just had to hook up our invoicing system with a carrier's JSON API ... and had to go through a suite of compliance testing with them on the test system before being signed off and getting the API key for 'prod'.

      If I can do it, why can't NATS?

      1. sabroni Silver badge

        Re: If I can do it, why can't NATS?

        Because a flight plan isn't the same thing as an invoice?

    4. Anonymous Coward
      Anonymous Coward

      Re: Resiliency – we've heard of it

      This is a precursor to a wonderful government contract to update the system (again I think) to include AI when it's just if and else statements parsing strings. I will bid $1 billion dollars for it but it will probably go to Infosys or Crapita.

    5. bazza Silver badge

      Re: Resiliency – we've heard of it

      Er, if you read the article it strongly suggests that the issue was caused by the system validating its input, not liking what it saw, rejecting incorrectly formatted flight plans, experiencing this a lot thus triggering some kind of "there's too much of this going on" exception and passing the problem straight back to the human operators. It sounds like it's operated exactly as designed, and exactly as someone has decided it should.

      I think one can question the design and the thinking behind it, but a lack of input validation does not seem to be the cause.

      1. Spazturtle Silver badge

        Re: Resiliency – we've heard of it

        They said that they would be able to fix it very quickly and if won't cause disruption if it happens again. So I suspect it might simply have been that nobody working there even knew they had to manually clear invalid flight plans and it took 4 hours to get in touch with the guy who retired 30 years ago.

    6. JulieM Silver badge

      Re: Resiliency – we've heard of it

      Literally the next lesson in any programming course after "accepting input from a user" is "Checking the user's input is within a sensible range".

      This story is an obvious fake. The system could simply never have worked for this long in such a broken state.

      There is something else going on, that we haven't been told about.

      1. Anonymous Coward
        Anonymous Coward

        Re: Resiliency – we've heard of it

        “There is something else going on, that we haven't been told about.”

        Agreed.

        I blame the cyclists.

      2. sabroni Silver badge

        Re: There is something else going on, that we haven't been told about.

        Never attribute to malice that which is adequately explained by stupidity https://en.wikipedia.org/wiki/Hanlon%27s_razor

        1. JulieM Silver badge

          Re: There is something else going on, that we haven't been told about.

          That's precisely the trouble: stupidity alone cannot adequately explain the fact that it worked for this long without a problem of this magnitude.

          This can best be described as a trans-Hanlonian moment.

  3. jonha
    FAIL

    It's getting harder and harder for those excitable papers to blame everything on the EU (not that some still try hard) so it's of course the turn of the French.

    As to the fail-safe strategy of shutting down everything on running into invalid data, it's hard to say whether that's appropriate or not without knowing a lot about the systems involved.

    What certainly IS strange though is that a backup system (that is there precisely in case No 1 fails) has apparently been fed the very same crap... which produced the same result. Resilience?

    1. Valeyard

      What certainly IS strange though is that a backup system (that is there precisely in case No 1 fails) has apparently been fed the very same crap... which produced the same result. Resilience?

      presumably though a backup would be useless without all current flight plans, this happening is probably less likely and preferable to a backup coming online and being next to useless because it's acting on information an hour old with planes flying around it doesn't know about

      1. steamnut
        FAIL

        Uh?

        So why bother with a backup if you can never use it?

        1. Anonymous Coward Silver badge
          Facepalm

          Re: Uh?

          In case the hardware dies

    2. Anonymous Coward
      Anonymous Coward

      Systems I've worked on with disaster recovery have generally had the DR recovery point current or within a few seconds, usually using disk block level replication (e.g. EMC SRDF) or data transfer (e.g. Oracle Dataguard). As pointed out, you don't want to lose data when your primary goes down, but the problem is that the GIGO (Garbage In, Garbage Out) process takes both down at once, just the same as any other kind of data corruption would.

      I've seen many comments on the Reg where people ask about disaster recovery/backup systems ignoring the fact that if the problem is the data, your DR system generally has the same problem as live.

      In this case, the data absolutely should have been validated earlier and avoided the lockup that the system achieved. Where that flaw lies will undoubtedly result in a chunk of blame application, bus throwing and scapegoat discovery.

    3. The Basis of everything is...
      Holmes

      Maybe I'm old-fashioned, but if I have two identical systems fed the same data, then I'd expect them both to give the same result. Especially if it's something safety critical like making sure airplanes that have been carefully shepherd into close proximity to each other don't actually bump heads.

      Or maybe there's a missing irony icon?

      (Not that you'll find much iron in an airplane of course, or these days much metal at all given the preference for making everything out of araldite and burnt string.)

      1. Inventor of the Marmite Laser Silver badge
        Alert

        The problem with having duplicated systems is that when they give one different results, which one do you believe?

        1. Ken Moorhouse Silver badge

          Re: which one do you believe?

          This is bad enough in synchronous systems, where data changes occur at the same time, due to a system-wide clock pulse, or an SQL transaction execution, but the world is generally asynchronous in nature. So two systems that are not tightly synchronised will inevitably be lagging or leading slightly, resulting in transient anomalies.

      2. Someone Else Silver badge

        I would suspect that the redundant/backup/second system should have been powered up without being exposed to the troublemaking flight plan...

    4. Richard Pennington 1
      FAIL

      Backup system designed to fail in the ssme way.

      Common-mode failure. Not very resilient. But reproducible.

  4. Vikingforties
    Trollface

    If Only ...

    .. your old company was free of similar technical SNAFUs while on your watch Willie, but a lot of us know different. And then had to suffer the hassle of what passes for BA's refund "system".

    @El Reg, do you have an icon for pot and kettle?

    1. The Basis of everything is...
      FAIL

      Re: If Only ...

      And if I recall correctly, continually criticised NATS about their charges and blocking investment in the systems and processes that only exist to make sure his aeroplanes get to where he wants them to go so he can make money.

    2. Flicker

      Re: If Only ...

      Instructitive to look at who actually owns NATS and is thus ulimately responsible for their investment priorities... so from their website:

      Our ownership

      A public private partnership

      NATS is a public private partnership between the Airline Group, which holds 42%, NATS staff who hold 5%, UK airport operator LHR Airports Limited with 4%, and the Government which holds 49% (the golden share).

      The Airline Group comprises:

      USS Sherwood Limited

      British Airways PLC

      Pension Protection Fund

      easyJet Airline Company Limited

      Virgin Atlantic Airways Limited

      Deutsche Lufthansa AG

      Thomson Airways Limited

      Thomas Cook Airlines Limited

      and the irony of then having rent-a-gob Wee Willie mouthing off as usual...

  5. adam 40 Silver badge
    WTF?

    Le Brexit

    I am incredulous that none of your regular commenters have blamed BoJo yet!!!

    - De'guste' de Tunbridge Wells

    1. xyz Silver badge

      Re: Le Brexit

      Well it was a full Boris level shitshow.

      Just checking to see if El Reg falls over if you use an apostrophe. O'Reilly.

    2. elsergiovolador Silver badge

      Re: Le Brexit

      BoJo brought Sunak who brought IR35 changes that culled experts from projects across multiple sectors.

      1. sabroni Silver badge
        Thumb Up

        Re: IR35 changes that culled experts from projects across multiple sectors.

        Aww, did your tax dodge stop working? Don't worry, I'm sure you're still vastly overpaid!

    3. martinusher Silver badge

      Re: Le Brexit

      BoJo was just one of the last in a whole series of 'somewhat less that competent' government figures. A fun one, erudite and so on but still not that good at his job. Unless his job was "to deliver anything remotely profitable into the hands of hedge funds". The result doesn't necessarily spell disaster over night since entropy usually takes some time to do its work but the corrosive effects of undercapitalsation and undermanning eventually do the trick.

      (The problem is that these funds often make the real money by liquidating assets -- the corporate carcass is often worth a lot more dead than alive -- and in this case, like with other essential services like the railways the carcass has to be kept on life support. (But that's what the government's for....)

      1. Lomax

        Re: Le Brexit

        > the carcass has to be kept on life support. (But that's what the tax payer is for....)

        FTFY

      2. NXM Silver badge

        Re: Le Brexit

        Depends on what you think the fluffy-headed imbecile's job was. If it was to distract attention from those actually pulling the strings, like Zaphod Beeblebrox, then he did his job perfectly. If it was to actually run the country then rather less so.

        1. sabroni Silver badge
          Thumb Up

          Re: he did his job perfectly

          At Putin's request, destabilise the western border of the EU while Russia fucks around on the east.

          If we were still part of the EU do you think Putin would be so keen to fuck with it?

      3. sabroni Silver badge
        Facepalm

        Re: A fun one

        You've got a fucked up idea of fun. The man is a lying, immoral scumbag.

        He had a funny haircut though and he looked hilarious on that zipwire so let's fuck the economy up!

  6. ChoHag Silver badge

    <BA> Fail safe? We've heard of it.

    This is why you don't put manglers in charge of anything important.

    How many deaths did this "chaos" cause? How long would it take to count the bodies if the system were designed by a CEO?

  7. Ken Moorhouse Silver badge

    How about unicode?

    One of those characters that looks like a regular one, but isn't?

    1. John Brown (no body) Silver badge

      Re: How about unicode?

      I'm fairly sure they aren't using OCR to read in printed copies of flight plans.

      A unicode character that "looks like" an ASCII or UTF-8 character only looks that way to a human. The computer sees the unique encoding and shouldn't get confused :-)

    2. c203

      Re: How about unicode?

      You mean things like é è etc? If so, the system was brought down by a bunch of French letters.

      1. Dan 55 Silver badge

        Re: How about unicode?

        Flight plans are in capital letters only according to this.

        So it can't be choking over a character which is not capital, or numeric, or one of a few symbols since this is so simple to validate.

        1. Ken Moorhouse Silver badge

          Re: How about unicode?

          Dan: Thanks for the informative link.

          "since this is so simple to validate"

          Haha, agree 100% but there is a saying about "woods and trees" which we should be mindful of.

          The validation of flight plan text, even if restricted to capital letters can look surprisingly complex, no doubt there is a succinct regex* which will do the trick, but I would hand-craft the validation (to be sure!) and I can imagine some things falling through the validation if not careful. Why did they veer off-piste to mandate a slash character between fields 6 and 7?

          *any offers?

      2. Ken G Silver badge
        Coat

        Re: How about unicode?

        surely those provide input protection?

      3. Ken Moorhouse Silver badge

        Re: If so, the system was brought down by a bunch of French letters.

        aka penetration testing.

      4. Someone Else Silver badge

        Re: How about unicode?

        How about a non-breaking space (U+00A0), perhaps? Or a word joiner (U+2060), or maybe a figure space (U+2007)?

  8. Androgynous Cupboard Silver badge

    My new favourite euphamism

    "Our systems, both primary and the backups, responded by suspending automatic processing to ensure that no incorrect safety-related information could be presented to an air traffic controller or impact the rest of the air traffic system,"

    This guy is good. Very good. Sir Humphrey-level good. Chapeau, if that's too not inappropriate.

    On encountering the wall, the car responded by suspending forward motion to ensure that no further processing of incorrect input data would take place and impact the rest of the vehicle

    1. veti Silver badge

      Re: My new favourite euphamism

      Yes it was a fuckup, but give them some credit. Just think about what other contexts the word "impact" might have occurred in, in this story.

      That's the problem there. You can't "just reject" safety critical data. It's telling you something , and you can't afford not to know what that is. Else there'll soon be a mystery plane weaving through your crowded sky with nobody knowing what it is or where it wants to go.

  9. Tron Silver badge

    NATS should refund everyone - airlines and passengers.

    And no %$%ing bonuses for NATS staff this year.

    1. Anonymous Coward
      Anonymous Coward

      Re: NATS should refund everyone - airlines and passengers.

      Except for higher management of course.

  10. Lomax
    Holmes

    Before the horse, cart

    I would put the back-up system in front of the main system instead, and let it process (but not execute) incoming plans - should an incoming plan lead to >X number of alterations to existing plans, or whatever other safety limits you want to set, reject it and alert the wetware. There would be no need for a third back-up system; the "back-forward" system could still serve as a hot spare since it would have to be kept up to date with the state of the main system in order to perform such tests. This architecture would reduce the reliance on error checking the messages themselves (which has a near infinite number of blind spots), by including their effect(s) on the actual system.

  11. Stu J

    Validating the input...

    ...is all well and good until your validation code crashes for some reason that wasn't anticipated, and your error handling fails to DLQ the message for a related reason, and instead crashes causing the message to stay on the head of the queue to be processed.

    Having seen this very issue, also in an aviation environment, I'd have expected that they should have a way to manually junk messages off the head of the processing queue - will be interesting if we ever find out if that's the case.

    I do wonder if their decision to "go manual" may have actually made it harder to recover, if they then had to manually filter out messages which had already been manually processed, before turning the automation back on...?

    1. bazza Silver badge

      Re: Validating the input...

      Certainly, having a way to at least junk old, irrelevant, would have happened in the past data would make sense. If a flight plan was referring to a previous date/time, it's never going to happen.

      Clearly there is an old and carefully thought out design decision lurking in the background to this.

      I've no idea what data and what format is in a flight plan, but I suspect that it's designed around a text file in an agreed format. That is a way in which the general / private aviation community could formulate a flight plan by hand and actually get up in the sky without the need for some specialised software. Clearly, some specialised software and thoroughly validated data exchange with organisations like NATS would be more reliable with less human overhead, but it would be bound to price the general / private aviationist out of the sky.

      No doubt there is much talk of "is it time for a change?" going on, but I kinda hope that they stay as is and don't let the interests of big business result in small businesses / private individuals being excluded.

  12. 5n0wcha1ns

    When a "." causes a FULL STOP

    How can they have left such a fatal error in such critical code. Fair enough it was probably all written in COBEL some 60 years ago, but all the same...... .

    1. Anonymous Coward
      Anonymous Coward

      Re: When a "." causes a FULL STOP

      are you kidding?

      it is written in an excel 4.2 sheet.

    2. david 12 Silver badge

      Re: When a "." causes a FULL STOP

      Fatal air systems failure?

  13. vapoureal

    Sign-off

    When I read these stories I always :-) and think back to the day: someone signed off this use case and design.

  14. spold Silver badge

    Reset it

    CTRL+Altitude+Delete

  15. Norman Nescio Silver badge

    Functional spec

    The functional specification for the American (FAA) system that converts Flight Plans into Routes is here:

    [pdf] FAA: NATIONAL AIRSPACE SYSTEM: En Route: CONFIGURATION MANAGEMENT DOCUMENT: COMPUTER PROGRAM FUNCTIONAL SPECIFICATIONS: ROUTE CONVERSION AND POSTING

    The UK system is based upon it, but has diverged.

    I posted it previously, but I think the conversation has moved on to the new article.

    Acknowledgement to CBSITCB of PPRuNe for pointing out the original information.

    Note that the system validates the input to check that it is properly formed, but in this case it looks like it failed to calculate a route from properly formed input: in other words, the form was likely filled in correctly, but the content of the form likely caused the processing to fail, possibly via an unhandled exception.

    I hope any report goes into sufficient technical detail, in a similar fashion to AAIB reports.

    1. PerlyKing

      Re: Functional spec

      So what you're saying is that it might have thrown a YouCantGetThereFromHereException ?

      1. Simon Ritchie

        Re: Functional spec

        When discussing the limitations of these systems, you need to remember that they may well have been designed and written in the 1960’s. The NATS system may have at its core a COBOL programming running in simulation, firmly convinced that it’s running on an IBM mainframe, reading input from 80-column punched cards and writing output to a line printer loaded with music paper. My guess is that in those days, reverting to manual may have been a sensible reaction to bad input data, but over the years, the throughput has increased a tad, and the humans who have to intervene to fix the data couldn’t do it fast enough on a day of high traffic such as August Bank Holiday Monday.

        Also remember that any change to the input protocol to make it more resiliant would demand changes to all of the hundreds of pieces of software running back at the airlines and the airports, feeding the flight data into the NATS system. Any change would have to be agreed and then implemented by all of the players. The airlines that are now screaming for compensation may be the very same airlines that have resisted any change to the system for the last few decades, because it would involve them spending money on updates to their own software.

  16. Vader

    Does seem a bit odd that a dodgy flight would knock it out. For get taking control we can't even control our airspace without messing it up.

  17. Anonymous IV
    Happy

    Data exception

    Pretty clearly there must have been an S0C7 abend, which required the NATS technical staff to look at a core dump for several hours.

    [Oh, the 1970s are over now?]

  18. John_Ericsson

    Can we have a sweepstake on how many years and how much over budget the replacement system will be? How about we all meet up here every 5 years until we know the answer.

  19. John Robson Silver badge
    Black Helicopters

    Why do I get the feeling

    That they've put a check in for this particular failure case, and not thought about adjacent or otherwise related possible failures.

  20. Screwed

    My first ever program caused the IBM mainframe to snarl up. Early 1970s, at school, punched with a manual card punch and a simple Fortran program.

    Seems I made a mistake with a punched card but not sure if it was mis-punched or left out. Should have been something like:

    #IOCS NONCONTROL

    Got a curt note back with my stack of cards.

    While I never made that mistake again, one of the joys of computers is the seemingly infinite number of opportunities to screw things up.

  21. Binraider Silver badge

    Input validation. You know, that thing that you (hopefully) are writing a significant chunk of your code to do?

    To be fair to NATS, it's ran for an awfully long time without incident. The results of the investigation will be fascinating to see, assuming they aren't swept under the carpet.

    Like the error on Ariane 5, it's better if the causes are published but I'm not sure responsible parties will see it that way.

  22. john.w

    Days of delays are their fall back.

    It is clear that NATS considers days of flight delays for tens of thousands of passengers is an acceptable backup solution.

    1. Dave@Home

      Re: Days of delays are their fall back.

      Depends how you look at it.

      Is safety your number one priority, in which case suspending flights for a time is fine.

      As to the capacity to recover, well a lot of that is on the airlines. Ryanair et al. all run theior crews close to the wire on operating hours already, add in airport curfews and the decision post pandemic to favour narrow body aircraft which means more flights per hundred heads and the capacity is pretty much redlining

      1. wbqqq

        Re: Days of delays are their fall back.

        I think that the time to recover is actually more key that the 'failure' of the NATS system. From a system perspective - the issue of redundancy is more an issue for the airlines - granted the NATS stoppage caused an across-the-board 4 hour delay, but the fact that it takes a week (rather than a day) to recover from the shock is due to most airlines running at 90%+ efficiency. If the airlines ran at 75% efficiency, most of the impact would probably have been addressed in a day, and the costs to the airlines more like £20m or less. But then they'd have to have higher fares...

        Ultimately, Low Fares => high efficiency => low redundancy => long time to recover from a shock

        ATC to pay compensation => higher ATC fees => higher fares

        It really comes down to a trade-off between higher fares or risk of delays or compensation (cost<->time<->quality?)

  23. ScottishYorkshireMan

    you mean it wasn't a ....

    DRONE?

    I thought anything that went wrong nowadays with ATC was always the fault of a drone, somewhere.

    I suppose the interference story is, "no you cannot say Cyber Attack, look, just blame it on the French".

  24. Luiz Abdala
    Holmes

    Airspace quilt.

    Is the EU airspace still that hodgepodge of a dozen or more restricted zones that you must avoid on a single flight, or did they streamline things a little bit?

    Not saying both being related to be the cause, this time...

    You know, if you have a extremely complex flight plan just because NATO, Russia, etc, you could prevent an automation meltdown in the first place if they were greatly simplified, perhaps?

  25. Anonymous Coward
    Anonymous Coward

    Here's the real problem..

    We now basically have a valid, defined and proven Denial of Service attack for all of UK's airspace :(.

    I hope their comms is secured - a bit of noise and fuzzing appears to be all it takes to take the whole shooting match down and there are enough malicious w*nkers out there who would do this for funzies.

    At least it failed safe.

  26. cantankerous swineherd

    safe to assume that matey doesn't know very much about the system.

  27. Fr. Ted Crilly Silver badge

    Wasn't

    Shapps in charge for a while? that might explain it...

  28. Nano nano

    test data ?!

    Not sure why the "excitable" tabloids should be pleased it was a French airline that provided the test data that showed up bugs in the UK NATS system ...

    1. Dan 55 Silver badge

      Re: test data ?!

      The chances of their readers working out the same flight plan didn't crash French ATC or Eurocontrol is low (if it really was a flight plan from a French airline). The chances of the British press making that point clear in their articles is non-existent.

    2. Potemkine! Silver badge

      Re: test data ?!

      Ever heard about Perfidious Albion? :-P

  29. BebopWeBop
    Facepalm

    Some of the UK's more excitable tabloid media outlets have already reported that a French airline's flight plan submission may be to blame.

    It didn't take too long for the French to be blamed. The Time did it as well - but then it is just another tabloid.

  30. Dan 55 Silver badge

    So it turns out the flight plan loader will crash if a flight plan includes two different waypoints with the same name in two different non-UK airspaces. This is, even in the head of NATS' own words, "perfectly compliant", as each airspace is responsible for naming its own waypoints.

    How they fixed this in less than a day is anyone's guess, but it sounds like a temporary bodge to me.

    But at least the French got blamed over this, that's the main thing.

    1. Seajay

      Yeah - so part of the problem stems from the fact that the world has duplicate waypoint names all over the place (despite ICAO and other bodies work to remove them), and the standard just requires them to be geographically distant. In this instance they were apart by 4,000 nautical miles.

      Basically the software has to extract the UK part of the route from the (perfectly valid) flight plan - the failure here came in the logic that extracts the entry/exit waypoints in UK airspace, where the exit waypoint doesn't have to be specified, but can be searched for the next location. In this flight plan it appears the next waypoint was a designation which matched the UK entry waypoint (also outside UK airspace), a situation which the software appears to have had no handler for other than "last resort" log and halt. (Crash!) It then moved over to the backup system, which processed the same data and did the same thing. It required the manufacturer to be able to help pinpoint the specific flight plan that had caused the crash and then get it back into service.

      Full preliminary report is here - fascinating reading if you're into that sort of thing.

      https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident%20Investigation%20Preliminary%20Report.pdf

      1. Norman Nescio Silver badge

        Unique

        It sounds like:

        (a) Waypoint names are not globally unique (and this will not change quickly, if at all).

        (b) A certain feature of the UK software operated using waypoint names with an implicit assumption that they were sufficiently unique for its purposes. That implicit assumption turned out to be incorrect,

        It strikes me that a quick workaround is to reject flight plans with duplicate waypoint names, requiring them to be manually processed*.

        A longer workaround would be to apply a local remapping of non-unique waypoint names to unused strings in the same namespace to make them unique for further processing**. Unfortunately, it would require a post-processing of a route to convert the transformed waypoint identifiers back into the non-unique versions so the subsequent people using the generated route don't have to alter all their systems and processes.

        * It is perfectly possible to have valid flight plans with the same waypoint more than once. A simple way is to fly in a circle as part of your route.

        ** If you have run out of space in the variable (i.e used all possible names that fit in the allocated storage), the problem gets trickier. Converting the waypoint identifier into a structure that, for example, holds the latitude and longitude of the waypoint as well as its identifier is a non starter, as that requires an extensive code rewrite. The geographic information might even be available to the system, but if it is not written to use it to make the waypoints unique for processing, the rewriting would be a huge job.

        1. Ken Moorhouse Silver badge

          Re: a quick workaround is to reject flight plans with duplicate waypoint names

          Feeding all flight plans through a pre-processor would arguably remove the human element and no doubt provide hard statistics for management meetings to provide a budget to turn the pre-processor into one that automates* waypoint renaming.

          *That is too strong a word for the process I'm thinking of. Couldn't there be dummy waypoints which help nail an ambiguous waypoint to a hard location? Expand the flight plan to include the dummy.

  31. Richard Pennington 1

    Update on NATS data-driven outage

    It is being reported that the outage was due to a genuine data problem: a submitted flight plan included transit via two identically-named waypoint markers.

    https://www.bbc.co.uk/news/business-66723586

    So, if the reporting is correct, there are actually two problems:

    [1] There should not have been two identically-named waypoint markers, and so at least one waypoint marker needs to be renamed.

    [2] The NATS software responded incorrectly: it should have thrown out the offending flight plan, with a human-intelligible note stating why it was being thrown out. [Like a human would have done].

    Incidentally, was the NATS software running in the cloud?

  32. AndrewB57

    Interim Report available here: https://www.caa.co.uk/our-work/publications/documents/content/cap2981/

    Of interest to software engineers and system designers.

    Also to those who collect acronyms

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like