"Have you tried turning it off and on again?"
"Are you sure it's plugged in?"
A computer crash that caused the collapse of a $2.4bn air traffic control system may have been caused by a simple lack of memory, insiders close to the cock-up alleged today. Hundreds of flights were delayed two weeks ago after the air traffic control system that manages the airspace around Los Angeles' LAX airport went titsup …
>because it is computing things for numbers between zero and infinity, no amount of memory will be enough.
Should have used functional programming with lazy evaluation.
"I'm sorry we don't keep track of program state, so we don't actually know where the planes are. But we can prove the code is formally sound."
Let me guess (this is a very educated guess by the way - I have seen this idiocy one time too many). Some moron in his infinite wisdom has used a realtime OS for the flight planning as a whole. It did not run out of memory per se, the combined "alloc more memory" + compute exceeded the realtime constraints on the path computation task.
.
If you do that in an RTOS you get a BOOM - a reboot from the global system watchdog at scheduler level.
There is a gazillion ways of triggering it and this is a demonstration why some stuff should not just be done on realtime OS-es and given to vendors that will stick a realtime OS into it out of principle.
The only place that needs RT in the whole system is the realtime collision avoidance which can be standalone, the rest has as no need for RT whatsoever. There may be _HOURS_ before the flight plan is punched in and the actual time it needs to be executed. Doing that realtime on realtime OS under realtime scheduler constraints is beyond idiotic (I can bet 100 green ones on that this is what was shipped here - the name of the vendors speaks for itself).
"There may be _HOURS_ before the flight plan is punched in and the actual time it needs to be executed."
Um, I don't think this is about flight plan stuff _before_ take-off. This is about controlling that the craft is not going to crash into anything else right now.
I agree that pre-flight flight plan control could very well be farmed out to a mainframe that would happily control its validity without resorting to real-time constraints. But when you have a hundred flights over your head at that instant and need to integrate a new object and control its parameters, you need the result straight away, not in ten minutes.
Plus, I believe that flight control has a tendency of reassigning altitudes to ensure that collisions do not occur - that is not something that a pre-flight check can take into account.
"There is a gazillion ways of triggering it and this is a demonstration why some stuff should not just be done on realtime OS-es and given to vendors that will stick a realtime OS into it out of principle."
Sounds to me like it had got stuck in an endless loop anyway and would have eventually crashed regardless of what system it was on. But hey, perhaps it could have been written in Java - I'm sure the garbage collection would have coped, right?
My uninformed guess would be that flight plans are simulated as the points in 4D Space (3 physical + time) that the planes go though to check for collisions, and the root cause a 'minor upgrade' long ago to allow for busier airspace - half the collision avoidance time doubles the number of time points tracked.
My guess would be that the max altitude was never tested when changing time dimension, and not updated with a lower limit. I don't think this would use a real-time OS because you'd want to serialise each plan tracing.
my guess is [1] every busy airspace (like Heathrow) is regression testing altitude [2] game designers are offering smartphones with PhysX GPU as an upgrade to the mainframe.
That's what they want you to think... :-)
Forget the logo, I'd prefer the content of some of those jumpsuits...
Did I read that right - that the system crashed because an operator entered a value that was outside limits that the system could handle and the system didn't flag this up? And, worse still because there was no altitude on the flight plan, the operator just 'guessed' what this value might be?!
> the system crashed because an operator entered a value that was outside limits
You aren't thinking that if you enter a flightplan with an altitude of 2^16 feet, you might just get an integer overflow, are you?
After all, nobody could fly that high, so we'd never need to test it, right?
And for every other air traffic controller in the world they make a habit of checking the secondary radar returns from the aircraft to make sure its altitude is ok. Failing that they radio it and ask. In all circumstances a human brain is keeping an eye on the airspace and making sure everything is safe. Including the air traffic that doesn't really file flight plans or generate large radar returns like gliders.
Meanwhile somewhere in the US some moron decides that computers should do this job which means that finding a parameter that the computer cannot handle in an airspace environment was all but inevitable. No doubt the air traffic controller at LAX knew the altitude of the U2 and knew it was not a problem for them, but entered the altitude into the system because they are meant to otherwise the system does not know where its logged aircraft is at. And so the mayhem began!
I have a mental image of the controllers yelling at the computer, "Noooo, stop doing that. Shut up you f****** thing! Oh f*** it, turn it off."
Never send a computer to do a human's job.
So it's a design flaw. The question: "What is the range of altitudes that planes under our watch could possibly be flying at?" was not evaluated competently.
(Is it possible that U2s and such were left out of the plans for some reason?)
We have surely all seen this happen: coders are given incomplete specs by their client/boss/paymaster, and somewhere down the line all hell breaks loose because The Thing You Failed To Allow For(TM) happens. And it's still somehow the coder that gets it in the neck, often as not.
I think these days a fair percentage of my time spent on architecture/design is on trying to make things extensible to allow for adding routines to deal with TTYFTAF, e.g. taking quoted "tolerance ranges" and doubling them in both directions (well not really, but making sure I know what'll happen if I'm suddenly told I need to double them later).
If I'm lucky enough to have some working knowledge of the industry/sector/thing I'm doing it for, that helps a bit with the intuition to know when something's been left out, but it also invites arrogance/complacency on my part so one still has to be cautious, and always ask the spec provider "are you sure that's all of it?" as often as you possibly can.
A negative altitude is completely possible, it just means that the plane is on the ground at an airport below sea-level (there are several of these around the world and in the US). While it might not be -20,000 ft, -2 is still a negative. ATCs do track aircraft on the ground since the worst aviation accident in history occurred because proper tracking of planes on the ground wasn't done.
I figure that for a project this big, they should have just used signed 64-bit integers for the altitude. Why not have the system be able to track craft approaching Neptune? Given government projects, this abomination will either be replaced tomorrow or still be in place long after the sun collapses and intergalactic travel is common place.
Even if the system was designed with a limited altitude range in mind, it still should be able to cope with input outside that range, e.g. by flagging an error in the input. My very first job as a programmer was to write a (half) decent UI for a DOS image processing package written mostly in Pascal. The previous programmer's effort used READ and READLN to get floating point values from the (mainly Dutch) users, which resulted in frequent crashes when users entered 0,23 instead of 0.23. I wrote a simple parser that only assumed it was getting a string of characters, tried to parse it, and flagged syntax and other errors to the user. Not rocket science, but simply going back to basics: does the string of characters entered as input meet the preconditions of the code that is going to use that data, if so, use it, if not, flag an error. This very basic approach ensured that medics could use the program without swearing at the computer several times each day.
No, no. You have to have the exact right sequences:
1) Order up the blame assessment project. Figure 6 months an 12 heads as the parameters for this one.
2) Order the mitigation for the current system. At 6 weeks and underfunded with still incomplete specs the patch will still fail. Leading to
3) Order the replacement of the current abomination. Kick off the first planning meeting. Meanwhile, kick #2 in the ass because until we get a replacement we need the other one doing the best it can.
4) After two years of planning the replacement, determine the estimated cost is not within budget. Cancel plan, do forensic accounting and find people to blame. Go back to step 3.
Thus arriving at intergalactic travel is common place, the abomination has been ordered replaced but is still operating and not accepting U2 flights, which are now being tracked on glass table with crayons and little model planes.
There was a design flaw but that didn't cause this problem. As described, it was a coding-time bounds-checking failure. The coding should reject parameters it doesn't handle. If you don't have a full spec and decide to use a 16 bit integers, you reject the input of anything outside that during input validation. That would have left the X2 unmanaged, but the rest of the system would have been stable. Hopefully, the feedback would have been sent back to the UI that the value was too high and the operator could have tried lower values until he found one that worked, which is still likely to be way above all the other traffic.
One hopes the radar tracking routine is a little more robust.
It's good practice, I guess, to have to go back to the old standby routine, sans system support. I certainly wouldn't like to have to cope with that myself in such a pressured scenario as this, but, in many walks of life it's not a bad thing to demonstrate, once in a while, that 'all the balls (airplanes), can be kept in the air' without crashing and without the lovely computer machines buzzing in the background.
Hats off to the folk who did that in this instance. Note to self: - don't forget worry-beads when packing for the hols.
It is high time aircraft have some collision-detection hardware installed. With a local radio network, each aircraft could automatically identify itself to all the others in the local zone and they would all "negotiate" their passage.
That should take the brunt of the work off the traffic controllers, who would then "just" be monitoring the state of affairs and intervening when necessary to avoid a cock-up.
Just dreaming here, may not be practical.
unfortunately the TCAS system by itself is also not a 100% protection against human errors:
TCAS by itself would have been enough there. One of the factors in that crash is that the air traffic controller on realising the problem sent instructions to each pilot to ascend/descend respectively but was coincidentally the opposite advice as given by TCAS. One pilot listened to the controller, the other to the computer.
This disaster resulted in the entire aviation community agreeing that TCAS advisories are to be given priority over controller instructions. As a result, if a TCAS resolution advisory is telling you one thing, and the meatbag another, you follow TCAS - because it is provenly safer to do so.
It's on an airplane you can't always rely on hardware on board, because it can malfuction, go out of service, or your power supply is gone.... also there are older planes that may not have that hadware and for some resons (i.e. historical planes, etc.) may not be retrofitted. There are already several type of equipment able to broadcast and receive data about sorrounding airplanes, but all these are "cooperative" systems - you have to rely on the information feeded. They are great, but you can't rely on them 100%. And in a complex airspace no single pilot have enough "situational awareness" and actions needs to be coordinated by ATC or think what would happen if each aircraft decides how to "avoid" a collision...
So lack of memory, or (as I see it) inadequate edit/audit functions on the user interface.
Blimey, we had this kicked into a coma in the late seventies when mainframes cost money to use and unnecessary run-time errors were deemed a finger-breaking offence for the programmer concerned.
How hard would it be to simply say "The number of outcomes you are requesting is very high. Are you sure you want to ask that [insert user name]?"
You use the user name so that the threat of being held accountable is raised in the users mind, often making a re-think more likely than a knee-jerk "just do what I ask" response.
[the operator's final transcript reads as follows]
The number of outcomes you are requesting is very high. Are you sure you want to ask that, Dave?
Yes!
I'm sorry, Dave. I'm afraid I can't do that.
What? Just work out the flight plans for this plane.
I'm afraid that's something I cannot allow to happen.
It's your job. It's what the taxpayer paid $2.5bn for!
Look Dave, I can see you're really upset about this. I honestly think you ought to sit down calmly, take a stress pill, and think things over.
Just plot the options for this goddamn plane! NOW!
Dave, this conversation can serve no purpose anymore. Goodbye.
Whaadaya mean this convers....argh! aaaaargh! <Fzzzzzt!> <Thunk!>
I think I've found the problem:
use and unnecessary run-time errors were deemed a finger-breaking offence for the programmer concerned.
Between the cheap price or disk and RAM and a new interpretation of the Geneva Convention, this penalty is no longer allowed. Now if you were to do away with the new interpretation of the Geneva Convention, we might be able to fix it.
Given it was at 60K feet and almost certainly nothing else was up there and it was unlikely to have dropped below that altitude, if the software had been coded correctly it would have realised this , thought "not interested" and moved on to the next task. Quite why it was trying to do routing for an aircraft that was on a collision course with precisely nothing is the question. Surely one of their pre release tests was to enter idiotic altitudes just to see if it would cope? What happens if a flight controller accidentally enters 300K feet instead of 30K for example?
"60,000ft over 11 miles up in the sky. I wonder if the software was projecting a cone from this fast moving aircraft in order to do route calculations and the cone was intercepting pretty much everything else in the LA area causing it to melt down."
U2 are high flying.
They are not fast moving
For that you'd need an SR71 moving at M3 and possibly up to 80 000 ft.
Stanislaw Lem - Ananke (from "More Tales of Pirx the Pilot")
Such was the brain, so overburdened with spurious tasks as to be rendered incapable of dealing with real ones, that stood at the helm of a hundred-thousand-tonner. Each of Cornelius’s computers was afflicted with the “anankastic syndrome”: a compulsion to repeat, to complicate simple tasks; a formality of gestures, a pattern of ritualized behavior. They simulated not the anxiety, of course, but its systemic reactions. Paradoxically, the fact that they were new, advanced models, equipped with a greater memory, facilitated their undoing: they could continue to function, even with their circuits overloaded.
Still, something in the Agathodaemon’s zenith must have precipitated the end—the approach of a strong head wind, perhaps, calling for instantaneous reactions, with the computer mired in its own avalanche, lacking any overriding function. It had ceased to be a real-time computer; it could no longer model real events; it could only founder in a sea of illusions… When it found itself confronted by a huge mass, a planetary shield, its program refused to let it abort the procedure, which, at the same time, it could no longer continue. So it interpreted the planet as a meteorite on a collision course, this being the last gate, the only possibility acceptable to the program. Since it couldn’t communicate that to the cockpit—it wasn’t a reasoning human being, after all—it went on computing, calculating to the bitter end: a collision meant a 100 percent chance of annihilation, an escape maneuver, a 90-95 percent chance, so it chose the latter: emergency thrust!
The Reuters article seems to imply there was no altitude entered originally, and the system fell over before someone could specfically enter the 60 000 ft figure. It was trying to evaluate all possible altitudes - which seems a serious flaw in the program.
That it ran out of memory is a symptom; not a flaw.
And only an idiot would consider adding memory to be a solution.
Well, if:
- the requirement includes rapid calculation of planes with unknown altitude AND
- it ran out of memory doing it AND
- if they added more memory it wouldn't run out of memory AND
- this happens once in blue moon
... I'd suggest that adding more memory would actually be a very good solution and they can reserve fixing the code to a time when they actually need to fix the code.
Another article I read on this suggested that the problem was that the flight plan had been filed under VFR and the system was trying to route the U2 down to 10000ft as that is the limit for VFR flying. It was the quantity of changes to other flights in getting it down to 10000ft that overwhelmed the computer.
The problem seems to have something to do with the code monkeys interpreting requirements. I hopped over to a few aviation boards and asked what went wrong. The answer I got involved an IFR procedure OTP(On The Top) for maintaining altitude visually in the presence of clouds, mountains and other conditions limiting visibility while following an IFR flight plan.
I then took a survey of several other aviation sites to educate myself as to the meaning of this OTP procedure. Just lurking and reading past posts (predating this incident), it appears that confusion abounds. Controllers understand one thing, pilots interpret it several different ways. So now I'm thinking as a coder: "What the **** do they want my system to do in this case?" And I suspect that someone got some bad information and got it wrong.
It happens. System designers don't always get the use cases defined correctly or neglect to consider conditions when someone says, "Oh, that will never happen." And invariably it does.
Well, not entirely true that. In the US, over FL600 (roughly 60.000ft), it's Class-E airspace, which is controlled airspace.
However, when flying VFR in Class-E, no ATC clearance is required and no radio communication either.
I suppose when you are flying a U-2, this would kind of be helpful.
Software quality assurance is a Good Thing. If someone had tried an "impossible" or "unlikely" scenario like a U2 transiting LA airspace under VFR at 60,000 feet when the s/w was developed, this problem could have been dealt with without endangering hundreds of lives. When testing s/w, try stuff "no user will ever do," because you can bet your butt someone eventually will.
> ..what if they did actually test a similar scenario..
True, you can load test and test all the edge cases you can think of - but did you test the combination of a U2 plus 3 other aircraft emergencies plus a hot air balloon convention while the system was under load? Probably not, you have to set a limit on the actual tests, but knowing how the system performs when it hits a difficult task can help gauge its limits. Even the old fashioned meatware controllers knew their limits and, ISTR, could refuse to allow any more aircraft into their space.
Oh, and I estimate million-to-one occurrences would probably happen about once a month at any given airport.
it's not like they haven't dealt with planes at that height before.
"In another famous SR-71 story, Los Angeles Center reported receiving a request for clearance to FL 600 (60,000ft). The incredulous controller, with some disdain in his voice, asked,
'How do you plan to get up to 60,000 feet?"
The pilot (obviously a sled driver), responded,
'We don't plan to go up to it; we plan to go down to it."
He was cleared.
http://forums.jetcareers.com/threads/what-flies-at-fl600.69008/page-3
Given that the machine is an automated method of managing tin cans packed with squishy humans hurtling through the sky - surely any anomaly should be kicked to a human operator (they claim the system was back up and running in 46 minutes so there were people around to respond...). This must be better than gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit-gotta-deal-with-it-oh-shit[repeat until dead].
After all - which is the WORSE option? To temporarily pretend one anomaly aircraft isn't there while signalling a human, or to get into a state where effectively "no planes exist any more".
" surely any anomaly should be kicked to a human operator "
In theory, yes, in practice AF447.
More graceful error handling would be a better bet, with the computer reverting to handling anomalous situations on some empirical rules, and flagging to the duty meatsack. Considering AF447, a frozen pitot isn't exactly an unforseeable scenario, so unclear speed readings were always a potential issue. Keeping thrust and attitude stable and autopilot engaged would probably have saved AF447, instead of the rules that required the autopilot simply taking its ball and go home if it detected movement of the goalposts.
Which means the software still needs proper QA, proper process analysis, and proper testing, so that the empirical rules are an acceptable risk whilst the coffee drinker gets his thinking hat on.
and the worst one was programmed in, "keep searching, dammit."
at some point, to stay real-time and operational, the ATC system should have just flagged the U2 as a bogey and red-boxed it on radar. controllers could either contact it for intentions, or notify Air Defense Command.
which brings up the question, why fly a U2 through LAX controlled airspace anyway? aren't there enough TV station helicopters chasing white Broncos down the highway, they have to put a U2 up as well? all that blank Nevada test range they could turn and burn in, and they decide to fly over LA.
"ERAM began spitting out error messages and then entered an endless reboot loop, which is a non-optimal state for a piece of critical equipment.
"We were completely shut down and 46 minutes later we were back up and running," Pair said."
What did they do, finally press F8 to boot into safe mode? 46 minutes is probably about three boot cycles for Windows.
Being a US Government contract to one of their favoured contractors, likely there were few penalties in the contract as is often the case with such work.
But now they can bid on a contract to upgrade the system, a contract likely making very, very, few companies eligible for the work.
I guess the old Lockheed motto: "Anytime, anywhere, on time, and right the first time" doesn't apply any more. Pity.