"...need to be hard rebooted after exactly 149 hours"
Not good if the 149 hours is up while on final approach. I'm pretty sure you wouldn't have to wait for the full 149 hours otherwise this would cause MAX sized problems with scheduling.
Some models of Airbus A350 airliners still need to be hard rebooted after exactly 149 hours, despite warnings from the EU Aviation Safety Agency (EASA) first issued two years ago. In a mandatory airworthiness directive (AD) reissued earlier this week, EASA urged operators to turn their A350s off and on again to prevent " …
It should reboot before 'Exactly' 149 hours.
I'm not sure I would fix the bug. This is a known feature, the planes all go through a maintenance schedule more often than this. It may be safer to make a reboot part of the maintenance rather than introducing a software change
So, let me get this straight. How many software engineers are going to be needed to accomplish this? 10? 500? This sound like a great job place a software engineer in every first-class cabin and let the soak up the alcohol till they are snookered. So then they can reboot while under the control of alcohol.
That would be a big IF the engines do a restart.
I can imagine making an Atlantic crossing in a storm, run out of time, everything goes black and the flight crew fighting to reboot the computers and get engines restarted in the rain while traveling @ 425kts.
Once the engines shutdown, all heat for the control surfaces no longer exists so the leading edge of the wing, horizontal and vertical stabilizers begin to freeze. This would not be a good day.
The collect multiple data sources (and probably provide any driving signals for things like strain gauges) and format the data.
However the MDM's didn't run programs (although I think the later model running on the ISS do)
As embedded flight avionics that suggests they should be under the full DO178b style development process. Nothing less.
Incidentally the first time I heard one of these timer overflow bugs was related to Patriot missile batteries in the 1990 Gulf War for extended periods of time (I heard about it much later)
So it's not exactly an unknown failure mode.
"Whats wrong with analogue wiring to a steering motor and a couple of max lock cut out sensors?"
That depends if it can detect and warn that the nosewheel isn't straight before it touches down and if the sensitivity can be adjusted with speed. You do NOT want a sneeze on roll (landing or takeoff) putting you in the bushes or snapping the tyres off.
More importantly a separate control path to the nosewheel would mean yet _another_ set of controls in the cockpit and pilots on the ground suffer from operational overload half the time anyway. That's why those checklists are so critical.
So do all the pilots have to read the counter as part of their pre-flight checks in order to make sure that the plane switched-on time doesn't exceed 149 hours? And what happens if the plane is delayed literally when it's on the runway - do the pilot have to recalculate while waiting and if it's likely to exceed while they are in operational command do they have to turn the aircraft around, park again and then go through the switch-it-off-and-on-again procedure?
The system will switch itself off if they forget to check or run into a too long delay. The only problem may be the fireball that causes the switching off. There are no known sparks involved in the switching off procedure and the switching on procedure expires when switching off is done by the system itself. Obviously, YMMV is appropriate. Please ensure to cash-in those miles before the switching off occurs automatically.
"The remedy for the A350-941 problem is straightforward according to the AD: install Airbus software updates for a permanent cure, or switch the aeroplane off and on again."
As far as I can tell from this, the issue has been fixed - a patch is available and all airlines need to do is install it. So why is the "turn it off an on again" thing even being mentioned? Surely with a potentially safety critical problem like this, it should be a simple case of grounding all aircraft until the patch has been applied.
hmm… imagine you are waiting to board your plane when an announcement is made
“There will be a short delay to boarding as the technician carries out some maintenance. We appreciate your understanding and hope to continue boarding as quickly as possible”
Wait 20 minutes…
“I have an update on the delay to boarding. It seems the technician has bricked the plane!”
I would guess that the patching only becomes mandatory at the next planned service of the plane so that the process can be properly planned. Up until that point, cold rebooting every 100 hours should be sufficient.
Having been on a Fokker suffering software issues that required a call being passed to 2nd line phone support to figure it out (it was turned off and back on again but not everything came back up...) I can tell you it wasn't 20 minutes later that they'd bricked it and an hour later they finally got it to start up right.
Escape because they wouldn't let us get off..
“There will be a short delay to boarding as the technician carries out some maintenance. We appreciate your understanding and hope to continue boarding as quickly as possible”
Wait 20 minutes…
"The technician has found he doesn't have the right adapter cable. One will be sent from Toulouse at the earliest opportunity. We're sorry for the further delay."
"The technician has found he doesn't have the right adapter cable. One will be sent from Toulouse at the earliest opportunity. We're sorry for the further delay."
The technician has now begun the process of installating from floppies, 432 of them, but is unable to find the spacebar...
Believe it or not, one of the ARINC 429 dataloader boxes in my lab loads flight software for some of the avionics that my company supplies to Airbus and Boeing using... wait for it...
3-1/2" 1.44 MB floppies.
At least it uses "modern" floppies, not the 5-1/4" or 8" floppies used by some of the older (but still operational) equipment that's sitting on the next shelf.
My experience was a 787 Dreamliner at Heathrow (AA) where the captain announced that some systems would not come on line and so a complete power off and on was required. We had to all disembark back to departure lounge because the captain wasn’t happy for us to be on board an aircraft with no power.
I was on a flight out of Birmingham a couple of years ago that got held on the apron. We sat there for over an hour while engineers came and went. Eventually they closed the door and restarted the taxi. The captain announced that although they couldn't fix the fault the plane was cleared to take off.
So..worth waiting an hour to try and fix but not bad enough to stop us flying.
Hmmm.
I wouldn't be happy to be on an aircraft with no power either. They tend to fall out of the sky that way.
a) For them to fall out of the sky they have to be in it. Sitting on the apron or on a taxiway there's not really a big opportunity for that.
b) Sully, the Gimli Glider crew and the BA flight 9 crew, among others, may feel the urge to disagree.
hmm… imagine you are waiting to board your plane when an announcement is made
“There will be a short delay to boarding as the technician carries out some maintenance. We appreciate your understanding and hope to continue boarding as quickly as possible”
Wait 20 minutes…
Display shows:
Update is 100% complete. Please don't turn off your airplane
[some spinner on the display rotating]
2 hours later
The display is still showing same message.
Generally depending on the classification of the problem the authorities grant some leeway in applying the AD. E.g. within next 28 days which gives the airlines some flex to incorporate it into the next scheduled down time. In this case they probably believe that the interim measure is safe enough that it doesn't require the AD to be applied immediately.
The update (patch) would not be applied by either the airline or Airbus; the kit itself will have been designed and certified by a 3rd party avionics house who will also have done the low level software (which would implement the actual communication links) which is what seems to be the issue here.
To update, the equipment would need to be replaced and the older units sent back to the manufacturer for software load and testing.
It takes time to replace units so this would be done at the next scheduled maintenance point.
You know how much disruption it causes, to hundreds of thousands of people, when a whole fleet of planes is grounded? Even briefly (and we don't know how brief it would be)?
That's an order that only goes out when they find something really dangerous. This hazard is easy to manage, once you know about it. Indeed, if it's been in service for two years without anyone noticing, that suggests it's pretty easy to manage even if you don't know about it
If the hard reset works that should be preferable to a software fix. Why? Because, as anyone involved in the development cycle knows, fix one thing and that patch is liable to break something else that only shows up once the user has it hands on. It is well understood that no matter how many QA tests are run, in the field the user will ALWAYS execute some action that the testers, in their wildest imagination, would never consider happening.
Any ideas about what is overflowing? 149 hours of seconds doesn't seem to be that obvious a limit, but I guess they probably have rounded down a little to stop planes falling out of the sky.
I've seen issues similar to the Boeing one turn up in less critical places. Found my customers since in internal testing no system was left up for long enough.
That makes it sound like they were trying to allocate individual header bits to different fields. So 28 bits only would give them 74 hours, but that wasn't enough, and 30 bits gave 300 which would never be needed, so they choose 29 bits.
I guess they then didn't write any test cases for overflow. I can imagine the problem is that they haven't wrapped the comparison operation correctly. So the newest data ends up looking very old.
ARINC 429 messages are 32-bits. 8-bits for the label, 2 bits for the SSM (signed status matrix), 2 bits for the SDI (source/destination indication), 1 bit for parity, That leaves 19 bits for data which is 536,288 (2^19).
There are 536,400 seconds in 149 hours. So they're sending time since power-up as a 19-bit value in seconds, and it overflows just before 149 hours has hit.
Many years ago I was a software programmer and the ICL 1904 I was using at a client site stopped working. We were much closer to the hardware in those days.
The engineers ran a test and I poked my nose in asking about the results. I deduced addition was faulty, and persuaded them to do an addition (on the switches !! ) they did and told me I was wrong.
8 hours later they replaced the addition unit and all worked. when I asked what they had added it was 1 & 1 Had I specified FFFF and 1 it would have shown the problem - carry in bit 8 was faulty!
Still reminds me to specify exactly what I want when testing.
What am I missing? Why do the systems need to know the milliseconds since it was started, rather than the milliseconds past an arbitrary time? Something that could reset every time the umbilical was unplugged, 00:00 UTC went by, or the pilot got up to shag a steward(ess)?
The black box needs precise timings. The internal indicators just need to be reply in a timely fashion. No?
It could be counter that counts cycles on some CPU or bus somewhere to generate a unique 'event' timestamp, and if it happens to be clocked at 333.625 MHz then it would overflow a 32 bit value in exactly 149 hours (though that "exactly" is probably rounded down from 149.something)
"The remedy for the A350-941 problem is straightforward according to the AD: install Airbus software updates for a permanent cure, or switch the aeroplane off and on again."
The remedy for the A350-941 problem is straightforward according to anyone with morals & a brain (& a healthy fear of litigation): install free Airbus software updates for a permanent cure.
FIFY
Let's hope the patch doesn't just automatically turn the plane off & on at 148 hours.
Reminds me of one of my favourite buffer over run stories. A missile was being developed, possibly AMRAAM I can't remember off-hand, and they had a problem with over runs. So in a move of genius they installed twice as much memory as would be needed in the longest possible flight, solving the problem.
Some years later they produced an improved range variant of the missile, predictably they forgot why they'd installed so much memory in the first place...
If I'm reading this correctly, 6 days with no power down = a severely crippled A350.
It went to service in 2015, and still in 2019, there are planes with this flaw ??
It shouldn't even have passed QA, in this state ! I remember EMC was not shipping Symmetrix (significantly cheaper than an airliner) without 3 weeks running flawlesslly, one in a cold room (0 degree C), one in a 20 C room and one in a 40 C room ! Do I understand correctly airplanes, those days, don't even come close to the level of QA from EMC, 2 decades ago ?
I'm not sure I want to board any airliners anymore ...
remember EMC was not shipping Symmetrix (significantly cheaper than an airliner) without 3 weeks running flawlessly,
I'd expect disk drives to run flawlessly for years without a reboot, so three weeks testing isn't much. It wouldn't have found this EMC problem https://www.theregister.co.uk/2014/01/15/vnx2_reboot_issue/ for which the solution was, yes you guessed:
1. Reboot SPA
2. Wait 30 min
3. Reboot SPB
before 80 days had passed.
How many airliners never actually get powered down completely sometime in every 6 days?
The VNX is not Symmetrix, it is their midrange product. The VNX is to the Symmetrix what an ATR regional jet is to the A350.
I know, I have a lot of experience with Symmetrix boxes and they are great kit, but it doesn't change the principle. Three weeks untroubled testing for something that is expected to run continuously for years, whether VNX or DMX, doesn't really have any relation to the kind of testing an ATR or an A350 would have.
"a pile of Intersil 6100s (PDP8 on a chip) they bought in the 1970s."
Back in the day, sensible system builders used to like to be able to source their chips from more than one chip shop.
For the 6100 family, there was (as Simon mentioned already), the Intersil version.
And here we are, some years later, and Simon's post doesn't mention that the 2nd source for this particular chip was a company called Harris..
https://www.hb9aik.ch/computer/6120history.htm (and elsewhere).
Small world, innit :)
Why is this a 'news' story?
1) As mentioned, 149 hours is more than 6 days. Since there are no possible flights that long (even with all the possible delays factored in), it should be easy to work in a turn off/on cycle, albeit with some more ground time and financial loss.
2) As mentioned, the problem has been identified and fixed, but we are currently in the period where operators get some leeway to work the update into their maintenance schedules.
In short, there's nothing to see here, move along, folks. The only reason it gets newsworthy is to say "Boeing's baaaaaad, but look here, Airbus isn't that much better, either!", even though the criticalness of Boeing's fault massively overshadows the mentioned fault of Airbus.
Financial loss ? Well you're obviously not in charge of an airline, that's for sure.
Airlines are already running close to red, they really can't afford to just go around losing more money.
Honestly, given how difficult it apparently is to operate an airline, I'm surprised they don't just give up and quit. There must be more money in it than I think.
Why is this a 'news' story?
Because of the Boeing incidents, issues with aircraft are high up in current public perception, and stories about that that 12 months ago wouldn't have rated even a footnote are of interest to the general public.
It's like when there's an explosion at a factory. Generally people don't give 2 figs about how their toothbrush is made. But when some disaster, either just spectacular or that results in major tragedy, strikes, people become interested - curious - about how their toothbrush is made, what trials and tribulations surrounded the invention of it, and so on.
Therefore what happens around aircraft manufacturing, science, maintenance, piloting stories, and so on, become of interest.
Any moron knows this.
Yep, to be fair, known bugs with workarounds are sometimes better than introducing a fix and all the risk that comes with it.
Besides, not knowing enough about the usage of aircraft, but I'm pretty sure they don't leave them on when not in use, and aren't flying non-stop - even the ones RyanAir hammer the hell out of.
Reminds me of a flight I took back from Baltimore to LHR. The cabin plunged into darkness then the low level lights came back on. A brief moment later the Captain informed us that there was a problem with the cabin electrics so there would be no hot food or inflight entertainment. The options were turn around and land in New York or "just go for it"......
TBH I was glad the wine was plentiful on that flight across the atlantic...
Nope, the patriot missile system had a software error which meant that for every hour it is left running, it would become less and less accurate. Hence why the SCUD hit the America base despite the sending up of patriot missiles. It hadn't been rebooted for far too long and the error meant that the missiles completely missed the incoming SCUD (cannot remember if they detonated far too short or too long).
Inertial measurement is hard.
You tell it exactly where it is to start with, and then it tries to keep track.
Over time, noise adds up, rounding errors compound and/or the Earth rotates and moves around the Sun.
Mishandle any of those, and you get significant drift over time. Enough to trigger an emergency escape system at a launch hold...
To keep tracking an object with radar, you set a "gate" and try and acquire the target again in the next period within the updated gate. The underlying problem was that different bits of the software calculated the time slightly differently leading to the gate being about .3 of a second out of sync after 100 hours. (counting upward with fixed point or floating point numbers is not as simple as people expect)
Why haven't they patched? Because it costs money and time.
And the regulators are wusses and wimps, why aren't they raining fines over these people who cares more about saving money that if their saving money kills people?
Maybe they are cousins of certain guy in the US?
This post has been deleted by its author
actually easy to implement although should not be have to put in that position.
a/c can time out on certain items while in flight (meaning next flight puts it over hours, good example is batteries timing out) so upon landing a/c is grounded and no tdmi/mel issued until addressed.
so flight and mtx planning would stop a/c at a mtx base before that time came close and reboot it.
meanwhile engineers should be scrambling to fix this or be getting fired.
This post has been deleted by its author
I prefer to fly long haul in a 747 precisely because the pilot's yoke is connected to the control surfaces by cables not of the electrical or fibre varieties.
That and the fact that you can roll a 747 over at 40,000', go into a high speed dive for 30,000', pull nearly 5g in the recovery and land without further incident bent wings, missing tail surfaces and permanently lowered main undercarriage notwithstanding. (https://en.wikipedia.org/wiki/China_Airlines_Flight_006)
No aircraft journey is going to last 6 days and 5 hours. That is between 3 and 4 times round the world.
If the time starts counting when the crew are starting up for their next flight, they will have plenty of time even with the best of British delays - bureaucracy management anti-union actions, or just the wrong type of rain. Then they can wait for a couple of days at the terminal, another one at the threshold before taking off. I don't know the maximum length of time one of these things can stay up but they can't be in-flight refueled so they can reboot when they get back down again.
So many faults on planes. Boeing has another problem: "Pilots reveal safety fears over Boeing’s fleet of Dreamliners. Company admits that fire extinguisher switch has failed a ‘small number’ of times" https://www.theguardian.com/business/2019/jun/15/boeing-dreamliner-b787-safety-fears.
149 is around 6 days. So I don't know how serious this actually is as it seems pretty unlikely an aircraft would remain sat there for 6 days, powered on and not being used. Unless the aircraft is put in some kind of "sleep" mode, or something like that and the time is still counting up.
Even so. Doesn't sound a hard one to fix with a routine that refuses the start the engines if the aircraft is on the ground and has over 120 hours since the last reboot.
We hardware engineers are always forgotten. Less money, less kudos and stature but we don't feck around.
Working on Dealing Room Systems (with our custom designed PCBs), many years ago, we had one board that would intermittently and infrequently crash.
Share prices/Currency/Commodity info would freeze on one of the Dealer's screens.
This peeved them somewhat ($ millions trades at stake).
Generally a dealer would have 4 screens and some hundreds of dealers per room. This board was used on each screen.
So 1200 boards per Room, for a 300 Dealer Room.
Thee was a Hard Reset switch on the board, but it required the sysadmin or on-site engineer to wander into the Machine Room (after an irate call from the Dealer) and find Cabinet x, Rack y and Board slot z. Off/On, Fixed. But that took too long.
Our software engineers spent weeks trying to track down the problem and gave up.
When they came to us, we found quickly and easily that we could put a simple hardware Watchdog Timer on the board.
If it wasn't reset every 5 seconds, the board was rebooted.
It worked well and no further complaints.
Obviously for planes, the logic might be a bit more complicated.
If not reset for 100'something hours and stationary on the ground then reboot.
Option 1. Well understood (but annoying) procedure that must be run on a regular basis
Option 2. Single uploading of software patch.
But.
Does the hardware architecture support uploading and verification (packet corruption being sent through network to end box)?
If not it's a box removal exercise or a direct connection to a box deep in the bowels of the aircraft
How well has the patch been tested?
Has it added some new failure mode?
IOW from the airlines PoV the risk assessment is not quite as straightforward as it seems.
Of course if we assume that all software patches are perfect and have no unintended side effects then the course of action is obvious.
Anyone here who's written software believe that assumption?