Millisecond roll-over?
So, what is the probability that the timing for these events is stored as milliseconds in a 32 bit structure?
The US Federal Aviation Administration has ordered Boeing 787 operators to switch their aircraft off and on every 51 days to prevent what it called "several potentially catastrophic failure scenarios" – including the crashing of onboard network switches. The airworthiness directive, due to be enforced from later this month, …
Could well be something like that, the earlier 248 day issue is exactly the same duration that older Unix hands will recognise as the 'lbolt issue': a variable holding the number of clock ticks since boot overflows a signed 32 bit int after 248 days assuming clock ticks are at 100Hz as was usual back then and is still quite common.
See e.g. here. The issue has been known about and the mitigation well documented for at least 30 years. Makes you wonder about the monkeys they have coding this stuff.
I've run into that problem (32-bit millisecond timer rollover issues) with microcontrollers, solved by doing the math correctly
capturing the tick count
if((uint32_t)(Ticker() - last_time) >= some_interval)
and
last_time=Ticker(); // for when it crosses the threshold
[ alternately last_time += some_interval when you want it to be more accurate ]
using a rollover time
if((int32_t)(Ticker() - schedule_time) >= 0)
and
schedule_time += schedule_interval (for when it crosses the threshold)
(this is how Linux kernel does its scheduled events, internally, as I recall, except it compares to jiffies which are 1/100 of a second if I remember correctly)
(examples in C of course, the programming lingo of choice the gods!)
do the math like this, should work as long as you use uint32_t data types for the 'Ticker()' function and for the 'scheduld_time'; or 'last_time' vars.
If you are an IDIOT and don't do unsigned comparisons "similar to what I just demonstrated", you can predict uptime-related problems at about... 49.71 days [assuming milliseconds].
I think i remember a 'millis()' or similarly named function in VxWorks. It's been over a decade since I've worked with it though. VxWorks itself was pretty robust back then, used in a lot of routers and other devices that "stay on all the time". So its track record is pretty good.
So the most likely scenario is what you suggested - a millisecond timer rolling over (with a 32-bit var storing info) and causing bogus data to accumulate after 49.71 days, which doesn't (for some reason) TRULY manifest itself until about 51 days...
Anyway, good catch.
IIRC, we had to restart those at a minimum of every thirty something days or they would lock up. Fortunately<sarcasm font>, they tended to fall over quite a bit more frequently than that so we seldom ran into that particular bug.
WTH is Boeing doing re-using that particular bit of crusty code?
It was Windows95/98 and it took years in the wild before anybody noticed for that very reason.
“ - and this is the rock-solid principle on which the whole of the Corporation's Galaxywide success is founded - their fundamental design flaws are completely hidden by their superficial design flaws.”
I run FreeBSD, occasionally I reboot it ever year to two ... just to check the the machines power supply restarts.
We had a Nortel Meridian. Had to be rebooted after only 13 years uptime. Granted it did have a lightning strike which melted several of the boards (it was still running).
Also had to reboot several Audiocodes ISDN to SIP converter after 5 1/2 years.
You server boys do need to learn about reliability...
Those were the days before Nortel chucked their entire development systems and switched to "industry standard" WNT. (Of course, they had to do a boat-load of work to make WNT run as a subtask but that's another tale of woe.)
But I am surprised about VxWorks. It is a very robust OS with a high EAL rating. Seeing the real problem -- and I doubt it is time-connected -- would be instructive.
"Seeing the real problem"
someone already posted a valid suggestion - millisecond rollover, and an algorithm to test for periodic timing that was poorly written. [during rollover you might end up with a "storm" of data collection for a brief period of time, as one example, or NO DATA COLLECTED AT ALL - even worse]
Again, my working with microcontrollers has already gotten me to discipline myself with respect to these kinds of maths so that the controller can run for MONTHS unattended, as you would expect it to, and not have a rollover issue after 49.71 days, or anything reasonbly close to that, depending on whether your millisecond timer is actually happening every 1.024 milliseconds...
Uptime is just a measure of how long since you last verified that your machine could boot successfully ;)
(Don't forget that a reboot doesn't give anything a chance to really stop. Problems are more likely to crop up after a machine has been powered off for more than a few minutes, and parts of it are cooling down).
So, re Windows 95 on a Boeing 787 :
If you are flying at the standard 30,000 feet, and you need to reboot, how long can the aircraft glide without any power? Because, somehow, if waiting the customary 2 minutes before rebooting, and then adding the time it takes to reboot, will the aircraft have crashed by then?
As a potential airline customer, these issues are of medium importance to me, as other ways of kicking the bucket could get me first.
This reminds me to ask another question: We are so used to the great name the Boeing used to have, that no one even wonders anymore if it ever was a good idea to name an aircraft manufacturer
Boeing!
I wouldn't do that. I would not name an aircraft company "Kaboom!" or "Oopsidaisy Aircraft".
Not even the Russians have an airplane manufacturer called "Crashki-Burnski Planes-ky Factory"
Sorry, Antonov, Tupolev or Sukhoi all do not translate into anything funny at all.
Then there is Piper - US single engine aircraft - they seemed to have to pay the piper somewhere along the line. The Canadian company "Bombardier" at least gives you the feeling that you are going to be bombing someone else, which is only somewhat reassuring, because pesky SAM's might give you a little bump in mid-air. uuh, stop that!
"Embraer" does not provide any feeling one way or the other, so is that a good thing? I don't know
So, may be it's time that Boeing renames itself to "Majestic Aluma-plastic Happy Flying Machines", or maybe "Sitting Vulture Soaring Eagle Planes". Just trying to help them out here.
Overall, the topic reminds me of Brian Eno's 1970/1980's song entitled, a bit sarcastically:
"Burning Airlines give you so much more!"
Here ends the reading!
> I wouldn't do that. I would not name an aircraft company "Kaboom!" or "Oopsidaisy Aircraft".
Or a parcel delivery company called "oops", I mean, "ups"?
I once saw a car from a driving school called Impact School of Motoring. It had a large dent in the back. (I promise it's true, but I wish I had taken a photo)
In the article, it mentions that some years back superceded backup flight plans could kick in mid flight and the aircraft would try to change course to the old plan.
I was wondering if that could have been a part of the problem that led to the Boeing 777 on flight 370 going missing?
I think Boeing have for years been suffering from 'Too big to fail syndrome' , the US gov' is kind of obliged to bail them out and keep them running so they don't try that hard to produce a good product.
I was wondering if that could have been a part of the problem that led to the Boeing 777 on flight 370 going missing?
I doubt it, because a) while also from Boeing, it was a different model and this FAA notification concerns only the 787, and b) a previous flightpath from Malaysia towards the middle of the Southern Indian Ocean would be quite unlikely.
"I would have thought that was reason enough to ground the entire fleet until the problem was clearly identified and a software patch released."
Most aircraft are power cycled in less than 48 days so the bug took some time to track down. It's often the first thing to try if you are having system issues so it could be any number of things. Granted, aircraft need to be far more robust, but every complex system has some sort of issue.
I think it might have been Matt Parker that talked about this issue in a presentation on math. Of course, he would have pointed out why it's a good idea to have mathematicians around checking these sorts or things.
The 787 I was on that needed a reboot to try and fix a locked fueling valve needed around 15 minutes to fully power cycle and have systems back up.
In the end I think they hit the valve with a wrench... you know what they should've done first :D
When Emirates introduced the first A380's I was on a flight where self-loading cargo was on-board, doors shut then Captain announced "Sorry Ladies and Gentlemen we are going to have to reboot the aircraft". This was after ground power was disconnected so no air-con. I can tell you that it takes a sweaty 18-20 mins before engine start.
typically VxWorks will come up really fast.
a) you compile it for your hardware - so no driver loading and/or hardware detection
b) it's an RTOS and not a monolithic kernel. Startup and scheduling are different. You could easily optimize restart times [let's say in-flight reboots being made possible].
c) the processes would all be compiled in, so no program loads either, as far as I can tell. This could be wrong, based on what they might be doing, but I suspect it'll be like it was for wifi routers with VxWorks, which is what I worked on - wireless, networking, WPA, asynchronous packet handling, stuff like that.
So yeah maybe it boots up in under 5 seconds? Possibly boots up even faster than THAT...
In the current world situation I imagine it won't be long before a goodly percentage of 787s are simply powered down somewhere out the way and left until there are lemon-soaked paper napkins again.
Of course, it might give RR a chance to catch up with engine rebuilds that have left some aircraft on the ground for a fair while in any case.
If it's Boeing, I'm not going.
Sounds like a number of designers are needed that come from this century and can resolve the endless looking list of shortcuts and issues that seem to be have been designed into this steaming pile of poo.
Why they are not thinking a bit more long-term in either handling the rollover issue better (more bits to make it a far longer duration), or better still design in the fact that things will roll and expecting that in the platform design and software so that it does work properly. .
Alternately, work around with a rolling reboots until better software and firmware can be deployed. The whole idea of "turn it off and on again" is so dated now.
" If so, "popping" the breaker on the ground should reset it."
If you have to do a reset in flight, yes, you hope that will do it. If you want to do it right, it should be done on the ground and completely so everything in the sequence comes up the way it was designed.
VxWorks is neither a Windows nor a UNIX. It is also a RTOS. The commentards above who obviously have no experience with this OS and yet are attempting to appear knowledgeable on the subject are painfully obvious ...
"It is better to remain silent at the risk of being thought a fool, than to talk and remove all doubt of it." —Maurice Switzer, 1907
No. Plenty of experience developing for VxWorks and exactly the same issues apply with int32/time rollover, etc. Moreover, many products are using very old, heavily patched versions of the OS because it's considered too risky/expensive to migrate. Especially in aerospace. And the fact that it's only used in dedicated applications in relatively small volumes, means that many bugs remain undiscovered and unfixed for years. It may be a more deterministic OS, but that only gets you so far.
Which is why you probably shouldn't fix this 'bug'
Simply require it to be powered down every 28days as part of the maintenance procedure.
I'm sure the engines need oil replacing every X 100 hours, nobody is demanding that the plane contains enough oil for a 50years service life.
I haven’t got that much experience with the OS. My experience is, though, that the development environment was a steaming pile of bits. If a coder’s attention goes into fighting with tools, quality of code will suffer for sure.
That was many years ago, maybe it works better now.
VXWorks ? You mean the OS used in Mars probes and landers that works for years on chips which are radiation hardened variants of PowerPCs originally ? That just works for years a long way from tech support ? Better coders in space work than mere aviation. Seriously, if this is been a known issue for 30+ years, why does not basic code testing get the stufup in basic acceptance testing ? A whole bunch at Boeing need to to sacked and banned from ever going near aircraft, or any other job requiring coding.
Of course Jake. Because you have 12 airline pilot friends who flew Boeing 787 Max who all said there was no issue with that plane and Boeing had given plenty of training and it was Pilots form third world countries with insufficient training that caused the issue (after the second crash)
And yet pretty much every pilot disagreed with that statement, including ones from most of the major US carriers and the unions and the FAA and the rest of the world and Boeing themselves.
So I would suggest that advice in the flying arena from yourself and you 12 Pilot friends is not worth the screen real-estate it is written on.
It was the 737MAX, there is no 787MAX. Am I supposed to listen to, or reply, to someone who made such a basic error?
I did not say there was no issue with the plane. I said that properly trained pilots knew of the issue, and the work around. Am I supposed to listen to, or reply to, a coward who makes such egregious logic errors?
Consider that the day before the Lion Air Flight 610 crash, the exact same plane was kept from crashing by a third, off-duty pilot who happened to be in the cockpit when the exact same problem that brought the plane down the following day occurred. That's right, he stopped the plane from crashing. As could the pilots who were onboard the next day, if they had had the proper training, which clearly existed.
The fact that this information and training wasn't available to the pilots of Ethiopian Airlines Flight 302 over five months later is criminal, and IMO that airline should be at least partially, if not wholly, responsible. Blaming it all on Boeing is akin to blaming the loss of a team sporting event on a single play by a single player. It says more about the lobbying power of the airlines than it does Boeing's issues (which do exist, and I'm not saying otherwise).
Yes of course Jake. That's why the plane has spent over a year being unable to fly. Just because they needed better training.
Have you not read anything of the scandal and issues that have arisen since the investigation has begun. With pilot instructors saying the system was faulty and could easily lead to a crash. With people who have followed the proper technique with *the real software* in a simulator not being able to control the plane.
With Boeing having to cancel orders, having to find parking lots to try to find places while they have issue after issue with their software. With all the evidence that has been produced that Boeing purposefully did not produce correct simulator training as they didn't want to have to force retrains on pilots to re-certify (i.e. The proper training didn't exist and Boeing even tried their best to make sure people didn't think they needed proper training).
And frankly your claim that we shouldn't be blaming it all on 'Boeing' is disingenuous to those that died unnecessarily. This was a problem of Boeing's making and for money and profit they put the lives of passengers at risk to rush a plane out and ensure they didn't lose customers. Engineers at Boeing have testified to that.
"The fact that this information and training wasn't available to the pilots of Ethiopian Airlines Flight 302 over five months later is criminal". Yes and Boeing execs should go to court for that criminal act. If they had owned up to the issue, grounded the planes, changed the software and then recommended re-certification in a simulator with *real software* then it wouldn't have happened. It is criminal, e can agree and Boeing should be suffering greater consequences.
Actually... It is a collective failure by the FAA and Boeing (and EASA subsequently). The FAA said "ok, we'll believe Boeing when they say 'we don't need to mention MCAS'" and EASA followed their lead. The Brazilian authorities on the other hand didn't and insisted that MCAS be mentioned.
Ethiopian is a more problematic case than LionAir because it happened shortly after takeoff at a much higher altitude above sea level (2355m). A minute after takeoff MCAS kicked in, which is a damn sight less than the 13 minutes that the LionAir crew had before theirs kicked in. I wouldn't want to have to resolve a software issue at that point (and Chesley Sullenberger also pointed this out when he attempted this in a simulator). Ethiopian policy to make a pilot with 200 flight hours an FO is something else, but arguably that airline is one of the safest in African airspace (along with Kenyan, SAA and Egyptair).
Boeing screwed the pooch, the aviation industry is broadly in agreement with that view. That this new issue comes to light is not really all that much of a surprise given that Airbus has made the same mistake (see https://www.theregister.co.uk/2019/07/25/a350_power_cycle_software_bug_149_hours/), but then again this is the *second* of such issues with Boeing (see https://www.theregister.co.uk/2015/05/01/787_software_bug_can_shut_down_planes_generators/ for the first). Boeing needs a lot of internal work to resolve the problem of ruling by accountant, whereas it *was* an engineering company first.
Just saying...
Performed by the lowest bidder, of course.
I'd much rather bet my life flying in ANY commercial aircraft (yes, including the 737-MAX) than being transported in an over-the-road vehicle driven by yourself, or pretty much any other licensed driver, on public roads. Don't take this personally. It's a math(s) thing.
Get some perspective, people. Paranoia is all very well and good, but there comes a point where you just look silly.
If you are honest enough to include yourself into that "any other licensed driver" category, I can't disagree with you, though I am a bit of a control freak, so I like to do my own steering, whether it be a bike, a motor bike, a boat, a car or a plane. I must confess I am not licensed to do the steering in a plane though, so by necessity I must leave that to others.
Of course I include myself! Furrfu.
Concur on preferring to be at the controls ... but I can sleep in the Peterbilt when my Wife or Daughter are taking their turn driving cross-country. Not cat-napping, real sleep. It's a trust thing. I also don't mind somebody else taking the controls of the aircraft when we're flying straight & level. Gives me a break. (I admit that I doubt I'd be quite as blasé about this if they weren't dual yoke ... )
"Dead people / Total flown people"
No. But then that number by itself is useless. You need miles and/or hours in that figure to take any real meaning from it. I don't have those numbers handy, either.
However, the MAX flew around half a million flights and only had two fatal crashes. Both of those crashes were avoidable (see my comments elsewhere). In my mind, the MAX could still be flown today, with properly trained pilots at the controls. It's not an inherently unsafe aircraft.
The court of public opinion says otherwise. ::shrugs::
The 787 has far more problems. They left metal shavings (from drill holes and tightening bolts) inside where they well. Which may be inside of cable conduits, where they erode away the insulation...
Normal rules require such debris to me removed, but it was foregone at this one facility for keeping deadlines and cost requirements. There was a quality inspection executive who tried to report it up, and subsequently became whistleblower. He considered it bad enough to advice his family to never fly in one.
Which is why it is better for all if people speak up even at the cost of temporarily being thought a fool. Culture which encourages openness is much to be desired.
"It is better to remain silent at the risk of being thought a fool, than to talk and remove all doubt of it." —Maurice Switzer, 1907
Do 7878s do short haul?
Having said that, adding in a 15-20 minute reboot time at each turnaround might not be an option. It depends on whether it can be done while other turnaround tasks are also being done, otherwise I doubt they'd want to do it except as required. Aircraft have to be turned around on time or risk losing their slot.
Probably all sorts of expensive consequences to actually fixing it with an update while turning it off an on again can be built into the maintenance for free.
Not the right solution but when shareholders/your bonus package demands ever higher profits you've got to cut corners somewhe... everywhere.
I think it depends on what you mean by "rebooted". Are these systems that stay live in the aircraft when it isn't in use? - if they are then there might be a problem. If they shutdown when the aircraft isn't in use then they're going to get a reboot pretty regularly and I doubt they'd ever get to the end of the period.
Anyway it isn't hard to fix. All sorts of things need to be checked before/after a flight. So you just add it to that checklist. "Has the aircraft been rebooted in the last 40 days?". But that's so straightforward I reckon they'd have thought of it already.
Alternatively. Build it into the aircraft's own software to reboot (or refuse to operate without one!) automatically every so often. Again, seems too easy..
So I reckon there has to be more to this one.
I was told by a test engineer, from a major aircraft manufacturer, that one of their tests was to remove the main command and control network switch in-flight and replace it with a spare. I know that they had performed it many times on the ground, but they must have bigger dangly bits than me, to have done it in-flight!
[Anon. because, although it was a long time ago, it was my (not ARINC 664) software on that particular switch!]
I'd say that would be a bit alarming, I still clearly remember hearing something similar about a train rebooting (multiple times, luckily no more than once in a single journey, so max two times a day when commuting) and that took about 15 minutes every time with the train standing still. Usually it happened about 30 seconds prior to departure, but it also happened between stations and every single time the train came to a full stop (security reasons they claimed) before the train was rebooted.
From a cold start it takes about 15 minutes to get a 787 up and running. Most of that time, to be fair, is waiting for the inertial reference system to align. The common computing resource (which runs in software what would traditionally be handled by individual avionics computers) is online within 3 minutes. If the CCR is reset in flight (never needed to do so yet...) it's back up and running within 70 seconds. It is permitted to reset both CCRs (left and right) simultaneously in the event of the loss of all displays.
The bulletin for this particular problem is quite woolly - I think that when it says "expired" data it means the results of a calculation that didn't complete in the assigned compute cycle (so realistically milliseconds late); not, say, the values from last Tuesday. Merely a layman on RTOS, was never touched on in my Comp Sci degree.
Gareth...
writing afterwards is easy, but still required.
the problem of recoverable software and hardware for safety critical systems were solved in 1978-87 and were subject of my PhD applied for Russian submarines, satellites and later Sukhoy 27, 27I and C (India and China respectively). Later in 94-95 I was preaching the same for British Aerospace - having no aircrafts in design or decision making people in UK in 98-99 I was pursuing the same and further development in Seattle for both: military and commercial aviation dept of Boeing - by invitation from there and full support of US Government.
Later we did project for EC called ONBASS was done in full swing 2004-2009.
Russian government is still trying to steal (as far as they can understand) how to make next steps - see patent and patent defence case.
We have summarised our experience in four books published by Springer.
Rod Liddle in 2015 did interview for Sun
Some notes about this saga you can find below. Dan and Andy from WSJ, as well as their predecessor Jeff Cole are fully aware about this all, see attachment in email I sent you...
Unfortunately - instead of REDOING avionics as it should be people elsewhere talk about functional safety, ( kind of bull shit, safety is only active and preventive, nothing else matter). Also popular nonsense called conditional maintenance - even NASA (like they now how to measure conditions of aircraft, on board system and environment - ( at order of 2 at power 24 states) - nobody knows how to combine and, of cause - verification of a system - for Intel new processor - hardware code if 1.2mln lines verification program exceed 1.5 billion lines - read thousand times more errors in there...
Well enjoy reading, if you want to implement the solution that is proven - also by patent a the best one for flight mode, tracing and recovering in real time system on board - do ask, but first read our books. The answers are there, excited and proven.
rgds,
Igor
Notes on Active safety of aircrafts and people
in 1985-89 the first concept of dynamic safety for aviation (CoDySA) was introduced by ATLAB Ltd, Bristol, now IT-ACS LTD, also UK.
Fields tests of prototype took place and passed STATE TESTS see this:
https://www.academia.edu/30247663/ITACS_LTD_Devices_and_Results_Chronology
in 98-99 specially for Boeing, .Lockheed And Northrop results were presented:
https://www.academia.edu/7119860/The_Concept_of_Dynamic_Safety
and EXPLAINED for CA and Military sectors of Boeing in Seattle and Orlando.
Special Talk how to aggregate on-board and Air traffic controllers info was suggested and explained
https://www.academia.edu/7126686/Safelets_a_Software_Support_for_Dynamic_Safety_System
a special project ONBASS was funded by EC within FP5 to make it implemented for general and commercial aviation.
results of this project we presented for Eurocontrol, Airbus, and EASA,
sent to Boeing and Eurocontrol, as well as German, Swiss, Russian, French governments.
https://www.academia.edu/40602498/Principle_of_Active_System_Safety_-_Airbus_HQ_2006
during demonstration of functioning the whole software framework with prototyped devices were presented 18 November 2008 in London.
there books published explaining all hardware, systems, software and active system control designs:
https://www.academia.edu/28829113/Resilient_Computer_System_Design
https://www.academia.edu/39740322/Software_Design_for_Resilient_Computer_Systems_SpringerNature
Interview for Times, thanks to Rod Liddle and Bianca Britton:
https://www.academia.edu/18111652/Rod_Got_An_Issue_Where_Am_I_Safe_to_Fly
and patent was made in UK and stolen by Russian Government:
https://www.academia.edu/31065112/Patent_on_method_of_active_system_safety
https://www.academia.edu/31822260/Method_and_apparatus_for_active_system_safety
https://www.academia.edu/30247663/ITACS_LTD_Devices_and_Results_Chronology
https://www.academia.edu/36980437/Patent_breach_failed_https_www.ipo.gov.uk_p-challenge-decision-results_o35518.pdf
defended in Europe
https://www.academia.edu/38316652/UK_Patent_defended_in_Europe
and ... Nothing was implemented - instead of two lamps on a panel of 737 indicating loss of oxygen pressure?
Now world is doing nothing again, and only few super experts and enthusiasts of aviation ( CAPTIO team) are pushing investigation of MH370 getting the grip of what was that and what to do to avoid in the future.
http://mh370-captio.net
Codysa etc
Money on safety - you have it! but you need to learn how to get it:
https://www.academia.edu/35896810/110218.pdf
The book Russian Government (Rospatent) tried to endorse and even re-patent:
https://www.academia.edu/28342759/Active_System_Control
CAPTIO work the latest video is here:
https://www.youtube.com/watch?v=Go3K0UUt2Us
ITACS Follow-up
https://www.academia.edu/40899039/Mh370_follow-ups_for_RAES_Brussels_event
and if we do nothing today, tomorrow will be exactly the same as yesterday. Happy flights. And good luck! You need very good one!.