"We might require a spacecraft software change (which is a complex activity)“
Understatement of the year
The European Space Agency (ESA) is breathing easier after communications with Jupiter Icy Moons Explorer (Juice) were restored – the spacecraft is currently barreling toward Venus for a gravity-assist flyby on August 31. Orbex Prime (pic: Orbex) ESA backs five rockets in Launcher Challenge – only some have exploded READ MORE …
Only problems and final mission success are news. "Nothing to report" status updates are not news.
Imagine the El Reg Headlines home page if every InfoSec team reported "nothing new today" every day. Instead, all El Reg bother to report on is the data breaches. Shame on them for not reporting every "nothing happened today" story. </sarc>
"The OBCP itself is used to schedule the transmitter amplifier on operations. We need this additional logic on-board to avoid radiation of the downlink through forbidden zones – there are scientific sensors in certain parts of the field of view of the antenna."
Boost on to communicate results to Earth, off to allow the instruments to function properly. There needs to be some way to decide when the downlink is on and when it is off. I reckon even the genius boffins at NASA may need to think a bit about this one.
... and it is subject to this problem once every 6 months, if the scheduling overlaps the timer reset. If there were another 8 bits in the counter, the reset would only happen long after the power ran out. That is also not a simple solution: it's probably a hardware counter, and even if it's not, adding another byte to a software counter is as complex as adding another counter. Which is why the genius boffins at ESA need to rethink this one.
> I realize that bits are expensive on spacecraft, but a 6 month timer rollover seems very limiting for no reason. How much bigger do you need so it exceeds the planned mission lifetime?
Mission lifetimes have a habit of being extended - just look at Voyager.
The solution is to make the timer rollover shorter: then it happens more often so bugs during rollover are more easily detected. Either that or invent a new propulsion technology that gets them there in less than 6 months. :-))
The size of the hardware timer would depend on the speed at which it increments as well as the duration before overflow. Not hard to calculate. Additionally, it isn't hard to combine the overflow of the timer with a software counter that would extend the duration even further. Lastly, as long as the "sleep" duration is less than the overflow duration, this would point to an incorrect code implementation of the test for whether the sleep duration has expired. So definitely a skill issue and a lack of adequate testing.
Though it also points to how a seemingly trivial function, is_timedout(t_now, t_start, duration), can pose significant challenges to organisations (Airbus, ESA and Boeing) that ought to know better and ought to have rigorous tests to detect counter timer overflow errors. Especially since it is a well-known and predicable problem.
No. Engineering at its finest would have flagged this six month timer as a major issue, and put in place, at an absolute minimum, detection & recovery code for this scenario. This was NOT an "unavoidable fail". It was an utter failure to properly account for the time domain.
From the article:
>It [the timer] constantly counts up and restarts from zero once every six months.
and
>Fortunately, there are 15 months until the timer wraparound occurs again
So how do the engineers have fifteen months until a six-month timer wraps around again?
It is the interaction between wraparound AND boost being on, not every wraparound.
I was explaining just such a "what happens if A and B?" scenario to one of my testers only last week.
"Reset on timeout. Reset cannot occur whilst on the middle of transmitting. What happens to the reset when it coincides with transmitting?"
[Edit: not a space program - something much more mundane]
So how do the engineers have fifteen months until a six-month timer wraps around again?
Because it's a sixteen month wrap around, not a six month one.
I do rather hope that a software engineer somewhere is now writing "I must unit test all my timer functions" a hundred times on a chalk board!
A back of envelope calculation suggests that they used a 32bit int to count 10ms ticks.
Why doesn't the calling function test that the booster is actually on instead of just assuming it all went according to plan? If it had to receive an ack from earth after each transmission but never received one it should be trying to repeat the exercise.
I'm sure the engineers contemplate many "what if" scenarios so surely this should have been spotted.
This was as hard to miss as an elephant in the middle of a two lane road. You got this roll over reset.
1) The did no audit their task list to see what tasks might be inconvenienced by this action.
2) They did not throw the reset into a task queue to ensure that nothing important was going on during the reset.
3) They did not raise a warning to prevent sensitive tasks from starting when the counter approached rollover.
4) They created an inadequate monitor for THE most critical function of the device.
5) Along with the inadequate monitor, their recovery software failed to account for the most basic failure mode, or to implement an OFFON process automatically.
The simple solution is:
1. When the action to restart the timer happens add an additional line/command
2. Add a "Launch the amplifier" command after the command to restart the timer has been issued
- If the amplifier is running then the command to start it will do nothing as it will simply error out
- If the amplifier is NOT running then the command to start it will do exactly that
This would require the minimal change to code and might actually be a simple shell script change if that's the way things are done, if the OS is Linux or a variant