The Register Home Page

back to article ESA engineers trace anomaly in silent Juice spacecraft to a bug in the code

The European Space Agency (ESA) is breathing easier after communications with Jupiter Icy Moons Explorer (Juice) were restored – the spacecraft is currently barreling toward Venus for a gravity-assist flyby on August 31. Orbex Prime (pic: Orbex) ESA backs five rockets in Launcher Challenge – only some have exploded READ MORE …

  1. Ace2 Silver badge
    Pint

    "We might require a spacecraft software change (which is a complex activity)“

    Understatement of the year

    1. KarMann Silver badge
      IT Angle

      The bog-standard 'have you tried turning it off and on again?' just might not cut it in this case.

  2. steamnut

    Again?

    ESA's software guys do seem a bit accident prone. Or is it the stress testing department too complacent?

    1. anothercynic Silver badge

      Re: Again?

      Not any worse than a certain systems and aerospace vehicles vendor from Virginia^H^H^H^H^H^H^H^HChicago^H^H^H^H^H^H^HSeattle... ;-)

      At least ESA fixes their software and it works as appropriate... unlike that other bunch who need 346 deaths before they do something about it.

    2. John Brown (no body) Silver badge

      Re: Again?

      Only problems and final mission success are news. "Nothing to report" status updates are not news.

      Imagine the El Reg Headlines home page if every InfoSec team reported "nothing new today" every day. Instead, all El Reg bother to report on is the data breaches. Shame on them for not reporting every "nothing happened today" story. </sarc>

  3. Jim Mitchell
    Boffin

    I realize that bits are expensive on spacecraft, but a 6 month timer rollover seems very limiting for no reason. How much bigger do you need so it exceeds the planned mission lifetime?

    1. Eclectic Man Silver badge
      Boffin

      "The OBCP itself is used to schedule the transmitter amplifier on operations. We need this additional logic on-board to avoid radiation of the downlink through forbidden zones – there are scientific sensors in certain parts of the field of view of the antenna."

      Boost on to communicate results to Earth, off to allow the instruments to function properly. There needs to be some way to decide when the downlink is on and when it is off. I reckon even the genius boffins at NASA may need to think a bit about this one.

      1. david 12 Silver badge

        ... and it is subject to this problem once every 6 months, if the scheduling overlaps the timer reset. If there were another 8 bits in the counter, the reset would only happen long after the power ran out. That is also not a simple solution: it's probably a hardware counter, and even if it's not, adding another byte to a software counter is as complex as adding another counter. Which is why the genius boffins at ESA need to rethink this one.

    2. Apocalypso - a cheery end to the world Bronze badge
      Happy

      > I realize that bits are expensive on spacecraft, but a 6 month timer rollover seems very limiting for no reason. How much bigger do you need so it exceeds the planned mission lifetime?

      Mission lifetimes have a habit of being extended - just look at Voyager.

      The solution is to make the timer rollover shorter: then it happens more often so bugs during rollover are more easily detected. Either that or invent a new propulsion technology that gets them there in less than 6 months. :-))

    3. Lipdorn

      The size of the hardware timer would depend on the speed at which it increments as well as the duration before overflow. Not hard to calculate. Additionally, it isn't hard to combine the overflow of the timer with a software counter that would extend the duration even further. Lastly, as long as the "sleep" duration is less than the overflow duration, this would point to an incorrect code implementation of the test for whether the sleep duration has expired. So definitely a skill issue and a lack of adequate testing.

      Though it also points to how a seemingly trivial function, is_timedout(t_now, t_start, duration), can pose significant challenges to organisations (Airbus, ESA and Boeing) that ought to know better and ought to have rigorous tests to detect counter timer overflow errors. Especially since it is a well-known and predicable problem.

      1. John Brown (no body) Silver badge
        Joke

        Hell, the clock chip in a PC can run for years powered by a CR2032!! And they are even Y2K compliant

  4. CorwinX Silver badge

    You have to admire...

    ... how these guys and gals keep coming up with kludges to keep landers/probes online when things go wrong.

    Engineering at its finest.

    1. Claptrap314 Silver badge
      FAIL

      Re: You have to admire...

      No. Engineering at its finest would have flagged this six month timer as a major issue, and put in place, at an absolute minimum, detection & recovery code for this scenario. This was NOT an "unavoidable fail". It was an utter failure to properly account for the time domain.

  5. Bent Metal

    Wait, what am I missing...?

    From the article:

    >It [the timer] constantly counts up and restarts from zero once every six months.

    and

    >Fortunately, there are 15 months until the timer wraparound occurs again

    So how do the engineers have fifteen months until a six-month timer wraps around again?

    1. Caver_Dave Silver badge

      Re: Wait, what am I missing...?

      It is the interaction between wraparound AND boost being on, not every wraparound.

      I was explaining just such a "what happens if A and B?" scenario to one of my testers only last week.

      "Reset on timeout. Reset cannot occur whilst on the middle of transmitting. What happens to the reset when it coincides with transmitting?"

      [Edit: not a space program - something much more mundane]

    2. Red Ted
      Stop

      Re: Wait, what am I missing...?

      So how do the engineers have fifteen months until a six-month timer wraps around again?

      Because it's a sixteen month wrap around, not a six month one.

      I do rather hope that a software engineer somewhere is now writing "I must unit test all my timer functions" a hundred times on a chalk board!

      A back of envelope calculation suggests that they used a 32bit int to count 10ms ticks.

  6. headrush

    Why doesn't the calling function test that the booster is actually on instead of just assuming it all went according to plan? If it had to receive an ack from earth after each transmission but never received one it should be trying to repeat the exercise.

    I'm sure the engineers contemplate many "what if" scenarios so surely this should have been spotted.

    1. MiguelC Silver badge
      Facepalm

      Why?

      because hindsight is 20/20, while foresight, well.... [see icon]

      1. Claptrap314 Silver badge

        Re: Why?

        This was as hard to miss as an elephant in the middle of a two lane road. You got this roll over reset.

        1) The did no audit their task list to see what tasks might be inconvenienced by this action.

        2) They did not throw the reset into a task queue to ensure that nothing important was going on during the reset.

        3) They did not raise a warning to prevent sensitive tasks from starting when the counter approached rollover.

        4) They created an inadequate monitor for THE most critical function of the device.

        5) Along with the inadequate monitor, their recovery software failed to account for the most basic failure mode, or to implement an OFFON process automatically.

  7. DDiggler

    Easy fix

    The simple solution is:

    1. When the action to restart the timer happens add an additional line/command

    2. Add a "Launch the amplifier" command after the command to restart the timer has been issued

    - If the amplifier is running then the command to start it will do nothing as it will simply error out

    - If the amplifier is NOT running then the command to start it will do exactly that

    This would require the minimal change to code and might actually be a simple shell script change if that's the way things are done, if the OS is Linux or a variant

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like