for all those involved in making mars science better value for the money.
As the European Space Agency flicked the standby switch on some of its long-lived spacecraft in response to the COVID-19 outbreak, The Register figured it was time for a look at how the agency has kept its fleet flying far beyond expectations. Today, the veteran Mars Express orbiter. ESA's Mars Express (MEX) spacecraft is an …
It's a headline. They're always compact in the extreme. In newpapers, the headline is normally written by a different team that the one which writes the articles. That's one reason why there's often a contradiction between the headline and the content. As for [The] Register, I'm not at all sure but we're likely to find out.
I'm coming to the conclusion that there are two sorts of people: ordinary people (me, Bill Gates, everyone) and people involved in the details of doing stuff in or about space, who are somehow ... better than everyone else. The people this article talks about have taken a spacecraft orbiting Mars and repurposed all sorts of bits of it to make it last hugely longer than it was meant to; the Apollo people ... well, OK, we know how amazing they were; and all space people seem to be like that. And then astronomers, whose purpose in life seems to be 'here's this absurd but theoretically possible idea ... which we have made work and have in production use, and in fact we're now embarking on some even more mad idea, which we will also make work': astronomical optical interferometry is absurd (you're making this huge machine which is accurate to a small fraction of a wavelength of light) ... but the EHT is just the exponential of absurd. And LIGO is just ridiculous.
All these people are amazing.
a lot of us on the team are British so was mostly tea and biscuits.
the killer is the reboot to load it as this performs a cold restart of the spacecraft and then immediately uses the software to kill any rates, find the sun,turn to earth and then turn on the transmitter.
so once that command leaves the ground theres no going back. it was 13 minutes to the spacecraft then 13 minutes till we lost the signal. then about 30 minutes till we got the signal back.
i think in that hour of waiting i must have got through 5 or 6 cups!
(i think there's a photo somewhere of one of the whiteboards where we were keeping score)
we worked through the night to recover the bulk of the spacecraft systems and given we dont normally do this staying as focused at 4am as you were 8 hours ago was going to be challenge.
thankfully a friend had baked us a homemade take on golden crunch creams for good luck and they had so much sugar in them it turned out that staying awake was not an issue :-)
the beer was on the friday once the flight tests were complete.
Red, thanks for pointing that out.
Now, this requires a moderately large leap of faith.....but the 'orbit of another planet' part actually doesn't matter.
If you look at things from a system point of view, each node or functional entity in the system is remote from it's neighbours/adjacencies to some degree. They may be separated by 5 Meters of utterly reliable Ethernet cable via a low-latency switch, or a few million miles of comparatively 'risky' and stupidly-high-latency hops via the Deep Space Network.
Once you've appreciated that, and done the appropriate design, modelling, and completed a confidence-giving 'review and test' process, remote software update is not as scary as you may think. Of course, I've only ever done terrestrial upgrades myself*, and no doubt I would have a seriously twitchy bottom doing what these guys did, not to mention a seriously bad case of pride in what I had achieved when it worked.
Bottom line for me here is : Muchos kudos and respect to the folks managing the mission software and science etc., but at least as much kudos is due to the folks in the DSN organisation.
You can't do jack without that bit of ethernet cable. Also, the connectors used at each end of the link aren't moving, and you can always see from one end of the fibre to the other with no interruptions.
Have a look at what spacecraft/missions the DSN enables..
Lower layers to the rescue, yet again :-)
[ * : Despite the handle ]
Now, this requires a moderately large leap of faith.....but the 'orbit of another planet' part actually doesn't matter.
You know, it really does matter. If the device you are blowing new software into is 'separated by 5 Meters of utterly reliable Ethernet cable via a low-latency switch' from whatever you are blowing the software with, then when you fuck it up you go to wherever the system is and change whatever jumper you need to change, or replace whatever EPROM you need to replace, and you're done. It's kind of embarrassing, but it's not catastrophic, and probably lots of people have needed to do this: certainly I have. If, on the other hand the thing you are blowing software onto is in orbit around Mars (or, for that matter in LEO), and you fuck it up, then it's gone, for good. Oh, and the machine you've just terminally written off cost a few hundred million dollars, not a few thousand to a few hundreds of thousand.
the truth is its a bit of both, making changes like this can be risky given the limited information available on ground
for example, we have a suspicion about what the issue is with the SSMM but without being able to take it apart and inspect the boards we can never be completely sure. we just know what the side effect are. thats part of the reason that software changes like this are usually a last resort. an 'operational work around' is often preferred as a first attempt.
but if you have to change it you test, you check like crazy and always make sure you have a plan-b in case it goes wrong.
once your ready for uplink it depends on exactly what it is you're changing
firstly if the change is small (Eg just a single function) you could patch it in RAM directly without having to stop the software. this has advantages as no reboot is needed and if there is an issue with the change then a restart clears it.
for gyroless the change was too big. so once we're happy with the new software, its uplinked to the backup flight computer on the backup part of the eeprom (each one contains 2 software images)
one done, the entire eeprom is dumped from the spacecraft and then checked that is bit for bit identical to what it should be.
in addition all other eemproms were dumped and loaded onto the simulator such that it was bit for bit accurate to what was onboard and the restart procedure tested several times.
once that is confirmed, then spacecraft is configured such that at the next restart it will use the backup computer and crucially that the subsequent reboot would use the prime. only then is the backup computer configuration updated to point from the current software image to the new one.
so we're in a state now where if the new software doesn't work or something goes else goes wrong, then we know on the second attempt it will boot what its currently using and know to be good.
in addition to this spacecraft have in built protection to give them best chance of recovering from a hardware or software failure, at least to the point where contact can can be established.
if a hardware fault forces a full automated cold restart, you dont want the spacecraft continuously trying to boot up hardware that is broken. for units of lower criticality there are tables stored in eeproms that record which units are declared safe and these are the selected by the software when it starts.
for something as crucial as the flight computers, the data bus, the clock etc - ie things that need to be chosen before the software starts, the firmware has a lookup table.
this table contains a list of pretty much every combination of cpu, bus, clock etc. in the event of a restart the firmware sets a series of switches based on the current table entry which controls which units are turned on. then it moves the pointer to the next entry. so each restart cycle will start with a different combination - the idea being it'll keep going till it finds one that works
there is a final entry in each row of the table that has a flag stating if the eeproms should be used eventually the firmware pointer reaches a table entry where this is set to false. in this case the software on all flight eeproms is suspected as corrupt and it will instead load a basic version of the flight software burnt into the flight computer. this version isnt sufficient to fly the mission but it'll allow contact to be established and ground to have a fighting chance of figuring out whats gone wrong
still even with all of this, when you're rebooting a spacecraft as old as MEX, its a long 45 mins waiting for that signal.
if the device you are reflashing a boot loader onto is entirely encased in plastic, it's kinda the same thing, but in my case those devices were only ~$100 each, and not in orbit around Mars. Still for GPL compliance the end-user needed to be able to do the same thing (even worse for breath-holding I think), successfully, or create an expensive paperweight in the process. The only way to attach hardware to recover the device involved drilling holes accurately and soldering wires through those holes onto the board, probably ruining the purpose of potting it in plastic in the first place (waterproofing for ocean operation).
So yeah, major emphasis on "get it right the first time" and (somewhat) necessary nervous behavior and unusual religious activity might be involved.
Bit late to the comments here :-( but is it about time (groan) that we had a Reg unit of time for space mission extensions beyond the original mission plan?
The obvious name for the unit is a 'Voyager' - except that Voyager is still going and so is an undefined period of time and also might be better used a unit of (undefined) distance. :-)
This unit is in fact a dimensionless number, the Opportunity. From the SI standard, appendix 4:
The Opportunity (symbol 'oppy') is defined as the exact ratio 5111/90, which is the number of sols that Opportunity survived on Mars divided by the number of sols the mission was planned to last. A Spirit, defined as 368/15 (equal to 2208/90) is a little less than half an Opportunity, and is used in some unit systems instead. The Opportunity is the preferred unit.
MEX has currently reached about 0.15 oppys.
I forgot to mention in my previous reply that the unit actually used by practitioners is usually the 'nol': 'normalised oppy, logarithmic'. The value in nols is defined as the natural logarithm of the value in oppys, + ln[5111/90] + 1. It is easy to check that a mission with a nol value of 1 lasted for its expected time, a mission which was a total failure has a nol value of negative infinity, and Opportunity had a nol value of 1 + ln(5111/90). nols are particularly useful as they are around 1 for the expected mission duration, while oppys are rather small values then. nols are not SI units.
I'm hugely impressed with Airbus and their ex-employees too.
Imagine the calls they made to retired engineers: "Hi Mary/Mike, you remember that MEX project you work on back in xxx? Well, we're putting the team back together."
Where's an A-Team icon when you need it (black van driving through an explosion)?
Supprt means "we'll take the call".
Maintained means "we can provide a correction"
I support lots of products where the developpers no longer want to compile the source.
My clients understand what box they're in, but they're still paying the yearly tithe to be able to call me.
it depends on the mission,
quite often, the manufacturer provides support during the primary phase of the mission. the amount of time this is varies from spacecraft to spacecraft but its not too different from any industrial software project.
you'll have people available to investigate issues that crop up, fix bugs (under warranty if there is one), and make updates if new features are needed. the main difference with spacecraft compared to other more terrestrial based systems is if its a hardware fault then you need come up with a software solution to work around it.
this front line support lasts either as long as its needed or if the prime mission is done and the spacecraft keeps getting extended, as long is practically or financially viable both from the provider and the end users point of view. (though i dont think anyone ever really throws anything out, just in case...)
Absolutely amazing tale of amazing engineers pulling off some amazing stuff.
Yes I have written quite some code, and some of it is even used in space research, but these people are absolutely amazing.
I'll raise a glass this evening, and doff my hat right now (grey Tilley, it's sunny).
Biting the hand that feeds IT © 1998–2020