
Perhaps space eggheads should formally verify their programs.
NASA has explained what caused communication issues with its CAPSTONE spacecraft: a bug in the code. CAPSTONE (Cislunar Autonomous Positioning System Technology Operations and Navigation Experiment) was launched atop a Rocket Lab Electron in June and on July 4 the company's Photon spacecraft deployed CAPSTONE for a several …
Assuming you can formally prove the software*, all that does is show that the software matches the specifications. Not that the specifications are right in the first place.
[*] Formally proving software isn't easy. It's not as if, in your IDE, you go Tools -> Prove and get a simple box back saying "Proved Correct"
Sure, but it costs two arms, a leg, an eye a kidney and a large bit of the liver...
In telecom 30 years ago unitary tests of each and every path in the code was the norm ( and it was mostly automated ) on testbeds ( read : PSTN or MSC exchanges used as testbeds ), then a lighter 'Integration' campaign was done on the first exchange with the new software ( and eventually new hardware tied to the new software ) that did include traffic overload tests. And even then there still was funky weird cases of bugs that showed up months after the software was deployed everywhere.
The whole process was around two years : one year of development and testbed tests, 6 months of integration/comissionning on a new exchange, 3 months of local, on call babysitting by $TELCO equipment builder techs, 3 months to generalize the new software to the rest of the exchanges in a country.
It beggars belief that seemingly basic errors still happen today.
Surely the writers of the test harnesses should cover all possible states and use cases without exception? And then another team, that only has sight of the requirements specification, writes another testing suite.
The cost of a lost mission makes a few software engineers' salaries chicken feed.
I agree, but I wonder if it is sometimes the case there there are many trillions of possible permutations and combinations of circumstances that it isn't possible to test for all of them. Rather like the Swiss cheese model, where many different factors need to line up for a fatal error to occur:
https://en.wikipedia.org/wiki/Swiss_cheese_model
Possible errors in any system are subject to the Chaos Theory. The slightest difference in a starting condition can produce a totally different result. I am always amazed by how many digital electronics people are ignorant of the potential for metastability conditions - particularly when an asynchronous signal is fed into a clocked gate.
To me, the biggest take away is that they agree what the software is supposed to do from the outset, before they even think of writing code. How many of us have been involved with software projects where the specs keep on changing? If you can have agreed specs, it makes it so much easier.
(The "don't blame the person, blame the process" culture is important too: It allows everyone to learn from mistakes)
There is usually a "V&V" process invoked as part of system development.
V for verification:did they build it right?
V for validation: did they build the right system?
The failsafes like radios seeing no action for a few hours or inactive attitude control are probably lessons learned from long ago.
Space is hard.
Mt friends often laugh at me for having at least a Plan B when I do anything. My career in IT taught me that you cater for the specific things you know can go wrong - and also try to handle the contingency of a generic failure.caused by an unexpected condition. What Donald Rumsfeld correctly called "The unknown unknowns".
I always worked on three plans. Plan A, contingency plan B where there was a chance that the work could still be completed with some additional steps, and the back-out plan.
Of course, where I had a problem, there was always a conflict between plan B and the back-out plan. Where you have a time critical service, the service managers don't really like using the contingency plan if it eats into the time necessary to restore the system to it's previous condition before the work.
This is not quite so easy when your asset is in a remote (in this case, a really remote) location.
"Whats Happens If
The Fault Detection System develops a fault?"
You jest, but when I was working at a job fixing audio equipment, too many times the protection circuitry in a power amp was the cause of the problem. While blowing up speakers isn't good, they were often cheaper to fix than an amplifier with some weird problem.
What I can glean from the primary sources - Advanced Space and NASA Ames.
The problem started with a ground controller sending a misformatted query to CAPSTONE.
The radio software detected this and shut down which was probably intended.
The fault detection software didn't perceive the radio shutdown as a fault, which it should have, but it is unclear whether or not the radio software told it about the fault, so the misprogramming could have been either in the radio software or the fault software.
The flight software eventually cleared the fault, probably because it couldn't contact Earth, probably by instituting the reboot that the fault software was expected to do, and probably via the fault software. While it was doing that it kept CAPSTONE on course
Note that the boffins at mission control didn't do anything to make any of this happen (other than fat fingering their query).
All in all this is an example of good software with built in redundancies and recovery plans baked in.
What they know is in the middle of the article:
https://advancedspace.com/capstone-tcm1-success/