Reply to post: Using COTS instead of rad-hard devices.

HPE supercomputer is still crunching numbers in space after 340 days

BobC

Using COTS instead of rad-hard devices.

Electronic components hardened to tolerate radiation exposure are unbelievably expensive. Even cheap "rad-hard" parts can easily cost 20x their commercial relatives. And 1000x is not at all uncommon!

There have been immense efforts to find ways to make COTS (commercial off-the-shelf) parts and equipment better tolerate the rigors of use in space. This has been going on ever since the start of the world's space programs, especially after the Van Allen radiation belts were discovered in 1958.

I was fortunate to have participated in one small project in 1999-2000 to design dirt-cheap avionics for a small set of tiny (1 kg) disposable short-lived satellites. I'll describe a few highlights of that process.

First, it is important to understand how radiation affects electronics. There are two basic kinds of damage radiation can cause: Bit-flips and punch-throughs (I'm intentionally not using the technical terms for such faults). First bit-flips: If you have ECC (error-correcting) memory, which many conventional servers do, bit-flips there can be automatically detected and fixed. However, if a bit-flip occurs in bulk logic or CPU registers, a software fault can result. The "fix" here is to have at least 3 processors running identical code, then performing multi-way comparison "voting" of the results. If a bit-flip is found or suspected, a system reset will generally clear it.

Then there are the punch-throughs, where radiation creates an ionized path through multiple silicon layers that becomes a short-circuit between power and ground, quickly leading to overheating and the Release of the Sacred Black Smoke. The fix here is to add current monitoring to each major chip (especially the MCU) and also to collections of smaller chips. This circuitry is analog, which is inherently less sensitive to radiation than digital circuits. When an abnormal current spike is detected, the power supply will be temporally turned off long enough to let the ionized area recombine (20ms-100ms), after which power may then be safely restored and the system restarted.

Second, we must know the specific kinds of radiation we need to worry about. In LEO (Low Earth Orbit), where our satellites would be, our biggest concern was Cosmic Rays, particles racing at near light-speed with immense energies, easily capable of creating punch-throughs. (The Van Allen belts shields LEO from most other radiation.) We also need to worry about less energetic radiation, but it's less than the level of a dental X-Ray.

With that information in hand, next came system design and part selection. Since part selection influences the design (and vice-versa), these phases occur in parallel. However, CPU selection came first, since so many other parts depend on the specific CPU being used.

Here is where a little bit of sleuthing saved us tons of time and money. We first built a list of all the rad-hard processors in production, then looked at their certification documents to learn on which semiconductor production lines they were being produced. We then looked to see what other components were also produced on those lines, and checked if any of them were commercial microprocessors.

We lucked out, and found one processor that not only had great odds of being what we called "rad-hard-ish" (soon shortened to "radish"), but also met all our other mission needs! We did a quick system circuit design, and found sources for most of our other chips that were also "radish". We had further luck when it turned out half of them were available from our processor vendor!

Then we got stupid-lucky when the vendor's eval board for that processor also included many of those parts. Amazingly good fortune. Never before or since have I worked on a project having so much luck.

Still, having parts we hoped were "radish" didn't mean they actually were. We had to do some real-world radiation testing. Cosmic Rays were the only show-stopper: Unfortunately, science has yet to find a way to create Cosmic Rays on Earth! Fortunately, heavy ions accelerated to extreme speeds can serve as stand-ins for Cosmic Rays. But they can't easily pass through the plastic chip package, so we had to remove the top (a process called "de-lidding") to expose the IC beneath.

Then we had to find a source of fast, heavy ions. Which the Brookhaven National Labs on Long Island happens to possess in their Tandem Van de Graaf facility (https://en.wikipedia.org/wiki/Tandem_Van_de_Graaff). We were AGAIN fantastically lucky to arrange to get some "piggy-back' time on another company's experiments so we could put our de-lidded eval boards into the vacuum chamber for exposure to the beam. Unfortunately, this time was between 2 and 4 AM.

Whatever - we don't look gift horses in the mouth. Especially when we're having so much luck.

I wrote test software that exercised all parts of the board and exchanged its results with an identical eval board that was running outside the beam. Whenever the results differed (meaning a bit-flip error was detected), both processors would be reset (to keep them in sync). We also monitored the power used by each eval board, and briefly interrupted power when the current consumption differed by a specific margin.

The tests showed the processor wasn't very rad-hard. In fact, it was kind of marginal, despite being far better than any other COTS processor we were aware of. Statistically, in the worst case we could expect to see Cosmic Ray no more than once per second during the duration of the mission. Our software application needed to complete its main loop AT LEAST once every second, and in the worst case took 600 ms to run. But a power trip took 100 ms, and a reboot took 500 ms. We were 200 ms short! Missing even a single processing loop iteration could cause the satellite to lose critical information, enough to jeopardize the mission.

All was not lost! I was able to use a bunch of embedded programming tricks to get the cold boot time down to less than 100 ms. The first and most important "trick" was to eliminate the operating system and program to the "bare metal": I wrote a nano-RTOS that provided only the few OS services the application needed.

When the PCBs were made, the name "Radish" was prominently displayed in the top copper layer. We chose to keep the source of the name confidential, and invented an alternate history for it.

Then we found we had badly overshot our weight/volume budget (which ALWAYS happens during satellite design), and wouldn't have room for three instances of the processor board. A small hardware and software change allowed us to go with just a single processor board with only a very small risk increase for the mission. Yes, we got lucky yet again.

I forgot to mention the project had even more outrageous luck right from the start: We had a free ride to space! We were to be ejected from a Russian resupply vehicle after it left the space station.

Unfortunately, the space station involved was Mir (not ISS), and Mir was deactivated early when Russia was encouraged to shift all its focus to the ISS. The US frowned on "free" rides to the ISS, and was certainly not going to allow any uncertified micro-satellites developed by a tiny team on an infinitesimal budget (compared to "real" satellites) anywhere near the ISS.

We lost our ride just before we started building the prototype satellite, so not much hardware existed when our ride (and thus the project) was canceled. I still have a bunch of those eval boards in my closet: I can't seem to let them go!

It's been 18 years, and I wonder if it would be worth reviving the project and asking Elon Musk or Jeff Bezos for a ride...

Anyhow, I'm not at all surprised a massively parallel COTS computer would endure in LEO.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon