Artemis II fault tolerance

Communications of the ACM had a fascinating post about how NASA built Artemis II’s fault tolerant computer. 3 fascinating excerpts.

(1) Eight modules with several back up scenarios: “Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.

Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a “fail-silent” design. The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds.

“We can lose three FCMs in 22 seconds and still ride through safely on the last FCM,” said Uitenbroek. A silenced FCM doesn’t become dead weight, however; the system is designed to reset, re-synchronize its state with the operating modules, and re-join the group mid-flight.

(2) Multiple redundancies with deterministic error-checking: “This architecture ensures that each FCM sees the same inputs, runs the same application code, and produces the same outputs,” said Uitenbroek. Every second, the drift of any individual FCM is measured and its local clock is recalibrated to the network’s ‘true’ time. If an application fails to meet its strict deadline, the module is automatically silenced, reset, and re-synchronized.

The hardware itself is also reinforced. The system employs triple-modular-redundant memory that self-corrects single-bit errors on every read. Even the network interface cards utilize two lanes of traffic that are constantly compared, ensuring that a bit flip in the communication fabric results in a fail-silent event rather than a corrupted command. The network itself is triple redundant with three separate planes, and all network switches employ self-checking strategies.

(3) Dissimilar redundancies: While the four-FCM primary system is robust, NASA must still account for common mode failures—software bugs or catastrophic events that could theoretically impact all primary channels simultaneously.

To mitigate this, Orion carries a completely independent Backup Flight Software (BFS) system. This is a prime example of dissimilar redundancy. It is implemented on different hardware, runs a different operating system, and utilizes independently developed, simplified flight software.

Even in a total power loss scenario—called a “dead bus”—Orion is designed to survive. If power is restored, the spacecraft enters a safe mode, in which the vehicle first stabilizes itself and then points its solar arrays at the Sun to recover power. Then, it orients its tail toward the Sun for thermal stability before attempting to re-establish communication with Earth. During such a failure, the crew can also take manual action to configure life support systems or don space suits.

Of course, it costs a lot to get this sort of redundancy planning in technical architecture. Those costs make sense on a space mission.

But, that said, there’s a lot we can learn on ensuring we’re making space for redundancy planning that is appropriate to our use-cases.