Computers in Spaceflight: The NASA
Experience
- - Chapter Four -
- - Computers in the Space Shuttle
Avionics System -
-
- Computer synchronization and
redundancy management
- [100] One key goal
shaping the design of the Shuttle was "autonomy." Multiple
missions might be in space at the same time, and large crews, many
with nonpilot passengers, were to travel in space in craft much
more self-sufficient than ever before. These circumstances, the
desire for swift turnaround time between launches, and the need to
sustain mission success through several levels of component
failure meant that the Shuttle had to incorporate a large measure
of fault tolerance in its design. As a result, NASA could do what
would have been unthinkable 20 years earlier: put men on the
Shuttle's first test flight. The key factor in enabling NASA to
take such a risk was the redundancy built into the
orbiter60.
-
- Fault tolerance on the Shuttle is achieved
through a combination of redundancy and backup. Its five
general-purpose computers have reliability through redundancy,
rather than the expensive quality control employed in the Apollo
program61. Four of the computers, each loaded with identical
software, operate in what is termed the "redundant set" during
critical mission phases such as ascent and descent. The fifth,
since it only contains software to accomplish a "no frills" ascent
and descent, is a backup. The four actuators that drive the
hydraulics at each of the aerodynamic surfaces are also redundant,
as are the pairs of computers that control each of the three main
engines.
-
- Management of redundancy raised several
difficult questions. How are failures detected and certified?
Should the system be static or dynamic? Should the computers run
separately without communication and be used to replace the
primary computer one by one as failures occur? Could the
computers, if running together, stay in step? Should redundancy
management of the actuators be at the computer or subsystem level?
Fortunately, NASA experience on other aircraft and spacecraft
programs could provide data for making the final decisions.
-
-
- Redundant Precursors
-
-
- Several systems that incorporated
redundancy preceded the Shuttle. The computer used in the Saturn
booster instrument unit that contained the rocket's guidance
system used triple modular redundant (TMR) circuits, which means
that there was one computer with redundant components.
Disadvantages to using such circuits in larger computers
[101]
are that they are expensive to produce, and an event such as the
explosion on Apollo 13 could damage enough of the computer that it
ceases to function. By spreading redundancy among several simplex
circuit computers scattered in various parts of the spacecraft,
the effects of such catastrophic failures are
minimized62.
-
- Skylab's two computers each could perform
all the functions required on its mission. If one failed, the
other would automatically take over, but both computers were not
up and running simultaneously. The computer taking over would have
to find out where the other had left off by using the contents of
the 64-bit transfer register located in the common section built
with TMR circuits. The Skylab computers were able to have such a
relatively leisurely switch-over system because they were not
responsible for navigation or high-frequency flight control
functions. If there were a failure, it would be possible for the
Skylab to drift in its attitude without serious danger; the
Shuttle would have no such margin of safety.
-
-
-
-
- Figure 4-3. The F-8 aircraft that
proved the redundant set configuration planned for the Shuttle
would work. (NASA photo ECN-6988)
-
-
-
- The need for the redundant computers on
the Shuttle to process information simultaneously, while still
staying closely synchronized for rapid switch-over, seriously
challenged the designers of the system. Such a close
synchronization between computers had not been done before, and
its feasibility would have to be proven before NASA could make a
full commitment to a particular design. Most of the
[102]
necessary confidence resulted from a digital fly-by-wire testing
program NASA started at the Dryden Flight Research Center in the
early 1970s63. The first computer used in the F-8 "Crusader"
aircraft chosen for the program was a surplus AGC in simplex, with
an electronic analog backup. Later, the project engineers wanted a
duplex system using a more advanced computer. Johnson Space Center
avionics people noted the similarities between the digital
fly-by-wire program and the Shuttle. Dr. Kenneth Cox of JSC
suggested that Dryden go with a triplex system to move beyond
simple one-for-one redundancy. By coordinating procurement, NASA
outfitted both the F-8 aircraft and the Shuttle with AP-101
processors. Draper Laboratory produced software for the F-8, and
its flight tests proved the feasibility of computers operating in
synchronization, as it suffered several single point computer
failures but successfully flew on without loss of control. This
flight program did much to convince NASA of the viability of the
synchronization and redundancy management schemes developed for
the Shuttle.
-
-
- How Many Computers?
-
-
- One key question in redundancy planning is
how many computers are required to achieve the level of safety
desired. Using the concept of fail operational/fail
operational/fail-safe, five computers are needed. If one fails,
normal operations are still maintained. Two failures result in a
fail-safe situation, since the three remaining prevent the feared
standoff possible in dual computer systems (one is wrong, but
which?). Due to cost considerations of both equipment and time,
NASA decided to lower the requirement to fail
operational/fail-safe, which allowed the number of computers to be
reduced to four. Since five were already procured and designed
into the system, the fifth computer evolved into a backup system,
providing reduced but adequate functions for both ascent and
descent in a single memory load. NASA's decision to use four
computers has a basis in reliability projections done for
fly-by-wire aircraft. Triplex computer system failures were
expected to cause loss of aircraft three times in a million
flights, whereas quadruple computer system failures would cause
loss of aircraft only four times in a thousand million
flights64.
-
- At first the backup flight system computer
was not considered to be a permanent fixture. When safety level
requirements were lowered, some IBM and NASA people expected the
fifth computer to be removed after the Approach and Landing Test
phase of the Shuttle program and certainly after the flight test
phase (STS-1 through 4)65. However, the utility of the backup system as
insurance against a generic software error in the primary system
outweighed considerations of the savings in weight, power, and
complexity to be made by....
-
-
-
[103]
-
-
- Figure 4-4. The intercommunication
system used in the F-8 triplex computer system.
-
-
-
- [104] ....eliminating
it66. In fact, as the first Shuttle flights approached,
Arnold Aldrich, Director of the Shuttle Office at Johnson Space
Center, circulated a memo arguing for a sixth computer to be
carried along as a spare67! He pointed out that since 90% of avionics
component failures were expected to be computer failures and that
since a minimum of three computers and the backup should exist for
a nominal re-entry, aborts would then have to take place after one
failure. By carrying a spare computer preloaded with the entry
software, the primary system could be brought back to full
strength. The sixth computer was indeed carried on the first few
flights. In contrast with this "suspenders and belt" approach,
John R. Garman of the Johnson Space Center Spacecraft Software
Division said that "we probably did more damage to the system as a
whole by putting in the backup"68. He felt that the institution of the backup took
much of the pressure off the developers of the primary system. No
longer was their software solely responsible for survival of the
crew. Also, integrating the priority-interrupt-driven operating
system of the primary computers with the time-slice system of the
backup caused compromises to be made in the primary.
-
-
- Synchronization
-
-
- Computer synchronization proved to be the
most difficult task in producing the Shuttle's avionics.
Synchronizing redundant computers and comparing their current
states is the best way to decide if a failure has occurred. There
are two types of synchronization used by the Shuttle's computers
in determining which of them has failed: one for the redundant set
of computers established for ascent to orbit and descent from
orbit, and one for synchronizing a common set while in orbit. It
took several years in the early 1970s to discover a way to
accomplish these two synchronizations.
-
- The essence of Shuttle redundancy is that
each computer in the redundant set could do all the functions
necessary at a particular mission phase. For true redundancy to
take place, all computers must listen to all traffic on all buses,
even though they might be commanding just a few. That way they
know about all the data generated in the current phase. They must
also be processing that data at the same time the other computers
do. If there is a failure, then the failed computer could drop out
of the set without any functional degradation whatever. At the
start, the Shuttle's designers thought it would be possible to run
the redundant computers separately and then just compare answers
periodically to make sure that the data and calculations
matched69. As it turned out, small differences in the
oscillators that acted as clocks within the computers caused the
computers to get out of step fairly [105] quickly. The
Spacecraft Software Division formed a committee, headed by Garman,
made up of representatives from Johnson Space Center, Rockwell
International, Draper Laboratory and IBM Corporation, to study the
problem caused by oscillator drift70. Draper's people made the suggestion that the
computers be synchronized at input and output
points71.This concept was later expanded to also place
synchronization points at process changes, when the system makes a
transition from one software module to another. The decision to
put in the synchronization points "settled everyone's mind" on the
issue72.
-
- Intercomputer communication is what makes
the Shuttle's avionics system uniquely advanced over other forms
of parallel computing. The software required for redundancy
management uses just 3K of memory and around 5% or 6% of each
central processor's resources, which is a good trade for the
results obtained78. An increasing need for redundancy and fault
tolerance in non-avionics systems such as banks, using automatic
tellers and nationwide computer networks, proves the usefulness of
this system. But this type of synchronization is so little known
or understood by people outside the Shuttle program that carryover
applications will be delayed.
-
- One reason why the redundancy management
software was able to be kept to a minimum is that NASA decided to
move voting to the actuators, rather than to do it before commands
are sent on buses. Each actuator is quadruple redundant. If a
single computer fails, it continues to send commands to an
actuator until the crew takes it out of the redundant set. Since
the Shuttle's other three computers are sending apparently correct
commands to their actuators, the failed computer's commands are
physically out-voted79. Theoretically, the only serious possibility is
that three computers would fail simultaneously, thus negating the
effects of the voting. If that occurs, and if the proper warnings
are given, the crew can then engage the backup system simply by
pressing a button located on each of the forward rotational hand
controllers.
-
- Does the redundant set synchronization
work? As described, the F-8 version, with redundancy management
identical to the Shuttle, survived several in-flight computer
failures without mishap. On the first Shuttle Approach and Landing
Test flight, a computer failed just as the Enterprise was released
from the Boeing 747 carrier; yet the landing was still successful.
That incident did a lot to convince the astronaut pilots of the
viability of the concept.
-
- Synchronization and redundancy together
were the methods chosen to ensure the reliability of the Shuttle
avionics hardware. With the key hardware problems solved, NASA
turned to the task of specifying the most complex flight software
ever conceived.
-
-
- [106] Box 4-2:
Redundant Set Synchronization: Key to Reliability
-
- Synchronization of the redundant set works
like this: When the software accepts an input, delivers an output,
or branches to a new process, it sends a 3-bit discrete signal on
the intercomputer communication (ICC) buses, then waits up to 4
milliseconds for similar discretes from the other computers to
arrive. The discretes are coded for certain messages. For example,
010 means an I/O is complete without error, but 011 means that an
I/O is complete with error73. This allows more information other than just "here
I am" to be sent. If another computer either sends the wrong
synchronization code, or is late the computer detecting either of
these conditions concludes that the delinquent computer has
failed, and refuses from then on to listen to it or acknowledge
its presence. Under normal circumstances, all three good computers
should have detected the single computer's error. The bad computer
is announced to the crew with warning lights, audio signals, and
CRT messages. The crew must purposely kill the power to the failed
computer, as there is no provision for automatic powerdown. This
prevents a generic software failure causing all the computers to
be automatically shut off.
-
- This form of synchronization creates a
tightly coupled group of computers constantly certifying that they
are at the same place in the software. To certify that they are
achieving the same solutions, a "sumword" is used. While computers
are in a redundant set, a sumword is exchanged 6.25 times every
second on the ICC buses74. A sumword typically consists of a 64 bits of data,
usually the least significant bits of the last outputs to the
solid rocket boosters, orbital maneuvering engines, main engines,
body flap, speed brake, rudder, elevons, throttle, the system
discretes, and the reaction control system75. If there are three straight miscomparisons of a
sumword, the detecting computers declare the computer involved to
be failed76.
-
- Both the 3-bit synchronization code and
sumword comparison are characteristics of the redundant set
operations. During noncritical mission phases such as on-orbit,
the computers are reconfigured. Two might be left in the redundant
set to handle guidance and navigation functions, such as
maintaining the state vector. A third would run the systems
management software that controls life support, power, and the
payload. The fourth would be loaded with the descent software and
powered down, or "freeze dried," to be instantly ready to descend
in an emergency and to protect against a failure of the two MMUs.
The fifth contains the backup flight system. This configuration of
computers is not tightly coupled, as in the redundant set. All
active computers, however, do continue the 6.25/second exchange of
sumwords, called the common set
synchronization77.
-
-
-
[107]
-
Figure 4-5. The various computer
configurations used during a Shuttle mission. The names of the
operational sequences loaded into the machines are shown.

