Fault Tolerance Computing Lecture Note
Fault Tolerance Computing Lecture Note
1.0 INTRODUCTION
Technology scaling allows realization of more and more complex system on a
single chip. This high level of integration leads to increased current and power
densities and causes early device and interconnect wear-out. In addition, there
are failures not caused by wear-out nor escaped manufacturing defects, but due
to increased susceptibility of transistors to high energy particles from
atmosphere or from within the packaging. Devices operating at reduced supply
voltages are more prone to charge related phenomenon caused by high-energy
particle strikes referred to as Single Event Effect (SEE). They experience
particle-induced voltage transients called Single Event Transient (SET) or
particle-induced bit-flips in memory elements also known as Single Event
Upset (SEU). Selecting the ideal trade-off between reliability and cost
associated with a fault tolerant architecture generally involves an extensive
design space exploration. Employing state-of-the-art reliability estimation
methods makes this exploration un-scalable with the design complexity.
OBJECTIVES
At the end of this unit you should be able to:
• Explain fault tolerant in computing.
• Explain redundancy and identify Hardware and Software Fault
Tolerant Issues
• Explain the Relationship Between Security and Fault Tolerance
• Identify different Fault Tolerance Architectures
Three central terms in fault-tolerant design are fault, error, and failure.
There is a cause effect relationship between faults, errors, and failures.
Specifically, faults are the cause of errors, and errors are the cause of
failures. Often the term failure is used interchangeably with the term
malfunction, however, the term failure is rapidly becoming the more
commonly accepted one. A fault is a physical defect, imperfection, or
flaw that occurs within some hardware or software component.
Essentially, the definition of a fault, as used in the fault tolerance
community, agrees with the definition found in the dictionary. A fault is
a blemish, weakness, or shortcoming of a particular hardware or software
component. An error is the manifestation of a fault. Specifically, an error
is a deviation from accuracy or correctness. Finally, if the error results in
the system performing one of its functions incorrectly then a system
failure has occurred. Essentially, a failure is the nonperformance of some
action that is due or expected. A failure is also the performance of some
function in a subnormal quantity or quality.
The concepts of faults, errors, and failures can be best presented by the use of
a three-universe model that is an adaptation of the four-universe models;
• first universe is the physical universe in which faults occur. The physical
universe contains the semiconductor devices, mechanical elements,
displays, printers, power supplies, and other physical entities that make
up a system. A fault is a physical defect or alteration of some component
within the physical universe.
• The second universe is the informational universe. The informational
universe is where the error occurs. Errors affect units of information
such as data words within a computer or digital voice or image
information. An error has occurred when some unit of information
becomes incorrect.
• The final universe is the external or user’s universe. The external
universe is where the user of a system ultimately sees the effect of faults
and errors. The external universe is where failures occur. The failure is
any deviation that occurs from the desired or expected behavior of a
system. In summary, faults are physical events that occur in the physical
universe. Faults can result in errors in the informational universe, and
errors can ultimately lead to failures that are witnessed in the external
universe of the system.
The cause-effect relationship implied in the three-universe model leads to the
definition of two important parameters; fault latency and error latency.
• Fault latency is the length of time between the occurrence of a fault and
the appearance of an error due to that fault.
• Error latency is the length of time between the occurrence of an error
and the appearance of the resulting failure. Based on the three-universe
model, the total time between the occurrence of a physical fault and the
appearance of a failure will be the sum of the fault latency and the error
latency.
Characteristics of Faults
Faults could be classified based on the following parameters a)
Causes/Source of Faults
b) Nature of Faults
c) Fault Duration
d) Extent of Faults
e) Value of faults
Sources of faults: Faults can be the result of a variety of things that occur
within electronic components, external to the components, or during the
component or system design process. Problems at any of several points within
the design process can result in faults within the system.
• Specification mistakes, which include incorrect algorithms, architectures,
or hardware and software design specifications.
• Implementation mistakes. Implementation, as defined here, is the process
of transforming hardware and software specifications into the physical
hardware and the actual software. The implementation can introduce
faults because of poor design, poor component selection, poor
construction, or software coding mistakes.
• Component defects. Manufacturing imperfections, random device
defects, and component wear-out are typical examples of component
defects. Electronic components simply become defective sometimes. The
defect can be the result of bonds breaking within the circuit or corrosion
of the metal. Component defects are the most commonly considered cause
of faults.
• External disturbance; for example, radiation, electromagnetic
interference, battle damage, operator mistakes, and environmental
extremes.
Fault Duration. The duration specifies the length of time that a fault is active.
• Permanent fault, that remains in existence indefinitely if no corrective
action is taken.
• Transient fault, which can appear and disappear within a very short
period of time.
• Intermittent fault that appears, disappears, and then reappears repeatedly.
Fault Extent. The extent of a fault specifies whether the fault is localized to a
given hardware or software module or globally affects the hardware, the
software, or both.
Testability. Testability is simply the ability to test for certain attributes within
a system. Measures of testability allow one to assess the ease with which
certain tests can be performed. Certain tests can be automated and provided as
an integral part of the system to improve the testability. Many of the techniques
that are so vital to the achievement of fault tolerance can be used to detect and
locate problems in a system for the purpose of improving testability.
Testability is clearly related to maintainability because of the importance of
minimizing the time required to identify and locate specific problems.
An extensive methodology has been developed in this field over the past thirty
years, and a number of fault-tolerant machines have been developed most
dealing with random hardware faults, while a smaller number deal with
software, design and operator faults to varying degrees. A large amount of
supporting research has been reported and efforts to attain software that can
tolerate software design faults (programming errors) have made use of static
and dynamic redundancy approaches similar to those used for hardware faults.
One such approach, N-version programming, uses static redundancy in the
form of independently written programs (versions) that perform the same
functions, and their outputs are voted at special checkpoints. Here, of course,
the data being voted may not be exactly the same, and a criterion must be used
to identify and reject faulty versions and to determine a consistent value
(through inexact voting) that all good versions can use. An alternative dynamic
approach is based on the concept of recovery blocks. Programs are partitioned
into blocks and acceptance tests are executed after each block. If an acceptance
test fails, a redundant code block is executed.
A transient fault is one that causes a component to malfunction for some time;
it goes away after that time and the functionality of the component is fully
restored. As an example, think of a random noise interference during a
telephone conversation. Another example is a memory cell with contents that
are changed spuriously due to some electromagnetic interference. The cell
itself is undamaged: it is just that its contents are wrong for the time being, and
overwriting the memory cell will make the fault go away.
An intermittent fault never quite goes away entirely; it oscillates between being
quiescent and active. When the fault is quiescent, the component functions
normally; when the fault is active, the component malfunctions. An example
for an intermittent fault is a loose electrical connection. Another classification
of hardware fault is into benign and malicious faults.
A fault that just causes a unit to go dead is called benign. Such faults are the
easiest to deal with. Far more insidious are the faults that cause a unit to produce
reasonable-looking, but incorrect, output, or that make a component
“act maliciously” and send differently valued outputs to different receivers.
Think of an altitude sensor in an airplane that reports a 1000-foot altitude to one
unit and an 8000-foot altitude to another unit. These are called malicious (or
Byzantine) faults.
If an FT system and an HA cluster have the same fault rate, but the
FT system can recover in 3 seconds and the HA cluster takes 5
minutes (300 seconds) to recover from the same fault, then the HA
cluster will be down 100 times as long as the FT system and will have
an availability which is two 9s less. That glorious five 9s claim
becomes three 9s (as reported in several industry studies), at least so
far as software faults are concerned.
So, the secret to high availability is in the recovery time. This is what
the Tandem folks worked so hard on for two decades before
becoming the Nonstop people. Nobody else has done it. Today,
Nonstop servers are the only fault-tolerant systems out-of-the-box in
the marketplace, and they hold the high ground for availability.
Redundancy
All of fault tolerance is an exercise in exploiting and managing redundancy.
Redundancy is the property of having more of a resource than is minimally
necessary to do the job at hand. As failures happen, redundancy is exploited
to mask or otherwise work around these failures, thus maintaining the desired
level of functionality.
There are four forms of redundancy that we will study: hardware, software,
information, and time. Hardware faults are usually dealt with by using
hardware, information, or time redundancy, whereas software faults are
protected against by software redundancy.
Computing nodes can also exploit time redundancy through re-execution of the
same program on the same hardware. As before, time redundancy is effective
mainly against transient faults. Because the majority of hardware faults are
transient, it is unlikely that the separate executions will experience the same
fault.
Time redundancy can thus be used to detect transient faults in situations in
which such faults may otherwise go undetected. Time redundancy can also be
used when other means for detecting errors are in place and the system is
capable of recovering from the effects of the fault and repeating the
computation. Compared with the other forms of redundancy, time redundancy
has much lower hardware and software overhead but incurs a highperformance
penalty.
Software redundancy is used mainly against software failures. It is a reasonable
guess that every large piece of software that has ever been produced has
contained faults (bugs). Dealing with such faults can be expensive: one way is
to independently produce two or more versions of that software (preferably by
disjoint teams of programmers) in the hope that the different versions will not
fail on the same input. The secondary version(s) can be based on simpler and
less accurate algorithms (and, consequently, less likely to have faults) to be
used only upon the failure of the primary software to produce acceptable
results. Just as for hardware redundancy, the multiple versions of the program
can be executed either concurrently (requiring redundant hardware as well) or
sequentially (requiring extra time, i.e., time redundancy) upon a failure
detection.
Techniques of Redundancy
The concept of redundancy implies the addition of information, resources, or
time beyond what is needed for normal system operation. The redundancy can
take one of several forms, including hardware redundancy, software
redundancy, information redundancy, and time redundancy. The use of
redundancy can provide additional capabilities within a system. In fact, if fault
tolerance or fault detection is required then some form of redundancy is also
required. But, it must be understood that redundancy can have a very important
impact on a system in the areas of performance, size, weight, power
consumption, reliability, and others
Hardware Redundancy
The physical replication of hardware is perhaps the most common form of
redundancy used in systems. As semiconductor components have become
smaller and less expensive, the concept of hardware redundancy has become
more common and more practical. The costs of replicating hardware within a
system are decreasing simply because the costs of hardware are decreasing.
There are three basic forms of hardware redundancy. First, passive
techniques use the concept of fault masking to hide the occurrence of faults
and prevent the faults from resulting in errors. Passive approaches are designed
to achieve fault tolerance without requiring any action on the part of the system
or an operator. Passive techniques, in their most basic form, do not provide for
the detection of faults but simply mask the faults.
The primary challenge with TMR is obviously the voter; if the voter fails, the
complete system fails. In other words, the reliability of the simplest form of
TMR, as shown in Figure 1, can be no better than the reliability of the voter.
Any single component within a system whose failure will lead to a failure of
the system is called a single-point-of-failure. Several techniques can be used
to overcome the effects of voter failure. One approach is to triplicate the voters
and provide three independent outputs, as illustrated in Figure 2. In Figure 2,
each of three memories receives
data from a voter which has received its inputs from the three separate
processors. If one processor fails, each memory will continue to receive a
correct value because its voter will correct the corrupted value. A TMR system
with triplicated voters is commonly called a restoring organ because the
configuration will produce three correct outputs even if one input is faulty. In
essence, the TMR with triplicated voters restores the error-free signal. A
generalization of the TMR approach is the N-modular redundancy (NMR)
technique. NMR applies the same principle as TMR but uses N of a given
module as opposed to only three. In most cases, N is selected as an odd number
so that a majority voting arrangement can be used. The advantage of using N
modules rather than three is that more module faults can often be tolerated.
For example, a 5MR system contains five replicated modules and a voter. A
majority voting arrangement allows the 5MR system to produce correct results
in the face of as many as two module faults. In many criticalcomputation
applications, two faults must be tolerated to allow the required reliability and
fault tolerance capabilities to be achieved. The primary tradeoff in NMR is the
fault tolerance achieved versus the hardware required. Clearly, there must be
some limit in practical applications on the amount of redundancy that can be
employed. Power, weight, cost, and size limitations very often determine the
value of N in an NMR system.
Voting within NMR systems can occur at several points. For example, an
industrial controller can sample the temperature of a chemical process from
three independent sensors, perform a vote to determine which of the three
sensor values to use, calculate the amount of heat or cooling to provide to the
process (the calculations being performed by three or more separate modules),
and then vote on the calculations to determine a result. The voting can be
performed on both analog and digital data. The alternative, in this example,
might be to sample the temperature from three independent sensors, perform
the calculations, and then provide a single vote on the final result. The primary
difference between the two approaches is fault containment. If voting is not
performed on the temperature values from the sensors, then the effect of a
sensor fault is allowed to propagate beyond the sensors and into the primary
calculations. Voting at the sensors, however, will mask, and contain, the effects
of a sensor fault. Providing several levels of voting, however, does require
additional redundancy, and the benefits of fault containment must be compared
to the cost of the extra redundancy.
Figure 2: Triplicated voters in a TMR configuration
One approach that alleviates the problem of the previous paragraph is called
the mid-value select technique. Basically, the mid-value select approach
chooses a value from the three available in a TMR system by selecting the
value that lies between the remaining two. If three signals are available, and
two of those signals are uncorrupted and the third is corrupted, one of the
uncorrupted results should lie between the other uncorrupted result and the
corrupted result. The mid-value select technique can be applied to any
system that uses an odd number of modules such that one signal must lie in
the middle of the others. The major difficulty with most techniques that use
some form of voting is that a single result must ultimately be produced,
thus creating a potential point where one failure can cause a system failure.
Clearly, single-points-offailure are to be avoided if a system is to be truly
fault-tolerant.
Active Hardware Redundancy. Active hardware redundancy techniques
attempt to achieve fault tolerance by fault detection, fault location, and fault
recovery. In many designs faults can be detected because of the errors they
produce, so in many instances error detection, error location and error
recovery are the appropriate terms to use. The property of fault masking,
however, is not obtained in the active redundancy approach. In other words,
there is no attempt to prevent faults from producing errors within the
system. Consequently, active approaches are most common in applications
where temporary, erroneous results are acceptable as long as the system
reconfigures and regains its operational status in a satisfactory length of
time. Satellite systems are good examples of applications of active
redundancy. Typically, it is not catastrophic if satellites have infrequent,
temporary failures. In fact, it is usually preferable to have temporary
failures than to provide the large quantities of redundancy necessary to
achieve fault masking.
fault will produce an error which is either detected or it is not detected. If the
error remains undetected, the result will be a system failure. The failure will
occur after a latency period has expired. If the error is detected, the source of
the error must be located, and the faulty component removed from operation.
Next, a spare component must be enabled, and the system brought back to an
operational state. It is important to note that the new operational state may be
identical to the original operational state of the system or it may be a degraded
mode of operation. The processes of fault location, fault containment, and fault
recovery are normally referred to simply as reconfiguration.
It is clear from this description that active approaches to fault tolerance require
fault detection and location capabilities.
Information Redundancy
Information redundancy is the addition of redundant information to data to
allow fault detection fault masking, or possibly fault tolerance. Good examples
of information redundancy are error detecting and error correcting codes,
formed by the addition of redundant information to data words, or by the
mapping of data words into new representations containing redundant
information.
Before beginning the discussions of various codes, we will define
several basic terms that will appear throughout this section. In general, a code
is a means of representing information, or data, using a well-defined set of
rules. A code word is a collection of symbols, often called digits if the symbols
are numbers, used to represent a particular piece of data based upon a specified
code. A binary code is one in which the symbols forming each code word
consist of only the digits 0 and 1. For example, a Binary Coded Decimal (BCD)
code defines a 4-bit code word for each decimal digit. The BCD code, for
example, is clearly a binary code. A code word is said to be valid if the code
word adheres to all of the rules that define the code; otherwise, the code word
is said to be invalid.
Benefits of FTA
FTA is a convent means to logically think through the many ways a failure
may occur. It provides insights that may lead to product improvements or
process controls. It is a logical, graphical diagram that organizes the possible
element failures and combination of failures that lead to the top level fault
being studied. The converse, the success tree analysis, starts with the
successful operation of a system, for example, and examines in a logical,
graphical manner all the elements and combinations that have to work
successfully.
With every product, there are numerous ways it can fail. Some more likely and
possible than others. The FTA permits a team to think through and organize
the sequences or patterns of faults that have to occur to cause a specific top
level fault. The top level fault may be a specific type of failure, say the car will
not start. Or it may be focused on a serious safety related failure, such as the
starter motor overheats starting a fire. A complex system may have numerous
FTA that each explore a different failure mode.
The primary benefit of FTA is the analysis provides a unique insight into the
operation and potential failure of a system. This allows the development team
to explore ways to eliminate or minimize the occurrence of product failure. By
exploring the ways a failure mode can occur by exploring the individual failure
causes and mechanisms, the changes impact the root cause of the potential
failures.
There exist several overlapping taxonomies of the field. Some are more
oriented toward control engineering approach, other to mathematical,
Statistical and AI approach. Interesting divisions are described in the
following division of fault detection methods Below:
PAIR-AND-A-SPARE
Pair-and-A-Spare (PaS) Redundancy was first introduced in as an
approach that combines Duplication with Comparison and standbysparing.
In this scheme each module copy is coupled with a FaultDetection (FD)
unit to detect hardware anomalies within the scope of individual module.
A comparator is used to detect inequalities in the results from two active
modules. In the case of inequality, a switch decides which one of the two
active modules is faulty by analyzing the reports from FD units and
replaces it with a spare one. This scheme was intended to prevent hardware
faults from causing system failures. The scheme fundamentally lacks
protection against transient faults and it incurs a large hardware overhead
to accomplish the identification of faulty modules.
Figure 4: Pair-and-A-Spare
RAZOR
Razor is a well-known solution to achieve timing error resilience by
using the technique called timing speculation. The principle idea
behind this architecture is to employ temporally separated
doublesampling of input data using Razor FFs, placed on critical
paths. The main FF takes input sample on the rising edge of the clock,
while a time-shifted clock (clk-del) to the shadow latch is used to take
a second sample. By comparing the data of the main FF and the
shadow latch, an error signal is generated. The timing diagram of how
the architecture detects timing errors. In this example a timing fault in
CL A causes the data to arrive late enough that the main FF captures
wrong data but the shadow always latches the input data correctly.
The error is signaled by the XOR gate which propagates through the
OR-tree for correction action to be taken. Error recovery in Razor is
possible either by clock-gating or by rollback recovery. Razor also
uses Dynamic Voltage Scaling (DVS) scheme to optimize the energy
vs. error rate trade-off.
CPipe
The CPipe or Conjoined Pipeline architecture proposed uses spatial
and temporal redundancy to detect and recover from timing and
transient errors. It duplicates CL blocks and the FFs as well to from
two pipelines interlinked together. The primary or leading pipeline is
overclocked to speedup execution while the replicated of shadow
pipeline has sufficient speed margin to be free from timing errors.
Comparators placed across the leading pipeline register in somewhat
similar way as the scheme, detects any metastable state of leading
pipeline register and SETs reaching the registers during the latching
window. The error recovery is achieved by stalling the pipelines and
using data from the shadow pipeline registers for rollback and it takes
3 cycles to complete.
DARA-TMR
DARA-TMR triplicates entire pipeline but uses only two pipeline
copies to run identical process threads in Dual Modular Redundancy
(DMR) mode. The third pipeline copy is disabled using power gating
and is only engaged for diagnosis purposes in case of very frequent
errors reported by the detection circuitry. Once the defective pipeline
is identified the system returns back to DMR redundancy mode by
putting the defected pipeline in off mode. The error recovery follows
the same mechanism as pipeline branch misprediction, making use of
architectural components for error recovery. DARA-TMR treats
permanent fault occurrence as a very rare phenomenon and undergo a
lengthy reconfiguration mechanism to isolate them .
δt = the amount of time between CLK capture edge and the lapse of
the comparison-window
tccq = FF clk-to-output delay tcdm =
demultiplexer contamination delay tcm =
multiplexer contamination delay
2. Error Detection
Error Detection is a fault tolerance technique where the program
locates every incidence of error in the system. This technique is
practically implemented using two attributes, namely, self-protection
and self-checking. The Self-Protection attribute of error detection is
used for spotting the errors in the external modules, whereas the
SelfChecking attribute of error detection is used for spotting the errors
in the internal module.
3. Exception Handling
Exception Handling is a technique used for redirecting the execution
flow towards the route to recovery whenever an error occurs in the
normal functional flow. As a part of fault tolerance, this activity is
performed under three different software components, such as the
Interface Exception, the Local Exception and the Failure Exception.
4. Checkpoint and Restart
This is one of the commonly used recuperation methods for single
version software systems. The Checkpoint and Restart fault
tolerance technique can be used for the events like run-time
exceptions, that is, a malfunction takes place during the run-time
and when the execution is complete there is no record of the error
happening. For this case, the programmer can place checkpoints in
the program and instruct the program to restart immediately right
from the occurrence of the error.
5. Process Pairs
Process Pair technique is a method of using the same software in
two different hardware units and validating the functional
differences in order to capture the faulty areas. This technique
functions on top of the checkpoint and restart technique, as similar
checkpoints and restart instructions are placed in both systems.
6. Data Diversity
Data Diversity technique is typically a process where the
programmer passes a set of input data, and places checkpoints for
detecting the slippage. The commonly used Data Diversity models
are ‘Input Data Re-Expression’ model, ‘Input Data Re-Expression
with Post-Execution Adjustment’ model, and ‘Re-Expression via
Decomposition and Recombination’ model.
7. Recovery Blocks
Recovery Block technique for multiple version software Fault
Tolerance involves the checkpoint and restart method, where the
checkpoints are placed before the fault occurrence, and the system
is instructed to move on to next version to continue the flow. It is
carried out in three areas, that is, the main module, the acceptance
tests, and the swap module.
8. N – Version Programming
The N – Version programming technique for the multi – version fault
tolerance is the commonly used method when the there is a provision
for testing multiple code editions. The recovery is made from
executing all the versions and comparing the outputs from each of
the versions. This technique also involves the acceptance test flow.
9. N Self–Checking Programming
N Self – Checking Programming is a combination technique of both
the Recovery block and the N – version Programming techniques,
which also calls for the acceptance test execution. It is performed by
the sequential and the parallel execution of various versions of the
software.
In web services, there are many fault tolerant techniques that can be
applied such as replication. Replication is a more efficient technique for
handling exception in a distributed application. Services can resume more
effectively by maintaining the global state of the application. For instance,
let's assume if one service needs the assistance of another service to
provide the desired result to the customer then service needs to
communicate with other service. Suppose, while communicating with
other service, at certain point of time if a fault occurs in a service, then
there is no need to continue service with faults. Then the state manager has
to roll back the state of the application at that point where the fault
occurred so that service can resume without fault and response can be
given to the consumer more effectively.
The Fault Tolerance methods can be applied to cloud computing in three levels:
• At hardware level: if the attack on a hardware resource causes the system
failure, then its effect can be compensated by using additional hardware
resources.
• At software (s/w) level: Fault tolerance techniques such as checkpoint
restart and recovery methods can be used to progress system execution in
the event of failures due to security attacks.
• At system level: At this level, fault tolerance measures can compensate
failure in system amenities and guarantee the availability of network and
other resources.
CONCLUSION
Fault-tolerance is achieved by applying a set of analysis and design techniques
to create systems with dramatically improved dependability. As new
technologies are developed and new applications arise, new faulttolerance
approaches are also needed. In the early days of fault-tolerant computing, it
was possible to craft specific hardware and software solutions from the
ground up, but now chips contain complex, highly-integrated functions, and
hardware and software must be crafted to meet a variety of standards to be
economically viable. Thus, a great deal of current research focuses on
implementing fault tolerance using Commercial-Off-The-Shelf (COTs)
technology.
Recent developments include the adaptation of existing fault-tolerance
techniques to RAID disks where information is striped across several disks to
improve bandwidth and a redundant disk is used to hold encoded information
so that data can be reconstructed if a disk fails. Another area is the use of
application-based fault-tolerance techniques to detect errors in high
performance parallel processors. Fault-tolerance techniques are expected to
become increasingly important in deep sub-micron VLSI devices to combat
increasing noise problems and improve yield by tolerating defects that are
likely to occur on very large, complex chips.
SUMMARY
The ability of a system to continue operation despite a failure of any single
element within the system implies the system is not in a series configuration.
There is some set of redundancy or set of alternative means to continue
operation. The system may use multiple redundancy elements, or be resilient
to changes in the system’s configuration. The appropriate solution to create a
fault tolerant system often requires careful planning, understanding of how
elements fail and the impact of surrounding elements of the failure.
Fault-tolerance techniques will become even more important the next
years. The ideal, from an application writer’s point of view, is total hardware
fault-tolerance. Trends in the market, e.g. Stratus and Sun Netra, shows that
this is the way systems go at the moment. There is also, fortunately, reason to
believe that such systems will become considerable cheaper than today.
Technology in general, and miniaturization in particular (which leads to
physically smaller and in general cheaper systems) contributes to this. Much
research is also being done with clusters of commercial general-purpose
computers connected with redundant buses. In that case, the software has to
handle the failures. However, as shown with the HA Cluster and Sun Netra,
that could also be done without affecting the user programs and applications.