0% found this document useful (0 votes)
36 views

Fault Tolerance Computing Lecture Note

E-book
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Fault Tolerance Computing Lecture Note

E-book
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Fault Tolerance Computing

1.0 INTRODUCTION
Technology scaling allows realization of more and more complex system on a
single chip. This high level of integration leads to increased current and power
densities and causes early device and interconnect wear-out. In addition, there
are failures not caused by wear-out nor escaped manufacturing defects, but due
to increased susceptibility of transistors to high energy particles from
atmosphere or from within the packaging. Devices operating at reduced supply
voltages are more prone to charge related phenomenon caused by high-energy
particle strikes referred to as Single Event Effect (SEE). They experience
particle-induced voltage transients called Single Event Transient (SET) or
particle-induced bit-flips in memory elements also known as Single Event
Upset (SEU). Selecting the ideal trade-off between reliability and cost
associated with a fault tolerant architecture generally involves an extensive
design space exploration. Employing state-of-the-art reliability estimation
methods makes this exploration un-scalable with the design complexity.

Fault Tolerance in software is a phenomenon where the software is capable of


fixing itself or continuing the normal operations in the occurrence of any
glitches or error in the system, provided that full coverage on the functionality
is maintained as specified in the required documentation. The reasons behind
these faults in the software system can be a fault from within, from other
integrated systems, from the downstream application, or from any other
external aspects like the system hardware, network, etc. This is one of the
factors based on which software is estimated to be a quality one or not. Hence
it is important that every software program consists of fault tolerance.
Faulttolerant computing is the art and science of building computing systems
that continue to operate satisfactorily in the presence of faults. A fault-tolerant
system may be able to tolerate one or more fault-types including -- i) transient,
intermittent or permanent hardware faults, ii) software and hardware design
errors, iii) operator errors, or iv) externally induced upsets or physical damage.
An extensive methodology has been developed in this field over the past thirty
years, and a number of fault-tolerant machines have been developed - most
dealing with random hardware faults, while a smaller number deal with
software, design and operator faults to varying degrees.

OBJECTIVES
At the end of this unit you should be able to:
• Explain fault tolerant in computing.
• Explain redundancy and identify Hardware and Software Fault
Tolerant Issues
• Explain the Relationship Between Security and Fault Tolerance
• Identify different Fault Tolerance Architectures

FAULT TOLERANT COMPUTING


What is Fault Tolerance
Fault Tolerance has been part of the computing community for quite a
long time, to clarify the building of our understanding of fault
tolerance, then we should know that fault tolerance is the art and
science of building computing systems that continue to operate
satisfactorily in the presence of faults. An operating system that offers
a solid definition for faults cannot be disrupted by a single point of
failure. It ensures business continuity and the high availability of
crucial applications and systems regardless of any failures.

Fault tolerance and dependable systems research covers a wide


spectrum of applications ranging across embedded real-time systems,
commercial transaction systems, transportation systems, military and
space systems to name a few. The supporting research includes system
architecture, design techniques, coding theory, testing, validation, proof
of correctness, modeling, software reliability, operating systems,
parallel processing, and real-time processing. These areas often involve
widely diverse core expertise ranging from formal logic, mathematics
of stochastic modeling, graph theory, hardware design and software
engineering.

Basic Terms of fault Tolerance Computing


Fault tolerance can be built into a system to remove the risk of it having
a single point of failure. To do so, the system must have no single
component that, if it were to stop working effectively, would result in
the entire system failing. Fault tolerance is reliant on aspects like load
balancing and failover, which remove the risk of a single point of failure.
It will typically be part of the operating system’s interface, which enables
programmers to check the performance of data throughout a transaction.

Three central terms in fault-tolerant design are fault, error, and failure.
There is a cause effect relationship between faults, errors, and failures.
Specifically, faults are the cause of errors, and errors are the cause of
failures. Often the term failure is used interchangeably with the term
malfunction, however, the term failure is rapidly becoming the more
commonly accepted one. A fault is a physical defect, imperfection, or
flaw that occurs within some hardware or software component.
Essentially, the definition of a fault, as used in the fault tolerance
community, agrees with the definition found in the dictionary. A fault is
a blemish, weakness, or shortcoming of a particular hardware or software
component. An error is the manifestation of a fault. Specifically, an error
is a deviation from accuracy or correctness. Finally, if the error results in
the system performing one of its functions incorrectly then a system
failure has occurred. Essentially, a failure is the nonperformance of some
action that is due or expected. A failure is also the performance of some
function in a subnormal quantity or quality.

The concepts of faults, errors, and failures can be best presented by the use of
a three-universe model that is an adaptation of the four-universe models;
• first universe is the physical universe in which faults occur. The physical
universe contains the semiconductor devices, mechanical elements,
displays, printers, power supplies, and other physical entities that make
up a system. A fault is a physical defect or alteration of some component
within the physical universe.
• The second universe is the informational universe. The informational
universe is where the error occurs. Errors affect units of information
such as data words within a computer or digital voice or image
information. An error has occurred when some unit of information
becomes incorrect.
• The final universe is the external or user’s universe. The external
universe is where the user of a system ultimately sees the effect of faults
and errors. The external universe is where failures occur. The failure is
any deviation that occurs from the desired or expected behavior of a
system. In summary, faults are physical events that occur in the physical
universe. Faults can result in errors in the informational universe, and
errors can ultimately lead to failures that are witnessed in the external
universe of the system.
The cause-effect relationship implied in the three-universe model leads to the
definition of two important parameters; fault latency and error latency.
• Fault latency is the length of time between the occurrence of a fault and
the appearance of an error due to that fault.
• Error latency is the length of time between the occurrence of an error
and the appearance of the resulting failure. Based on the three-universe
model, the total time between the occurrence of a physical fault and the
appearance of a failure will be the sum of the fault latency and the error
latency.

Characteristics of Faults
Faults could be classified based on the following parameters a)
Causes/Source of Faults
b) Nature of Faults
c) Fault Duration
d) Extent of Faults
e) Value of faults

Sources of faults: Faults can be the result of a variety of things that occur
within electronic components, external to the components, or during the
component or system design process. Problems at any of several points within
the design process can result in faults within the system.
• Specification mistakes, which include incorrect algorithms, architectures,
or hardware and software design specifications.
• Implementation mistakes. Implementation, as defined here, is the process
of transforming hardware and software specifications into the physical
hardware and the actual software. The implementation can introduce
faults because of poor design, poor component selection, poor
construction, or software coding mistakes.
• Component defects. Manufacturing imperfections, random device
defects, and component wear-out are typical examples of component
defects. Electronic components simply become defective sometimes. The
defect can be the result of bonds breaking within the circuit or corrosion
of the metal. Component defects are the most commonly considered cause
of faults.
• External disturbance; for example, radiation, electromagnetic
interference, battle damage, operator mistakes, and environmental
extremes.

Nature of a faults: specifies the type of fault; for example, whether it is a


hardware fault, a software fault, a fault in the analog circuitry, or a fault in the
digital circuitry.

Fault Duration. The duration specifies the length of time that a fault is active.
• Permanent fault, that remains in existence indefinitely if no corrective
action is taken.
• Transient fault, which can appear and disappear within a very short
period of time.
• Intermittent fault that appears, disappears, and then reappears repeatedly.

Fault Extent. The extent of a fault specifies whether the fault is localized to a
given hardware or software module or globally affects the hardware, the
software, or both.

Fault value of a fault can be either determinate or indeterminate. A


determinate fault is one whose status remains unchanged throughout time
unless externally acted upon. An indeterminate fault is one whose status at
some time, T, may be different from its status at some increment of time
greater than or less than T.

Three primary techniques for maintaining a system’s normal performance in


an environment where faults are of concern; fault avoidance, fault masking,
and fault tolerance.
• Fault avoidance is a technique that is used in an attempt to prevent the
occurrence of faults. Fault avoidance can include such things as design
reviews, component screening, testing, and other quality control methods.
• Fault masking is any process that prevents faults in a system from
introducing errors into the informational structure of that system.
• Fault tolerance is the ability of a system to continue to perform its tasks
after the occurrence of faults. The ultimate goal of fault tolerance is to
prevent system failures from occurring. Since failures are directly caused
by errors, the terms fault tolerance and error tolerance are often used
interchangeably.

Approaches for Fault Tolerance.


• Fault masking is one approach to tolerating faults.
• Reconfiguration is the process of eliminating a faulty entity from a
system and restoring the system to some operational condition or state. If
the reconfiguration technique is used then the designer must be concerned
with fault detection, fault location, fault containment, and fault recovery.
• Fault detection is the process of recognizing that a fault has occurred.
Fault detection is often required before any recovery procedure can be
implemented.
• Fault location is the process of determining where a fault has occurred so
that an appropriate recovery can be implemented.
• Fault containment is the process of isolating a fault and preventing the
effects of that fault from propagating throughout a system. Fault
containment is required in all fault-tolerant designs.
• Fault recovery is the process of remaining operational or regaining
operational status via reconfiguration even in the presence of faults.

Goals of Fault Tolerance


Fault tolerance is an attribute that is designed into a system to achieve some
design goals such as; dependability, reliability, availability, safety,
performability, maintainability, and testability; fault tolerance is one stem
attribute capable of fulfilling such requirements.

Dependability. The term dependability is used to encapsulate the concepts of


reliability, availability, safety, maintainability, performability, and testability.
Dependability is simply the quality of service provided by a particular system.
Reliability, availability, safety, maintainability, performability, and testability,
are examples of measures used to quantify the dependability of a system.

Reliability. The reliability of a system is a function of time, R(t), defined as


the conditional probability that the system performs correctly throughout the
interval of time, [t0,t], given that the system was performing correctly at time
t0. In other words, the reliability is the probability that the system operates
correctly throughout a complete interval of time. The reliability is a conditional
probability in that it depends on the system being operational at the beginning
of the chosen time interval. The unreliability of a system is a function of time,
F(t), defined as the conditional probability that a system begins to perform
incorrectly during the interval of time, [t0,t], given that the system was
performing correctly at time t0. The unreliability is often referred to as the
probability of failure.

Reliability is most often used to characterize systems in which even momentary


periods of incorrect performance are unacceptable, or it is impossible to repair
the system. If repair is impossible, such as in many space applications, the time
intervals being considered can be extremely long, perhaps as many as ten years.
In other applications, such as aircraft flight control, the time intervals of
concern may be no more than several hours, but the probability of working
correctly throughout that interval may be 0.9999999 or higher. It is a common
convention when reporting reliability numbers to use 0.9i to represent the
fraction that has i nines to the right of the decimal point. For example,
0.9999999 is written as 0.97.

Availability. Availability is a function of time, A(t), defined as the probability


that a system is operating correctly and is available to perform its functions at
the instant of time, t. Availability differs from reliability in that reliability
involves an interval of time, while availability is taken at an instant of time. A
system can be highly available yet experience frequent periods of inoperability
as long as the length of each period is extremely short. In other words, the
availability of a system depends not only on how frequently it becomes
inoperable but also how quickly it can be repaired. Examples of high-
availability applications include time-shared computing systems and certain
transactions processing applications, such as airline reservation systems.
Safety. Safety is the probability, S(t), that a system will either perform its
functions correctly or will discontinue its functions in a manner that does not
disrupt the operation of other systems or compromise the safety of any people
associated with the system. Safety is a measure of the failsafe capability of a
system; if the system does not operate correctly, it is desired to have the system
fail in a safe manner. Safety and reliability differ because reliability is the
probability that a system will perform its functions correctly, while safety is
the probability that a system will either perform its functions correctly or will
discontinue the functions in a manner that causes no harm.

Performability. In many cases, it is possible to design systems that can


continue to perform correctly after the occurrence of hardware and software
faults, but the level of performance is somehow diminished. The
performability of a system is a function of time, P(L,t), defined as the
probability that the system performance will be at, or above, some level, L, at
the instant of time, t Performability differs from reliability in that reliability is
a measure of the likelihood that all of the functions are performed correctly,
while performability is a measure of the likelihood that some subset of the
functions is performed correctly.

Graceful degradation is an important feature that is closely related to


performability. Graceful degradation is simply the ability of a system to
automatically decrease its level of performance to compensate for hardware
and software faults. Fault tolerance can certainly support graceful degradation
and performability by providing the ability to eliminate the effects of hardware
and software faults from a system, therefore allowing performance at some
reduced level.
Maintainability. Maintainability is a measure of the ease with which a system
can be repaired, once it has failed. In more quantitative terms, maintainability
is the probability, M(t), that a failed system will be restored to an operational
state within a period of time t. The restoration process includes locating the
problem, physically repairing the system, and bringing the system back to its
operational condition. Many of the techniques that are so vital to the
achievement of fault tolerance can be used to detect and locate problems in a
system for the purpose of maintenance. Automatic diagnostics can
significantly improve the maintainability of a system because a majority of the
time used to repair a system is often devoted to determining the source of the
problem.

Testability. Testability is simply the ability to test for certain attributes within
a system. Measures of testability allow one to assess the ease with which
certain tests can be performed. Certain tests can be automated and provided as
an integral part of the system to improve the testability. Many of the techniques
that are so vital to the achievement of fault tolerance can be used to detect and
locate problems in a system for the purpose of improving testability.
Testability is clearly related to maintainability because of the importance of
minimizing the time required to identify and locate specific problems.

3.0.1.2 Fault Tolerant Systems


Fault tolerance is a process that enables an operating system to respond to a
failure in hardware or software. This fault-tolerance definition refers to the
system's ability to continue operating despite failures or malfunctions. A fault-
tolerant system may be able to tolerate one or more fault-types including
Transient, Intermittent or Permanent Hardware Faults,
• Software and Hardware Design Errors,
• Operator Errors
• Externally Induced Upsets or Physical Damage.

An extensive methodology has been developed in this field over the past thirty
years, and a number of fault-tolerant machines have been developed most
dealing with random hardware faults, while a smaller number deal with
software, design and operator faults to varying degrees. A large amount of
supporting research has been reported and efforts to attain software that can
tolerate software design faults (programming errors) have made use of static
and dynamic redundancy approaches similar to those used for hardware faults.
One such approach, N-version programming, uses static redundancy in the
form of independently written programs (versions) that perform the same
functions, and their outputs are voted at special checkpoints. Here, of course,
the data being voted may not be exactly the same, and a criterion must be used
to identify and reject faulty versions and to determine a consistent value
(through inexact voting) that all good versions can use. An alternative dynamic
approach is based on the concept of recovery blocks. Programs are partitioned
into blocks and acceptance tests are executed after each block. If an acceptance
test fails, a redundant code block is executed.

An approach called design diversity combines hardware and software


faulttolerance by implementing a fault-tolerant computer system using
different hardware and software in redundant channels. Each channel is
designed to provide the same function, and a method is provided to identify if
one channel deviates unacceptably from the others. The goal is to tolerate both
hardware and software design faults. This is a very expensive technique, but it
is used in very critical aircraft control applications.
Major building blocks of a Fault-tolerance System
The key benefit of fault tolerance is to minimize or avoid the risk of systems
becoming unavailable due to a component error(s). This is particularly
important in critical systems that are relied on to ensure people’s safety, such
as air traffic control, and systems that protect and secure critical data and
highvalue transactions The core components to improving fault tolerance
include:

Diversity: If a system’s main electricity supply fails, potentially due to a storm


that causes a power outage or affects a power station, it will not be possible to
access alternative electricity sources. In this event, fault tolerance can be
sourced through diversity, which provides electricity from sources like backup
generators that take over when a main power failure occurs.
• Some diverse fault-tolerance options result in the backup not having the
same level of capacity as the primary source. This may, in some cases,
require the system to ensure graceful degradation until the primary
power source is restored.
• Redundancy
• Fault-tolerant systems use redundancy to remove the single point of
failure. The system is equipped with one or more power supply units
(PSUs), which do not need to power the system when the primary PSU
functions as normal. In the event the primary PSU fails or suffers a fault,
it can be removed from service and replaced by a redundant PSU, which
takes over system function and performance.
• Alternatively, redundancy can be imposed at a system level, which
means an entire alternate computer system is in place in case a failure
occurs.
Replication: Replication is a more complex approach to achieving fault
tolerance. It involves using multiple identical versions of systems and
subsystems and ensuring their functions always provide identical results. If
the results are not identical, then a democratic procedure is used to identify
the faulty system. Alternatively, a procedure can be used to check for a
system that shows a different result, which indicates it is faulty.
• Replication can either take place at the component level, which involves
multiple processors running simultaneously, or at the system level,
which involves identical computer systems running simultaneously

Basic Characteristics of Fault Tolerant Systems


A fault tolerant system may have one or more of the following
characteristics:
• No Single Point of Failure: This means if a capacitor, block of
software code, a motor, or any single item fails, then the system does
not fail. As an example, many hospitals have backup power systems in
case the grid power fails, thus keeping critical systems within the
hospital operational. Critical systems may have multiple redundant
schemes to maintain a high level of fault tolerance and resilience.

• No Single Point Repair Takes the System Down: Extending the


single point failure idea, effecting a repair of a failed component does
not require powering down the system, for example. It also means the
system remains online and operational during repair. This may pose
challenges for both the design and the maintenance of a system. Hot
swappable power supplies is an example of a repair action that keeps
the system operating while replacing a faulty power supply.
• Fault isolation or identification: The system is able to identify when
a fault occurs within the system and does not permit the faulty element
to adversely influence to functional capability (i.e. Losing data or
making logic errors in a banking system). The faulty elements are
identified and isolated. Portions of the system may have the sole
purpose of detecting faults, built-in self-test (BIST) is an example.

Fault containment to prevent propagation of failure


• When a failure occurs it may result in damage to other elements within the
system, thus creating a second or third fault and system failure.
• For example, if an analog circuit fails it may increase the current across
the system damaging logic circuits unable to withstand high current
conditions. The idea of fault containment is to avoid or minimize collateral
damage caused by a single point failure.

Robustness or Variability Control


• When a system experiences a single point failure, the system changes.
• The change may cause transient or permanent changes affecting how the
working elements of the system response and function. Variation occurs,
and when a failure occurs there often is an increase in variability. For
example, when one of two power supplies fails, the remaining power supply
takes on the full load of the power demand. This transition should occur
without impacting the performance of the system. The ability to design and
manufacture a robust system may involve design for six sigma, design of
experiment optimization, and other tools to create a system able to operate
when a failure occurs.
Availability of Reversion Mode
• There are many ways a system may alter it performance when a failure
occurs, enabling the system to continue to function in some fashion.
• For example, if part of a computer’s cooling system fails, the central
processor unit (CPU) may reduce its speed or command execution rate,
effectively reducing the heat the CPU generates. The fail failure causes a
loss of cooling capacity and the CPU adjusts to accommodate and avoids
overheating and failing. Other reversion schemes may include a roll back
to a prior working state, or a switch to a prior or safe mode software set.
• In some cases, the system may be able to operators with no or only
minimal loss of functional capability, or the reversion operation
significantly restricts the system operation to a critical few functions.

Hardware and Software Fault Tolerant Issues


In everyday language, the terms fault, failure, and error are used
interchangeably. In fault-tolerant computing parlance, however, they have
distinctive meanings. A fault (or failure) can be either a hardware defect or a
software i.e. programming mistake (bug). In contrast, an error is a
manifestation of the fault, failure and bug. As an example, consider an adder
circuit, with an output line stuck at 1; it always carries the value 1
independently of the values of the input operands. This is a fault, but not (yet)
an error. This fault causes an error when the adder is used and the result on that
line is supposed to have been a 0, rather than a 1. A similar distinction exists
between programming mistakes and execution errors. Consider, for example,
a subroutine that is supposed to compute sin(x) but owing to a programming
mistake calculates the absolute value of sin(x) instead. This mistake will result
in an execution error only if that particular subroutine is used and the correct
result is negative.
Both faults and errors can spread through the system. For example, if a chip
shorts out power to ground, it may cause nearby chips to fail as well. Errors
can spread because the output of one unit is used as input by other units. To
return to our previous examples, the erroneous results of either the faulty adder
or the sin(x) subroutine can be fed into further calculations, thus propagating
the error.

To limit such contagion, designers incorporate containment zones into


systems. These are barriers that reduce the chance that a fault or error in one
zone will propagate to another. For example, a fault-containment zone can be
created by ensuring that the maximum possible voltage swings in one zone are
insulated from the other zones, and by providing an independent power supply
to each zone. In other words, the designer tries to electrically isolate one zone
from another. An error-containment zone can be created, as we will see in some
detail later on, by using redundant units, programs and voting on their output.

Hardware faults can be classified according to several aspects. Regarding their


duration, hardware faults can be classified into permanent, transient, or
intermittent. A permanent fault is just that: it reflects the permanent going out
of commission of a component. As an example of a permanent fault think of a
burned-out light bulb.

A transient fault is one that causes a component to malfunction for some time;
it goes away after that time and the functionality of the component is fully
restored. As an example, think of a random noise interference during a
telephone conversation. Another example is a memory cell with contents that
are changed spuriously due to some electromagnetic interference. The cell
itself is undamaged: it is just that its contents are wrong for the time being, and
overwriting the memory cell will make the fault go away.

An intermittent fault never quite goes away entirely; it oscillates between being
quiescent and active. When the fault is quiescent, the component functions
normally; when the fault is active, the component malfunctions. An example
for an intermittent fault is a loose electrical connection. Another classification
of hardware fault is into benign and malicious faults.

A fault that just causes a unit to go dead is called benign. Such faults are the
easiest to deal with. Far more insidious are the faults that cause a unit to produce
reasonable-looking, but incorrect, output, or that make a component
“act maliciously” and send differently valued outputs to different receivers.
Think of an altitude sensor in an airplane that reports a 1000-foot altitude to one
unit and an 8000-foot altitude to another unit. These are called malicious (or
Byzantine) faults.

Fault Tolerance VS High Availability


Why is it that we see industry-standard servers advertising five 9s of

availability while Nonstop servers acknowledge four 9s? Are these

high-availability industry-standard servers really ten times more

reliable than fault-tolerant NonStop servers? Of course not.

To understand this marketing discrepancy, let’s take a look at the


factors which differentiate fault-tolerant systems from
highavailability systems.
To start with, there is no reason to assume that a single NonStop
processor is any more or less reliable than an industry-standard
processor. In fact, a reasonable assumption is that a processor will be
up about 99.5% of the time (that is, it will have almost three 9s
availability) whether it be a NonStop processor or an
industrystandard processor.

So how do we get four or five 9s out of components that offer less


than three 9s of availability? Through redundancy, of course.
NonStop servers are inherently redundant and are fault tolerant (FT)
in that they can survive any single fault. In the high-availability (HA)
world, industry-standard servers are configured in clusters of two or
more processors that allow for re-configuration around faults.
FT systems tolerate faults; HA clusters re-configure around faults.

If you provide a backup, you double your 9s. Thus, in a twoprocessor


configuration, each of which has an availability of .995, you can be
dreaming of five 9s of hardware availability. But dreams they are.
True, you will have at least one processor up 99.999% of the time;
but that does not mean that your system will be available for that
proportion of time. This is because most system outages are not
caused by hardware failures.

The causes of outages have been studied by many (Standish Group,


IEEE Computer, Grey, among others), and they all come up with
amazingly similar breakdowns:
- Hardware 10% – 20%
- Software 30% – 40 %
- People 20% – 40%
- Environment 10% – 20%
- Planned 20% – 30%

These results are for single processor systems. However, we are


considering redundant systems which will suffer a hardware failure
only if both systems fail. Given a 10-20% chance that a single system
will fail due to a hardware failure, an outage due to a dual hardware
failure is only 1% to 4%. Thus, we can pretty much ignore hardware
failures as a source of failure in redundant systems. (This is a gross
understatement for the new Nonstop Advanced Architecture, which
is reaching toward six or seven 9s for hardware availability.)

So, what is left that can be an FT/HA differentiator? Environmental


factors (air conditioning, earthquakes, etc.) and people factors
(assuming good system management tools) are pretty much
independent of the system. Planned downtime is a millstone around
everyone’s neck, and much is being done about this across all
systems. This leaves software as the differentiator.
Software faults are going to happen, no matter what. In a single
system, 30-40% of all single-system outages will be caused by
software faults. The resultant availability of a redundant system is
going to depend on how software faults are handled. Here is the
distinction between fault-tolerant systems and high-availability
systems. A fault-tolerant system will automatically recover from a
software fault almost instantly (typically in seconds) as failed
processes switch over to their synchronized backups. The state of
incomplete transactions remains in the backup disk process and
processing goes on with virtually no delay. On the other hand, a high-
availability (HA) cluster will typically require that the applications
be restarted on a surviving system and that in-doubt transactions in
process be recovered from the transaction log. Furthermore, users
must be switched over before the applications are once again
available to the users. This can all take several minutes. In addition,
an HA switchover must often be managed manually.

If an FT system and an HA cluster have the same fault rate, but the
FT system can recover in 3 seconds and the HA cluster takes 5
minutes (300 seconds) to recover from the same fault, then the HA
cluster will be down 100 times as long as the FT system and will have
an availability which is two 9s less. That glorious five 9s claim
becomes three 9s (as reported in several industry studies), at least so
far as software faults are concerned.

So, the secret to high availability is in the recovery time. This is what
the Tandem folks worked so hard on for two decades before
becoming the Nonstop people. Nobody else has done it. Today,
Nonstop servers are the only fault-tolerant systems out-of-the-box in
the marketplace, and they hold the high ground for availability.
Redundancy
All of fault tolerance is an exercise in exploiting and managing redundancy.
Redundancy is the property of having more of a resource than is minimally
necessary to do the job at hand. As failures happen, redundancy is exploited
to mask or otherwise work around these failures, thus maintaining the desired
level of functionality.

There are four forms of redundancy that we will study: hardware, software,
information, and time. Hardware faults are usually dealt with by using
hardware, information, or time redundancy, whereas software faults are
protected against by software redundancy.

Hardware redundancy is provided by incorporating extra hardware into the


design to either detect or override the effects of a failed component. For
example, instead of having a single processor, we can use two or three
processors, each performing the same function. By having two processors, we
can detect the failure of a single processor; by having three, we can use the
majority output to override the wrong output of a single faulty processor. This
is an example of static hardware redundancy, the main objective of which is
the immediate masking of a failure. A different form of hardware redundancy
is dynamic redundancy, where spare components are activated upon the failure
of a currently active component. A combination of static and dynamic
redundancy techniques is also possible, leading to hybrid hardware
redundancy.

Hardware redundancy can thus range from a simple duplication to complicated


structures that switch in spare units when active ones become faulty. These
forms of hardware redundancy incur high overheads, and their use is therefore
normally reserved for critical systems where such overheads can be justified.
In particular, substantial amounts of redundancy are required to protect against
malicious faults.
The best-known form of information redundancy is error detection and
correction coding. Here, extra bits (called check bits) are added to the original
data bits so that an error in the data bits can be detected or even corrected. The
resulting error-detecting and error-correcting codes are widely used today in
memory units and various storage devices to protect against benign failures.
Note that these error codes (like any other form of information redundancy)
require extra hardware to process the redundant data (the check bits).
Error-detecting and error-correcting codes are also used to protect data
communicated over noisy channels, which are channels that are subject to
many transient failures. These channels can be either the communication links
among widely separated processors (e.g., the Internet) or among locally
connected processors that form a local network. If the code used for data
communication is capable of only detecting the faults that have occurred (but
not correcting them), we can retransmit as necessary, thus employing time
redundancy.

In addition to transient data communication failures due to noise, local and


wide-area networks may experience permanent link failures. These failures
may disconnect one or more existing communication paths, resulting in a
longer communication delay between certain nodes in the network, a lower
data bandwidth between certain node pairs, or even a complete disconnection
of certain nodes from the rest of the network. Redundant communication links
(i.e., hardware redundancy) can alleviate most of these problems.

Computing nodes can also exploit time redundancy through re-execution of the
same program on the same hardware. As before, time redundancy is effective
mainly against transient faults. Because the majority of hardware faults are
transient, it is unlikely that the separate executions will experience the same
fault.
Time redundancy can thus be used to detect transient faults in situations in
which such faults may otherwise go undetected. Time redundancy can also be
used when other means for detecting errors are in place and the system is
capable of recovering from the effects of the fault and repeating the
computation. Compared with the other forms of redundancy, time redundancy
has much lower hardware and software overhead but incurs a highperformance
penalty.
Software redundancy is used mainly against software failures. It is a reasonable
guess that every large piece of software that has ever been produced has
contained faults (bugs). Dealing with such faults can be expensive: one way is
to independently produce two or more versions of that software (preferably by
disjoint teams of programmers) in the hope that the different versions will not
fail on the same input. The secondary version(s) can be based on simpler and
less accurate algorithms (and, consequently, less likely to have faults) to be
used only upon the failure of the primary software to produce acceptable
results. Just as for hardware redundancy, the multiple versions of the program
can be executed either concurrently (requiring redundant hardware as well) or
sequentially (requiring extra time, i.e., time redundancy) upon a failure
detection.

Techniques of Redundancy
The concept of redundancy implies the addition of information, resources, or
time beyond what is needed for normal system operation. The redundancy can
take one of several forms, including hardware redundancy, software
redundancy, information redundancy, and time redundancy. The use of
redundancy can provide additional capabilities within a system. In fact, if fault
tolerance or fault detection is required then some form of redundancy is also
required. But, it must be understood that redundancy can have a very important
impact on a system in the areas of performance, size, weight, power
consumption, reliability, and others

Hardware Redundancy
The physical replication of hardware is perhaps the most common form of
redundancy used in systems. As semiconductor components have become
smaller and less expensive, the concept of hardware redundancy has become
more common and more practical. The costs of replicating hardware within a
system are decreasing simply because the costs of hardware are decreasing.
There are three basic forms of hardware redundancy. First, passive
techniques use the concept of fault masking to hide the occurrence of faults
and prevent the faults from resulting in errors. Passive approaches are designed
to achieve fault tolerance without requiring any action on the part of the system
or an operator. Passive techniques, in their most basic form, do not provide for
the detection of faults but simply mask the faults.

The second form of hardware redundancy is the active approach, which is


sometimes called the dynamic method. Active methods achieve fault tolerance
by detecting the existence of faults and performing some action to remove the
faulty hardware from the system. In other words, active techniques require that
the system perform reconfiguration to tolerate faults. Active hardware
redundancy uses fault detection, fault location, and fault recovery in an attempt
to achieve fault tolerance. The final form of hardware redundancy is the hybrid
approach. Hybrid techniques combine the attractive features of both the
passive and active approaches. Fault masking is used in hybrid systems to
prevent erroneous results from being generated. Fault detection, fault location,
and fault recovery are also used in the hybrid approaches to improve fault
tolerance by removing faulty hardware and replacing it with spares. Providing
spares is one form of providing redundancy in a system.

Hybrid methods are most often used in the critical-computation applications


where fault masking is required to prevent momentary errors, and high
reliability must be achieved. Hybrid hardware redundancy is usually a very
expensive form of redundancy to implement.
Passive Hardware Redundancy. Passive hardware redundancy relies upon
voting mechanisms to mask the occurrence of faults. Most passive approaches
are developed around the concept of majority voting. As previously mentioned,
the passive approaches achieve fault tolerance without the need for fault
detection or system reconfiguration; the passive designs inherently tolerate the
faults. The most common form of passive hardware redundancy is called triple
modular redundancy (TMR). The basic concept of TMR is to triplicate the
hardware and perform a majority vote to determine the output of the system. If
one of the modules becomes faulty, the two remaining fault free modules mask
the results of the faulty module when the majority vote is performed. The basic
concept of TMR is illustrated in Figure 1. In typical applications, the replicated
modules are processors, memories, or any hardware entity. A simple example
of TMR is shown in Figure 1 where data from three independent processors is
voted upon before being written to memory. The majority vote provides a
mechanism for ensuring that each memory contains the correct data, even if a
single faulty processor exists. A similar voting process is provided at the output
of the memories, so that a single memory failure will not corrupt the data
provided to any one processor. Note that in Figure 2 there are three separate
voters so that the failure of a single voter cannot corrupt more than one memory
or more than one processor.

The primary challenge with TMR is obviously the voter; if the voter fails, the
complete system fails. In other words, the reliability of the simplest form of
TMR, as shown in Figure 1, can be no better than the reliability of the voter.
Any single component within a system whose failure will lead to a failure of
the system is called a single-point-of-failure. Several techniques can be used
to overcome the effects of voter failure. One approach is to triplicate the voters
and provide three independent outputs, as illustrated in Figure 2. In Figure 2,
each of three memories receives

Figure1: Basic TMR Model

data from a voter which has received its inputs from the three separate
processors. If one processor fails, each memory will continue to receive a
correct value because its voter will correct the corrupted value. A TMR system
with triplicated voters is commonly called a restoring organ because the
configuration will produce three correct outputs even if one input is faulty. In
essence, the TMR with triplicated voters restores the error-free signal. A
generalization of the TMR approach is the N-modular redundancy (NMR)
technique. NMR applies the same principle as TMR but uses N of a given
module as opposed to only three. In most cases, N is selected as an odd number
so that a majority voting arrangement can be used. The advantage of using N
modules rather than three is that more module faults can often be tolerated.

For example, a 5MR system contains five replicated modules and a voter. A
majority voting arrangement allows the 5MR system to produce correct results
in the face of as many as two module faults. In many criticalcomputation
applications, two faults must be tolerated to allow the required reliability and
fault tolerance capabilities to be achieved. The primary tradeoff in NMR is the
fault tolerance achieved versus the hardware required. Clearly, there must be
some limit in practical applications on the amount of redundancy that can be
employed. Power, weight, cost, and size limitations very often determine the
value of N in an NMR system.

Voting within NMR systems can occur at several points. For example, an
industrial controller can sample the temperature of a chemical process from
three independent sensors, perform a vote to determine which of the three
sensor values to use, calculate the amount of heat or cooling to provide to the
process (the calculations being performed by three or more separate modules),
and then vote on the calculations to determine a result. The voting can be
performed on both analog and digital data. The alternative, in this example,
might be to sample the temperature from three independent sensors, perform
the calculations, and then provide a single vote on the final result. The primary
difference between the two approaches is fault containment. If voting is not
performed on the temperature values from the sensors, then the effect of a
sensor fault is allowed to propagate beyond the sensors and into the primary
calculations. Voting at the sensors, however, will mask, and contain, the effects
of a sensor fault. Providing several levels of voting, however, does require
additional redundancy, and the benefits of fault containment must be compared
to the cost of the extra redundancy.
Figure 2: Triplicated voters in a TMR configuration

In addition to a number of design tradeoffs on voting, there are several


problems with the voting procedure, as well. The first is deciding whether a
hardware voter will be used, or whether the voting process will be implemented
in software. A software voter takes advantage of the computational capabilities
available in a processor to perform the voting process with a minimum amount
of additional hardware. Also, the software voter provides the ability to modify
the manner in which the voting is performed by simply modifying the software.
The disadvantage of the software voter is that the voting can require more time
to perform simply because the processor cannot execute instructions and
process data as rapidly as a dedicated hardware voter. The decision to use
hardware or software voting will typically depend upon:
• the availability of a processor to perform the voting,
• the speed at which voting must be performed,
• the criticality of space, power, and weight limitations,
• the number of different voters that must be provided, and
• the flexibility required of the voter with respect to future changes in the
system.

The concept of software voting is shown in Figure 3. Each processor


executes its own version of task A. Upon completion of the tasks, each
processor shares its results with processor 2, who then votes on the results
before using them as input to task B. If necessary, each processor might
also execute its version of the voting routine and receive data from the other
processors.
A second major problem with the practical application of voting is that the
three results in a TMR system, for example, may not completely agree,
even in a fault-free environment. The sensors that are used in many control
systems can seldom be manufactured such that their values agree exactly.
Also, an analog-to-digital converter can produce quantities that disagree in
the least-significant bits, even if the exact signal is passed through the same
converter multiple times. When values that disagree slightly are processed,
the disagreement can propagate into larger discrepancies. In other words,
small differences in inputs can produce large differences in outputs that can
significantly affect the voting process. Consequently, a majority voter may
find that no two results agree exactly in a TMR system, even though the
system may be functioning perfectly.

One approach that alleviates the problem of the previous paragraph is called
the mid-value select technique. Basically, the mid-value select approach
chooses a value from the three available in a TMR system by selecting the
value that lies between the remaining two. If three signals are available, and
two of those signals are uncorrupted and the third is corrupted, one of the
uncorrupted results should lie between the other uncorrupted result and the
corrupted result. The mid-value select technique can be applied to any
system that uses an odd number of modules such that one signal must lie in
the middle of the others. The major difficulty with most techniques that use
some form of voting is that a single result must ultimately be produced,
thus creating a potential point where one failure can cause a system failure.
Clearly, single-points-offailure are to be avoided if a system is to be truly
fault-tolerant.
Active Hardware Redundancy. Active hardware redundancy techniques
attempt to achieve fault tolerance by fault detection, fault location, and fault
recovery. In many designs faults can be detected because of the errors they
produce, so in many instances error detection, error location and error
recovery are the appropriate terms to use. The property of fault masking,
however, is not obtained in the active redundancy approach. In other words,
there is no attempt to prevent faults from producing errors within the
system. Consequently, active approaches are most common in applications
where temporary, erroneous results are acceptable as long as the system
reconfigures and regains its operational status in a satisfactory length of
time. Satellite systems are good examples of applications of active
redundancy. Typically, it is not catastrophic if satellites have infrequent,
temporary failures. In fact, it is usually preferable to have temporary
failures than to provide the large quantities of redundancy necessary to
achieve fault masking.

The basic operation of an active approach to fault tolerance is shown in Figure


3. During the normal operation of a system a fault can obviously occur. After
the fault latency period, the
Figure 3: A model of active approach to fault tolerance.

fault will produce an error which is either detected or it is not detected. If the
error remains undetected, the result will be a system failure. The failure will
occur after a latency period has expired. If the error is detected, the source of
the error must be located, and the faulty component removed from operation.
Next, a spare component must be enabled, and the system brought back to an
operational state. It is important to note that the new operational state may be
identical to the original operational state of the system or it may be a degraded
mode of operation. The processes of fault location, fault containment, and fault
recovery are normally referred to simply as reconfiguration.

It is clear from this description that active approaches to fault tolerance require
fault detection and location capabilities.

Information Redundancy
Information redundancy is the addition of redundant information to data to
allow fault detection fault masking, or possibly fault tolerance. Good examples
of information redundancy are error detecting and error correcting codes,
formed by the addition of redundant information to data words, or by the
mapping of data words into new representations containing redundant
information.
Before beginning the discussions of various codes, we will define
several basic terms that will appear throughout this section. In general, a code
is a means of representing information, or data, using a well-defined set of
rules. A code word is a collection of symbols, often called digits if the symbols
are numbers, used to represent a particular piece of data based upon a specified
code. A binary code is one in which the symbols forming each code word
consist of only the digits 0 and 1. For example, a Binary Coded Decimal (BCD)
code defines a 4-bit code word for each decimal digit. The BCD code, for
example, is clearly a binary code. A code word is said to be valid if the code
word adheres to all of the rules that define the code; otherwise, the code word
is said to be invalid.

The encoding operation is the process of determining the corresponding code


word for a particular data item. In other words, the encoding process takes an
original data item and represents it as a code word using the rules of the code.
The decoding operation is the process of recovering the original data from the
code word. In other words, the decoding process takes a code word and
determines the data that it represents. Of primary interest here are binary codes.
In many binary code words, a single error in one of the binary digits will cause
the resulting code word to no longer be correct, but, at the same time, the code
word will be valid. Consequently, the user of the information has no means of
determining the correctness of the information. It is possible, however, to
create a binary code for which the valid code words are a subset of the total
number of possible combinations of 1s and 0s. If the code words are formed
correctly, errors introduced into a code word will force it to lie in the range of
illegal, or invalid code words, and the error can be detected.
This is the basic concept of the error detecting codes. The basic concept of the
error correcting code is that the code word is structured such that it is possible
to determine the correct code word from the corrupted, or erroneous, code
word. Typically, the code is described by the number of bit errors that can be
corrected. For example, a code that can correct single-bit errors is called a
single error correcting code. A code that can correct two-bit errors is called a
double-error correcting code, and so on.

Relationship Between Security and Fault Tolerance


Security plays an increasingly important role for software system. Security
concern must inform every phase of software development from problem
domain to solution domain. Software security estimates provides the help for
degree of protection and assess the impact. Microsoft has stated that above
50% of the security related problem for any firm has been found at design level
of software development process. Software security touch points are based on
good software engineering and involve explicitly pondering security
throughout the software lifecycle. Security estimation of software may heavily
affect to security of the final product. The experts tried in this regard to develop
the security estimation guidelines, view and concept. There are some
probabilities that original code segment may have some security flaws,
anomalies that may influence security at different phase.

The security assessment is helpful for software developers, risk management


team and executives of the company It definitely needs thoughtful subtle of
security including security measurements, classifications and security
attributes. Security attributes may decrease the cost and impinge between
problem domains to solution domain at each phase of development life cycle.
A level-2 heading must be in Italic, left-justified and numbered using an
uppercase alphabetic letter followed by a period. Software Security is an
external software attribute that reduces faults and effort required for secured
software. Security must encompass dependable protection and secured the
software system against all relevant concerns including confidentiality,
integrity, availability, non-repudiation, survivability, accessibility despite
attempted compromises, preventing, misuse and reducing the consequences of
unforeseen threats. Fault tolerance is direct associated to security attributes
such as confidentiality, integrity, availability, non-repudiations, and
survivability. Fault tolerance thought will efficiently improve the security.
Fault tolerance is frequently essential, but it can be riskily error-prone because
of the added efforts that must be involved in the programme procedure. A
consistent quantitative estimate of security is highly enviable at an early stage
of software development life cycle. Fault tolerance is direct associated to
reliability and security. Fault prevention and fault tolerance intend to present
the ability to deliver an accurate service. Controlling and Monitoring can work
mutually to enforce the security policy. Fault tolerance is the ability of a system
to continue secures the software module and presence of software faults. Fault
tolerance attributes as a fault masking, fault detection and fault consideration
effective to security policy. Fault tolerance implies a savings in development
time, cost and efforts; also, it reduces the number of components that must be
originally developed.

METHODS FOR FAULT TOLERANT COMPUTING


Fault Tree Analysis (FTA) is a convent means to logically think through the
many ways a failure may occur. It provides insights that may lead to product
improvements or process controls. FTA provides a means to logically and
graphically display the paths to failure for a system or component. One way to
manage a complex system is to start with a reliability block diagram (RBD).
Then create a fault tree for each block in the RBD. Whether a single block or
a top level fault for a system the basic process to create a fault tree follows a
basic pattern. This is comprises eight steps
• Define the system. This includes the scope of the analysis including
defining what is considered a failure. This becomes important when a
system may have an element fail or a single function fails and the
remainder of the system still operates.
• Define top-level faults. Whether for a system or single block define
the starting point for the analysis by detailing the failure of interest for
the analysis.
• Identify causes for top-level fault. What events could cause the top
level fault to occur? Use the logic gate symbols to organize which
causes can cause the failure alone (or), or require multiple events to
occur before the failure occurs (and).
• Identify next level of events. Each event leading to the top level failure
may also have precipitating events.
• Identify root causes. For each event above continue to identify
precipitating events or causes to identify the root or basic cause of the
sequence of events leading to failure.
• Add probabilities to events. When possible add the actual or relative
probability of occurrence of each event.
• Analysis the fault tree. Look for the most likely events that lead to
failure, for single events the initiate multiple paths to failure, or patterns
related to stresses, use, or operating conditions. Identify means to
resolve or mitigate paths to failure.
• Document the FTA. Beyond the fault tree, graphics include salient
notes from the discussion and action items.

Benefits of FTA
FTA is a convent means to logically think through the many ways a failure
may occur. It provides insights that may lead to product improvements or
process controls. It is a logical, graphical diagram that organizes the possible
element failures and combination of failures that lead to the top level fault
being studied. The converse, the success tree analysis, starts with the
successful operation of a system, for example, and examines in a logical,
graphical manner all the elements and combinations that have to work
successfully.

With every product, there are numerous ways it can fail. Some more likely and
possible than others. The FTA permits a team to think through and organize
the sequences or patterns of faults that have to occur to cause a specific top
level fault. The top level fault may be a specific type of failure, say the car will
not start. Or it may be focused on a serious safety related failure, such as the
starter motor overheats starting a fire. A complex system may have numerous
FTA that each explore a different failure mode.

The primary benefit of FTA is the analysis provides a unique insight into the
operation and potential failure of a system. This allows the development team
to explore ways to eliminate or minimize the occurrence of product failure. By
exploring the ways a failure mode can occur by exploring the individual failure
causes and mechanisms, the changes impact the root cause of the potential
failures.

The benefits include:


• Identify failures deductively. Using the logic of a detailed failure analysis
and tools like ‘5 whys’, FTA helps the team focus on the causes of each
event in a logical sequence that leads to the failure.
• Highlight the important elements of system related to system failure. The
FTA process may lead to a single component or material that causes many
paths to failure, thus improving that one element may minimize the
possibly of many failures.
• Create a graphical aid for system analysis and management.
• Apparently managers like graphics, and for complex system, it helps to
focus the team on critical elements.
• Provides an alternatively way to analysis the system. FMEA, RBD and
other tools permit a way to explore system reliability, FTA provide a tool
that focuses on failure modes one at a time. Sometimes a shift in the frame
of reference illuminates new and important elements of the system.
• Focus on one fault at a time. The FTA can start with an overall failure
mode, like the car not starting, or it can focus on one element of the vehicle
failing, like the airbag not inflating as expected within a vehicle. The team
chooses the area for focus at the start of the analysis.
• Expose system behavior and possible interactions. FTA allows the
examination of the many ways a fault may occur and may expose
nonobvious paths to failure that other analysis approaches miss.
• Account for human error. FTA includes hardware, software, and human
factors in the analysis as needed. The FTA approach includes the full range
of causes for a failure.
• Just another tool in the reliability engineering toolbox. For complex
systems and with many possible ways that a significant fault may occur,
FTA provides a great way to organize and manage the exploration of the
causes. The value comes from the insights created that lead to design
changes to avoid or minimize the fault.

Fault Detection Methods


Fault detection plays an important role in high cost and safety-critical
processes. Early detection of process faults can help avoid abnormal event
progression. Fault detection determines the occurrence of fault in the
monitored system. It consists of detection of faults in the processes, actuators
and sensors by using dependencies between different measurable signals.
Related tasks are also fault isolation and fault identification. Fault isolation
determines the location and the type of fault whereas fault identification
determines the magnitude (size) of the fault. Fault isolation and fault
identification are together referred as fault diagnosis. The task of fault
diagnosis consists of the determination of the type of the fault, with as many
details as possible such as the fault size, location and time of detection.

There exist several overlapping taxonomies of the field. Some are more
oriented toward control engineering approach, other to mathematical,
Statistical and AI approach. Interesting divisions are described in the
following division of fault detection methods Below:

A. Data Methods and Signal Models


• Limit checking and trend checking
• Data analysis (PCA)
• Spectrum analysis and parametric models Pattern recognition
(neural nets)
B. Process Model Based Methods
• Parity equations
• State observers
• Parameter estimation
• Nonlinear models (neural nets)
C. Knowledge Based Methods
• Expert systems
• Fuzzy logic

Fault Tolerance Architecture


Several fault-tolerant architectures have been proposed in the literature
in the past to address the circuit reliability concerns. A few of these
relevant solutions include; Partial-TMR, Full-TMR, DARA-TMR , PaS
, CPipe , STEM and Razor , which are discussed . We select this set of
architectures because it includes a representative of each class of the
broad spectrum of fault tolerant architectures.

PAIR-AND-A-SPARE
Pair-and-A-Spare (PaS) Redundancy was first introduced in as an
approach that combines Duplication with Comparison and standbysparing.
In this scheme each module copy is coupled with a FaultDetection (FD)
unit to detect hardware anomalies within the scope of individual module.
A comparator is used to detect inequalities in the results from two active
modules. In the case of inequality, a switch decides which one of the two
active modules is faulty by analyzing the reports from FD units and
replaces it with a spare one. This scheme was intended to prevent hardware
faults from causing system failures. The scheme fundamentally lacks
protection against transient faults and it incurs a large hardware overhead
to accomplish the identification of faulty modules.

Figure 4: Pair-and-A-Spare
RAZOR
Razor is a well-known solution to achieve timing error resilience by
using the technique called timing speculation. The principle idea
behind this architecture is to employ temporally separated
doublesampling of input data using Razor FFs, placed on critical
paths. The main FF takes input sample on the rising edge of the clock,
while a time-shifted clock (clk-del) to the shadow latch is used to take
a second sample. By comparing the data of the main FF and the
shadow latch, an error signal is generated. The timing diagram of how
the architecture detects timing errors. In this example a timing fault in
CL A causes the data to arrive late enough that the main FF captures
wrong data but the shadow always latches the input data correctly.
The error is signaled by the XOR gate which propagates through the
OR-tree for correction action to be taken. Error recovery in Razor is
possible either by clock-gating or by rollback recovery. Razor also
uses Dynamic Voltage Scaling (DVS) scheme to optimize the energy
vs. error rate trade-off.

Figure 5: RAZOR Architecture


STEM
STEM cell architecture takes Razor a step further by incorporating
capability to deal with transient faults as well. STEM cell architecture
presented in incorporates power saving and performance
enhancement mechanisms like Dynamic Frequency Scaling (DFS) to
operate circuits beyond their worst-case limits. Similar to Razor FFs,
a STEM cells replace the FF on the circuit critical paths, but instead
of taking two temporally separated samples, A STEM cell takes three
samples using two delayed clocks. Mismatches are detected by the
comparators and the error signals is used to select a sample which is
most likely to be correct for rollback.
Figure 6: STEM Architecture

CPipe
The CPipe or Conjoined Pipeline architecture proposed uses spatial
and temporal redundancy to detect and recover from timing and
transient errors. It duplicates CL blocks and the FFs as well to from
two pipelines interlinked together. The primary or leading pipeline is
overclocked to speedup execution while the replicated of shadow
pipeline has sufficient speed margin to be free from timing errors.
Comparators placed across the leading pipeline register in somewhat
similar way as the scheme, detects any metastable state of leading
pipeline register and SETs reaching the registers during the latching
window. The error recovery is achieved by stalling the pipelines and
using data from the shadow pipeline registers for rollback and it takes
3 cycles to complete.

Figure 7: CPipe Architecture


TMR
TMR is one of the most p5opular fault tolerant architectures. In a basic
TMR scheme called Partial-TMR, we have three implementation of
same logic function and their outputs are voted by a voter circuit. This
architecture can tolerate all the single-faults occurring in the CL block
but faults in voter or pipeline registers cause the system to fail. Full-
TMR on the other hand, triplicates the entire circuit including the FFs
and can tolerate all single-faults in any part of the circuit except voter
and the signals to the input pipeline register, which may result in
common-mode failure.

Figure 8 : TMR Architecture

DARA-TMR
DARA-TMR triplicates entire pipeline but uses only two pipeline
copies to run identical process threads in Dual Modular Redundancy
(DMR) mode. The third pipeline copy is disabled using power gating
and is only engaged for diagnosis purposes in case of very frequent
errors reported by the detection circuitry. Once the defective pipeline
is identified the system returns back to DMR redundancy mode by
putting the defected pipeline in off mode. The error recovery follows
the same mechanism as pipeline branch misprediction, making use of
architectural components for error recovery. DARA-TMR treats
permanent fault occurrence as a very rare phenomenon and undergo a
lengthy reconfiguration mechanism to isolate them .

Figure 9: TMR Architecture

Hybrid Fault-Tolerant Architecture


HyFT architecture employs information redundancy (as duplication
with comparison) for error detection, timing redundancy (in the form
of re-computation/rollback) for transient and timing error correction
and hardware redundancy (to support reconfiguration) for permanent
error correction. the HyFT architecture employs triplication of CL
blocks. A set of multiplexers and demultiplexer is used to select two
primary CL copies and to put the third CL copy in standby mode
during normal operation. HyFT architecture is driven by a control logic
module that generates the necessary control signals. HyFT architecture
uses the pseudo-dynamic comparator for error detection to achieve
better glitch detection capability and to reduce the power consumption.
The HyFT architecture uses a concurrent error detection mechanism.
A pseudo-dynamic comparator compares the outputs of two active CL
copies. It can be seen that the comparator is placed across the output
register such that it gets to compare the output of the output register
Sout, which is a synchronous signal with the output of the secondary
running copy Aout, which is an asynchronous signal. This orientation
of the pseudo-dynamic comparator also offers marginal protection
against the errors due to faults in the output register and allows it to
remain off the critical path. Thus, it does not impact the temporal
performance of the circuit. The error recovery scheme uses stage-level
granularity reconfigurations and single-cycle deep rollbacks. The
shadow latches incorporated in pipeline registers keep one clock cycle
preceding state of the pipeline register FFs. The comparison takes
place after every clock cycle. Thus, error detection can invoke a
reconfiguration and a rollback cycle, confining the error and
preventing it from effecting the computation in the following cycles.
The comparison takes place only during brief intervals of time referred
to as comparison window. The timing of comparison window is
defined by the high phase of a delayed clock signal DC, which is
generated from CLK using a delay element. These brief comparisons
allow keeping the switching activity in OR-tree of the comparator to a
minimum, offering a 30% power reduction compared with a static
comparator. The functioning of the pseudo-dynamic comparator
requires specific timing constraints to be applied during synthesis of
CL blocks, as defined below. Timing Constraints: In typical pipeline
circuits the contamination delay of CL should respect hold-time of the
pipeline register latches. However, in the HyFT architecture, as CL
also feeds to the pseudo-dynamic comparator, CL outputs need to
remain stable during the comparison. And since the comparison takes
place just after a clock edge, any short paths in the CL can cause the
input signals of the comparator to start changing before the lapse of the
comparison-window. Thus, the CL copies have to be synthesized with
minimum delay constraints governed by:

tcd > δt −tccq −tcdm −tcm where:


tcd = CL contamination delay

δt = the amount of time between CLK capture edge and the lapse of
the comparison-window
tccq = FF clk-to-output delay tcdm =
demultiplexer contamination delay tcm =
multiplexer contamination delay

with the help of a timing diagram we explain the associated timing


constraints. Besides CLK and DC the timing diagram shows the signals
at the two inputs of the comparator labeled as Aout and Sout also
Figure 10: HyFT Architecture
indicated. The remaining two signals are the inputs of CL labeled as
CLin and the outputs of CL labeled as CLout. The grey shaded regions
allow margins of the corresponding signals. The timing allowance for
the start of the comparison-window depends on the clk-to-output delay
of the output register. This implies that the comparison should not begin
until the output of the output register stabilizes.

3.0.2.3 Fault Models


A fault model attempts to identify and categorize the faults that
may occur in a system, in order to provide clues as to how to fine-tune
the software development environment and how to improve error
detection and recovery. A question that needs to be asked is: is the
traditional distributed systems fault model appropriate for Grid
computing, or are refinements necessary?
The development of fault models is an essential part of the
process in determining the reliability of a system. A fault model
describes the types of faults that a system can develop, specifying where
and how they will occur in it. However, faults become more difficult to
formulate sensibly as a system is viewed at an increasingly more abstract
level, especially the definition of how a fault manifests itself. The entities
listed in a fault model need not necessarily physically exist, but may be
abstractions of real-world objects. In general, a fault model is an
abstracted representation of the physical defects which can occur in a
system, such that it can be employed to usefully, and reasonably
accurately, simulate the behaviour of the system over its intended
lifetime with respect to its reliability. Four major goals exist when
devising a fault model:
1. The abstract faults described in the model should adequately cover
the effects of the physical faults which occur in the realworld
system.
2. The computational requirements for simulation should be
satisfiable.
3. The fault model should be conceptually simple and easy to use.
4. It should provide an insight into introducing fault tolerance in a
design

Fault Tolerance Methods


A Fault in any software system, usually, happens due to the gaps left
unnoticed during the design phase. Based on this, the fault tolerance
techniques are identified into two different groups, that is, the Single
Version Technique and the Multi-Version Technique. There can be
plenty of techniques implemented under each of these categories, and a
few of the techniques often used by the programmers are,
1. Software Structure and Actions
When the software system is one single block of code, it is logically
more vulnerable to failure. Because, when one tiny error occurs in the
program, the whole system will be brought down. Hence, it is crucial
for the software system should be structured in a modular form, where
the functionality is covered in separate modules. In the case of failure,
each module should hold specific instructions on how to handle it and
let the other modules run as usual, instead of passing on the failure
from module to module.

2. Error Detection
Error Detection is a fault tolerance technique where the program
locates every incidence of error in the system. This technique is
practically implemented using two attributes, namely, self-protection
and self-checking. The Self-Protection attribute of error detection is
used for spotting the errors in the external modules, whereas the
SelfChecking attribute of error detection is used for spotting the errors
in the internal module.

3. Exception Handling
Exception Handling is a technique used for redirecting the execution
flow towards the route to recovery whenever an error occurs in the
normal functional flow. As a part of fault tolerance, this activity is
performed under three different software components, such as the
Interface Exception, the Local Exception and the Failure Exception.
4. Checkpoint and Restart
This is one of the commonly used recuperation methods for single
version software systems. The Checkpoint and Restart fault
tolerance technique can be used for the events like run-time
exceptions, that is, a malfunction takes place during the run-time
and when the execution is complete there is no record of the error
happening. For this case, the programmer can place checkpoints in
the program and instruct the program to restart immediately right
from the occurrence of the error.

5. Process Pairs
Process Pair technique is a method of using the same software in
two different hardware units and validating the functional
differences in order to capture the faulty areas. This technique
functions on top of the checkpoint and restart technique, as similar
checkpoints and restart instructions are placed in both systems.

6. Data Diversity
Data Diversity technique is typically a process where the
programmer passes a set of input data, and places checkpoints for
detecting the slippage. The commonly used Data Diversity models
are ‘Input Data Re-Expression’ model, ‘Input Data Re-Expression
with Post-Execution Adjustment’ model, and ‘Re-Expression via
Decomposition and Recombination’ model.

7. Recovery Blocks
Recovery Block technique for multiple version software Fault
Tolerance involves the checkpoint and restart method, where the
checkpoints are placed before the fault occurrence, and the system
is instructed to move on to next version to continue the flow. It is
carried out in three areas, that is, the main module, the acceptance
tests, and the swap module.

8. N – Version Programming
The N – Version programming technique for the multi – version fault
tolerance is the commonly used method when the there is a provision
for testing multiple code editions. The recovery is made from
executing all the versions and comparing the outputs from each of
the versions. This technique also involves the acceptance test flow.

9. N Self–Checking Programming
N Self – Checking Programming is a combination technique of both
the Recovery block and the N – version Programming techniques,
which also calls for the acceptance test execution. It is performed by
the sequential and the parallel execution of various versions of the
software.

10. Consensus Recovery Blocks


This method combines the Recovery Block and the N- Version
Programming techniques where the decision algorithm technique is
combined for handling and recovering the inaccuracy in the system.
This combination of all the efficient fault tolerance techniques gives a
much more consistent method of Fault tolerance.
Major Issues in Modelling and Evaluation

• Interference with fault detection in the same component. In


passenger vehicle example, with either of the fault-tolerant systems it
may not be obvious to the driver when a tire has been punctured. This
is usually handled with a separate "automated fault-detection system".
In the case of the tire, an air pressure monitor detects the loss of pressure
and notifies the driver. The alternative is a "manual fault-detection
system", such as manually inspecting all tires at each stop.

• Interference with fault detection in another component. Another


variation of this problem is when fault tolerance in one component
prevents fault detection in a different component. For example, if
component B performs some operation based on the output from
component A, then fault tolerance in B can hide a problem with A. If
component B is later changed (to a less fault-tolerant design) the system
may fail suddenly, making it appear that the new component B is the
problem. Only after the system has been carefully scrutinized will it
become clear that the root problem is actually with component A.
• Reduction of priority of fault correction. Even if the operator is aware
of the fault, having a fault-tolerant system is likely to reduce the
importance of repairing the fault. If the faults are not corrected, this will
eventually lead to system failure, when the fault-tolerant component
fails completely or when all redundant components have also failed.
• Test difficulty. For certain critical fault-tolerant systems, such as a
nuclear reactor, there is no easy way to verify that the backup
components are functional. The most infamous example of this is
Chernobyl, where operators tested the emergency backup cooling by
disabling primary and secondary cooling. The backup failed, resulting
in a core meltdown and massive release of radiation.
• Cost. Both fault-tolerant components and redundant components tend
to increase cost. This can be a purely economic cost or can include other
measures, such as weight. Manned spaceships, for example, have
so many redundant and fault-tolerant components that their weight is
increased dramatically over unmanned systems, which don't require the
same level of safety.
• Inferior components. A fault-tolerant design may allow for the use of
inferior components, which would have otherwise made the system
inoperable. While this practice has the potential to mitigate the cost
increase, use of multiple inferior components may lower the reliability
of the system to a level equal to, or even worse than, a comparable
nonfault-tolerant system.

Fault Tolerance for Web Applications


In web services when a fault occurs, it goes into various stages. When
an error occurs in web services, it should make sure the error or faults
through various fault detection mechanism to know the failure causes
so that failed components can be repaired of recovered from an error.
The flow of web service failure responses shown in figure 11.
Figure 11: Failure Stages of Web services

A) Error Confinement: Error confinement stage prevents an error


effect on web services. It can be gain with the help of error
detection within a service by multiple checks.
B) Error Detection: Error detection stage helps in identifying
unexpected error in a web service.
C) Error Diagnosis: Error diagnosis stage helps to diagnose the fault
that has been traced in error detection stage. Error diagnosis stage
comes into picture when error detection doesn't provide enough
information about fault location.
D) Reconfiguration: Reconfiguration comes into picture when and
error is detected and located in the error detection and error
diagnosis stage.
E) Recovery: Recovery is used to recover fault from web service
using retry and rollback approaches.
F) Restart: Restart comes into picture after the recovery of web
service. Restart can be done either using hot start or cold start.
G) Repair: In Repair, failed component has to be changed in order to
work properly.
H) Reintegration: In the reintegration stage repaired component has
to be integrating.

In web services, there are many fault tolerant techniques that can be
applied such as replication. Replication is a more efficient technique for
handling exception in a distributed application. Services can resume more
effectively by maintaining the global state of the application. For instance,
let's assume if one service needs the assistance of another service to
provide the desired result to the customer then service needs to
communicate with other service. Suppose, while communicating with
other service, at certain point of time if a fault occurs in a service, then
there is no need to continue service with faults. Then the state manager has
to roll back the state of the application at that point where the fault
occurred so that service can resume without fault and response can be
given to the consumer more effectively.

Fault Tolerance Implementation in Cloud Computing


A cloud is a type of parallel and distributed system containing a set of
interconnected and virtualized computers that are dynamically provisioned and
presented as one or more unified computing resources based on servicelevel
agreements established through negotiation between the service provider and
consumers. It is a style of computing where service is provided across the
Internet using different models and layers of abstraction, It refers to the
applications delivered as services to the mass, ranging from the endusers
hosting their personal documents on the Internet to enterprises outsourcing
their entire IT infrastructure to external data centers. A simple example of
cloud computing service is Yahoo email or Gmail etc. Although cloud
computing has been widely adopted by the industry, still there are many
research issues to be fully addressed like fault tolerance, workflow scheduling,
workflow management, security etc. Fault tolerance is one of the key issues
amongst all. It is concerned with all the techniques necessary to enable a
system to tolerate software faults remaining in the system after its
development. When a fault occurs, these techniques provide mechanisms to
the software system to prevent system failure occurrence. The main benefits of
implementing fault tolerance in cloud computing include failure recovery,
lower cost, improved performance metrics etc. A cloud infrastructure consist
of the following broad components:
• Servers – The physical machines that act as host machines for one or
more e virtual machines.
• Virtualization – Technology that abstracts physical components such as
servers, storage, and networking and provides these as logical resources.
• Storage – In the form of Storage Area Networks (SAN), network attached
storage (NAS), disk drives etc. Along with facilities as archiving and
backup.
• Network – To provide interconnections between physical servers and
storage.
• Management – Various software for configuring, management and
monitoring of cloud infrastructure including servers, network, and storage
devices.
• Security – Components that provide integrity, availability, and
confidentiality of data and security of information, in general.
• Backup and recovery services.

The cloud computing, as a fast advancing technology, is increasingly being


used to host many business or enterprise applications. However, the extensive
use of the cloud-based services for hosting business or enterprise applications
leads to service reliability and availability issues for both service providers and
users. These issues are intrinsic to cloud computing because of its highly
distributed nature, heterogeneity of resources and the massive scale of
operation. Consequently, several types of faults may occur in the cloud
environment leading to failures and performance degradation. The major types
of faults are listed as follows:
• Network fault: Since cloud computing resources are accessed over a
network (Internet), a predominant cause of failures in cloud computing
are the network faults. These faults may occur due to partitions in the
network, packet loss or corruption, congestion, failure of the destination
node or link, etc.
• Physical faults: These are faults that mainly occur in hardware resources,
such as faults in CPUs, in memory, in storage, failure of power etc.
• Process faults: faults may occur in processes because of resource shortage,
bugs in software, incompetent processing capabilities, etc.
• Service expiry fault: If a resource’s service time expires while an
application that leased it is using it, it leads to service failures.

The Fault Tolerance methods can be applied to cloud computing in three levels:
• At hardware level: if the attack on a hardware resource causes the system
failure, then its effect can be compensated by using additional hardware
resources.
• At software (s/w) level: Fault tolerance techniques such as checkpoint
restart and recovery methods can be used to progress system execution in
the event of failures due to security attacks.
• At system level: At this level, fault tolerance measures can compensate
failure in system amenities and guarantee the availability of network and
other resources.

Challenges of Implementing Fault Tolerance in Cloud Computing


Providing fault tolerance requires careful consideration and analysis because
of their complexity, inter-dependability and the following reasons.
• There is a need to implement autonomic fault tolerance technique for
multiple instances of an application running on several virtual machines
• Different technologies from competing vendors of cloud infrastructure
need to be integrated for establishing a reliable system
• The new approach needs to be developed that integrate these fault
tolerance techniques with existing workflow scheduling algorithms
• A benchmark based method can be developed in cloud environment for
evaluating the performances of fault tolerance component in comparison
with similar ones
• To ensure high reliability and availability multiple clouds computing
providers with
• independent software stacks should be used
• Autonomic fault tolerance must react to synchronization among various
clouds

CONCLUSION
Fault-tolerance is achieved by applying a set of analysis and design techniques
to create systems with dramatically improved dependability. As new
technologies are developed and new applications arise, new faulttolerance
approaches are also needed. In the early days of fault-tolerant computing, it
was possible to craft specific hardware and software solutions from the
ground up, but now chips contain complex, highly-integrated functions, and
hardware and software must be crafted to meet a variety of standards to be
economically viable. Thus, a great deal of current research focuses on
implementing fault tolerance using Commercial-Off-The-Shelf (COTs)
technology.
Recent developments include the adaptation of existing fault-tolerance
techniques to RAID disks where information is striped across several disks to
improve bandwidth and a redundant disk is used to hold encoded information
so that data can be reconstructed if a disk fails. Another area is the use of
application-based fault-tolerance techniques to detect errors in high
performance parallel processors. Fault-tolerance techniques are expected to
become increasingly important in deep sub-micron VLSI devices to combat
increasing noise problems and improve yield by tolerating defects that are
likely to occur on very large, complex chips.

SUMMARY
The ability of a system to continue operation despite a failure of any single
element within the system implies the system is not in a series configuration.
There is some set of redundancy or set of alternative means to continue
operation. The system may use multiple redundancy elements, or be resilient
to changes in the system’s configuration. The appropriate solution to create a
fault tolerant system often requires careful planning, understanding of how
elements fail and the impact of surrounding elements of the failure.
Fault-tolerance techniques will become even more important the next
years. The ideal, from an application writer’s point of view, is total hardware
fault-tolerance. Trends in the market, e.g. Stratus and Sun Netra, shows that
this is the way systems go at the moment. There is also, fortunately, reason to
believe that such systems will become considerable cheaper than today.
Technology in general, and miniaturization in particular (which leads to
physically smaller and in general cheaper systems) contributes to this. Much
research is also being done with clusters of commercial general-purpose
computers connected with redundant buses. In that case, the software has to
handle the failures. However, as shown with the HA Cluster and Sun Netra,
that could also be done without affecting the user programs and applications.

6.0 TUTOR QUESTIONS


1. What Is Fault Tolerance?
2. Explain the different Fault Tolerance Architecture that know?
3. Explain the Methods for Fault Tolerant Computing
4. Describe the different forms of hardware Redundancy
5. Explain the properties of Fault Tolerant Systems

You might also like