0% found this document useful (0 votes)
177 views20 pages

Introduction To Fault Tolerance

Fault tolerance is a non-functional (QoS) requirement that requires a system to continue to operate, even in the presence of faults  Fault tolerance should be achieved with minimal involvement of users or system administrators

Uploaded by

ankitbhattt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
177 views20 pages

Introduction To Fault Tolerance

Fault tolerance is a non-functional (QoS) requirement that requires a system to continue to operate, even in the presence of faults  Fault tolerance should be achieved with minimal involvement of users or system administrators

Uploaded by

ankitbhattt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

BY:

ANKIT BHATT
ME-VLSI & EMBEDDED

What
Is
Failure?
A system is said to fail when it cannot meet its promises.
A failure is brought about by the existence of errors in

the system.
The cause of an error is called a fault.

Concept of Fault Tolerance


Hardware, software and networks cannot be totally free
from failures
Fault tolerance is a non-functional (QoS) requirement
that requires a system to continue to operate, even in the
presence of faults
Fault tolerance should be achieved with minimal
involvement of users or system administrators
Distributed systems can be more fault tolerant than
centralized systems, but with more processor hosts
generally the occurrence of individual faults is likely to
be more frequent

Attributes Consequences and Strategies


What is a

Attributes
Dependable
Availability
system
Reliability
Safety
How to
distinguish
Confidentiality
faults
Integrity
Maintainability Consequences
Fault
Error
Strategies
Failure
Fault prevention
Fault tolerance
Fault recovery
Fault forcasting

Distributed Systems

How to
handle
faults?

Terminology of Fault Tolerance


Fault

causes

Error

results in Failure

Fault is a defect within the system


Error is observed by a deviation from the expected
behaviour of the system
Failure occurs when the system can no longer perform as
required (does not meet spec)
Fault Tolerance is ability of system to provide a service,
even in the presence of errors
Distributed Systems

Strategies to Handle Faults


Fault avoidance
Techniques aim to prevent
faults from entering the
system during design stage
Fault removal
Methods attempt to find
faults within a system before
it enters service
Fault detection
Techniques used during
service to detect faults within
the operational system
Fault tolerant
Techniques designed to tolerant
faults, i.e. to allow the system
operate correctly in the presence of
faults.
Distributed Systems

Actions to identify and


remove errors:
Design reviews
Testing
Use certified tools
Analysis:
Hazard analysis
Formal methods proof & refinement

No non-trivial system
can be guaranteed free
from error
Must have an
expectation of failure
and make appropriate
provision

Fault Models
A fault model identifies targets for testing
A fault model makes analysis possible
Effectiveness measurable by experiments
Different types

Stuck-at faults
Multiple stuck-at faults
Bridging faults

Single Stuck At Fault


Single (line) stuck-at fault

The given line has a constant value (0/1)


independent of other signal values in the circuit
Properties
o Only one line is faulty
o The faulty line is permanently set to 0 or 1
o The fault can be at an input or output of a gate
o Simple logical model is independent of technology
o It reduces the complexity of fault-detection

Example:
XOR circuit has 12 fault sites and 24 single stuck-at faults

Multiple Stuck-At Faults


Multiple stuck-at fault

Several single stuck-at faults occur at the same time


Multiple stuck-at faults are usually not considered in
practice because of two reasons
The number of multiple stuck-at faults in a circuit
with k lines is 3K-1, which is too large a number
even for circuits of moderate size
o Tests for single stuck-at faults are known to cover a
very high percentage (greater than 99.6%) of multiple stuck-at
faults when the circuit is large and
has several outputs
o

Bridging Fault
Two or more normally distinct points (lines) are
shorted together
Two types of bridging faults:
Input bridging
Can form wired logic or voting model.
Feedback (input-to-output) bridging
Can introduce feedback.
Can cause oscillation or latching.

Transistor Fault
o MOS transistor is considered an ideal switch.

o Two types of faults are modeled:-

Stuck-open -A single transistor is permanently stuck in


the open state turn the circuit into a sequential one and
need a sequence of at least 2 tests to detect a single fault.
Stuck-on - A single transistor is permanently
shorted irrespective of its gate voltage.
o Detection of a stuck-open fault requires two vectors.

Example of Transistor Stuck-Open


fault

Hardware Faults Classification


Three types of faults:
Transient Faults - disappear after a relatively short time
Example - a memory cell whose contents are changed spuriously
due to some electromagnetic interference .
Overwriting the memory cell with the right content will make
the fault go away.
Permanent Faults - never go away, component has to be

repaired or replaced.
Intermittent Faults - cycle between active and benign states
Example - a loose connection

Fault Tolerance Techniques


Hardware Redundancy
Software Redundancy
Information Redundancy
Time Redundancy

Hardware Redundancy
Extra hardware is added to override the effects of a failed

component
Static Hardware Redundancy - for immediate masking of a

failure
Example: Use three processors and vote on the

result. The wrong output of a single faulty processor is masked

Dynamic Hardware Redundancy - Spare components are

activated upon the failure of a currently active component


Hybrid Hardware Redundancy - A combination of static and

dynamic redundancy techniques

Software Redundancy
Multiple teams of programmers

Write different versions of software for the same

function
The hope is that such diversity will ensure that not all
the copies will fail on the same set of input data

Information Redundancy
Add check bits to original data bits so that an error in
the data bits can be detected and even corrected

Error detecting and correcting codes have been


developed and are being used
Information redundancy often requires hardware

redundancy to process the additional check bits

Time Redundancy
Provide additional time during which a failed
execution can be repeated

Most failures are transient - they go away after some


time
If enough slack time is available, failed unit can

recover and redo affected computation

THANK YOU

You might also like