BY:
ANKIT BHATT
ME-VLSI & EMBEDDED
What
Is
Failure?
A system is said to fail when it cannot meet its promises.
A failure is brought about by the existence of errors in
the system.
The cause of an error is called a fault.
Concept of Fault Tolerance
Hardware, software and networks cannot be totally free
from failures
Fault tolerance is a non-functional (QoS) requirement
that requires a system to continue to operate, even in the
presence of faults
Fault tolerance should be achieved with minimal
involvement of users or system administrators
Distributed systems can be more fault tolerant than
centralized systems, but with more processor hosts
generally the occurrence of individual faults is likely to
be more frequent
Attributes Consequences and Strategies
What is a
Attributes
Dependable
Availability
system
Reliability
Safety
How to
distinguish
Confidentiality
faults
Integrity
Maintainability Consequences
Fault
Error
Strategies
Failure
Fault prevention
Fault tolerance
Fault recovery
Fault forcasting
Distributed Systems
How to
handle
faults?
Terminology of Fault Tolerance
Fault
causes
Error
results in Failure
Fault is a defect within the system
Error is observed by a deviation from the expected
behaviour of the system
Failure occurs when the system can no longer perform as
required (does not meet spec)
Fault Tolerance is ability of system to provide a service,
even in the presence of errors
Distributed Systems
Strategies to Handle Faults
Fault avoidance
Techniques aim to prevent
faults from entering the
system during design stage
Fault removal
Methods attempt to find
faults within a system before
it enters service
Fault detection
Techniques used during
service to detect faults within
the operational system
Fault tolerant
Techniques designed to tolerant
faults, i.e. to allow the system
operate correctly in the presence of
faults.
Distributed Systems
Actions to identify and
remove errors:
Design reviews
Testing
Use certified tools
Analysis:
Hazard analysis
Formal methods proof & refinement
No non-trivial system
can be guaranteed free
from error
Must have an
expectation of failure
and make appropriate
provision
Fault Models
A fault model identifies targets for testing
A fault model makes analysis possible
Effectiveness measurable by experiments
Different types
Stuck-at faults
Multiple stuck-at faults
Bridging faults
Single Stuck At Fault
Single (line) stuck-at fault
The given line has a constant value (0/1)
independent of other signal values in the circuit
Properties
o Only one line is faulty
o The faulty line is permanently set to 0 or 1
o The fault can be at an input or output of a gate
o Simple logical model is independent of technology
o It reduces the complexity of fault-detection
Example:
XOR circuit has 12 fault sites and 24 single stuck-at faults
Multiple Stuck-At Faults
Multiple stuck-at fault
Several single stuck-at faults occur at the same time
Multiple stuck-at faults are usually not considered in
practice because of two reasons
The number of multiple stuck-at faults in a circuit
with k lines is 3K-1, which is too large a number
even for circuits of moderate size
o Tests for single stuck-at faults are known to cover a
very high percentage (greater than 99.6%) of multiple stuck-at
faults when the circuit is large and
has several outputs
o
Bridging Fault
Two or more normally distinct points (lines) are
shorted together
Two types of bridging faults:
Input bridging
Can form wired logic or voting model.
Feedback (input-to-output) bridging
Can introduce feedback.
Can cause oscillation or latching.
Transistor Fault
o MOS transistor is considered an ideal switch.
o Two types of faults are modeled:-
Stuck-open -A single transistor is permanently stuck in
the open state turn the circuit into a sequential one and
need a sequence of at least 2 tests to detect a single fault.
Stuck-on - A single transistor is permanently
shorted irrespective of its gate voltage.
o Detection of a stuck-open fault requires two vectors.
Example of Transistor Stuck-Open
fault
Hardware Faults Classification
Three types of faults:
Transient Faults - disappear after a relatively short time
Example - a memory cell whose contents are changed spuriously
due to some electromagnetic interference .
Overwriting the memory cell with the right content will make
the fault go away.
Permanent Faults - never go away, component has to be
repaired or replaced.
Intermittent Faults - cycle between active and benign states
Example - a loose connection
Fault Tolerance Techniques
Hardware Redundancy
Software Redundancy
Information Redundancy
Time Redundancy
Hardware Redundancy
Extra hardware is added to override the effects of a failed
component
Static Hardware Redundancy - for immediate masking of a
failure
Example: Use three processors and vote on the
result. The wrong output of a single faulty processor is masked
Dynamic Hardware Redundancy - Spare components are
activated upon the failure of a currently active component
Hybrid Hardware Redundancy - A combination of static and
dynamic redundancy techniques
Software Redundancy
Multiple teams of programmers
Write different versions of software for the same
function
The hope is that such diversity will ensure that not all
the copies will fail on the same set of input data
Information Redundancy
Add check bits to original data bits so that an error in
the data bits can be detected and even corrected
Error detecting and correcting codes have been
developed and are being used
Information redundancy often requires hardware
redundancy to process the additional check bits
Time Redundancy
Provide additional time during which a failed
execution can be repeated
Most failures are transient - they go away after some
time
If enough slack time is available, failed unit can
recover and redo affected computation
THANK YOU