0% found this document useful (0 votes)
23 views43 pages

Algorithm and Architectural Level Methodologies For Low Power

Uploaded by

sohelmudgal85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views43 pages

Algorithm and Architectural Level Methodologies For Low Power

Uploaded by

sohelmudgal85
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Algorithm and Architectural Level

Methodologies for Low Power

By: BHOGESHRAO
2SD24LDE02
Introduction
• With ever increasing integration levels, power has
become a critical design parameter.
• Consequently, a lot of effort has gone into achieving
lower dissipation at all levels of the design process.
• It has been demonstrated by several researchers
that algorithm and architecture level design
decisions can have a dramatic impact on power
consumption
• In this chapter we explore some of the known
synthesis, optimization and estimation techniques
applicable at the algorithm and architectural levels.
Introduction
• The techniques mentioned in this chapter are
targeted for DSP applications but can readily
be adapted for more general applications.
• Two examples -a vector quantizer encoder and
an FIR filter - are used throughout the chapter
to illustrate how the methodologies may be
applied.
• While the former is evaluated and optimized
for ASIC design, the latter is targeted for a
programmable processor.
Design Flow
• A design environment oriented towards power
minimization must embody optimization and estimation
tools at all levels of the design flow. A top-down approach,
with examples of associated tools, is illustrated in Figure
11.1.
• The most effective design decisions derive from choosing
and optimizing algorithms at the highest levels.
• However, implementation details cannot be accurately
modeled or estimated at this level of abstraction so
relative metrics must be judiciously used in making design
selections.
• More information is available at the architectural level,
hence estimates are more accurate and the effectiveness
of optimizations can be more accurately quantified.
Design Example 1: Vector Quantization, Introduction
• Throughout this chapter we will be using the
design of a video vector quantizer to illustrate
design flow at the different levels.
• Vector Quantization (VQ) is a data compression
method used in voice recognition and video systems.
• This example implements 16-to-1 video compression.
• In this approach to video quantization, an image is
broken up into a sequence of 4x4 pixel images (Figure
11.2)
• Each pixel is represented by an 8-bit word indicating
luminance.
Design Example 1: Vector Quantization, Introduction

• The 4x4 image, therefore, can be thought of as a


vector of 16 words, each 8 bits in length.
• Each of these vectors are compared with a
previously generated codebook of - in this case - 256
different vectors.
• This codebook is generated a priori with the
intention of covering enough of the vector space to
give a good representation of all probable vectors.
• After compression, an 8 bit word is generated
delineating the address of a codevector that
approximates the original 4x4 vector image the best.
• This corresponds to a compression ratio of
16:1, since 16, 8-bit words are now
represented as a single 8-bit word.
• Our design is directed toward a 240x128 pixel
grey scale display. Processing the standard 30
frames/sec moving picture necessitates that
one 4x4 pixel-vector be compressed every
17.3 micro-second.
• Distortion calculation: Each input block is compared
with all the “code words” stored in CODEBOOK (256
code words). It determines the similarity (closest)
between the input and code word. (Calculates MSE-
Mean Square Error)
Design Example 2: FIR Filter, Introduction
• The second example that will be used
throughout this chapter is a 14 tap, low-pass
Finite Impulse Response (FIR) filter. The
algorithm will be optimized for targeted
architectures and various implementations of
the filter --using dedicated and programmable
hardware-- will be analyzed in terms of their
power consumption characteristics.
Algorithm level: Analysis and Optimization
• Estimation (Analysis)
• The sources of power consumption on a CMOS chip
can be classified as dynamic power, short circuit
currents and leakage.
• At the algorithm level, it makes sense to only consider
the dynamic power.
• The contributions of short circuit currents and leakage
are mostly determined at the circuit level and are only
marginally effected by algorithm-level decisions.
• The power dissipated can be described by the
following well-known equation for dynamic power:
• where f is the frequency of operations,
• V is the supply voltage and
• Ceff is the effective capacitance switched.
• Ceff combines two factors - C, the physical
capacitance being charged/discharged, and α
the corresponding switching probability:
• Ceff = αC
• For the purpose of estimation, we can divide
power dissipation into two components:
1) algorithm-inherent dissipation and
2) Implementation overhead.
• The algorithm-inherent dissipation comprises
the power of the execution units and memory.
This is "inherent" in the sense that it is
necessary to achieve the basic functionality of
the algorithm, and cannot be avoided
irrespective of the implementation.
• On the other hand, the implementation
overhead includes control, interconnect and
registers. The power consumed by this
component depends largely on the choice of
architecture/implementation.
1. Estimating the algorithm-inherent dissipation -The
algorithm-inherent dissipation refers to the power
consumed by the execution units and memory.
• This component is fundamental to a given algorithm
and is the prime factor for comparisons between
different algorithms as well as for quantifying the
effect of algorithm level design decisions.
• Its dissipation can be estimated by a weighted sum
of the number of operations in the algorithm.
• The weights used for the different operations must
reflect the respective capacitances switched
2 .Estimating the implementation overhead –
• The implementation overhead consists of the control,
interconnect and implementation related
memory/register power.
• The power consumed by these components depends
on the specific architecture platform chosen and on
the mapping of the algorithm onto the hardware
• Since this overhead is not essential to the basic
functionality of a given algorithm, several estimation
tools ignore its effect for algorithm level comparisons
• However the power consumed by these components
is often comparable if not greater than the algorithm-
inherent dissipation.
• It is clear, therefore, that it important to get reasonable estimates of the
implementation overhead for realistic comparisons between algorithms
and to guide high level decisions.
• This is a formidable task without a complete architecture description.
• Fortunately, it is possible to produce first-order predictions of the
overhead component given some properties of both the algorithm and
the targeted hardware platform or architecture.
• There are 2 Structural Properties of an algorithm they are Regularity and
Locality.
• Regularity refers to consistent and repeatable computation or data
access patterns. (Eg: Matrix Multiplication)
• Easy to map on to hardware, efficient for parallelization, hardware
sharing
• Spatial Locality- Accessing data elements stored near to each other (Eg :
Array)
• Temporal Locality- Reusing data in short intervals of time.
• Improves cache usage, reduces memory power consumption
• Minimizing Bus interconnects.
• A spatially-local algorithm renders itself more easily to
efficient partitioning on hardware, allowing highly
capacitive global buses to be used sparingly
(Moderately/Limitedly)
• A temporally-local algorithm tends to require less
temporary storage and have small register files leading
to lower capacitances.
• In terms of memory/register access, spatial locality
refers to distance between the addresses of items
referenced close together in time and temporal locality
refers to the probability of future accesses to items
referenced in the recent past. A spatially-local memory
access pattern allows partitioning of memory into
smaller blocks that require less power per access.
• Given a targeted hardware platform and a number of
algorithm properties, techniques can be developed for
early prediction of the implementation overhead.
• Consider the interconnect power in an custom ASIC
implementation. It has been established that in
general the average length and, hence, the physical
capacitance of the buses is proportional to the
predicted die area.
• In its turn, the active area is a function of algorithmic
parameters such as the number of operations to be
performed and their concurrency pattern.
• The switching activity can be derived from the number
of bus accesses, which is proportional to the number
of edges in the computational graph of the algorithm.
Design Example 1: Vector Quantization, Algorithmic
Estimation
• We have established that at this high level of design
description, there is no means to accurately estimate
absolute power. However by using such metrics as
operation count and first-order estimates of the critical
path, design decisions can be made.
• To illustrate this level of estimation we use the most
straight forward method of coding the vector, a full
search through the entire code book (FSVQ) combined
with the standard Mean Square Error (MSE) distortion
measure given by Eq.
Design Example 1: Vector Quantization, Algorithmic
Estimation
• where C is the code book code vector, X is the original 4x4 vector
representation, and i is the index of the individual pixel word.
• The computational complexity per vector can be quantified by
enumerating executions (e.g. memory accesses, multiplications,
additions, etc.) required to search the codebook.
• This gives a reasonable first order approximation of relative
power consumption. Computing the MSE between two vectors
requires 16 memory accesses, 16 subtractions, 16 multiplies and
16 additions.
• In FSVQ, this is done for each of the 256 vectors in the codebook,
and each of these vectors are compared with the leading MSE
candidate at the time.
Algorithm-inherent dissipation - Operation
count can now be used to estimate the
switching capacitance inherent to the
algorithm if the targeted hardware library is
known.
• Using black box capacitance models of the hardware and
making assumptions on the bit-widths of each operator, a first
order estimate of capacitance can be made.

• Knowing that memory accesses and multiplications are


power hungry, the first-order analysis produces the insight
that these are the functions most in need of optimization. This
picture can be refined by introducing some architectural
constraints.
• Assume for instance that a single one-ported memory
is used to store the codebook.
• This sets the maximum concurrency of the memory
accesses to one only, i.e equivalently, imposes a
sequential execution of the algorithm.
• To meet the real time constraint of 17.3 micro-second
/block, this means that the memory access time has
to be smaller than 4.2 nsec.
• For a power efficient implementation, it is obvious
that either a more complex memory architecture
(dual port memory/ multi port memory), or a revised
algorithm is necessary.
Design Example 2: FIR Filter, Algorithmic Exploration

• Consider a direct-form structure (Figure 11.4)


of the FIR filter and assume a throughput
constraint of 3.125 MHz. As the voltage is
reduced, it is necessary to choose faster
hardware to meet the required time
constraint.
• Availability of power and area estimates of different
modules in the cell-library allows us to evaluate power
and area trade-off's involved in using different library
cells.
• Though the ripple adder dissipates less power than the
carry select adder (CSA), it fails to meet the required
throughput below 5 V whereas the CSA continues to
meet the throughput requirement down to 3 V.
• Table 11.3 summarizes the energy and area estimates
obtained using the techniques described in Section
11.3.1. Using the carry select adder reduces the power
consumption to a third of its original value with
minimal area penalty.
• Design space exploration (estimation tool) provides
an interactive environment giving quick feedback to
the designer about the effect of design choices on
specified performance metrics and allows the user
to make intelligent decisions.
• It Provides guidance for the selection of algorithms,
as a cost function for transformations. and for
hardware selection resulting in large power savings
Power Minimization Techniques at the Algorithm Level
• After examining methods for estimating power
consumption at the algorithm level, the next
logical step is to examine power minimization
techniques at this level.
• We will start by mentioning some of the
general approaches for power minimization
and then look at specific techniques that can
be used for minimization of both the
algorithm-inherent dissipation and the
implementation overhead.
• The recurring theme in low power design at all levels
of abstraction is voltage reduction. At the algorithm
level, functional pipelining, retiming, algebraic
transformations and loop transformations can be
used to increase speed and allow lower voltages.
• Be aware that these approaches often translate into
larger silicon area implementations, hence the
approach has been termed trading area for power.
• Estimation and exploration tools help us decide how
much we can drop the voltage while still meeting the
required performance constraints, as well as the
associated area penalty.
• Another technique for low power design is
avoiding wasteful activity. At the algorithm
level, the size and complexity of a given
algorithm (e.g. operation counts, word lengths)
determine the activity.
• If there are several algorithms for a given task,
the one with least number of operations is
generally preferable.
• Reducing the algorithm-inherent dissipation –
• Important transformations in this category
include operation reduction and strength
reduction.
• Operation reduction includes common sub-expression
elimination, algebraic transformations (e.g. reverse
distributivity), dead code elimination.
• Strength reduction refers to replacing energy consuming
operations by a combination of simpler operations.
• The most common in this category is expansion of
multiplications by constants into shift and add
operations. Though this transformation typically results
in lower power, it may sometimes have the opposite
effect if it results in an increase in critical path.
• Another drawback is that it introduces extra overhead in
the form of registers and control.
• Another important component of the algorithm-inherent
dissipation is the memory power.
• These include conversion of background memory to
foreground register files and reduction of memory size
using loop reordering and loop merging transformations.
• Minimization of the implementation overhead - is a more
challenging problem.
• it was explained how certain algorithms have potentially
less overhead than others as they possess certain
structural properties such as locality (reduces data
movement) and regularity (predictable and repetitive
structure- Reduces complexity- promotes resource sharing)
• For selection of algorithms, therefore, we must be able to
detect these properties.
• Optimizations on the algorithm level should
enhance and preserve them.
• Spatial locality can be detected and used to
guide partitioning.
• Regular algorithms typically require less
control and interconnect overhead.
• One other way to reduce the implementation
overhead is to reduce the chip area, as this
typically translates into reduced bus
capacitances
Design Example 1: Vector Quantization, Algorithmic
Optimization
• Continuing with this design example, the properties of operation
count and critical path will be used to aid in the choice and
optimization of algorithms. Using an estimate of algorithm-
inherent dissipation of Table 11.1, memory access has been
identified as the main hurdle to achieving a low-power design.
To achieve significant power savings, other algorithms are
investigated.
• Tree Search Vector Quantization (TSVQ) - Tree Search Vector
Quantizer (TSVQ) encoding [5] requires far less computation.
• TSVQ performs a binary search of the vector space instead of a
full search.
• As a result, the computational complexity is proportional to
log2N rather than N, where N is the number of vectors in the
code book
• Figure 11.5 diagrams the structure of the tree search. At each level of
the tree, the input vector is compared with two codebook entries.

• If at level 1, for example, the input vector is closer to the left entry, then
the right branch of the tree is not analyzed further and an index bit 0 is
transmitted. This process is repeated until a leaf of the tree is reached.
Hence only 2*log2(256) = 16 distortion comparisons have to be made,
compared to 256 distortion calculations in the FSVQ.
• Mathematical Optimizations - In TSVQ, there is a
large computational reduction available by
mathematically rearranging the computation of
the difference between the input vector X, and
two code vectors Ca and Cb , originally given by Eq.

• Since a given node in the comparison tree always


compares the same two code vectors, the
calculation of the errors can be combined under
one summation.
• With the quadratics expanded, this yields

• The first summation can be precomputed once the


codebook is known and stored in a single memory
location. The quantities 2(Cbi - Ca;) may also be
calculated and pre-stored.
• Therefore, at each level of the tree the number of
multiplications/additions/subtractions is reduced almost
50%
• The impact of the algorithm selection and the mathematical
transformations is summarized in Table 11.4 for a 256 vector
codebook.

• Therefore , The Optimized Search is chosen as the favorable


algorithm.
Design Example 2: FIR Filter, Algorithmic Optimization

• The algorithmic transformations described in this section


represent one of the most powerful and widely applicable
class of optimization techniques.
• We revisit our FIR example to demonstrate their advantages.
• As mentioned before, the throughput required is 3.125 MHz.
• The direct form has 13 additions and 1 multiplication in the
critical path and cannot meet the throughput constraint
below 3 V for the given hardware library.
• We use retiming to reduce the critical path. Figure 11.6
shows the structure of the retimed version.
• The critical path (after retiming) is now reduced to only 1
multiplication and 1 addition operation.
• This allow for a reduction in supply voltage
below 3 V while maintaining the same
throughput.
• The area-energy trade-off's for both the
versions with the variation of the supply
voltage, as generated by the algorithmic
estimation tools, are shown in Figure 11. 7
• The retimed version allows the voltage to be
reduced to 1.5 V, thus reducing the power
consumption drastically.
• However, the area penalty may be prohibitive.
• The designer can choose the voltage that best
suited simultaneously taking into account the
area, throughput and energy.
THANK YOU

You might also like