0% found this document useful (0 votes)
53 views137 pages

Embedded Systems - 16CS402: Department of Computer Science and Engineering, Dayananda Sagar University, Bengaluru

The document discusses common components used in embedded software including state machines, circular buffers, and queues. State machines are well-suited for reactive systems using inputs and outputs. Circular buffers and queues are useful for streaming data applications like digital signal processing.

Uploaded by

Kiran Kiru
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
0% found this document useful (0 votes)
53 views137 pages

Embedded Systems - 16CS402: Department of Computer Science and Engineering, Dayananda Sagar University, Bengaluru

The document discusses common components used in embedded software including state machines, circular buffers, and queues. State machines are well-suited for reactive systems using inputs and outputs. Circular buffers and queues are useful for streaming data applications like digital signal processing.

Uploaded by

Kiran Kiru
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 137

Embedded

Systems–
16CS402
Department of Computer Science and Engineering, Dayananda
Sagar University, Bengaluru
COMPONENTS FOR EMBEDDED
PROGRAMS
▪ consider code for three structures or components that are commonly used in
embedded software: the state machine, the circular buffer, and the queue. State
machines are well suited to reactive systems such as user interfaces; circular buffers
and queues are useful in digital signal processing.
▪ State Machines: The reaction of most systems can be characterized in terms of the
input received and the current state of the system. This leads naturally to a finite-state
machine style of describing the reactive system’s behavior.
▪ The state machine style of programming is also an efficient implementation of such
computations.
▪ Finite-state machines are usually first encountered in the context of hardware design.
Programming
▪ Example 5.1 shows how to write a finite-state machine in a high-level programming
language.
State machine example

no seat/-
no seat/ idle
buzzer off seat/timer on
no seat/- no belt
buzzer seated and no
Belt/buzzer on timer/-
belt/-
belt/
buzzer off no belt/timer on
belted
COMPONENTS FOR EMBEDDED
PROGRAMS
▪ The controller’s job is to turn on a buzzer if a person sits in a seat and does not fasten
the seat belt within a fixed amount of time.

▪ This system has three inputs and one output. The inputs are a sensor for the seat to
know when a person has sat down, a seat belt sensor that tells when the belt is
fastened, and a timer that goes off when the required time interval has elapsed.

▪ The output is the buzzer. Appearing below is a state diagram that describes the seat
belt controller’s behavior.

▪ The idle state is in force when there is no person in the seat. When the person sits
down, the machine goes into the seated state and turns on the timer.

▪ If the timer goes off before the seat belt is fastened, the machine goes into the buzzer
state. If the seat belt goes on first, it enters the belted state. When the person leaves
the seat, the machine goes back to idle
C implementation
#define IDLE 0

#define SEATED 1

#define BELTED 2

#define BUZZER 3

switch (state) {

case IDLE: if (seat) { state = SEATED; timer_on = TRUE; }

break;

case SEATED: if (belt) state = BELTED;

else if (timer) state = BUZZER;

break;

}
Stream-Oriented Programming and
Circular Buffers
▪ Stream-Oriented Programming and Circular Buffers The data stream style
makes sense for data that comes in regularly and must be processed on the fly.

▪ we would process the samples over a given interval by reading them all in from
a file and then computing the results all at once in a batch process.

▪ The circular buffer is a data structure that lets us handle streaming data in an
efficient way. Figure 5.1 illustrates how a circular buffer stores a subset of the
data stream.

▪ At each point in time, the algorithm needs a subset of the data stream that
forms a window into the stream.

▪ The window slides with time as we throw out old values no longer needed and
add new values. Since the size of the window does not Appearing below are the
declarations for the circular buffer and filter coefficients, assuming that N, the
number of taps in the filter, has been previously defined.
Signal processing and circular buffer

▪ Commonly used in signal processing:


▪ new data constantly arrives;
▪ each datum has a limited lifetime.

time time t+1

▪ Use a circulard1 d2to hold


buffer d3 the
d4 data
d5stream.
d6 d7
Circular buffer

x1 x2 x3 x4 x5 x6

t1 t2 t3
Data stream

x1
x5 x2
x6 x3
x7 x4

Circular buffer
Circular buffers

▪ Indexes locate currently used data, current input data:

input d1 use d5

d2 input d2

d3 d3

use d4 d4

time t1 time t1+1


Circular buffer implementation: FIR
filter

int circ_buffer[N], circ_buffer_head = 0;


int c[N]; /* coefficients */

int ibuf, ic;
for (f=0, ibuff=circ_buff_head, ic=0;
ic<N; ibuff=(ibuff==N-1?0:ibuff++), ic++)
f = f + c[ic]*circ_buffer[ibuf];
Queues

▪ Queues are used whenever data may arrive and depart at somewhat
unpredictable times or when variable amounts of data may arrive.

▪ A queue is often referred to as an elastic buffer. One way to build a queue is


with a linked list. This approach allows the queue to grow to an arbitrary size.

▪ But in many applications we are unwilling to pay the price of dynamically


allocating memory. Another way to design the queue is to use an array to
hold all the data.
Buffer-based queues

#define Q_SIZE 32 int dequeue() {


#define Q_MAX (Q_SIZE-1)
int returnval;
int q[Q_MAX], head, tail;
if (head == tail) error();
void initialize_queue() { head = tail =
0; }
returnval = q[head];
void enqueue(int val) {
if (head == Q_MAX) head = 0;
if (((tail+1)%Q_SIZE) == head)
error();
else head++;
q[tail]=val;
return returnval;
if (tail == Q_MAX) tail = 0; else tail++;
} }
Models of programs

▪ Source code is not a good representation for programs:


▪ clumsy;
▪ leaves much information implicit.

▪ Compilers derive intermediate representations to manipulate and optimize


the program.

▪ fundamental model for programs is the control/data flow graph (CDFG).

▪ As the name implies, the CDFG has constructs that model both data
operations (arithmetic and other computations) and control operations
(conditionals).

▪ Part of the power of the CDFG comes from its combination of control and
data constructs.
Data flow graph

▪ DFG: data flow graph.


▪ Data Flow Graphs A data flow graph is a model of a program with no
conditionals.

▪ In a high-level programming language, a code segment with no conditionals


—more precisely, with only one entry and exit point—is known as a basic
block.
▪ Figure 5.2 shows a simple basic block. As the C code is executed, we would
enter this basic block at the beginning and execute all the statements
▪ Does not represent control.
▪ Models basic block: code with no entry or exit.
▪ Describes the minimal ordering requirements on operations.
Single assignment form

x = a + b; x = a + b;
y = c - d; y = c - d;
z = x * y; z = x * y;
y = b + d; y1 = b + d;

original basic block single assignment form


Data flow graph

x = a + b; a b c d
y = c - d;
+ -
z = x * y;
y
y1 = b + d; x

* +

single assignment form


z y1
DFG
DFGs and partial orders

a b c d Partial order:
▪ a+b, c-d; b+d x*y
+ -
Can do pairs of
y operations in any
x
order.
* +

z y1
Control-data flow graph

▪ CDFG: represents control and data.


▪ Uses data flow graphs as components.
▪ Two types of nodes:
▪ decision;
▪ data flow.
Data flow node
Encapsulates a data flow graph:

Write operations in basic block form for simplicity.

x = a + b;
y=c+d
Control

T v1 v4
cond value

F v2 v3

Equivalent forms
CDFG example
T
if (cond1) bb1(); cond1 bb1()
else bb2(); F
bb3(); bb2()

switch (test1) {
bb3()
case c1: bb4(); break;
case c2: bb5(); break; c3
c1 test1
case c3: bb6(); break;
c2
} bb4() bb5() bb6()
for loop

for (i=0; i<N; i++)


i=0
loop_body();
for loop
F
i<N
i=0;
T
while (i<N) {
loop_body(); i++; } loop_body()
equivalent
Assembly and linking

▪ Last steps in compilation:

HLL
HLL
HLL compile assembly
assembly
assembly assemble

link executable link


Assembly and linking
▪ The assembler’s job is to translate symbolic assembly language statements into
bit-level representations of instructions known as object code.

▪ The assembler takes care of instruction formats and does part of the job of
translating labels into addresses.

▪ However, since the pro- gram may be built from many files, the final steps in
determining the addresses of instructions and data are performed by the linker,
which produces an executable binary file.

▪ That file may not necessarily be located in the CPU’s memory, however, unless
the linker happens to create the executable directly in RAM.

▪ The program that brings the program into memory for execution is called a
loader.

▪ The simplest form of the assembler assumes that the starting address of the
assembly language program has been specified by the programmer. The
addresses in such a program are known as absolute addresses.
Multiple-module programs

▪ Programs may be composed from several files.


▪ Addresses become more specific during processing:
▪ relative addresses are measured relative to the start of a module;
▪ absolute addresses are measured relative to the start of the CPU address
space.
Relative address generation

▪ Some label values may not be known at assembly time.


▪ Labels within the module may be kept in relative form.
▪ Must keep track of external labels---can’t generate full binary for
instructions that use external labels.
Assemblers

▪ Major tasks:
▪ generate binary for symbolic instructions;
▪ translate labels into addresses;
▪ handle pseudo-ops (data, etc.).

▪ Generally one-to-one translation.


▪ Assembly labels:
ORG 100
label1 ADR r4,c
Two-pass assembly

▪ Pass 1:
▪ generate symbol table

▪ Pass 2:
▪ generate binary instructions
Symbol table

ADD r0,r1,r2 xx 0x8


xx ADD r3,r4,r5 yy 0x10
CMP r0,r3
yy SUB r5,r6,r7

assembly code symbol table


Symbol table generation

▪ Use program location counter (PLC) to determine address of each


location.
▪ Scan program, keeping count of PLC.
▪ Addresses are generated at assembly time, not execution time.
Symbol table example
PLC=0x7
xx 0x8
PLC=0x7
yy 0x10
PLC=0x7
PLC=0x7
ADD r0,r1,r2
xx ADD r3,r4,r5
CMP r0,r3
yy SUB r5,r6,r7
Pseudo-operations

▪ Pseudo-ops do not generate instructions:


▪ ORG sets program location.
▪ EQU generates symbol table entry without advancing PLC.
▪ Data statements define data blocks.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
Symbol table
Symbol table
Linking

▪ Combines several object modules into a single executable module.


▪ Jobs:
▪ put modules in order;
▪ resolve labels across modules.
Externals and entry points
entry point
xxx ADD r1,r2,r3 a ADR r4,yyy
Ba external reference ADD r3,r4,r5
yyy %1
Module ordering

▪ Code modules must be placed in absolute positions in the memory


space.
▪ Load map or linker flags control the order of modules.

module1

module2

module3
Dynamic linking

▪ Some operating systems link modules dynamically at run time:


▪ shares one copy of library among all executing programs;
▪ allows programs to be updated with new versions of libraries.
Program design and analysis

▪ Compilation flow.
▪ Basic statement translation.
▪ Basic optimizations.
▪ Interpreters and just-in-time compilers.
Compilation

▪ Compilation strategy (Wirth):


▪ compilation = translation + optimization

▪ Compiler determines quality of code:


▪ use of CPU resources;
▪ memory access scheduling;
▪ code size.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
Basic compilation phases

HLL

parsing, symbol table

machine-independent
optimizations

machine-dependent
optimizations

assembly
Statement translation and optimization

▪ Source code is translated into intermediate form such as CDFG.


▪ CDFG is transformed/optimized.
▪ CDFG is translated into instructions with optimization decisions.
▪ Instructions are further optimized.
Arithmetic expressions

a*b + 5*(c-d) a b c d
* -
expression
5

DFG
Arithmetic expressions, cont’d.

a b c d ADR r4,a

MOV r1,[r4]

1 * 2 - ADR r4,b

MOV r2,[r4]
5 ADD r3,r1,r2

ADR r4,c
3 * MOV r1,[r4]
ADR r4,d
MOV r5,[r4]
SUB r6,r4,r5
4 +
MUL r7,r6,#5
ADD r8,r7,r3

DFG code
Control code generation

if (a+b > 0)
x = 5;
a+b>0 x=5
else
x = 7;
x=7
Control code generation, cont’d.
ADR r5,a

LDR r1,[r5]

ADR r5,b

1 a+b>0 x=5 2 LDR r2,b

ADD r3,r1,r2

BLE label3
LDR r3,#5
3 x=7 ADR r5,x
STR r3,[r5]
B stmtent
LDR r3,#7
ADR r5,x
STR r3,[r5]
stmtent ...
Procedure linkage

▪ Need code to:


▪ call and return;
▪ pass parameters and results.

▪ Parameters and returns are passed on stack.


▪ Procedures with few parameters may use registers.
Procedure stacks

growth
proc1 proc1(int a) {
proc2(5);
FP }
frame pointer
proc2
5 accessed relative to SP
SP
stack pointer
ARM procedure linkage

▪ APCS (ARM Procedure Call Standard):


▪ r0-r3 pass parameters into procedure. Extra parameters are put on stack
frame.
▪ r0 holds return value.
▪ r4-r7 hold register values.
▪ r11 is frame pointer, r13 is stack pointer.
▪ r10 holds limiting address on stack size to check for stack overflows.
Data structures

▪ Different types of data structures use different data layouts.


▪ Some offsets into data structure can be computed at compile time,
others must be computed at run time.
One-dimensional arrays

▪ C array name points to 0th element:

a a[0]
a[1] = *(a + 1)
a[2]
Two-dimensional arrays

▪ Column-major layout:

a[0,0]
a[0,1] M
...
N

... a[1,0]
a[1,1] = a[i*M+j]
Structures

▪ Fields within structures are static offsets:

aptr
struct { field1 4 bytes
int field1;
char field2; *(aptr+4)
} mystruct; field2

struct mystruct a, *aptr = &a;


Expression simplification

▪ Constant folding:
▪ 8+1 = 9

▪ Algebraic:
▪ a*b + a*c = a*(b+c)

▪ Strength reduction:
▪ a*2 = a<<1
Dead code elimination

▪ Dead code:
#define DEBUG 0 0
0
if (DEBUG) dbg(p1);
▪ Can be eliminated by 1
analysis of control dbg(p1);
flow, constant folding.
Procedure inlining

▪ Eliminates procedure linkage overhead:

int foo(a,b,c) { return a + b - c;}


z = foo(w,x,y);

z = w + x + y;
Loop transformations

▪ Goals:
▪ reduce loop overhead;
▪ increase opportunities for pipelining;
▪ improve memory system performance.
Loop unrolling

▪ Reduces loop overhead, enables some other optimizations.

for (i=0; i<4; i++)


a[i] = b[i] * c[i];


for (i=0; i<2; i++) {
a[i*2] = b[i*2] * c[i*2];
a[i*2+1] = b[i*2+1] * c[i*2+1];
}
Loop fusion and distribution

▪ Fusion combines two loops into 1:

for (i=0; i<N; i++) a[i] = b[i] * 5;


for (j=0; j<N; j++) w[j] = c[j] * d[j];
 for (i=0; i<N; i++) {
a[i] = b[i] * 5; w[i] = c[i] * d[i];
}
▪ Distribution breaks one loop into two.
▪ Changes optimizations within loop body.
Loop tiling

▪ Breaks one loop into a nest of loops.


▪ Changes order of accesses within array.
▪ Changes cache behavior.
Loop tiling example

for (i=0; i<N; i++) for (i=0; i<N; i+=2)

for (j=0; j<N; j++) for (j=0; j<N; j+=2)

c[i] = a[i,j]*b[i]; for (ii=0; ii<min(i+2,n); ii++)


for (jj=0; jj<min(j+2,N); jj++)
c[ii] = a[ii,jj]*b[ii];
Array padding

▪ Add array elements to change mapping into cache:

a[0,0] a[0,1] a[0,2] a[0,0] a[0,1] a[0,2] a[0,2]

a[1,0] a[1,1] a[1,2] a[1,0] a[1,1] a[1,2] a[1,2]

before after
Register allocation

▪ Goals:
▪ choose register to hold each variable;
▪ determine lifespan of varible in the register.

▪ Basic case: within basic block.


Register lifetime graph

w = a + b; t=1 a
x = c + w; t=2 b
t=3 c
y = c + d; d
w
x
y

1 2 3 time
Instruction scheduling

▪ Non-pipelined machines do not need instruction scheduling: any


order of instructions that satisfies data dependencies runs equally
fast.
▪ In pipelined machines, execution time of one instruction depends on
the nearby instructions: opcode, operands.
Reservation table

▪ A reservation table Time/instr A B


relates
instructions/time to instr1 X
CPU resources. instr2 X X
instr3 X
instr4 X
Software pipelining

▪ Schedules instructions across loop iterations.


▪ Reduces instruction latency in iteration i by inserting instructions
from iteration i+1.
Instruction selection

▪ May be several ways to implement an operation or sequence of


operations.
▪ Represent operations as graphs, match possible instruction
sequences onto graph.

+ +
* +
* MUL ADD *
expression templates MADD
Using your compiler

▪ Understand various optimization levels (-O1, -O2, etc.)


▪ Look at mixed compiler/assembler output.
▪ Modifying compiler output requires care:
▪ correctness;
▪ loss of hand-tweaked code.
Interpreters and JIT compilers

▪ Interpreter: translates and executes program statements on-the-fly.


▪ JIT compiler: compiles small sections of code into instructions during
program execution.
▪ Eliminates some translation overhead.
▪ Often requires more memory.
Program design and analysis

▪ Program-level performance analysis.


▪ Optimizing for:
▪ Execution time.
▪ Energy/power.
▪ Program size.

▪ Program validation and testing.


Program-level performance analysis

▪ Need to understand
performance in detail:
▪ Real-time behavior, not just
typical.
▪ On complex platforms.

▪ Program performance 
CPU performance:
▪ Pipeline, cache are windows
into program.
▪ We must analyze the entire
program.
Complexities of program performance

▪ Varies with input data:


▪ Different-length paths.

▪ Cache effects.
▪ Instruction-level performance variations:
▪ Pipeline interlocks.
▪ Fetch times.
How to measure program performance

▪ Simulate execution of the CPU.


▪ Makes CPU state visible.

▪ Measure on real CPU using timer.


▪ Requires modifying the program to control the timer.

▪ Measure on real CPU using logic analyzer.


▪ Requires events visible on the pins.
Program performance metrics

▪ Average-case execution time.


▪ Typically used in application programming.

▪ Worst-case execution time.


▪ A component in deadline satisfaction.

▪ Best-case execution time.


▪ Task-level interactions can cause best-case program behavior to result in
worst-case system behavior.
Elements of program performance

▪ Basic program execution time formula:


▪ execution time = program path + instruction timing
▪ Solving these problems independently helps
simplify analysis.
▪ Easier to separate on simpler CPUs.
▪ Accurate performance analysis requires:
▪ Assembly/binary code.
▪ Execution platform.
Data-dependent paths in an if
statement

if (a || b) { /* T1 */ a b c path
if ( c ) /* T2 */ 0 0 0 T1=F, T3=F: no assignments

x = r*s+t; /* A1 */ 0 0 1 T1=F, T3=T: A4


0 1 0 T1=T, T2=F: A2, A3
else y=r+s; /* A2 */
0 1 1 T1=T, T2=T: A1, A3
z = r+s+u; /* A3 */
1 0 0 T1=T, T2=F: A2, A3
} T1=T, T2=T: A1, A3
1 0 1
else { 1 1 0 T1=T, T2=F: A2, A3
if ( c ) /* T3 */ 1 1 1 T1=T, T2=T: A1, A3

y = r-t; /* A4 */
}
Paths in a loop

for (i=0, f=0; i<N; i++) i=0


f = f + c[i] * x[i]; f=0

N
i=N
Y
f = f + c[i] * x[i]

i=i+1
Instruction timing

▪ Not all instructions take the same amount of time.


▪ Multi-cycle instructions.
▪ Fetches.

▪ Execution times of instructions are not independent.


▪ Pipeline interlocks.
▪ Cache effects.

▪ Execution times may vary with operand value.


▪ Floating-point operations.
▪ Some multi-cycle integer operations.
Mesaurement-driven performance
analysis

▪ Not so easy as it sounds:


▪ Must actually have access to the CPU.
▪ Must know data inputs that give worst/best case performance.
▪ Must make state visible.

▪ Still an important method for performance analysis.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
Feeding the program

▪ Need to know the desired input values.


▪ May need to write software scaffolding to generate the input values.
▪ Software scaffolding may also need to examine outputs to generate
feedback-driven inputs.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
Trace-driven measurement

▪ Trace-driven:
▪ Instrument the program.
▪ Save information about the path.

▪ Requires modifying the program.


▪ Trace files are large.
▪ Widely used for cache analysis.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
Physical measurement

▪ In-circuit emulator allows tracing.


▪ Affects execution timing.
▪ Logic analyzer can measure behavior at pins.
▪ Address bus can be analyzed to look for events.
▪ Code can be modified to make events visible.
▪ Particularly important for real-world input
streams.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
CPU simulation

▪ Some simulators are less accurate.


▪ Cycle-accurate simulator provides accurate clock-cycle timing.
▪ Simulator models CPU internals.
▪ Simulator writer must know how CPU works.

Overheads for Computers as


© 2008 Wayne Wolf Components 2nd ed.
SimpleScalar FIR filter simulation

int x[N] = {8, 17, … }; N total sim sim cycles


cycles per filter
int c[N] = {1, 2, … }; execution
100 25854 259
main() {
1,000 155759 156
int i, k, f; 1,0000 1451840 145

for (k=0; k<COUNT; k++)


for (i=0; i<N; i++)
f += c[i]*x[i];
}
Overheads for Computers as
© 2008 Wayne Wolf Components 2nd ed.
Performance optimization motivation

▪ Embedded systems must often meet deadlines.


▪ Faster may not be fast enough.

▪ Need to be able to analyze execution time.


▪ Worst-case, not typical.

▪ Need techniques for reliably improving execution time.


Programs and performance analysis

▪ Best results come from analyzing optimized instructions, not high-


level language code:
▪ non-obvious translations of HLL statements into instructions;
▪ code may move;
▪ cache effects are hard to predict.
Loop optimizations

▪ Loops are good targets for optimization.


▪ Basic loop optimizations:
▪ code motion;
▪ induction-variable elimination;
▪ strength reduction (x*2 -> x<<1).
Code motion

for (i=0; i<N*M; i++)


i=0; Xi=0;
= N*M
z[i] = a[i] + b[i];
i<N*M
i<X N
Y
z[i] = a[i] + b[i];

i = i+1;
Induction variable elimination

▪ Induction variable: loop index.


▪ Consider loop:
for (i=0; i<N; i++)
for (j=0; j<M; j++)
z[i,j] = b[i,j];

▪ Rather than recompute i*M+j for each array in each iteration, share
induction variable between arrays, increment at end of loop body.
Cache analysis

▪ Loop nest: set of loops, one inside other.


▪ Perfect loop nest: no conditionals in nest.
▪ Because loops use large quantities of data, cache conflicts are
common.
Array conflicts in cache

a[0,0] 1024
1024 4099

b[0,0] 4099 ...

main memory cache


Array conflicts, cont’d.

▪ Array elements conflict because they are in the same line, even if not
mapped to same location.
▪ Solutions:
▪ move one array;
▪ pad array.
Performance optimization hints

▪ Use registers efficiently.


▪ Use page mode memory accesses.
▪ Analyze cache behavior:
▪ instruction conflicts can be handled by rewriting code, rescheudling;
▪ conflicting scalar data can easily be moved;
▪ conflicting array data can be moved, padded.
Energy/power optimization

▪ Energy: ability to do work.


▪ Most important in battery-powered systems.

▪ Power: energy per unit time.


▪ Important even in wall-plug systems---power becomes heat.
Measuring energy consumption

▪ Execute a small loop, measure current:

while (TRUE)
a();
Sources of energy consumption

▪ Relative energy per operation (Catthoor et al):


▪ memory transfer: 33
▪ external I/O: 10
▪ SRAM write: 9
▪ SRAM read: 4.4
▪ multiply: 3.6
▪ add: 1
Cache behavior is important

▪ Energy consumption has a sweet spot as cache size changes:


▪ cache too small: program thrashes, burning energy on external memory
accesses;
▪ cache too large: cache itself burns too much power.
Cache sweet spot

[Li98] © 1998 IEEE


Optimizing for energy

▪ First-order optimization:
▪ high performance = low energy.

▪ Not many instructions trade speed for energy.


Optimizing for energy, cont’d.

▪ Use registers efficiently.


▪ Identify and eliminate cache conflicts.
▪ Moderate loop unrolling eliminates some loop overhead instructions.
▪ Eliminate pipeline stalls.
▪ Inlining procedures may help: reduces linkage, but may increase
cache thrashing.
Efficient loops

▪ General rules:
▪ Don’t use function calls.
▪ Keep loop body small to enable local repeat (only forward branches).
▪ Use unsigned integer for loop counter.
▪ Use <= to test loop counter.
▪ Make use of compiler---global optimization, software pipelining.
Single-instruction repeat loop
example
STM #4000h,AR2
; load pointer to source
STM #100h,AR3
; load pointer to destination
RPT #(1024-1)
MVDD *AR2+,*AR3+
; move
Optimizing for program size

▪ Goal:
▪ reduce hardware cost of memory;
▪ reduce power consumption of memory units.

▪ Two opportunities:
▪ data;
▪ instructions.
Data size minimization

▪ Reuse constants, variables, data buffers in different parts of code.


▪ Requires careful verification of correctness.

▪ Generate data using instructions.


Reducing code size

▪ Avoid function inlining.


▪ Choose CPU with compact instructions.
▪ Use specialized instructions where possible.
Program validation and testing

▪ But does it work?


▪ Concentrate here on functional verification.
▪ Major testing strategies:
▪ Black box doesn’t look at the source code.
▪ Clear box (white box) does look at the source code.
Clear-box testing

▪ Examine the source code to determine whether it


works:
▪ Can you actually exercise a path?
▪ Do you get the value you expect along a path?
▪ Testing procedure:
▪ Controllability: rovide program with inputs.
▪ Execute.
▪ Observability: examine outputs.
Controlling and observing programs

firout = 0.0; ▪ Controllability:


for (j=curr, k=0; j<N; j++, k++) ▪ Must fill circular buffer
firout += buff[j] * c[k]; with desired N values.
for (j=0; j<curr; j++, k++) ▪ Other code governs how
we access the buffer.
firout += buff[j] * c[k];

if (firout > 100.0) firout = 100.0; ▪ Observability:


if (firout < -100.0) firout = -100.0; ▪ Want to examine firout
before limit testing.
Execution paths and testing

▪ Paths are important in functional testing as well as performance


analysis.
▪ In general, an exponential number of paths through the program.
▪ Show that some paths dominate others.
▪ Heuristically limit paths.
Choosing the paths to test

▪ Possible criteria:
▪ Execute every statement
at least once.
not covered
▪ Execute every branch
direction at least once.
▪ Equivalent for
structured programs.
▪ Not true for gotos.
Basis paths

▪ Approximate CDFG
with undirected graph.
▪ Undirected graphs
have basis paths:
▪ All paths are linear
combinations of basis
paths.
Cyclomatic complexity

▪ Cyclomatic complexity
is a bound on the size
of basis sets:
▪ e = # edges
▪ n = # nodes
▪ p = number of graph
components
▪ M = e – n + 2p.
Branch testing

▪ Heuristic for testing branches.


▪ Exercise true and false branches of conditional.
▪ Exercise every simple condition at least once.
Branch testing example

▪ Correct: ▪ Test:
▪ if (a || (b >= c)) ▪a=F
{ printf(“OK\n”); } ▪ (b >=c) = T
▪ Incorrect: ▪ Example:
▪ if (a && (b >= c)) ▪ Correct: [0 || (3 >= 2)] =
{ printf(“OK\n”); } T
▪ Incorrect: [0 && (3 >= 2)]
=F
Another branch testing example

▪ Correct: ▪ Incorrect code


▪ if ((x == good_pointer) && x- changes pointer.
>field1 == 3)) { printf(“got the
value\n”); }
▪ Assignment returns new
LHS in C.
▪ Incorrect:
 if ((x = good_pointer) && x-
▪ Test that catches
>field1 == 3)) { printf(“got error:
the value\n”); }
▪ (x != good_pointer) &&
x->field1 = 3)
Domain testing

▪ Heuristic test for linear


inequalities.
▪ Test on each side +
boundary of inequality.
Def-use pairs

▪ Variable def-use:
▪ Def when value is
assigned (defined).
▪ Use when used on right-
hand side.
▪ Exercise each def-use
pair.
▪ Requires testing correct
path.
Loop testing

▪ Loops need specialized tests to be tested efficiently.


▪ Heuristic testing strategy:
▪ Skip loop entirely.
▪ One loop iteration.
▪ Two loop iterations.
▪ # iterations much below max.
▪ n-1, n, n+1 iterations where n is max.
Black-box testing

▪ Complements clear-box testing.


▪ May require a large number of tests.

▪ Tests software in different ways.


Black-box test vectors

▪ Random tests.
▪ May weight distribution based on software specification.

▪ Regression tests.
▪ Tests of previous versions, bugs, etc.
▪ May be clear-box tests of previous versions.
How much testing is enough?

▪ Exhaustive testing is impractical.


▪ One important measure of test quality---bugs
escaping into field.
▪ Good organizations can test software to give very low
field bug report rates.
▪ Error injection measures test quality:
▪ Add known bugs.
▪ Run your tests.
▪ Determine % injected bugs that are caught.
Program design and analysis

▪ Software modem.
Theory of operation

▪ Frequency-shift keying:
▪ separate frequencies for 0 and 1.

0 1

time
FSK encoding

▪ Generate waveforms based on current bit:

0110101 bit-controlled
waveform
generator
FSK decoding

A/D converter
zero filter detector 0 bit

one filter detector 1 bit


Transmission scheme

▪ Send data in 8-bit bytes. Arbitrary spacing between bytes.


▪ Byte starts with 0 start bit.
▪ Receiver measures length of start bit to synchronize itself to
remaining 8 bits.

start (0) bit 1 bit 2 bit 3 ... bit 8


Requirements
Inputs Analog sound input, reset button.

Outputs Analog sound output, LED bit display.

Functions Transmitter: Sends data from memory


in 8-bit bytes plus start bit.
Receiver: Automatically detects bytes
and reads bits. Displays current bit on
LED.
Performance 1200 baud.

Manufacturing cost Dominated by microprocessor and


analog I/O
Power Powered by AC.

Physical Small desktop object.


size/weight
Specification

Line-in* Receiver
1 1

sample-in()
input()
bit-out()

Transmitter Line-out*
1 1

bit-in()
output()
sample-out()
System architecture

▪ Interrupt handlers for samples:


▪ input and output.

▪ Transmitter.
▪ Receiver.
Transmitter

▪ Waveform generation by table lookup.


▪ float sine_wave[N_SAMP] = { 0.0, 0.5, 0.866, 1, 0.866, 0.5, 0.0, -0.5,
-0.866, -1.0, -0.866, -0.5, 0};

time
Receiver

▪ Filters (FIR for simplicity) use circular buffers to hold data.


▪ Timer measures bit length.
▪ State machine recognizes start bits, data bits.
Hardware platform

▪ CPU.
▪ A/D converter.
▪ D/A converter.
▪ Timer.
Component design and testing

▪ Easy to test transmitter and receiver on host.


▪ Transmitter can be verified with speaker outputs.
▪ Receiver verification tasks:
▪ start bit recognition;
▪ data bit recognition.
System integration and testing

▪ Use loopback mode to test components against each other.


▪ Loopback in software or by connecting D/A and A/D converters.
Component design and testing

▪ Must test performance as well as testing.


▪ Compression time shouldn’t dominate other tasks.

▪ Test for error conditions:


▪ memory overflow;
▪ try to delete empty message set, etc.
System integration and testing

▪ Can test partial integration on host platform; full testing requires


integration on target platform.
▪ Simulate phone line for tests:
▪ it’s legal;
▪ easier to produce test conditions.

You might also like