08 Architecture

Cyber-Physical Systems
Embedded Architecture
1
Introduction to Microcontrollers
2
Introduction to Microcontrollers
Ø A microcontroller (MCU) is a small
computer on a single integrated circuit
consisting of a relatively simple central
processing unit (CPU) combined with
peripheral devices such as memories,
I/O devices, and timers.
§ By some accounts, more than half of all CPUs
sold worldwide are microcontrollers.
§ Such a claim is hard to substantiate because
the difference between microcontrollers and
general-purpose processors is indistinct.
3
Microcontrollers
Ø An Embedded Computer System on a Chip
§ A CPU
§ Memory (Volatile and Non-Volatile)
§ Timers
§ I/O Devices
Ø Typically intended for limited energy usage
§ Low power when operating plus sleep modes
Ø Where might you use a microcontroller?
4
What is Control?
Ø Sequencing operations
§ Turning switches on and off
Ø Adjusting continuously (or at least finely)
variable quantities to influence a process
5
Microcontroller vs Microprocessor
Ø A microcontroller is a small computer on a single
integrated circuit containing a processor core,
memory, and programmable input/output
peripherals.
Ø A microprocessor incorporates the functions of a
computer’s central processing unit (CPU) on a
single integrated circuit.
6
Microcontroller vs Microprocessor
7
Types of Processors
Ø In general-purpose computing, the variety of
instruction set architectures today is limited, with the
Intel x86 architecture overwhelmingly dominating all.
Ø There is no such dominance in embedded computing.
On the contrary, the variety of processors can be
daunting to a system designer.
Ø Do you want same microprocessor for your watch,
autonomous vehicle, industrial sensor?
8
How to choose?
Ø How to choose micro-processors/controllers?
Ø Things that matter
§ Peripherals
§ Concurrency & Timing
§ Clock Rates
§ Memory sizes (SRAM & flash)
§ Package sizes
9
Types of Microcontrollers
10
DSP Processors
Ø Processors designed specifically to support numerically
intensive signal processing applications are called DSP
processors, or DSPs (digital signal processors).
Ø Signal Processing Applications: interactive games; radar,
sonar, and LIDAR (light detection and ranging) imaging
systems; video analytics (the extraction of information from
video, for example for surveillance); driver-assist systems for
cars; medical electronics; and scientific instrumentation.
11
A Common Signal Processing Algorithm
Ø finite impulse response (FIR) filtering
Ø N is the length of the filter
Ø ai are tap values $%&
Ø x(n) is the input 𝑦 𝑛 = $ 𝑎! 𝑥(𝑛 − 𝑖)

!"#
FIR Filter Formula
12
FIR Filter Implementation
Ø z-1 is unit delay
Ø Suppose N = 4 and a0 = a1 = a2 = a3 = 1/4.
Ø Then for all n ∈ N,
y(n) = (x(n) + x(n − 1) + x(n − 2) + x(n − 3))/4 .
Ø Multiply-Accumulate
Tapped delay line implementation of the FIR filter 13

Multiply-Accumulate Instructions
Ø Digital Signal Processors provide a fast and efficient
multiply-accumulate (MAC) instruction
§ Typically including a relatively large accumulator
Ø They also typically use a Harvard memory access
architecture
Ø They may include auto-increment addressing modes
Ø They may support circular buffer addressing
§ Efficient implementation of delay lines
Ø They may support zero-overhead loops
14
Comparison
Frequency Response Comparison

1
Amplitude
Digital
Analog
0.1
10 100 1000
Frequency
15
Digital Filter Critique
Ø The filter pole is at about ¼ of the sampling rate
§ We have only 4 samples of the impulse response
§ This makes the FIR filter simple: only 4 taps
§ This also degrades the filter performance
Ø We may be able to improve the filter performance
some by using a different design technique
§ The filter coefficients would differ
Ø A higher sampling rate with respect to the filter
corner frequency could also help
16
FIR Filter Delay Implementation
Ø Circular Buffer
17
Programmable Logic Controller (PLC)
Ø A microcontroller system for industrial automation
§ Continuous operation
§ Hostile environments
§ originated as replacements for control circuits using electrical relays to
control machinery
Ø PLCs are frequently programmed using ladder logic

§ This notation was developed to specify logic constructed with relays and
switches
18
GPUs
Ø A graphics processing unit (GPU) is a specialized
processor designed especially to per- form the
calculations required in graphics rendering.
Ø Most used for Gaming (earlier days)
Ø Common programming language: CUDA
19
Parallelism vs Concurrency
Ø Embedded computing applications typically do
more than one thing “at a time.”
Ø Tasks are said to be “concurrent” if they
conceptually execute simultaneously
Ø Tasks are said to be “parallel” if they physically
execute simultaneously
§ Typically multiple servers at the same time
20
Imperative Language
Ø Non-concurrent programs specify a sequence of
instructions to execute.
Ø Imperative Language: expresses a computation as
a sequence of operations
§ Example: C, Java
Ø How to write concurrent programs in imperative
language?
§ Thread Library
21
Dependency – Sequential Consistency
Ø No dependency
between lines 3 and 4
Ø Line 4 is dependent
on Line 3
22
Thread Mapping on Processor
Ø OS Dependent Scheduler
§ Static Mapping
§ Basic Lowest Load (fill in Round Robin fashion)
§ Extended Lowest Load
23
Performance Improvement
Ø Various current architectures seek to improve
performance by finding and exploiting potentials
for parallel execution
§ This frequently improves processing throughput
§ It does not always improve processing latency
§ It frequently makes processing time less predictable
Ø Many embedded applications rely on results being
produced at predictable regular rates
§ Embedded results must be available at the right time 24
Parallelism
Ø Temporal Parallelism – Pipelining
Ø Spatial Parallelism –
§ Superscalar (instruction and data level parallelism)
§ VLIW
§ Multicore
25
RISC and CISC Architectures
Ø CISC – Complex Instruction Set Computer
§ Multi-clock complex instructions
Ø RISC – Reduced Instruction Set Computer
§ Simple instructions that can be executed within one cycle
26
5 Cycles of RISC Instruction Set
Ø Instruction fetch cycle (IF)
§ Fetch instruction from memory pointed by PC, then increment PC
Ø Instruction decode/register fetch cycle (ID)
§ Decode the instruction
Ø Execution/effective address cycle (EX)
§ ALU operates on the operands
Ø Memory access (MEM)
§ Load/Store instructions
Ø Write-back cycle (WB)
§ Register-Register ALU instruction
27
Pipelining in RISC
data hazard (computed branch)
control hazard (conditional branch)
4 branch
Mux
Zero?
taken
Add
Decode
memory
data
Mux
Mux
Instruction
memory
Register
ALU
PC
bank
data hazard (memory read or ALU result)

fetch decode execute memory writeback
28
Simple RISC Pipeline
29
Pipelining Hazard
Ø Data Hazard (RAW (read after write) , WAW
(write after write) , WAR (write after read) )
§ Pipeline bubble (no op)
§ Interlock
§ Out-of-order Execution
Ø Control Hazard
§ Out-of-order Execution
§ Speculative Execution
30
Interlocks
instruction B reads a register written by instruction A
hardware resources: hardware resources:

instruction memory A B C D E instruction memory A B C D E
register bank read 1 A B C D E register bank read 1 A B C D E
interlock
register bank read 2 A B C D E register bank read 2 A B C D E
ALU A B C D E ALU A B C D E
data memory A B C D E data memory A B C D E
register bank write A B C D E register bank write A B C D E
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 11 12
cycle cycle
Reservation Table Reservation Table

with Interlocks
31
CISC
Ø DSPs are typically CISC machines
Ø Instructions support
§ FIR filtering
§ FFTs
§ Viterbi decoding
32
FIR Filter Implementation
Ø z-1 is unit delay
Ø Suppose N = 4 and a0 = a1 = a2 = a3 = 1/4.
Ø Then for all n ∈ N,
y(n) = (x(n) + x(n − 1) + x(n − 2) + x(n − 3))/4 .
Ø Multiply-Accumulate
Tapped delay line implementation of the FIR filter 33

CISC Instruction
Ø Texas Instruments TMS320c54x family of DSP processors
Ø Code
§ RPT numberOfTaps - 1
§ MAC *AR2+, *AR3+, A
Ø RPT: zero overhead loops
Ø MAC : Multiply accumulate
§ a := a + x ∗ y
§ AR2, AR3 are registers
§ A is the Accumulator
34
Symmetric FIR Filter
Ø Coefficients of FIR Filter is often symmetric
§ 𝑁 = 2, 𝑎" = 𝑎#$"$%
Ø If hardware has two ALUs, it can be used

Ø Requires half the time
Example DSP Library from TI:
http://processors.wiki.ti.com/index.php/C674x_DSPLIB 35
VLIW Instruction Set
Ø Used for DSP, other
Embedded
Applications
Ø Multiple independent
instructions per cycle,
packed into single
large "instruction
word" or "packet"
36
Multicore Architecture
Ø Combination of several processors in a single chip
Ø Real-time and Safety critical tasks can have
dedicated processors
Ø Heterogeneous multicore
§ CPU and GPUs together
37
FPGAs
Ø Field Programmable
Gate Arrays
§ Set of logic gates and RAM
blocks
§ Reconfigurable /
Programmable
§ Precise timing
Ø System on Chip design

Zync 38
Bits to represent data
Ø Range and Resolution Tradeoff
§ More bits
o Better precision
o More flip-flops
§ Fewer bits
o Less precision
o Fewer flip-flops è lower footprint, lower power
Ø Fixed Point Representation

§ Simulation required for the complete design for dynamic range of
parameters
39
Fixed and Floating Point Numbers
Ø Programs may use float or double
Ø Many embedded processors do not have floating
point arithmetic hardware
Ø Conversion required, which makes it slow
Ø Imaginary Binary Point is considered for computation
§ Binary point separates bits
§ Decimal point separates digits
Ø Format x.y representation indicates
§ x bits left & y bits right of binary point
40
Fixed Point Numbers
Ah Al
Ø 𝟎𝟏𝟏𝟎𝟏. 𝟏𝟎𝟏!
Ø = 1×2" + 1×2! + 1×2# + 1×2$% + 1×2$" m bits n bits
Ø = 13.625 Integer Fraction
𝒇 = 𝑨𝒉 + 𝑨𝒍 ×𝟐#𝒏
Radix point
10101.101% = 𝐴& + 𝐴' ×2#(
= 21 + 5×2#(
= 21.625
41
Unsigned Fixed Point Representation
Ø Example: Convert 𝑓 = 3. 141593 to unsigned fixed-point UQ4.12
format.
Ø Calculate 𝑓×2&' = 12867.964928

Ø Round the result to an integer, 𝑟𝑜𝑢𝑛𝑑 12867.964928 = 12868
Ø Convert the integer to binary: 12868 = 11_0010_0100_01002
Ø Organize into UQ4.12: 0011.0010_0100_01002
Ø Final result in Hex: 0x3244
− 𝑓 = −8.5625×10%)
&'()(
Ø Error:
')*
42
Signed Fixed Point Representation
s m bits n bits
Sign bit Radix point

𝑵$𝟐
𝑨 = −𝟏×𝒃𝑵$𝟏 ×𝟐 𝑵$𝟏
+5 𝒃𝒊 ×𝟐𝒊
𝒊)𝟎
𝑨
𝒇= 𝒏
𝟐
where 𝑁 = 𝑚 + 𝑛 + 1
43
Signed Fixed Point Representation
Ø Example: Convert 𝑓 = −3. 141593 to signed fixed-point Q3.12 format.
Ø Calculate 𝑓×2&' = −12867.964928

Ø Round the result to an integer, 𝑟𝑜𝑢𝑛𝑑 −12867.964928 = −12868
Ø Convert the absolute integer to binary: 12868 = 11_0010_0100_01002
(Note that the integer is represented in two’s complement.)
Ø Make the result into 16 bits: 0011_0010_0100_01002
Ø Find the two’s complement: 1100_1101_1011_11002
Ø Final result in Hex: 0xCDBC
Error: − − 𝑓 = 8.5625×10%)
&'()(
Ø
')*
44
Range and Resolution
Ø Range of Unsigned UQm.n (m+n bits)
§ Unsigned integer à 0, 2:;< − 1
§ Unsigned fixed point à 0, 2:;< − 1 ×2$< = 0, 2: − 2$<
Ø Range of Signed Fixed point Qm.n (m+n+1 bits)
§ Range of signed integers: [−2:;< , 2:;< − 1]
§ Range of Signed fixed point number:[−2:;< , 2:;< −
1]×2$< = −2: , 2: − 2$<
Ø Resolution/Precision (UQm.n and Qm.n) = 2!"
45
Addition and Subtraction
Addition
Assume UQ16.16 𝑓+ = 𝑓, + 𝑓-
𝐼, = 𝑓, ×2./ 𝑓, = 𝐼, ×2#./
/𝐼- = 𝑓- ×2./ /𝑓- = 𝐼- ×2#./
𝐼+ = 𝑓+ ×2./ 𝑓+ = 𝐼+ ×2#./
Subtraction
𝑓+ = 𝑓, − 𝑓-
𝑓+ = 𝑓, + 𝑓-
𝐼+ = 𝐼, − 𝐼-
= 𝐼, ×2#./ + 𝐼- ×2#./
= 𝐼, + 𝐼- ×2#./
𝐼+ ×2#./ = 𝐼, + 𝐼- ×2#./
𝐼+ = 𝐼, + 𝐼-
46
Multiplication
𝑓- = 𝑓. ×𝑓/
= 𝐼. ×2$%0 × 𝐼/ ×2$%0
= 𝐼. ×𝐼/ ×2$"!
𝐼- = 𝐼. ×𝐼/ ×2$%0
𝑓- = 𝐼- ×2$%0
47
Law of Conservation of Bits
Ø When multiplying two x-bit numbers with
formats n.m and p.q, the result has format (n +
p).(m + q)
Ø Processors might support full precision
multiplications
Ø Finally need to convert x-bits to data register
48
Fixed Point Multiplication
𝑓- = 𝑓. ×𝑓/
= 𝐼. ×2$%0 × 𝐼/ ×2$%0
= 𝐼. ×𝐼/ ×2$"! 𝐼- = 𝐼. ×𝐼/ ×2$%0
𝑓- = 𝐼- ×2$%0
49
Overflow Example
Ø Multiply 0.5x0.5
Ø Fixed point representation of 0.5 = 230
Ø Result of Multiplication = 260

Ø Discard higher bits results in error
Ø Remedy: Shift Right before multiply
Ø Result = 0.01, interpreted as 0.25

50
Programmers need to guard
Ø Overflow – since higher order bits are discarded
Ø Underflow – due to lower order bits being
discarded
Ø Truncation –if bits are chosen before operation
Ø Rounding – rounds to nearest full precision after
operation
51

08 Architecture

Uploaded by

08 Architecture

Uploaded by

Cyber-Physical Systems

Ø x(n) is the input 𝑦 𝑛 = $ 𝑎! 𝑥(𝑛 − 𝑖)

FIR Filter Formula

Tapped delay line implementation of the FIR filter 13

Frequency Response Comparison

Ø PLCs are frequently programmed using ladder logic

data hazard (memory read or ALU result)

hardware resources: hardware resources:

Reservation Table Reservation Table

Tapped delay line implementation of the FIR filter 33

Ø If hardware has two ALUs, it can be used

Ø System on Chip design

Ø Fixed Point Representation

10101.101% = 𝐴& + 𝐴' ×2#(

Ø Calculate 𝑓×2&' = 12867.964928

Sign bit Radix point

Ø Calculate 𝑓×2&' = −12867.964928

Ø Result of Multiplication = 260

Ø Result = 0.01, interpreted as 0.25

You might also like