Lecture 05 ARM Processors

ARM Processors and Architectures
Outline
• Introduction
• ARM Architecture Overview
• ARMv7-AR Architecture
• Programmer’s Model
• Memory Systems
• ARM System Design
• AMBA bus protocol
ARM Ltd
• ARM founded in November 1990

• Advanced RISC Machines
• Company headquarters in Cambridge, UK

• Processor design centers in Cambridge, Austin, and Sophia Antipolis
• Sales, support, and engineering offices all over the world
• Best known for its range of RISC processor cores designs

• Other products – fabric IP, software tools, models, cell libraries - to help partners
develop and ship ARM-based SoCs
• ARM does not manufacture silicon

ARM in the market
According to ARM Holdings, in 2010 alone,

producers of chips based on ARM architectures
reported shipments of 6.1 billion ARM-based
processors, representing 95% of smartphones, 35%
of digital televisions and set-top boxes and 10% of
mobile computers
http://en.wikipedia.org/wiki/ARM_architecture
-
RISC Design Philosophy
“The architectural simplicity of ARM processors leads to very
small implementations, and small implementations mean
devices can have very low power consumption.
Implementation size, performance, and very low power
consumption are key attributes of the ARM architecture.”
ARM Architecture Reference Manual ARMv7-A edition

RISC Design Philosophy
ARM is RISC
• Uniform register file
• Load/store architecture
• Simple addressing
Outline
• Introduction
• Memory Systems
Development of the ARM Architecture
v4 v5 v6 v7 v8
Halfword and Improved SIMD Instructions

signed interworking Multi-processing 64-bit Arch
v6 Memory Thumb-2 32, 64-bit
halfword / CLZ
byte support Saturated architecture registers
Unaligned data Architecture Profiles
arithmetic 7-A -
System mode DSP MAC support Architecture
Extensions: Applications Profiles
instructions 7-R - Real-
Thumb Thumb-2 8-A -
(6T2) time Applications
instruction Extensions: 7-M -
set (v4T) Jazelle TrustZone
® (6Z) Microcontroller
(5TEJ)
Multicore
(6K)
Thumb
only (6-M)
 Note that implementations of the same architecture can be different
 Cortex-A8 - architecture v7-A, with a 13-stage pipeline
 Cortex-A9 - architecture v7-A, with an 8-stage pipeline
Architecture ARMv7 profiles
• Application profile (ARMv7-A)
• Memory management support (MMU)
• Highest performance at low power
• Influenced by multi-tasking OS system requirements
• TrustZone and Jazelle-RCT for a safe, extensible system
• e.g. Cortex-A5, Cortex-A9
• Real-time profile (ARMv7-R)

• Protected memory (MPU)
• Low latency and predictability ‘real-time’ needs
• Evolutionary path for traditional embedded business
• e.g. Cortex-R4
• Microcontroller profile (ARMv7-M, ARMv7E-M, ARMv6-M)

• Lowest gate count entry point
• Deterministic and predictable behavior a key priority
• Deeply embedded use
• e.g. Cortex-M3
Which architecture is my processor?
Outline
• Introduction
• Cortex A9 processor
• Memory Systems
Cortex A9
• The ARM Cortex-A9 processor is the high
performance choice in a family of low power, cost-
sensitive devices.
• The Cortex-A9 microarchitecture is delivered either

as a Cortex-A9 single core processor or a scalable
multicore processor: the Cortex-A9 MPCore ™
processor
What are its specs?
• The Cortex A9 core:
• Gives 2.50 DMIPS/MHz/core (Dhrystone MIPS)
• Generally clocked between 800MHz and 2GHz
• Possible to run > 1GHz and < 250mW
• 1 – 4 cores
Where is it used?
Apple A5 SoC NVIDIA Tegra 2 SoC
(iPhone 4S, iPad 2, iPad mini) (Motorola XOOM)
Zynq 7020 SoC

(Xilinx ZedBoard FPGA)
Cortex-A9
• ARMv7-A Architecture
• Thumb-2, Thumb-2EE
• TrustZone support
• Variable-length Multi-issue
pipeline
• Register renaming
• Speculative data prefetching
• Branch Prediction & Return
Stack
• 64-bit AXI instruction and data
interfaces
 Optional features:
• TrustZone extensions  PTM instruction trace interface
• L1 Data and Instruction caches  IEM power saving support
• 16-64KB each  Full Jazelle DBX support
• 4-way set-associative  VFPv3-D16 Floating-Point Unit (FPU) or NEON™
media processing engine
CortexA9 Microarchitecture
Rename Issue Execute Writeback
ecode
struction
etch
Memory
Instruction Fetch
• Instruction cache size: 16KB, 32KB, or 64KB

• Superscalar pipeline: fetching two instructions at once
• Branch Prediction:
• Global History Buffer: 1K ~ 16K entries
• Branch-Target Address Cache: 512 ~ 4K entries
• Return stack of 4 x 32 bits
• Fast-loop mode: instruction loop that are smaller than 64 bytes often
complete without additional instruction cache accesses
Instruction Decode
• Super Scalar Decoder

- Capable of decoding two full instructions per cycle
Rename
• Register Renaming
- Resolving data dependencies and unroll small loops by
hardware
Issue
• Issue (aka Dispatch) can be fed maximum of 2 instructions

per cycle
• Issue can dispatch up to 4 instructions per cycle
• Out of order selection of instructions from queue
Execute (1)
• Variable length Executing Stage (1 ~ 3 cycles)

- Most Instructions finish within 1 cycle
- Instruction which folds shifts and rotates can take 3 cycles
• ADD r0, r1, r2 (1 cycle)
• ADD r0, r1, r2 LSL #2 (2 cycle)
• Corresponds to a = b + (c << 2);
• ADD r0, r1, r2 LSL r3 (3 cycle)
• Corresponds to a = b + (c << d);
Execute (2)
• NEON Media Processing Engine
- NEON technology supports instructions targeted primarily at audio, video,
3D graphics, image and speech processing.
Execute (3)
• What is NEON?
• NEON is a wide SIMD data processing architecture
• 32 registers, 64 bit wide or 16 registers, 128 bit wide
• NEON instructions perform “Packed SIMD” processing
• Registers can be considered as “vector” of same data type
• Instructions perform the same operation in all lanes
Execute (4)
• NEON Media Processing Engine supports vector

computations on:
- half-precision (16bit), single-precision (32bit), double-
precision (64bit) floating-point numbers
- 8, 16, 32 and 64 bit signed and unsigned integers
• Supported Operations Include:

- addition, subtraction, multiplication
- maximum or minimum of a vector of operands
- Inverse square-root approximation (y = x^-(1/2))
- many more
Memory
• Dependent load-store instructions forwarded for resolution within

memory system
• 2-level TLB structure
– micro TLB
• 32 entries on data side and 32 or 64 entries on instruction side
• to reduce power consumed in translation and protection look-ups
– main DTLB
Memory (2)
• Data prefetcher
• monitor cache line requests by processor and cache misses to
determine how much data to prefetch
• can prefetch up to 8 independent data streams
• prefetch and allocate data in the L1 data cache, as long as it keeps
hitting in the prefetched cache line
• When stop prefetching?
Memory Hierarchy
Cortex A9 MPcore
CPU CPU CPU CPU

Instructi Instructi Instructi Instructi
Data Data Data Data
on on on on
Cache Cache Cache Cache
Cache Cache Cache Cache
Accelerator
Snoop Control Unit (SCU) Coherence Port
L2 Cache
Main Memory
L1 caches
Cortex A9 MPcore
• Non-unified
CPU CPU CPU CPU - 32 bytes line length
- can be disabled independently
D$ I$ D$ I$ D$ I$ D$ I$ • 16, 32 or 64KB
• 4 - way associative
SCU ACP • support for Security Extensions
• I cache: VIPT
AXI RW
64-bit bus
AXI RW
64-bit bus • D cache: PIPT
L2 Cache - reduce number of caches flushes and refills
and save energy
Main Memory
L2 cache
Cortex A9 MPcore
CPU CPU CPU CPU

• shared, unified
D$ I$ D$ I$ D$ I$ D$ I$
• On-chip in Zedboard
• 128KB to 8MB
SCU ACP
• 4 to 16-way associative
AXI RW AXI RW
64-bit bus 64-bit bus
L2 Cache
Main Memory
Snoop Control Unit (1)
Cortex A9 MPcore
CPU CPU CPU CPU

• Integral part of cache memory
systems
D$ I$ D$ I$ D$ I$ D$ I$ • Connects processors to memory
system through AXI interfaces
SCU ACP
AXI RW AXI RW
64-bit bus 64-bit bus
L2 Cache
Main Memory
Snoop Control Unit (2)
• SCU functions :
- maintain data cache coherency
- initiate L2 memory accesses
- arbitrate between processors’ simultaneous request for L2
accesses
- manages accesses from ACP
- Provides access to on-chip ROM and RAM
• does not support instruction cache coherency
Accelerator Coherence Port
• optional AXI 64-bit slave port

• allows to connect to non-cached system mastering peripherals
and accelerators
—For example, DMA engine or cryptographic accelerator
• SCU enforces memory coherency
Multi-Core
Outline
• Introduction
• Memory Systems
Data Sizes and Instruction Sets
• ARM is a 32-bit load / store RISC architecture
• The only memory accesses allowed are loads and stores
• Most internal registers are 32 bits wide
• Most instructions execute in a single cycle
• When used in relation to ARM cores

• Halfword means 16 bits (two bytes)
• Word means 32 bits (four bytes)
• Doubleword means 64 bits (eight bytes)
• ARM cores implement two basic instruction sets

• ARM instruction set – instructions are all 32 bits long
• Thumb instruction set – instructions are a mix of 16 and 32 bits
• Thumb-2 technology added many extra 32- and 16-bit instructions to the original 16-bit
Thumb instruction set
• Depending on the core, may also implement other instruction sets

• VFP instruction set – 32 bit (vector) floating point instructions
• NEON instruction set – 32 bit SIMD instructions
• Jazelle-DBX - provides acceleration for Java VMs (with additional software support)
• Jazelle-RCT - provides support for interpreted languages
Processor Modes
• ARM has seven basic operating modes
• Each mode has access to its own stack space and a different subset of registers
• Some operations can only be carried out in a privileged mode
Mode Description
Supervisor Entered on reset and when a Supervisor call
(SVC) instruction (SVC) is executed
Entered when a high priority (fast) interrupt is
Exception modes
FIQ
raised
IRQ Entered when a normal priority interrupt is raised

Privileged
modes
Abort Used to handle memory access violations
Undef Used to handle undefined instructions
Privileged mode using the same registers as User

System
mode
Mode under which most Applications / OS tasks Unprivileged
User
run mode
The ARM Register Set
User mode IRQ FIQ Undef Abort SVC
r0
r1
r2 ARM has 37 registers, all 32-bits long
r3
r4 A subset of these registers is accessible in
r5 each mode
r6 Note: System mode uses the User mode
r7 register set.
r8 r8
r9 r9
r10 r10
r11 r11
r12 r12
r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp) r13 (sp)
r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr) r14 (lr)
r15 (pc)
cpsr
spsr spsr spsr spsr spsr
Current mode Banked out registers

Program Status Registers
31 28 27 24 23 19 16 15 10 9 8 7 6 5 4 0
N Z C V Q [de] J GE[3:0] IT[abc] EA IF T mode

f s x c
• Condition code flags • T bit
• N = Negative result from ALU • T = 0: Processor in ARM state
• Z = Zero result from ALU • T = 1: Processor in Thumb state
• C = ALU operation Carried out
• J bit
• V = ALU operation oVerflowed
• J = 1: Processor in Jazelle state
• Mode bits
• Sticky Overflow flag - Q flag • Specify the processor mode
• Indicates if saturation has occurred
• Interrupt Disable bits
• I = 1: Disables IRQ
• SIMD Condition code bits – GE[3:0] • F = 1: Disables FIQ
• Used by some SIMD instructions • E bit
• E = 0: Data load/store is little endian
• E = 1: Data load/store is bigendian
• IF THEN status bits – IT[abcde]
• Controls conditional execution of Thumb • A bit
instructions • A = 1: Disable imprecise data aborts
Instruction Set basics
• The ARM Architecture is a Load/Store architecture

• No direct manipulation of memory contents
• Memory must be loaded into the CPU to be modified, then written back out
• Cores are either in ARM state or Thumb state

• This determines which instruction set is being executed
• An instruction must be executed to switch between states
• The architecture allows programmers and compilation tools to reduce

branching through the use of conditional execution
• Method differs between ARM and Thumb, but the principle is that most (ARM) or all
(Thumb) instructions can be executed conditionally.
Data Processing Instructions
• These instructions operate on the contents of registers
• They DO NOT affect memory
arithmetic logical move
manipulation ADC SBC BIC ORR MVN
(has destination ADD SUB RSB AND EOR MOV
register) RSC
ORN
comparison CMN CMP TST TEQ
(set flags only) (ADDS) (SUBS) (ANDS) (EORS)
• Syntax:
<Operation>{<cond>}{S} {Rd,} Rn, Operand2
• Examples:
• ADD r0, r1, r2 ; r0 = r1 + r2
• TEQ r0, r1 ; if r0 = r1, Z flag will be set
• MOV r0, r1 ; copy r1 to r0
Single Access Data Transfer
• Use to move data between one or two registers and memory
LDRD STRD Doubleword
LDR STR Word
LDRB STRB Byte Memory

LDRH STRH Halfword
LDRSB Signed byte load
LDRSH Signed halfword load 31 0
Upper bits zero filled or
Rd sign extended on Load
• Syntax:
• LDR{<size>}{<cond>} Rd, <address>
• STR{<size>}{<cond>} Rd, <address>
• Example:
• LDRB r0, [r1] ; load bottom byte of r0 from the
; byte of memory at address in r1
Multiple Register Data Transfer
 These instructions move data between multiple registers and memory
 Syntax
 <LDM|STM>{<addressing_mode>}{<cond>} Rb{!}, <register list>
 4 addressing modes (IA) IB DA DB
 Increment after/before r4
 Decrement after/before r4 r1
r1 r0 Increasing
Base Register (Rb) r10 r0 r4 Address
r1 r4
r0 r1
r0
 Example
 LDM r10, {r0,r1,r4} ; load registers, using r10 base
Subroutines
• Implementing a conventional subroutine call requires two steps
• Store the return address
• Branch to the address of the required subroutine
• These steps are carried out in one instruction, BL
• The return address is stored in the link register (lr/r14)
• Branch to an address (range dependent on instruction set and width)
• Return is by branching to the address in lr
func1 func2
void func1 (void)
{
: :
BL func2 :
func2(); BX lr
:
:
}
Supervisor Call (SVC)
SVC{<cond>} <SVC number>
 Causes an SVC exception

 The SVC handler can examine the SVC number to decide what
operation has been requested
 But the core ignores the SVC number
 By using the SVC mechanism, an operating system can implement
a set of privileged operations (system calls) which applications
running in user mode can request
 Thumb version is unconditional
Exception handling process
1. Save processor status
Main • Copies CPSR into SPSR_<mode>
Application • Stores the return address in LR_<mode>
• Adjusts LR based on exception type
2. Change processor status for exception
• Mode field bits
• ARM or Thumb state
• Interrupt disable bits (if appropriate)
Exception • Sets PC to vector address
handler 3. Execute exception handler
• <users code>
4. Return to main application
• Restore CPSR from SPSR_<mode>
• Restore PC from LR_<mode>
• 1 and 2 performed automatically by the core
• 3 and 4 responsibility of software
What is NEON?
• NEON is a wide SIMD data processing architecture
• Extension of the ARM instruction set (v7-A)
• 32 x 64-bit wide registers (can also be used as 16 x 128-bit wide registers)
• NEON instructions perform “Packed SIMD” processing
• Registers are considered as vectors of elements of the same data type
• Data types available: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float
• Instructions usually perform the same operation in all lanes
Source
Registers
Dn
Elements
Dm
Operation
Dd
Destination
Register
Lane
NEON Coprocessor registers
• NEON has a 256-byte register file
• Separate from the core registers (r0-r15)
• Extension to the VFPv2 register file (VFPv3)
• Two different views of the NEON registers

• 32 x 64-bit registers (D0-D31) D0
• 16 x 128-bit registers (Q0-Q15) D1
Q0
D2
Q1
• Enables register trade-offs D3
• Vector length can be variable : :

• Different registers available
D30
Q15
D31
NEON vectorizing example
• How does the compiler perform vectorization?
2. Unroll the loop to the appropriate number of
iterations, and perform other transformations
void add_int(int * __restrict pa, like pointerization
int * __restrict pb, void add_int(int *pa, int *pb,
unsigned int n, int x) unsigned n, int x)
{ {
unsigned int i; unsigned int i;
for(i = 0; i < (n & ~3); i++) for (i = ((n & ~3) >> 2); i; i--)
pa[i] = pb[i] + x; {
} *(pa + 0) = *(pb + 0) + x;
*(pa + 1) = *(pb + 1) + x;
1. Analyze each loop: *(pa + 2) = *(pb + 2) + x;
 Are pointer accesses safe for *(pa + 3) = *(pb + 3) + x;
vectorization? pa += 4; pb += 4;
}
 What data types are being used? }
How do they map onto NEON vector
registers? pb
 Number of loop iterations x
3. Map each unrolled operation onto + + + + +

a NEON vector lane, and generate
pa
corresponding NEON instructions 127 0
Outline
• Introduction
• Memory Systems
Memory Types
• Each defined memory region will specify a memory type
• The memory type controls the following:

• Memory access ordering rules
• Caching and buffering behaviour
• There are 3 mutually exclusive memory types:

• Normal
• Device
• Strongly Ordered
• Normal and Device memory allow additional attributes for specifying

• The cache policy
• Whether the region is Shared
• Normal memory allows you to separately configure Inner and Outer cache policies
(discussed in the Caches and TCMs module)
L1 and L2 Caches
I-Cache RAM L2 Cache

MMU/MPU
Off-chip
ARM Core Memory
On-chip
BIU
SRAM
D-Cache RAM
L1 L2 L3
• Typical memory system can have multiple levels of cache

• Level 1 memory system typically consists of L1-caches, MMU/MPU and TCMs
• Level 2 memory system (and beyond) depends on the system design
• Memory attributes determine cache behavior at different levels
• Controlled by the MMU/MPU (discussed later)
• Inner Cacheable attributes define memory access behavior in the L1 memory
system
• Outer Cacheable attributes define memory access behavior in the L2 memory
system (if external) and beyond (as signals on the bus)
• Before caches can be used, software setup must be performed
ARM Cache Features
• Harvard Implementation for L1 caches
• Separate Instruction and Data caches
• Cache Lockdown
• Prevents line Eviction from a specified Cache Way (discussed later)
• Pseudo-random and Round-robin replacement strategies

• Unused lines can be allocated before considering replacement
• Non-blocking data cache

• Cache Lookup can hit before a Linefill is complete (also checks Linefill buffer)
• Streaming, Critical-Word-First
• Cache data is forwarded to the core as soon as the requested word is received in the
Linefill buffer
• Any word in the cache line can be requested first using a ‘WRAP’ burst on the bus
• ECC or parity checking

Example 32KB ARM cache
Address
Tag Set (= Index) Word Byte
31 13 12 5 4 2 1 0
19 8 3
Cache line
7 6 5 4 3 2 1 0 d
Tag v Data d
Tag
Tag v v
DataLine 0 d
Tag
DataLine 0
v Data
d
d  Cache has 8 words of data in each line
Counter
Line 1 Line 0
 Each cache line contains Dirty bit(s)
Victim
LineLine
1 0
Line 1
Line 1
Line 254  Indicates whether a particular cache line
Line 30
LineLine
255 30 was modified by the ARM core
LineLine
31 30
Line 31
Line 31  Each cache line can be Valid or invalid
 An invalid line is not considered when
v - valid bit d - dirty bit(s)
performing a Cache Lookup
Interrupt Controller
• MPCore processors include an

integrated Interrupt Controller (IC) External Interrupt Legacy IRQ and
Sources FIQ Signals
• Implementation of the Generic
Interrupt Controller (GIC) architecture
.......
• The IC provides: Interrupt Controller

• Configurable number of external nIRQ nFIQ
interrupts (max 224)
• Interrupt prioritization and pre- Global Private
emption Timer Timer
• Interrupt routing to different cores CPU {n}

Private
Watchdog
• Enabled per CPU

• When not enabled, that CPU will use
legacy nIRQ[n] and nFIQ[n] signals
Outline
• Introduction
• Memory Systems
Example ARM-based system
• ARM core deeply embedded within an
Clocks and DMA
SoC Reset Controller Port
• Design can have both external and
FLASH
internal memories ARM
External
Processor
• Varying width, speed and size – core
Memory
AMBA AXI
depending on system requirements Interface
DEBUG
• Can include ARM peripherals SDRAM
nIRQ
On chip
• Can include on-chip memory from ARM CoreLink
nFIQ
memory
Artisan Physical IP Libraries Interrupt
Controller APB
• Elements connected using AMBA Other Bridge
AMBA APB
(Advanced Microcontroller Bus CoreLink
Architecture) Peripherals
Custom
• External debug and trace via JTAG or Peripherals
ARM based SoC
CoreSight interface
Buses 101
• A bus is a multiwire path on which related information is
delivered
– Address, data, and control buses
• Processor and peripherals communicate through buses
• Peripherals may be classified as:
– Arbiter, master, slave, or master/slave (bridge)
Arbiter Master Arbiter
Master/
Master Slave
Slave Slave Slave

Buses 101
• Bus masters have the ability to initiate a bus transaction
• Bus slaves can only respond to a request
• Bus arbitration is a three-step process:
– A device requesting to become a bus master asserts a bus request signal
– The arbiter continuously monitors the request and outputs an individual
grant signal to each master according to the master’s priority scheme
and the state of the other master requests at that time
– The requesting device samples its grant signal until the master is granted
access. The master then initiates a data transfer between the master and
a slave when the current bus master releases the bus
• Arbitration mechanisms
– Fixed priority, round-robin, hybrid
AMBA bus
• Advanced Microcontroller Bus Architecture (AMBA)
• Open-standard, on-chip interconnect specification for the
connection and management of functional blocks in system-on-
a-chip (SoC) designs.
• The most widely-used bus protocol
• Multiple bus protocols under AMBA.
• Advanced Extensible Interface 4 (AXI4) Available in
• Advanced Extensible Interface 4 – Lite (AXI4 - Lite) Xilinx
• Advanced Extensible Interface 4 Stream (AXI4-Stream) FPGA
• Advanced High-performance Bus (AHB)

• Advanced Peripheral Bus (APB)
• ….
AXI protocols
• AMBA 4.0 (2010) includes the latest version of AXI known as AXI4
• AXI4 —High-performance interface. Can support multiple masters

• AXI4-Lite — A light-weight variant of the interface, used for memory
mapped single transactions. Single master.
• AXI4-Stream — A light-weight variant of the interface, used for
streaming data
• Characteristics:
• Read and Write data channels are separate
• For each channel, address/control phases are separate from data
phases
• byte strobes enable unaligned data transfers
• multiple outstanding addresses can be issued
• transactions can be completed out-of-order
An Example AMBA System
High Performance
APB
ARM processor UART
High
Bandwidth AXI4 Timer
APB
External
Bridge
Memory Keypad
Interface
High-bandwidth DMA PIO

on-chip RAM Bus Master
Low Power
Non-pipelined
High Performance Simple Interface
Pipelined
Burst Support
Multiple Bus Masters
AXI Multi-Master System Design
ARM Master 2
Inter-connection architecture
Slave Slave Slave Slave

#1 #2 #3 #4
Master interface
Slave interface
AXI protocols- Write and Read Channels
• Separate Write
and Read
Channels
between master
and slaves
• Separate Address
and Data Buses
• Out of order
completion
• Data buses 8-
1024 bits
• Bursts 1-16 data
transfers
AXI4 transaction (1)
• AXI4 write burst transaction

• AWVALID (MS)
• AWREADY (S->M)
AXI4 transaction (2)
• AXI4 read burst transaction

• ARVALID (MS)
• ARREADY (S->M)

Lecture 05 ARM Processors

Uploaded by

Lecture 05 ARM Processors

Uploaded by

ARM Processors and Architectures

• ARM founded in November 1990

• Company headquarters in Cambridge, UK

• Best known for its range of RISC processor cores designs

• ARM does not manufacture silicon

According to ARM Holdings, in 2010 alone,

ARM Architecture Reference Manual ARMv7-A edition

Halfword and Improved SIMD Instructions

• Real-time profile (ARMv7-R)

• Microcontroller profile (ARMv7-M, ARMv7E-M, ARMv6-M)

• The Cortex-A9 microarchitecture is delivered either

Zynq 7020 SoC

• Instruction cache size: 16KB, 32KB, or 64KB

• Super Scalar Decoder

• Issue (aka Dispatch) can be fed maximum of 2 instructions

• Variable length Executing Stage (1 ~ 3 cycles)

• NEON Media Processing Engine supports vector

• Supported Operations Include:

• Dependent load-store instructions forwarded for resolution within

CPU CPU CPU CPU

CPU CPU CPU CPU

CPU CPU CPU CPU

• optional AXI 64-bit slave port

• When used in relation to ARM cores

• ARM cores implement two basic instruction sets

• Depending on the core, may also implement other instruction sets

IRQ Entered when a normal priority interrupt is raised

Undef Used to handle undefined instructions

Privileged mode using the same registers as User

Current mode Banked out registers

N Z C V Q [de] J GE[3:0] IT[abc] EA IF T mode

• The ARM Architecture is a Load/Store architecture

• Cores are either in ARM state or Thumb state

• The architecture allows programmers and compilation tools to reduce

LDRB STRB Byte Memory

 Causes an SVC exception

• Two different views of the NEON registers

• Vector length can be variable : :

3. Map each unrolled operation onto + + + + +

• The memory type controls the following:

• There are 3 mutually exclusive memory types:

• Normal and Device memory allow additional attributes for specifying

I-Cache RAM L2 Cache

• Typical memory system can have multiple levels of cache

• Pseudo-random and Round-robin replacement strategies

• Non-blocking data cache

• ECC or parity checking

• MPCore processors include an

• The IC provides: Interrupt Controller

• Interrupt routing to different cores CPU {n}

• Enabled per CPU

Arbiter Master Arbiter

Slave Slave Slave

• Advanced High-performance Bus (AHB)

• AXI4 —High-performance interface. Can support multiple masters

High-bandwidth DMA PIO

Slave Slave Slave Slave

• AXI4 write burst transaction

• AXI4 read burst transaction

You might also like