ARM An ARMv8.1-M Performance Monitoring User Guide
ARM An ARMv8.1-M Performance Monitoring User Guide
Version 1.186
Document ID: ARM051-799564642-251
Non-Confidential-Published
Armv8.1-M Performance Monitoring User Guide
Version 1.186
Document ID: ARM051-799564642-251
This document is protected by copyright and other related rights and the practice or implementation of the information contained in this
document may be protected by one or more patents or pending patent applications. No part of this document may be reproduced in any
form by any means without the express prior written permission of Arm. No license, express or implied, by estoppel or otherwise to any
intellectual property rights is granted by this document unless specifically stated.
Your access to the information in this document is conditional upon your acceptance that you will not use or permit others to use the
information for the purposes of determining whether implementations infringe any third party patents.
THIS DOCUMENT IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR
FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE DOCUMENT. For the avoidance of doubt, Arm makes no representation
with respect to, and has undertaken no analysis to identify or understand the scope and content of, patents, copyrights, trade secrets, or
other rights.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY
DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE
THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF ARM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
This document consists solely of commercial items. You shall be responsible for ensuring that any use, duplication or disclosure of this
document complies fully with any relevant export laws and regulations to assure that this document or any portion thereof is not
exported, directly or indirectly, in violation of such export laws. Use of the word “partner” in reference to Arm’s customers is not
intended to create or refer to any partnership relationship with any other company. Arm may make changes to this document at any time
and without notice.
If any of the provisions contained in these terms conflict with any of the provisions of any click through or signed written agreement
covering this document with Arm, then the click through or signed written agreement prevails over and supersedes the conflicting
provisions of these terms. This document may be translated into other languages for convenience, and you agree that if there is any
conflict between the English version of this document and any translation, the terms of the English version of the Agreement shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the
US and/or elsewhere. All rights reserved. Other brands and names mentioned in this document may be the trademarks of their
respective owners. Please follow Arm’s trademark usage guidelines at [Link]
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
LES-PRE-20349
Copyright © 2020 Arm Limited (or its affiliates). All rights reserved.
Release Information
Document History
Contents
1 About this document .............................................................................................................................................................................................5
1.1. References..............................................................................................................................................................................................5
1.2. Terms and Abbreviations .....................................................................................................................................................................5
1.3. Scope .......................................................................................................................................................................................................6
Conventions and Feedback .........................................................................................................................................................................6
Feedback on this product .....................................................................................................................................................................7
Feedback on documentation ...............................................................................................................................................................7
Other information .................................................................................................................................................................................7
2 Introduction..............................................................................................................................................................................................................8
2.1. Profiling Overview ................................................................................................................................................................................8
2.2. Profiling Armv8-M systems .................................................................................................................................................................8
2.2.1 DWT profiling...............................................................................................................................................................................8
2.2.2 ITM profiling .................................................................................................................................................................................9
2.2.3 ETM profiling ................................................................................................................................................................................9
2.2.4 SysTick timer ..............................................................................................................................................................................10
2.3. The PMU Profiling Feature introduced in the Armv8.1-M Architecture ....................................................................................10
2.3.1 Cycle counter and event counters ..........................................................................................................................................10
2.3.2 Counting cycles ..........................................................................................................................................................................10
2.3.3 Counting events .........................................................................................................................................................................11
2.4 Performance Monitoring in other Arm Systems .......................................................................................................................11
2.4.1 Performance monitoring in A-profile and R-profile systems .........................................................................................11
2.4.2 Performance monitoring in Arm Neural Processing Units .............................................................................................11
2.4.3 Performance monitoring in Arm Graphics and Multimedia Processors.......................................................................12
Term Meaning
1.3. Scope
This document describes how to use the PMU as defined by the Armv8.1-M Architecture.
Convention Meaning
monospace denotes text that can be entered at the keyboard, such as commands, file and program names, and source
code.
monospace denotes a permitted abbreviation for a command or option. The underlined text can be entered instead of
the full command or option name.
monospace italic denotes arguments to commands and functions where the argument is to be replaced by a specific value.
monospace bold denotes language keywords when used outside example code.
italic highlights important notes, introduces special terminology, denotes internal cross-references, and
citations.
bold highlights interface elements, such as menu names. Also used for emphasis in descriptive lists, where
appropriate, and for Arm® processor signal names.
Feedback on documentation
If you have comments on the documentation, e-mail errata@[Link]. Give:
The title.
The number, [Document ID Value], [Issue].
If viewing online, the topic names to which your comments apply.
If viewing a PDF version of a document, the page numbers to which your comments apply.
A concise explanation of your comments.
Arm also welcomes general suggestions for additions and improvements.
Arm periodically provides updates and corrections to its documentation on the Arm Information Center, together with knowledge articles
and Frequently Asked Questions (FAQs).
Other information
Arm Developer, [Link]
Arm Documentation, [Link]
Arm Support and Maintenance, [Link]
Arm Glossary, [Link]
2 Introduction
2.1. Profiling Overview
There are several reasons why a user might want to profile their application. These reasons include:
The M-profile architecture provides different features to help users carry out such tasks, including the Performance Monitoring Unit
(PMU), introduced in the Mainline variant of the Armv8.1-M Architecture. This application note demonstrates a variety of use cases for
the PMU, as described in section 7 Using the PMU in your application and section 8 PMU Profiling Example.
Prior to the Armv8.1-M PMU, the M-profile architecture provided the following profiling features:
1
PMU_CYCCNT is an alias of DWT_CYCCNT.
2
Counts additional cycles required to execute multicycle instructions and instruction fetch stalls.
Debug tools and the application itself can use these counters for profiling.
The ITM requires a debugger to be connected to the system to retrieve and decode the information contained in the trace packets. As
described in the CoreSight Components Technical Reference Manual, the ITM and Serial Wire Output (SWO) can be used to form a Serial
Wire Viewer (SWV). Debug tools are capable of displaying instrumentation trace data packets transmitted via a SWV, like in Figure 2-1.
Although this type of instrumentation profiling or trace capture is not very intrusive compared with halting the system, it does require
some additional code to be added to the program, which could affect timing and measurements.
3
Increments on the additional cycles required to execute all load or store instructions.
Similar to the ITM, the ETM generates trace packets with timestamp information for different operations that have executed on the M-
profile target system. Debug tools that support the ETM can retrieve, decode, and display the execution history of the application. When
debug information is present in the application, a debug tool can deduce further profiling information such as thread or function
execution time statistics, and code coverage. The Keil MDK Trace Data window is an example of such a debug tool, which is described in
more detail on the Keil website in the µVision User's Guide:
[Link]
As described in the Whitepaper, Introduction to the Armv8.1-M Architecture, the Armv8.1-M architecture is an enhancement to the
original Armv8-M architecture and brings many additional features, including an M-Profile Vector Extension (MVE) for signal processing
and machine learning applications, also known as Helium. New features like MVE, the Low Overhead Branch (LOB) Extension, and half-
precision floating-point instructions, can significantly improve the performance of applications running on an Armv8.1-M implementation
such as the Cortex-M55 processor.
An Armv8.1-M PMU includes counters for counting cycles and a wide range of other events while an application is running on the target
platform. The PMU counters can be read by software at runtime or when the processor is being debugged. These counters are a useful
and convenient resource for measuring the performance of an M-profile system, including the M-profile features added in Armv8.1-M.
Note
An implementation that does not include the Main Extension, does not support the Performance Monitoring Extension.
• A 32-bit Cycle Counter Register that is hard-wired to count CPU cycles. This register is an alias of the Cycle Counter Register in
the DWT (DWT_CYCCNT) and is always present in an Armv8.1-M implementation, like Cortex-M55, configured with the PMU.
• Up to 31 Event Counters. A minimum number of two Event Counters is permitted for an implementation that includes the PMU.
The Cortex-M55 processor includes eight Event Counters.
• Use a 16-bit Event Counter to count an event related to cycles. One such event is CPU_CYCLES, which like CYCCNT, increments
every CPU cycle. There are also other more specific events relating to cycle counting. For example, there are different events
that can be used to count certain stalls in the processor pipeline. Additionally, there is an event that counts bus cycles. It is also
possible to form a 32-bit cycle counter by chaining 16-bit counters together (see section 7.6 Chaining Event Counters to create a
32-bit Counter).
The Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile describes the latest version of the PMU specification,
PMUv3, and further information about the PMU relevant to Armv8-R architecture can be found in the ARM Architecture Reference
Manual Supplement ARMv8, for the ARMv8-R AArch32 architecture profile.
Some of the Armv8-M Performance Monitoring Extension is based on the A-Profile and R-Profile PMU specifications, but the Armv8-M
Performance Monitoring Extension specification is a standalone specification that does not belong to the PMUv3 specification.
M-profile implementations might be used in a larger SoC that includes A-profile or R-profile implementations. In such implementations,
separate programming APIs will need to be used for programming the respective PMUs.
The Arm Streamline Performance Analyzer tool can be programmed to generate a series of charts for visualizing information generated
by a A-profile and R-profile PMU counters. These charts can help with various aspects of Cortex-A and Cortex-R system profiling such as
checking the effectiveness of the caches and identifying how well the system bus is being utilized. This assistance helps programmers to
profile their code quicker and to get the best performance out of the Arm processors they are working with.
Ethos-U55 has its own self-contained PMU with a maximum of four 32-bit event counters and one 48-bit cycle counter. This is a separate
IP block and has no interaction with an Armv8.1-M implementation’s PMU.
There is a separate API for programming the Ethos-U55 PMU, which works in a similar way to the CMSIS-Core API for Armv8.1-M and
Cortex-M55, however, the list of PMU events is very different.
Copyright © 2020 Arm Limited or its affiliates. All rights reserved.
Confidentiality: Non-Confidential Page 11
Armv8.1-M Performance Monitoring User Guide
Version 1.186
Document ID: ARM051-799564642-251
Configuring the Ethos-U55 PMU registers can be achieved from software running on Armv8.1-M implementation, or from a host
computer using Arm Mbed DAPLink or the Arm Streamline Performance Analyzer tool.
A typical use case for the Ethos-U55 PMU would be to measure the performance related to the AXI bus interface. For example, a user
might clear the PMU counters immediately before a network operation begins, and then read them again after the execution has
finished. The counters would be used to measure the performance and detected potential bottlenecks.
The Arm Streamline Performance Analyzer tool can be programmed to generate a series of charts for visualizing information generated
by a GPU’s performance counters. These charts can help with various aspects of GPU profiling such as identifying the cause of heavy
rendering loads or workload inefficiencies. This assistance helps GPU programmers to profile their code quicker and to get the best
performance out of the graphics processor they are working with.
Arm also offers several different Armv8.1-M and Cortex-M55 platforms and simulation models that support the PMU.
3.2 Hardware Platform and Simulation Model Support for the PMU
As of August 2020, the following platforms and simulation models support the Armv8.1-M PMU.
Some FVPs are packaged as part of software development tools like Arm Development Studio and Keil MDK.
These toolkits provide connection dialogs to allow the user to connect to the FVP through an IDE.
The Fast Models tool provides an environment to design and create custom virtual platforms, like FVPs, for early
software development.
The Fast Models tool provides different types of ready-made M-profile Fast Models that support the PMU:
• Armv8.1-M Architecture Fast Model available in Fast Models 11.6 and later
• Cortex-M55 CPU Fast Model available in Fast Models 11.10 and later.
[Link]
There is also a Corstone-300 Ecosystem FVP available, which includes support for the PMU:
[Link]
platforms-software/arm-ecosystem-fvps
Note
FVPs support a limited number of PMU events and are not cycle accurate.
Cycle Models are 100% cycle accurate models of Arm IP for performance analysis and IP selection.
Cycle Models
[Link]
[Link]
RTL Simulators from Arm EDA See the Release Note and Integration and Implementation information for your Armv8.1-M implementation for
tool vendors further information on RTL simulator support. The Arm Cortex-M55 Release Note and Arm Cortex-M55
Processor Integration and Implementation Manual are confidential documents that are only available to
licensees.
Arm MPS3 FPGA Prototyping The Arm MPS3 platform provides a way to load pre-built Arm sub-system images into its FPGA. More
Board information about the Arm MPS3 platform is available at:
[Link]
prototyping-boards/mps3
An SSE-300 FPGA image is due to be released sometime in 2020. SSE-300 is a CoreLink subsystem that
includes a Cortex-M55 processor More information about SSE-300 is available at:
[Link]
sse-300-subsystem
Third party platforms Arm works closely with its partners who license Arm technology. Arm partners who have licensed Armv8.1-M
technology, such as Cortex-M55, typically develop their own platforms. Such platforms might or might not be
publicly available.
Table 3-1 Hardware Platform and Simulation Support for the PMU
CMSIS-Core is part of the Cortex Microcontroller Software Interface Standard (CMSIS) and provides a standardized API for different
aspects of software development for the Cortex-M devices, including:
CMSIS-Core source code and documentation is available from the following CMSIS GitHub repository:
• [Link]
• [Link]
• [Link]
• [Link]
The following two macros must be set appropriately before using these headers to program the PMU:
• __PMU_PRESENT
• __PMU_NUM_EVENTCNT
These macros are defined by the CMSIS-Core device header file, which is normally provided by Arm microcontroller device vendors. The
device header file is typically available as part of a CMSIS Device Family Pack (DFP) that includes other files, such as the startup and
Copyright © 2020 Arm Limited or its affiliates. All rights reserved.
Confidentiality: Non-Confidential Page 14
Armv8.1-M Performance Monitoring User Guide
Version 1.186
Document ID: ARM051-799564642-251
initialization code mentioned in the list above, enabling the user to develop a CMSIS-compliant embedded application. The DFP, which is
essentially an archive file, is created by the device vendor.
Arm also acts as a device vendor by providing some device headers and DFPs targeted at its models and platforms described in section
3.2 Hardware Platform and Simulation Model Support for the PMU. Two generic Armv8.1-M/Cortex-M55 CMSIS-Core device headers can
be found at:
• [Link]
software/CMSIS_5/blob/develop/Device/ARM/ARMv81MML/Include/ARMv81MML_DSP_DP_MVE_FP.h
• [Link]
DFPs are supported by different embedded development tools such as Keil MDK and Arm DS. These packs/archives can be downloaded
from the following repository on the Arm website.
• [Link]
Also, some IDEs provide an integrated ‘Pack Installer’ to make it even easier to download, install, and use, the DFPs and the CMSIS-Core
files contained within.
• Arm Compiler 6.
• GNU Arm Embedded Toolchain.
• IAR C/C++ Compiler.
Also, the PMU can issue an event counter trace packet each time the lower 8 bits of a counter overflows. This only occurs when a counter
increments naturally and not when it is written to directly by software or using a debugger. Additionally, the PMU can serve as an event
source for the Cross Trigger Interface (CTI), which might be useful for debugging, tracing, and profiling, systems with multiple processors.
The Registers index section of the Armv8-M architecture provides a complete list of registers that can be implemented in an Armv8-M
implementation. This section shows that the block of system memory for the PMU registers begins at address 0xE0003000. The PMU
registers and their associated addresses are listed with links to more detailed register descriptions.
0xE0003FF0
PMU_CIDR0 Performance Monitoring Unit Component Identification Register 0
0xE0003FF4
PMU_CIDR1 Performance Monitoring Unit Component Identification Register 1
0xE0003FF8
PMU_CIDR2 Performance Monitoring Unit Component Identification Register 2
0xE0003FFC
PMU_CIDR3 Performance Monitoring Unit Component Identification Register 3
Table 4-1 PMU Registers and Address Mappings
The PMU Type Register, PMU_TYPE, helps software identify information about a device’s PMU configuration. For example, PMU_TYPE
can be read to find out the number of counters available in the PMU.
Another important register is the PMU Control Register, PMU_CTRL. PMU_CTRL can be used to enable/disable the PMU and reset the
PMU counters.
A PMU Event Type Register, PMU_EVTYPERn, can be programmed by software to determine which event a specific counter is monitoring.
The PMU Count Enable Set Register, PMU_CNTENSET, and PMU Count Enable Clear Register, PMU_CNTENCLR, can be used to enable and
disable individual event counters.
A PMU Event Counter Registers, PMU_EVCNTRn, can be read by software to determine the current count of an event associated with that
counter. There is also the PMU Cycle Count Register, PMU_CYCCNT, which is dedicated to counting cycles. These registers can be reset to
zero and can also be written to so that they have a starting value.
Some of the other PMU registers are covered throughout this document such as the Performance Monitoring Unit Software Increment
Register, PMU_SWINC, as well as registers related to overflow and interrupt generation.
The PMU counters count upwards. When a PMU counter overflows the action taken depends on how the PMU is configured. For
example, it can be configured to generate an interrupt upon an overflow. These use cases are described in more detail in section 7 Using
the PMU in your application.
Before using the PMU, software needs to ensure that trace is enabled via the Debug Exception Monitor Control Register, DEMCR.
Figure 4-1 shows a typical usage flow for configuring the PMU. Section 7 provides code examples on how to use the PMU in your
application.
Start
Configure DEMCR to
enable trace
Program
PMU_EVTYPERn to
monitor a specific
event for event
counter n. Execute code
sequence for
profiling/monitoring
Stop
Configure
PMU_CNTENSET to
enable counter n
Configure
PMU_CTRL to
enable PMU and
start counting
Furthermore, the CMSIS-Core header file pmu_armv8.h and relevant core_<cpu>.h header contain macros corresponding to each
supported event with some brief descriptions. These macros can be used by software, as shown in section 7 Using the PMU in your
application and section 8 PMU Profiling Example.
The section titled List of supported architectural and microarchitectural events in the Armv8-M Architecture Reference Manual
(DI0553B.k) provides a full list and descriptions of the supported events that can be counted.
Note
The number of supported events may change in future revisions of the architecture.
For example, the architectural event, EXC_TAKEN (Exception taken), can be used to count each time any implementation, such as Cortex-
M55, takes an exception. The behavior of this event would work the same on any other Armv8.1-M implementation.
Good examples of microarchitectural events are any events relating to caches or branch prediction, since these features may vary across
different implementations. For example, Cortex-M55 does not support branch prediction, but it is possible for another Armv8.1-M
implementation to support it. Also, even if features like these are supported in a particular implementation, they might not be
configured/enabled in the RTL or by software. Additionally, even, if such features are implemented, they may be configured differently.
For example, there are different RTL configuration options for the sizes of the caches in Cortex-M55.
There are also some architectural and microarchitectural events that are not supported by Cortex-M55 (see section 4.3.3 Unsupported
architectural and microarchitectural events).
• Instruction execution.
• Instruction speculation.
• Operation execution.
• Operation speculation.
• MVE instruction execution.
• MVE instruction speculation.
• External memory accesses.
• Tightly Coupled Memory (TCM) accesses.
• Cache behavior.
Copyright © 2020 Arm Limited or its affiliates. All rights reserved.
Confidentiality: Non-Confidential Page 21
Armv8.1-M Performance Monitoring User Guide
Version 1.186
Document ID: ARM051-799564642-251
• Branch prediction.
• Exceptions.
• Security state transitions.
• Pipeline stalls.
• Chaining counters.
• CPU Cycles.
• Debug and trace events.
• Software increment.
• Memory errors.
• Error Correcting Code (ECC) in the TCM or Cache memories - events beginning with ECC_.
• No Write-Allocate mode - event prefixed with NWA (NWAMODE_ENTER and NWAMODE).
• S-AHB accesses - SAHB_ACCESS.
• P-AHB accesses - PAHB_ACCESS.
• M-AXI accesses - AXI_WRITE_ACCESS and AXI_READ_ACCESS.
• Data cache prefetching - PF_LINEFILL, PF_CANCEL and PF_DROP_LINEFILL.
• Internal watchdog - DOSTIMEOUT_DOUBLE and DOSTIMEOUT_TRIPLE.
On Cortex-M55:
- PLD and PLDW are both fully supported on Cortex-
M55. (PLDW requests a line-fill for a cache miss in a
Write-allocate region).
- PLI is not operationally supported and acts like a NOP.
PLI has no effect on LD_RETIRED.
0x0007 A ST_RETIRED 5
0x0008 AR INST_RETIRED 6
0x0009 A EXC_TAKEN 7
Exception taken
0x000A A EXC_RETURN 8
On Cortex-M55:
PMU <x>_RETIRED events only count operations which
complete so they don’t include any cases where an
instruction is interrupted by an exception or debug event
(including BKPT). However, SVC is included in this
event as it functionally behaves like software-controlled
branch.
On Cortex-M55:
This event only counts true immediate branches i.e. B
#imm, CB{N}Z #imm
0x000E A BR_RETURN_RETIRED 11
0x000F A UNALIGNED_LDST_RETIRED 12
0x0013 M MEM_ACCESS 16
IMPLEMENTATION DEFINED.
0x001A M MEMORY_ERROR See also sections 4.4.2 Level 1 cache events and 4.4.3 20
TCM events, and section 4.3.2 Implementation-defined
Local memory error events.
0x001D M BUS_CYCLES 22
Bus cycle
0x001E A CHAIN See section 7.6 Chaining Event Counters to create a 32- 23
bit Counter.
For an odd numbered counter, increment when an
overflow occurs on the preceding even-numbered
counter on the same PE
Instruction architecturally executed, mispredicted Therefore, on Cortex-M55, this event counts all retired
branch not-taken branches.
On Cortex-M55:
If there are no instructions available from the fetch stage
of the processor pipeline (into the main decode/execution
stages), the processor considers the front-end of the
processor pipeline as being stalled.
On Cortex-M55:
If there is an instruction available from the fetch stage of
the pipeline but it cannot be accepted by the decode stage
of the processor pipeline, the processor considers the
back-end of the processor pipeline as being stalled.
0x0036 M LL_CACHE_RD The last level cache on Cortex-M55 is the level 1 cache. 29
See section 4.3.2.
Last level data cache read
0x0037 M LL_CACHE_MISS_RD The last level cache on Cortex-M55 is the level 1 cache. 30
See section 4.4.2 Level 1 cache events.
Last level data cache read miss
Operation retired
Operation speculated
0x003C M STALL This general case stall event counts when there is no 34
instruction executing this cycle.
No operation sent for execution This could be due to STALL_FRONTEND or
STALL_BACKEND, or simply a register/memory
hazard.
0x0040 M L1D_CACHE_RD 38
0x0100 M LE_RETIRED 39
0x0108 M LE_CANCEL 43
0x0114 A SE_CALL_S 45
0x0115 A SE_CALL_NS 46
0x0118 A DWT_CMPMATCH0 See section 7.7.8 Triggering an overflow after the core 47
has executed code ‘N’ times.
DWT comparator 0 match
0x0119 A DWT_CMPMATCH1 See section 7.7.8 Triggering an overflow after the core 48
has executed code ‘N’ times.
DWT comparator 1 match
0x011A A DWT_CMPMATCH2 See section 7.7.8 Triggering an overflow after the core 49
has executed code ‘N’ times.
DWT comparator 2 match
0x011B A DWT_CMPMATCH3 See section 7.7.8 Triggering an overflow after the core 50
has executed code ‘N’ times.
DWT comparator 3 match
• Level 1 instruction and data caches can be configured for different sizes.
• The level 1 instruction cache can be enabled using the CMSIS-Core function SCB_EnableICache().
• The level 1 data cache can be enabled using the CMSIS-Core function SCB_EnableDCache().
• The Cortex-M55 implementation-defined events, PF_LINEFILL, PF_CANCEL and PF_DROP_LINEFILL, refer to the level 1 data
cache prefetcher.
The Cortex-M55 also has a bit in its Auxiliary Control Register (ACTLR) to disable write allocation. Disabling write allocation is generally
worse for performance but can improve performance in some situations where allocating on writes is undesirable, such as executing the
C standard library memset() or whilst initializing memory before the main() program begins.
• TCMs are implementation defined features of Cortex-M55 and are not described by the Armv8-M architecture.
• The instruction and data TCMs can be configured for different sizes.
• The instructions TCM can be configured to be enabled or disabled out-of-reset.
• The TCMs can also be enabled or disabled by software by writing to the Cortex-M55's ITCM Control Register and DTCM Control
Registers.
The EPU is disabled at reset. Software can typically enable the EPU using code in the CMSIS-Core device header: system_<device>.c
file in the SystemInit() function. See section 3.3 CMSIS Programming API for the PMU for further information about CMSIS.
INST_RETIRED
IPC =
CPU_CYCLES
MIPS (retired)
INST_RETIRED
MIPSretired =
𝑡𝑡elapsed × 106
Where 𝑡𝑡elapsed is the elapsed time, in seconds.
Note that the default value is not valid for a Cortex-M55, so this value
should be changed to either 8 or 0 to match the RTL configuration
options.
Table 5-1 Cortex-M55 FVP and Armv8.1-M AEM PMU parameters
For example, when using cpu0 on the Cortex-M55 FVP, start the fast model with the following parameters to ensure the PMU is present
with eight counters:
Arm recommends using CMSIS-Core PMU support code, which is written in C, to access the PMU registers. If accessing the PMU registers
in assembly language, please note that these registers are word accessible only. Halfword and byte accesses are UNPREDICTABLE.
6.2 Debug
The PMU counters do not increment when:
The Armv8-M Architecture Reference Manual also lists some restrictions when the PMU is used at the same time as the Armv8.0-M DWT
Performance Monitors.
6.3 Security
The PMU registers are not banked between Security states. This means that both secure privileged and non-secure privileged code can
access the PMU registers.
The PMU registers are accessible to accesses through unprivileged DAP requests when either DAUTHCTRL_S.UIDAPEN or
DAUTHCTRL_NS.UIDAPEN is set.
The PMU_CTRL.DP bit is an alias of the DWT_CTRL.CYCDISS bit, which is set to zero on a Cold reset. When PMU_CTRL.DP is zero, the PMU
cycle counter increments regardless of the Security state of the PE. Therefore, to ensure that the cycle counter does not count in Secure
state, set PMU_CTRL.DP (or DWT_CTRL.CYCDISS) to 1: This can be achieved with the following CMSIS-Core compliant code:
PMU->PMU_Ctrl |= PMU_CTRL_CYCCNT_DISABLE_Msk;
Software can simply use the CMSIS-Core API for the PMU (in pmu_armv8.h) to find out the number of available counters, for
example:
uint32_t num_event_counters;
num_event_counters = ((PMU->PMU_Type) & PMU_TYPE_NUM_CNTS_Msk);
The PMU can be disabled again with the following CMSIS-Core function call:
The user also needs to ensure that trace is enabled inside the processor in order to make use of the PMU. The following CMSIS-Core code
can be used as a global enable for the DWT, PMU, and ITM features:
/* Enable Trace */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
A user might also find it useful to keep an incremental count, so that each time the code being measured is run, the program keeps track
of the overall combined count. For example:
The PMU Cycle Counter is set to an unknown value on a Warm reset. Therefore, the Cycle Counter should be reset before it is used for
the first time. Whether the cycle counter needs resetting more than once will depend on the application. For example, reset the cycle
counter to measure the performance of a new segment of code.
Note: Some project development environments make use of the Cycle Counter Register for debug features. For example, Keil MDK uses
the Cycle Counter register for its Event Recorder and the States register:
[Link]
[Link]
Therefore, using the CMSIS PMU API to modify the Cycle Counter Register may affect the usability of such debug features.
/*
Configure Event Counter Register 0 to count instructions retired
Configure Event Counter Register 1 to count L1 D-Cache misses
*/
ARM_PMU_Set_EVTYPER(0, ARM_PMU_INST_RETIRED);
ARM_PMU_Set_EVTYPER(1, ARM_PMU_L1D_CACHE_MISS_RD);
/* Get number of instructions retired and number of L1 D-Cache misses (on read) */
instructions_retired_count = ARM_PMU_Get_EVCNTR(0);
l1_dcache_miss_count = ARM_PMU_Get_EVCNTR(1);
Use cases will obviously vary. This example simply reads Event Counters 0 and 1 into the variables instructions_retired_count
and l1_dcache_miss_count. These variables should provide one-time-only counts relating to whatever code is added between
enabling and disabling the counters. There are a wide range of ways the user can go about analyzing these values. For example, the
values could be printed to a display or saved to a file on a storage device. The user could also add the variables to a Watch Window in a
debugger and configure the debugger to break program execution when they reach a certain data value range.
A user might also find it useful to keep an incremental count, so that each time the code being measured is run, the program keeps track
of the overall counts combined. For example:
The PMU Event Counters are set to an unknown value on a Warm reset. Therefore, the user should reset the Event Counters before using
them for the first time. Whether the counters need resetting more than once, or disabling, will depend on the application. For example, it
might be desirable to reset the counters when a new thread becomes active. When reading multiple counter values, slightly more
accurate results might be observed by disabling the counters before reading their current value.
The description from the Armv8-M Architecture Reference Manual for SW_INCR says:
Configuring and enabling an event counter so that it can be incremented by software can be achieved by using similar code as in section
7.4 Using 16-bit Event Counters. The ARM_PMU_CNTR_Increment() function can then be used to write to the relevant bit in the
Software Increment Register to increment the counter, before it’s read by software again sometime later. For example:
“If the PE performs two Architecturally executed writes to the PMU_SWINC register without an intervening Context synchronization
event, then the counter is incremented twice.”
What this means is that a Context synchronization event, e.g., Instruction Synchronization Barrier (ISB), is not required between two
writes to the Software Increment Register to guarantee that the related Event Counter increments twice.
To make it less likely that you need to handle a counter overflow, it is possible to chain an odd-numbered counter with a preceding even-
numbered counter to form a 32-bit counter. For example, software could chain together Event Counter 7 with Event Counter 6 to form a
32-bit counter. This also means that the system can be configured by software to have a mixture of 16-bit and 32-bit counters.
The example below shows how you can create another 32-bit cycle counter from two Event Counter registers
/*
Initialize variables for:
- lower 16 bits of cycle count
- upper 16 bits of cycle count
- cycle count (concatenated)
*/
uint32_t cycle_count_lower = 0;
uint32_t cycle_count_upper = 0;
uint32_t cycle_count_combined = 0;
/*
Configure Event Counter Register 6 to count CPU Cycles
Configure Event Counter Register 7 to chain together with Event Counter Register 6
*/
ARM_PMU_Set_EVTYPER(6, ARM_PMU_CPU_CYCLES);
ARM_PMU_Set_EVTYPER(7, ARM_PMU_CHAIN);
If your device implements the DSP Extension you might notice that a compiler translates the above logical OR operation into a single
PKHBT instruction.
Note that there is a known issue in CMSIS v5.70 with PMU_EVCNTR_CNT_Msk. It should be set to 0xFFFFUL, but instead it is
incorrectly set to 16UL. Use one of the following workarounds to avoid this issue:
Figure 7-1 Armv8-M Architecture Reference Manual Snapshot of PMU Overflow Status Set Register Bit Descriptions
A user might only be interested in whether a particular counter has overflowed. For example, to find out whether Event Counter 3 has
overflowed a user could mask all other overflow bits except bit 3 (the bit that corresponds to Event Counter 3):
Figure 7-2 Armv8-M Architecture Reference Manual Snapshot of PMU Overflow Status Clear Register Bit Descriptions
The following CMSIS-Core PMU code clears overflow status of Event Counter Register 3:
ARM_PMU_Set_CNTR_OVS(PMU_OVSCLR_CNT3_STATUS_Msk);
Depending on the application it also might be perfectly ok to clear all counter overflow status bits, rather than just one bit.
ARM_PMU_Set_CNTR_IRQ_Enable(PMU_INTENSET_CNT3_ENABLE_Msk);
There’s also a corresponding ‘Disable counter overflow interrupt request’ function named ARM_PMU_Set_CNTR_IRQ_Disable().
The interrupt associated with a PMU counter overflow is the DebugMonitor exception. Unlike Halting debug, the DebugMonitor
exception is traditionally used as a method of debugging without putting the core into Debug state. Instead the processor carries on
running, which is useful for debugging systems with hard real-time requirements when it is not a viable option to halt the processor’s
clock. Handling PMU counter overflows is a new usage model for the DebugMonitor exception in M-profile systems. The following code
enables Monitor debug:
CoreDebug->DEMCR |= CoreDebug_DEMCR_MON_EN_Msk;
The System Handler Priority Register 3 (SHPR3) can be programmed by privileged software to program the priority of the DebugMonitor
exception. Therefore, in order to ensure that the DebugMonitor exception is taken, it is important to provide it with an appropriate
priority level.
Note: there are secure and non-secure versions of the DebugMonitor exception and SHPR3.
The user is responsible for writing the DebugMonitor exception handling routine. When the DebugMonitor exception is generated on a
counter overflow, the associated handler code could use a variable to count how many times a counter has overflowed, for example:
void DebugMon_Handler(void)
{
overflow_count++;
return;
}
The current counter value can be concatenated with the overflow count variable (similar to concatenating chained counters in section 7.6
Chaining Event Counters to create a 32-bit Counter) to form a larger sized counter. Let’s say that event counter 3 is being used to count
the number of retired instructions. To form a larger 48-bit counter, an application could execute something similar to the following code:
/* Concatenate current value for Event Counter 3 with its overflow Count */
count_combined = (unsigned long long(overflow_count << 16)) |
PMU_EVCNTR_CNT_Msk & instructions_retired_count);
These steps are simple to carry out if only a single counter has overflowed. The following example shows one way of handling this
scenario within the DebugMon_Handler() exception handling routine.
void DebugMon_Handler(void)
{
/* Read PMU overflow status */
uint32_t pmu_overflow_status = ARM_PMU_Get_CNTR_OVS();
/* Count leading zeroes to find out which bit position was set */
pmu_overflow_status = __CLZ(pmu_overflow_status);
/* Calculate trailing zeroes: take away no. of leading zeroes from no. of reg bits */
pmu_overflow_status = (32 - pmu_overflow_status)-1;
return;
}
This example uses the __PMU_NUM_EVENTCNT macro to create an array with an element for each event counter, plus the cycle
counter, that can be used to count how many times a counter has overflowed. A Cortex-M55 implementation that includes a PMU has
eight event counters, plus one cycle counter, so such an array in a piece of Cortex-M55 software would be nine words deep.
One further issue to consider is that more than one counter can potentially overflow at the same time, and therefore, multiple overflow
bits could be set. Although this might be an unlikely scenario, for accurate information on counter overflow the DebugMonitor handler
would need to carry out steps 1-3 again, but this time loop through each bit of the overflow status to check which bits are set. For
example:
void DebugMon_Handler(void)
{
uint32_t temp;
while(pmu_overflow_status)
{
/* Count leading zeroes to find out the highest bit position set */
temp = __CLZ(pmu_overflow_status);
/* Calculate trailing zeroes: take away no. of leading zeroes from no. of reg bits */
temp = (32 - temp)-1;
Note: The PMU handler code itself could affect various counters. Therefore, it might be a good idea to temporarily disable the counter(s)
that caused the interrupt at the beginning of the handler routine and enable the counter(s) again before returning to the main
application.
PMU->CTRL |= PMU_CTRL_FRZ_ON_OV_Msk;
The user can check whether freeze-on-overflow support is available by reading the PMU Type Register. The following CMSIS-Core macros
can be used by software PMU_TYPE_FRZ_OV_SUPPORT_Pos and PMU_TYPE_FRZ_OV_SUPPORT_Pos.
Note: setting freeze-on-overflow will cause the chaining of event counters to stop working, because the overflow of the odd-numbered
counter freezes counting.
Note: when writing to the DHCSR, 0xA05F must be written to the DEBUGKEY field - bits [31:16] – otherwise the write will be ignored.
Software can set the C_PMOV field using the following code:
DCB->DHCSR = 0xA05F0040;
Note: software cannot write to C_DEBUGEN and can only read this bit to see whether Halting debug has been enabled by a debugger.
A debugger must write 0xA05F041 to set both C_PMOV and C_DEBUGEN in the DHSCR, to halt the core when a PMU counter
overflows.
PMU->CTRL |= PMU_CTRL_TRACE_ON_OV_Msk;
Note that the user can check whether trace-on-overflow support is available by reading the PMU Type Register. The following CMSIS-Core
macros can be used by software PMU_TYPE_TRACE_ON_OV_SUPPORT_Msk and PMU_TYPE_TRACE_ON_OV_SUPPORT_Msk.
The Armv8-M Architecture Reference Manual describes the trace packet information as follows:
Figure 7-3 Armv8-M Architecture Reference Manual Snapshot of PMU Overflow Packet Description
Figure 7-2 Armv8-M Architecture Reference Manual Snapshot of PMU Overflow Packet
7.7.8 Triggering an overflow after the core has executed code ‘N’ times
The PMU can also be used in conjunction with the DWT to halt the processor after a code sequence of interest, such as a loop, has
executed a given number of iterations. The following sequence may be used to achieve this:
1. Decide how many times (N) you would like the loop (L) to execute before generating an overflow and initialize a 16-bit loop limit
variable. N is set to 10 in the example below.
2. Configure DWT comparator <n> to match an Instruction Address, for example, a location in memory at the end of a loop.
3. Configure PMU event counter <m> to count on DWT_CMPMATCH<n>.
4. Set PMU event counter <m> to -N.
5. Decide how to handle the overflow and take appropriate action (see previous sections 7.7.x).
6. Enable event counter <m>.
7. Execute loop (L).
After the instruction (loop) being watched executes ‘N’ times, an overflow will occur.
This mechanism can also be used with chained event counters, as described in section 7.6 Chaining Event Counters to create a 32-bit
Counter to trigger an overflow of the even numbered counter.
The following example code shows how to achieve this scenario where N is set to 10:
ARM_PMU_CNTR_Enable(PMU_CNTENSET_CNT0_ENABLE_Msk);
• A basic scalar example that uses a traditional counting down loop: strcpy_scalar().
• A basic scalar example that uses a low-overhead-loop, strcpy_scalar_lol().
The example is written in GNU assembly language syntax, which is supported by GCC and Arm Compiler 6.
/* strcpy.s */
strcpy_scalar:
loopStart:
LDRB R3, [R1], #1
STRB R3, [R0], #1
SUBS R2, R2, #1
BNE loopStart
BX LR
strcpy_scalar_lol:
PUSH {R0,LR}
WLS LR, R2, lolEnd /* While Loop Start */
lolStart:
LDRB R3, [R1], #1
STRB R3, [R0], #1
LE LR, lolStart /* Loop End */
lolEnd:
POP {R0,PC}
.end
Both functions have been exported using the .global and .type keywords so that code from other source files may reference them.
The program uses the CMSIS device header file by including CMSIS_header_file and RTE_Components.h. The CMSIS device
header includes the Cortex-M55 processor core header file, which provides the user with access to the CMSIS PMU API. More
information about using CMSIS in an application can be found online:
• [Link]
• [Link]
6-without-an-ide
The printf() routines may need to be retargeted to the device that you’re working with.
/* main.c */
#include <stdio.h>
#include "RTE_Components.h" // include information about project configuration
#include CMSIS_device_header // include <device>.h file
int main(void)
{
/* Reset count variables for cycle count and retired Loop End instructions */
uint32_t cycle_count = 0;
uint32_t le_retired_count = 0;
/* Enable Trace */
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
/* Print results */
printf("Cycles for strcpy_scalar = %d\n"
"Loop End instructions retired = %d\n",cycle_count, le_retired_count);
/* Reset count variables for cycle count and retired Loop End instructions */
cycle_count = 0;
le_retired_count = 0;
return 0;
}
Running this code on Cortex-M55 RTL where instructions and data are stored in TCMs prints something similar to the following:
Running the same code on a Cortex-M55 FVP prints something similar to the following:
The Fast Model results show that the FVP correctly counted the number of retired Loop End instructions, showing that it can provide a
quick and convenient way to run functionally accurate simulations. The cycle count information is approximate and is more aligned with
the total number of instructions retired, rather than the actual cycle count shown by an RTL simulator or hardware platform. A cycle
accurate model would be an alternative solution where accuracy is required.
Software can manage PMU counter overflows by utilizing the PMU Overflow Flag Status Set Register, PMU_OVSSET, to check for overflows. Developers can mask specific bits to determine which counter overflowed and use the PMU Overflow Status Clear Register, PMU_OVSCLR, to reset overflow status bits. Ensuring an overflow does not go unnoticed involves configuring an interrupt to be generated upon overflow of a specific counter using the PMU Interrupt Enable Set Register, PMU_INTENSET. These combined steps allow software to respond to overflows, such as logging an event or adjusting counts .
Failing to promptly clear an overflow of the PMU event counter in an Armv8.1-M system can lead to ambiguities in performance metrics, causing uncertainty over the number of times a counter has overflowed. Software can handle this by implementing strategies to check and clear the overflow status regularly using PMU_OVSCLR, or even triggering interrupts upon overflow to ensure the status is actively managed. This helps maintain accurate performance monitoring, crucial for effective debugging and optimization .
The Cortex-M55 processor supports only a subset of microarchitectural (µArch) PMU events due to its inline, non-speculative pipeline nature. Unsupported µArch events include branch prediction, certain cache events, and speculation events. This limitation implies that developers working on performance optimizations will not have access to specific speculative execution metrics, which can be critical in other architectures. Developers must rely on supported events and may need alternative strategies to diagnose performance issues related to these unsupported metrics .
In Arm Cortex-M systems, PMU counter overflows are handled using the DebugMonitor exception, which differs from halt-mode debugging as it allows the processor to continue execution rather than halting. This is beneficial for debugging systems with real-time requirements, where stopping the processor would disrupt operations. The DebugMonitor exception provides a means to handle PMU overflows without impacting system performance critically, thereby supporting efficient debugging and monitoring in live environments .
The lack of support for branch prediction events in the Cortex-M55 implies that developers will not have insights into the effectiveness of speculative execution, which can be crucial for optimizing high-performance applications. This limitation may affect developers' ability to fine-tune branch-heavy workloads and necessitates reliance on supported events to deduce performance issues intrinsically associated with branching decisions .
Unsupported MVE unaligned instruction events on the Cortex-M55 impact performance profiling by restricting visibility into specific memory access patterns and vectorized performance characteristics. This limitation poses challenges for optimizing code that relies on these instructions, particularly in applications leveraging MVE for vector operations. Developers have to depend on alternative metrics or profiling methods to gather insights into the performance impacts of such instructions .
Chaining event counters allows for the creation of a 32-bit counter from two 16-bit counters, enhancing the capability of performance monitoring by allowing for the accumulation of a significantly greater range of data before overflow occurs. This is particularly beneficial for applications where event counts exceed the limits of a 16-bit counter. Configuration requires setting an odd-numbered counter to chain with a preceding even-numbered counter, using specific PMU control registers and macros in the software to manage the reset, enabling, and reading of these enhanced counters .
The PMU Cycle Count Register, PMU_CYCCNT, is significant in performance monitoring as it is dedicated to counting the number of cycles. This register can be both read and written by software, allowing developers to initialize it with a starting value and reset it to zero as needed. It provides a mechanism to monitor the execution time of code segments by counting the processor cycles consumed during execution. This facilitates detailed performance analysis and optimization in applications. Enabling the PMU and configuring the appropriate control registers is necessary for utilizing this functionality effectively .
Developers face challenges in reconfiguring PMU registers due to the complexity of setting multiple registers, such as PMU_EVTYPERn, PMU_CNTENSET, and PMU_CTRL, to monitor specific events accurately. Each configuration requires precise control over the event type and the enabling or disabling of counters. Consequently, managing this complexity along with ensuring minimal performance impact and accurate data collection demands careful planning and thorough understanding of the PMU's operation .
The Debug Exception Monitor Control Register (DEMCR) is crucial for enabling PMU operations. It ensures tracing is active, which is necessary for comprehensive real-time performance monitoring without halting the processor. This capability is important as it allows the PMU to collect metrics seamlessly while the processor executes tasks, providing valuable insights for performance analysis and tuning, especially in real-time systems where halting the processor is not feasible .