Embedded Systems - ARM Programming Techniques
Embedded Systems - ARM Programming Techniques
Programming Techniques
Beta Draft
Document Number: ARM DUI 0021A Issued: June 1995 Copyright Advanced RISC Machines Ltd (ARM) 1995
EUROPE
Advanced RISC Machines Limited Fulbourn Road Cherry Hinton Cambridge CB1 4JN Telephone: +44 1223 400400 Facsimile: +44 1223 400410 Email: info@armltd.co.uk
JAPAN
Advanced RISC Machines K.K. KSP West Bldg, 3F 300D, 3-2-1 Sakado, Takatsu-ku, Kawasaki-shi Kanagawa, 213 Japan Telephone: +81 44 850 1301 Facsimile: +81 44 850 1308 Email: info@armltd.co.uk
USA
ARM USA Suite 5, 985 University Avenue Los Gatos California 95030 Telephone: +1 408 399 5199 Facsimile: +1 408 399 8854 Email: info@arm.com
Proprietary Notice
ARM, the ARM Powered logo and EmbeddedICE are trademarks of Advanced RISC Machines Ltd. Neither the whole nor any part of the information contained in, or the product described in, this datasheet may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in this datasheet is subject to continuous developments and improvements. All particulars of the product and its use contained in this datasheet are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties or merchantability, or fitness for purpose, are excluded. This datasheet is intended only to assist the reader in the use of the product. ARM Ltd shall not be liable for any loss or damage arising from the use of any information in this datasheet, or any error or omission in such information, or any incorrect use of the product.
Change Log
Issue A Date By June 95 PB/BH/EH/AP Change Created
Beta Draft
ii
Programming Techniques
ARM DUI 0021A
TOC
1 Introduction
1.1 1.2 About this manual Feedback
Contents
1-1
1-2 1-3
Getting Started
2.1 2.2 Introducing the Toolkit The Hello World Example
2-1
2-2 2-4
Programmers Model
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 Introduction Memory Formats Instruction Length Data Types Processor Modes Processor States The ARM Register Set The Thumb Register Set Program Status Registers Exceptions
3-1
3-2 3-3 3-4 3-4 3-4 3-5 3-6 3-8 3-10 3-12
Programming Techniques
ARM DUI 0021A
Contents-1
Contents
4 ARM Assembly Language Basics
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 Introduction Structure of an Assembler Module Conditional Execution The ARMs Barrel Shifter Loading Constants Into Registers Loading Addresses Into Registers Using the Load and Store Multiple Instructions
4-1
4-2 4-4 4-6 4-10 4-14 4-17 Jump Tables4-21 4-23
5-1
5-2 5-3 5-8 5-12 5-17 5-25 5-27 5-28 5-29
6-1
6-2 6-3 6-11 6-14 6-17
7-1
7-2 7-3 7-9
Advanced Linking
8.1 8.2 Using Overlays ARM Shared Libraries
8-1
8-2 8-8
9-1
9-2 9-2 9-14 9-18
Contents-2
Programming Techniques
ARM DUI 0021A
Contents
10 The ARMulator
10.1 10.2 10.3 10.4 The ARMulator Using the ARMulator Rapid Prototype Memory Model Writing Custom Serial Drivers for ARM Debuggers Rebuilding the ARMulator
10-1
10-2 10-4 10-11 10-13
11
Exceptions
11.1 11.2 11.3 11.4 11.5 11.6 Overview Entering and Leaving an Exception The Return Address and Return Instruction Writing an Exception Handler Installing an Exception Handler Exception Handling on Thumb-Aware Processors
11-1
11-2 11-5 11-6 11-8 11-12 11-14
12
Implementing SWIs
12.1 12.2 12.3 12.4 12.5 12.6 Introduction Implementing a SWI Handler Loading the Vector Table Calling SWIs from your Application Development Issues: SWI Handlers and Demon Example SWI Handler
12-1
12-2 12-7 12-9 12-11 12-15 12-18
13
13-1
13-2 13-3 13-5 13-9
14
14-1
14-2 14-5 14-8 14-12
Programming Techniques
ARM DUI 0021A
Contents-3
Contents
Contents-4
Programming Techniques
ARM DUI 0021A
1
1.1 1.2 About this manual Feedback
Introduction
This chapter introduces the Programming Techniques manual. 1-2 1-3
Programming Techniques
ARM DUI 0021A
1-1
Introduction
1.1
1.1.1
You should use this book in conjunction with the ARM Software Development Toolkit, as most of the example programs are available on-line in the toolkits examples directory You will need to refer to the ARM Software Development Toolkit Reference Manual (ARM DUI 0020) for full details of the software tools. Also, the relevant ARM Datasheet will give you specic details about the device with which you are working.
1-2
Programming Techniques
ARM DUI 0021A
Introduction
1.1.2 Conventions
Typographical conventions The following typographical conventions are used in this manual: typewriter denotes text that may be entered at the keyboard: commands, le and program names and assembler and C source. shows text which must be substituted with user-supplied information: this is most often used in syntax descriptions is used to highlight important notes and ARM-specic terminology.
typewriter-italic Oblique
Filenames
Unless otherwise stated, lenames are quoted in Unix formatfor example: examples/basicasm/gcd1.s If you are using the PC platform, you must translate them into their DOS equivalent: EXAMPLES\BASICASM\GCD1.S
1.2
1.2.1
Feedback
Feedback on the Software Development Toolkit
If you have feedback on the Software Development Toolkit, please contact either your supplier or ARM Ltd. You can send feedback via e-mail to: xdevt@armltd.co.uk. In order to help us give a rapid and useful response, please give: details of which hosting and release of the ARM software tools you are using a small sample code fragment which reproduces the problem a clear explanation of what you expected to happen, and what actually happened
1.2.2
Programming Techniques
ARM DUI 0021A
1-3
Introduction
1-4
Programming Techniques
ARM DUI 0021A
2
2.1 2.2 Introducing the Toolkit The Hello World Example
Getting Started
This chapter introduces the components of the ARM software development toolkit, and takes you through compiling, linking and running a simple ARM program. 2-2 2-4
Programming Techniques
ARM DUI 0021A
2-1
Getting Started
2.1 Introducing the Toolkit
The ARM software development toolkit is a collection of utilities for producing programs written in ARM code. The tools include emulators so that programs can be run even when real ARM hardware is unavailable to the developer.
Re-targetable libraries
Full documentation
Utilities
It comprises a set of command line tools and, in the case of the IBM PC platform, a pair of applications which provide an interactive development environment in the Windows desktop. The tools are used for two main purposes: Software development This involves building either C, C++, or ARM assembler source code into ARM object code, which is then debugged using the ARM source level debugger. The debugger has facilities which include single stepping, setting breakpoints and watchpoints, and viewing registers. Testing and debugging can be carried out on code running in a real ARM processor, or using the integrated ARM processor emulator. Benchmarking Once application code has been built, it can be benchmarked either on an ARM processor attached to the host system, or under software emulation. The ARM emulator can also be used to simulate the memory environment.
2-2
Programming Techniques
ARM DUI 0021A
Getting Started
2.1.1 Tools
The ARM software development toolkit consists of the following core command-line tools: armcc The ARM C cross compiler. This is a mature, industrial-strength compiler, tested against Plum Hall C Validation Suite for ANSI conformance. It supports both Unix and PCC compatible modes. It is highly optimising, with options to optimise for code size or execution speed. The compiler is very fast, compiling 500 lines per second on a SPARC 10/41. The compiler can also produce ARM assembly language source. The Thumb C cross compiler. This is based on the ARM C compiler but produces 16-bit Thumb instructions instead of 32-bit ARM instructions. The ARM cross assembler. This compiles ARM assembly language source into ARM object format object code. The Thumb and ARM cross assembler. This compiles both ARM assembly and Thumb assembly language source into object code. An assembler directive dictates whether the code following is ARM (32-bits) or Thumb (16-bits). The Thumb and ARM linker. This combines the contents of one or more object les (the output of a compilation of assembler) with selected parts of one or more object libraries, to produce an executable program. The Thumb and ARM object le decoder/disassembler. This is used to extract information from object les, such as the code size. The Thumb and ARM symbolic debugger. This is used to emulate ARM processors, allowing ARM and Thumb executable programs to be run on non-ARM hardware. It also allows source level debugging of programs that have been compiled with debug information. This consists of single stepping either C source or assembler source, setting break points/ watchpoints, etc. armsd can also connect to real hardware and allow source level debugging on that hardware.
tcc
armasm tasm
armlink
decAOF armsd
These tools are documented in The ARM Software Development Toolkit Reference Manual: Chapter 1, Introduction. On the IBM PC platform, the toolkit also comprises: APM The ARM Project Manager. This is an integrated development environment, which provides all the functions of a traditional make le, along with source editing facilities and a link to the ARM debugger. The ARM windowed debugger. This is the Windows version of armsd which integrates with the ARM Project Manager.
Windbg
These applications are documented in the ARM Windows Toolkit Guide (ARM DUI 0022).
Programming Techniques
ARM DUI 0021A
2-3
Getting Started
2.2 The Hello World Example
This example shows you how to write, compile, link and execute a simple C program that prints Hello World and a carriage return on the screen. The code will be generated on a text editor, compiled and linked using armcc, and run on armsd.
2.2.1
2-4
Programming Techniques
ARM DUI 0021A
Getting Started
2.2.2 Timing
To nd out how many microseconds this would take to run on real hardware, type the following: print $clock You can change the memory model and clock speed of the hardware being simulatedfor more information, see Chapter 13, Benchmarking, Performance Analysis, and Profiling. To load and run the program again, enter: reload go To quit the debugger, enter: quit
2.2.3
Debugging
Next, re-compile the program to include high-level debugging information, and use the debugger to examine the code. Compile the program using: armcc -g hello.c -o hello2 where the -g option instructs the compiler to add debug information. Load hello2 into armsd: armsd hello2 and set a breakpoint on the rst statement in main by entering: break main at the armsd: prompt. To execute the program up to the breakpoint, enter: go The debugger reports that it has stopped at breakpoint #1, and displays the source line. To view the ARM registers, enter: reg To list the C source, enter: type This displays the whole source le. type can also display sections of code: for example if you enter: type 1,6 lines 1 to 6 of the source will be displayed.
Programming Techniques
ARM DUI 0021A
2-5
Getting Started
To show the assembly code rather than the C source, type: list This will produce the assembly around the current position in the program. You can also list memory at a given address: list 0x8080
2.2.4
2.2.5
2-6
Programming Techniques
ARM DUI 0021A
Getting Started
; generated by Norcroft 23 1995] ARM C vsn 4.65 (Advanced RISC Machines) [May
AREA |C$$code|, CODE, READONLY |x$codeseg| DATA main MOV STMDB SUB CMP BLMI ADD BL MOV LDMDB L000024 DCB DCB DCB DCB 0x48,0x65,0x6c,0x6c 0x6f,0x20,0x77,0x6f 0x72,0x6c,0x64,0x0a 00,00,00,00 ip,sp sp!,{fp,ip,lr,pc} fp,ip,#4 sp,sl __rt_stkovf_split_small a1,pc,#L000024-.-8 _printf a1,#0 fp,{fp,sp,pc}
AREA |C$$data|,DATA |x$dataseg| EXPORT main IMPORT _printf IMPORT __rt_stkovf_split_small END Note
Your code may differ slightly from the above, depending on the version of armcc in use.
2.2.6
Programming Techniques
ARM DUI 0021A
2-7
Getting Started
2-8
Programming Techniques
ARM DUI 0021A
3
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 Introduction Memory Formats Instruction Length Data Types Processor Modes Processor States
Programmers Model
This chapter describes the features of the ARM Processor which are of special interest to the programmer. 3-2 3-3 3-4 3-4 3-4 3-5 3-6 3-8 3-10 3-12
The ARM Register Set The Thumb Register Set Program Status Registers Exceptions
Programming Techniques
ARM DUI 0021A
3-1
Programmers Model
3.1 Introduction
This chapter gives an overview of the ARM from the programmers point of view, and is designed to provide you with some general background for the discussions in this book.
3.1.1
Version 2a introduced an Atomic Load and Store instruction (SWP) and the use of Coprocessor 15 as a system control coprocessor. Versions 1, 2 and 2a all supported a 26-bit address bus and combined in register 15 a 24-bit Program Counter (PC) and 8 bits of processor status. Architecture 3 Version 3 of the architecture extended the addressing range to 32 bits, dening a 30-bit Program Counter value in register 15. The status information was moved from register 15 to a new 11-bit status register (the Current Program Status Register or CPSR). Version 3 also added two new privileged processing modes (Version 2 has just three: Supervisor, IRQ and FIQ). The new modes, Undened and Abort, allowed coprocessor emulation and virtual memory support in Supervisor mode. In addition, a further ve status registers (the Saved Program Status Registers or SPSRs) were dened, one for each privileged processor mode, in which the CPSR contents is preserved when the corresponding exception is taken. A variant of the Version 3 architectureVersion 3Madded multiply and multiply accumulate instructions that produce a 64 bit result (SMULL, UMULL, SMLAL, UMLAL). Architecture 4 Version 4 added halfword load and store instructions and sign extended byte and halfword load instructions. It also reserved some SWI instruction space for architecturally dened operations, added a new privileged processor mode called System (that uses the User mode registers) and dened several new undened instructions. A variant of Version 4 called 4T incorporates an instruction decoder for a 16-bit subset of the ARM instruction set (known as Thumb). Processors which have this decoder are referred to as being Thumb-aware. 3-2
Programming Techniques
ARM DUI 0021A
Programmers Model
3.2 Memory Formats
The ARM views memory as a linear collection of bytes numbered upwards from zero. Bytes 0 to 3 hold the rst stored word, bytes 4 to 7 the second and so on. The ARM can treat words in memory as being stored either in Big Endian or Little Endian format.
3.2.1
Higher Address
31 8 4 0
24
23 9 5 1
16
15 10 6 2
7 11 7 3
Word Address 8 4 0
Lower Address
Most significant byte is at lowest address Word is addressed by byte address of most significant byte
3.2.2
Least significant byte is at lowest address Word is addressed by byte address of least significant byte
Programming Techniques
ARM DUI 0021A
3-3
Programmers Model
3.3 Instruction Length
ARM instructions are exactly one word (32 bits), and are aligned on a four-byte boundary. Thumb instructions are exactly one halfword, and are aligned on a two-byte boundary.
3.4
Data Types
The ARM supports the following data types: Byte Halfword Word 8 bits 16 bits halfwords must be aligned to 2-byte boundaries (Architecture 4 only) 32 bits words must be aligned to four-byte boundaries
Load and store operations can transfer bytes, halfwords and words to and from memory. Signed operands are in twos complement format.
3.5
Processor Modes
There are a number of different processor modes. These are shown in the following table: Processor mode 1 User 2 FIQ 3 IRQ 4 Supervisor 5 Abort 6 Undefined 7 System (usr) (fiq) (irq) (svc) (abt) (und) (sys) Description the normal program execution mode designed to support a high-speed data transfer or channel process used for general-purpose interrupt handling a protected mode for the operating system used to implement virtual memory and/or memory protection used to support software emulation of hardware coprocessors used to run privileged operating system tasks (Architecture Version 4 only)
3-4
Programming Techniques
ARM DUI 0021A
Programmers Model
3.6 Processor States
Note
3.6.1
Switching state
Entering Thumb state Entry into Thumb state occurs on execution of a BX instruction with the state bit (bit 0) set in the operand register. Transition to Thumb state also occurs automatically on return from an exception (IRQ, FIQ, RESET, UNDEF, ABORT, SWI etc) if the exception was entered from Thumb state. Entering ARM state Entry into ARM state happens: 1 2 On execution of the BX instruction with the state bit clear in the operand register. On the processor taking an exception (IRQ, FIQ, RESET, UNDEF, ABORT, SWI etc.). In this case, the PC is placed in the exception modes link register, and execution commences at the exceptions vector address. See 3.10 Exceptions on page 3-12 and Chapter 11, Exceptions.
Programming Techniques
ARM DUI 0021A
3-5
Programmers Model
3.7 The ARM Register Set
The ARM processor has a total of 37 registers, comprising: 30 general-purpose registers 6 status registers a program counter
However, not all of these registers can be seen at once. Depending on the processor mode, fteen general-purpose registers (R0 to R14), one or two status registers and the program counter will be visible. The registers are arranged in partially overlapping banks with a different register bank for each processor mode: Table 3-2: The ARM register set on page 3-7 shows how the registers are arranged, with the banked registers shaded. Table 3-4: The mode bits on page 3-11 lists which registers are visible in which mode.
3.7.1
Register roles
Registers 0-12 are always free for general-purpose use. Registers 13 and 14, although available for general use, also have specic roles: Register 13 (also known as the Stack Pointer or SP) is banked across all modes to provide a private Stack Pointer for each mode (except System mode which shares the user mode R13). Register 14 (also known as the Link Register or LR) is used as the subroutine return address link register. R14 is also banked across all modes (except System mode which shares the user mode R14). When a Subroutine call (Branch and Link instruction) is executed, R14 is set to the subroutine return address. The banked registers R14_SVC, R14_IRQ, R14_FIQ, R14_ABORT and R14_UNDEF are used similarly to hold the return address when exceptions occur (or a subroutine return address if subroutine calls are executed within interrupt or exception routines). R14 may be treated as a general-purpose register at all other times. Register 15 is used specically to hold the Program Counter (PC). When R15 is read, bits [1:0] are zero and bits [31:2] contain the PC. When R15 is written bits[1:0] are ignored and bits[31:2] are written to the PC. Depending on how it is used, the value of the PC is either the address of the instruction plus n (where n is 8 for ARM state and 4 for Thumb state) or is unpredictable. CPSR is the Current Program Status Register. This is accessible in all processor modes, and contains the condition code ags, interrupt enable ags, and current processor mode. In Architecture 4T, the CPSR also holds the processor state. See 3.9 Program Status Registers on page 3-10 for more information.
3-6
Programming Techniques
ARM DUI 0021A
Programmers Model
3.7.2 The FIQ banked registers
FIQ mode has banked registers R8 to R12 (as well as R13 and R14). Regusters R8_FIQ, R9_FIQ, R10_FIQ, R11_FIQ and R12_FIQ are provided to allow very fast interrupt processing (without the need to preserve register contents by storing them to memory), and to preserve values across interrupt calls (so that register contents do not need to be restored from memory).
User/ System
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 PC
Supervisor
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13_SVC R14_SVC PC
Abort
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13_ABORT R14_ABORT PC
Undefined
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13_UNDEF R14_UNDEF PC
Interrupt
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13_IRQ R14_IRQ PC
Fast interrupt
R0 R1 R2 R3 R4 R5 R6 R7 R8_FIQ R9_FIQ R10_FIQ R11_FIQ R12_FIQ R13_FIQ R14_FIQ PC
CPSR
CPSR SPSR_SVC
CPSR SPSR_ABORT
CPSR SPSR_UNDEF
CPSR SPSR_IRQ
CPSR SPSR_FIQ
Programming Techniques
ARM DUI 0021A
3-7
Programmers Model
3.8 The Thumb Register Set
Note
Supervisor
R0 R1 R2 R3 R4 R5 R6 R7 SP_SVC LR_SVC PC
Abort
R0 R1 R2 R3 R4 R5 R6 R7 SP_ABORT LR_ABORT PC
Undefined
R0 R1 R2 R3 R4 R5 R6 R7 SP_UNDEF LR_UNDEF PC
Interrupt
R0 R1 R2 R3 R4 R5 R6 R7 SP_IRQ LR_IRQ PC
Fast interrupt
R0 R1 R2 R3 R4 R5 R6 R7 SP_FIQ LR_FIQ PC
CPSR
CPSR SPSR_SVC
CPSR SPSR_ABORT
CPSR SPSR_UNDEF
CPSR SPSR_IRQ
CPSR SPSR_FIQ
Programming Techniques
ARM DUI 0021A
Programmers Model
This relationship is shown in Figure 3-3: Mapping of Thumb state registers onto ARM state registers.
Thumb state
R0 R1 R2 R3 R4 R5 R6 R7
ARM state
R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 Stack Pointer (R13) Link Register (R14) Program Counter (R15) CPSR SPSR
Stack Pointer (SP) Link Register (LR) Program Counter (PC) CPSR SPSR
Figure 3-3: Mapping of Thumb state registers onto ARM state registers
3.8.1
Programming Techniques
ARM DUI 0021A
Hi registers
3-9
Lo registers
Programmers Model
3.9 Program Status Registers
The ARM contains a Current Program Status Register (CPSR), plus ve Saved Program Status Registers (SPSRs) for use by exception handlers. The CPSR: holds information about the most recently performed ALU operation controls the enabling and disabling of interrupts sets the processor operating mode sets the processor state (Architecture 4T only)
The CPSR is saved to the appropriate SPSR when the processor enters an exception. The arrangement of bits in these registers is shown in Figure 3-4: Program Status Register format, below.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
N Z C V
F T M4 M3M2 M1 M0
3.9.1
3-10
Programming Techniques
ARM DUI 0021A
Programmers Model
3.9.2 The control bits
The bottom 8 bits of a PSR (incorporating I, F, T and M[4:0]) are known collectively as the control bits. These change when an exception arises, and can be altered by software only when the processor is in a privileged mode. Interrupt disable bits The I and F bits are the interrupt disable bits. When set, these disable the IRQ and FIQ interrupts respectively. The state bit Bit T is the processor state bit. When the state bit is set to 0, this indicates that the processor is in ARM state (ie. executing 32-bit ARM instructions). When it is set to 1, this indicates that the processor is in Thumb state (executing 16-bit Thumb instructions) The state bit is only implemented on Thumb-aware processors (Architecture 4T). On non Thumb-aware processors the state bit will always be zero. The mode bits The M4, M3, M2, M1 and M0 bits (M[4:0]) are the mode bits. These determine the mode in which the processor operates, as shown in Table 3-4: The mode bits, below. Not all combinations of the mode bits dene a valid processor mode. Only those explicitly described can be used.
.. M[4:0] 10000 10001 10010 10011 10111 11011 11111 Mode User FIQ IRQ SVC Abort Undef System Accessible Registers PC, R14 to R0, CPSR PC, R14_fiq to R8_fiq, R7 to R0, CPSR, SPSR_fiq PC, R14_irq, R13_irq,R12 to R0, CPSR, SPSR_irq PC, R14_svc, R13_svc,R12 to R0, CPSR, SPSR_svc PC, R14_abt, R13_abt,R12 to R0, CPSR, SPSR_abt PC, R14_und, R13_und,R12 to R0, CPSR, SPSR_und PC, R14 to R0, CPSR (Architecture 4 only)
Programming Techniques
ARM DUI 0021A
3-11
Programmers Model
3.10 Exceptions
Note This section is a brief overview of the ARMs exceptions. For a detailed explanation of how they operate and how to handle them please refer to Chapter 11, Exceptions.
Exception type Reset Undefined instructions Software Interrupt (SWI) Prefetch Abort (Instruction fetch memory abort) Data Abort (Data Access memory abort) IRQ (Interrupt) FIQ (Fast Interrupt)
3-12
Programming Techniques
ARM DUI 0021A
Programmers Model
3.10.2 Action on entering an exception
When an exception occurs, the ARM makes use of the banked registers to save state, by: 1 2 3 4 copying the address of the next instruction into the appropriate Link Register copying the CPSR into the appropriate SPSR forcing the CPSR mode bits to a value corresponding to the exception forcing the PC to fetch the next instruction from the relevant vector
It may also set the interrupt disable ags to prevent otherwise unmanageable nestings of exceptions from taking place. If the processor is Thumb-aware (Architecture 4T) and is operating in Thumb state, it will automatically switch into ARM state.
If the processor is Thumb-aware (Architecture 4T), it will restore the operating state (ARM or Thumb) which was in force at the time the exception occurred.
Programming Techniques
ARM DUI 0021A
3-13
Programmers Model
3-14
Programming Techniques
ARM DUI 0021A
4
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
Programming Techniques
ARM DUI 0021A
4-1
The banking of registers gives rapid context switching for dealing with exceptions and privileged operations: see Chapter 3, Programmers Model for a summary of the ARM register set. Flexible load and store multiple instructions The ARMs multiple load and store instructions allow any set of registers from a single bank to be transferred to and from memory by a single instruction.
4-2
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
4-3
4.2.1
4.2.2
4.2.3
General layout
The general form of lines in an assembler module is:
Programming Techniques
ARM DUI 0021A
4.2.5
Programming Techniques
ARM DUI 0021A
4-5
Conditional Execution
The ARMs ALU status ags
The ARMs Program Status Register contains, among other ags, copies of the ALU status ags: N Z C V Negative result from ALU ag Zero result from ALU ag ALU operation Carried out ALU operation oVerowed
See Figure 3-4: Program Status Register format on page 3-10 for details. Data processing instructions change the state of the ALUs N, Z, C and V status outputs, but these are latched in the PSRs ALU ags only if a special bit (the S bit) is set in the instruction.
4.3.2
Execution conditions
Every ARM instruction has a 4-bit eld that encodes the conditions under which it will be executed. These conditions refer to the state of the ALU N, Z, C and V ags as shown in the Table 4-1: Condition codes on page 4-7. If the condition eld indicates that a particular instruction should not be executed given the current settings of the status ag, the instruction will simply soak up one cycle but will have no other effect. If the current instruction is a data processing instruction, and the ags are to be updated by it, the instruction must be postxed by an S. The exceptions to this are CMP, CMN, TST and TEQ, which always update the ags (since this is their only effect). Examples ADD r0, r1, r2 ADDS r0, r1, r2 ADDEQS r0, r1, r2 CMP r0, r1 ; ; ; ; ; ro = r1 + r2, dont update flags r0 = r1 + r2 and UPDATE flags If Z flag set then r0 = r1 + r2, and UPDATE flags Update flags based on r0 - r1
4-6
Programming Techniques
ARM DUI 0021A
4.3.3
Programming Techniques
ARM DUI 0021A
4-7
Not only has code size been reduced from seven words to four, but execution time has also decreased, as can be seen from Table 4-2: Only branches conditional and Table 4-3: All instructions conditional on page 4-9. These show the execution times for the simple case where r0 equals 1 and r1 equals 2. In this case, replacing branches with conditional execution of all instructions has given a saving of three cycles. With all inputs to the gcd algorithm, the conditional version of the code will execute in the same number of cycles (when both inputs are the same), or fewer cycles.
4.3.4
4-8
Programming Techniques
ARM DUI 0021A
r0:a 1 1 1 1 1 1 1 1
r1: b 2 2 1 1 1 1 1 1
Instruction CMP r0, r1 SUBGT r0, r0, r1 SUBLT r1, r1, r0 BNE gcd CMP r0, r1 SUBGT r0, r0, r1 SUBLT r1, r1, r0 BNE gcd
Cycles 1 Not executed -1 1 3 1 Not executed -1 Not executed -1 Not executed -1 Total = 10
Programming Techniques
ARM DUI 0021A
4-9
LSR
ASR
arithmetic shift right by n bits. The bits fed into the top end of the operand are copies of the original topor signbit. (signed division by 2n)
CF
Destination
CF
4-10
Programming Techniques
ARM DUI 0021A
Destination
CF
The barrel shifter can be used in several of the ARMs instruction classes. The options available in each case are described below.
4.4.1
Note that in the second example the assembler is left to work out how to split the constant 0xFC000003 into an 8-bit constant and an even rotate (in this case #0xFC000003 could be replaced by #0xFF, 6). For more information, see 4.5 Loading Constants Into Registers on page 4-14. a register (optionally) shifted or rotated either by a 5-bit constant or by another register For example: ADD SUB CMP MVN r0, r0, r1, r3, r1, r1, r2, r2, r2 r2, LSR #10 r1, ROR R5 RRX
Note that in the last example, the rotate right extended does not take a parameter, but rather rotates right by only a single bit. RRX is actually encoded by the assembler as ROR #0. Example: Constant multiplication The ARM core provides a powerful multiplication facility in the MUL and MLA instructions (plus UMULL, UMAL, SMULL and SMLAL on processors that implement ARM Architectures 3M and 4M). These instructions make use of Booths Algorithm to perform integer multiplication, taking up to 17 cycles to complete for MUL and MLA and up to 6 or 7 cycles to complete for UMULL, UMAL, SMULL and SMLAL. In cases where the multiplication is by a constant, it can be quicker to make use of the barrel shifter, as the operations it provides are effectively multiply / divide by powers of two.
Programming Techniques
ARM DUI 0021A
4-11
Using a move/add/subtract combined with a shift, all multiplications by a constant which are a power of two or a power of two +/ 1 can be carried out in a single cycle. See Chapter 5, Exploring ARM Assembly Language.
4.4.2
In pre-indexed instructions, the offset is calculated and added to the base, and the resulting address is used for the transfer. If writeback is selected, the transfer address is written back into the base register. In post-indexed instructions the offset is calculated and added to the base after the transfer. The base register is always updated by post-indexed instructions. Example: Addressing an entry in a table of words The following fragment of code calculates the address of an entry in a table of words and then loads the desired word: ; r0 holds the entry number [0,1,2,...] LDR r1, =StartOfTable MOV r3, #4 MLA r1, r0, r3, r1 LDR r2, [r1] ... StartOfTable DCD <table data> It rst loads the start address of the table, then moves the immediate constant 4 into a register, using the multiply and accumulate instruction to calculate the address, and nally loads the entry.
4-12
Programming Techniques
ARM DUI 0021A
4.4.3
The assembler will attempt to generate a shifted 8-bit value to match the expression, the top four bits of which can be loaded into the top four bits of the PSR. This will not disturb the control bits. The ag bits are the only part of the CPSR which can be modied while in User mode (when no SPSRs are visible).
Programming Techniques
ARM DUI 0021A
4-13
4.5.2
and so on, plus their bitwise complements. We can therefore load constants directly into registers using instructions such as: MOV MOV MOV MOV r0, r0, r0, r0, #0xFF #0x1,30 #0xFF, 28 #0x1, 26 ; ; ; ; r0 r0 r0 r0 = = = = 255 1020 4080 4096
However, converting a constant into this form is an onerous task. The assembler therefore attempts the conversion itself. If the supplied constant cannot be expressed as a shifted 8-bit value or its bitwise complement, the assembler will report this as an error. The following example illustrates how this works. The left-hand column lists the ARM instructions entered by the user, while the right-hand column shows the assemblers attempts to convert the supplied constants to an acceptable form.
4-14
Programming Techniques
ARM DUI 0021A
The above code is available as loadcon1.s in directory examples/basicasm. To assemble it, rst copy it into your current working directory and then issue the command: armasm loadcon1.s -o loadcon1.o To conrm that the assembler has produced the correct code, you can disassemble it using the ARM Object format decoder: decaof -c loadcon1.o
4.5.3
Programming Techniques
ARM DUI 0021A
4-15
MOV LargeTable
pc, lr % 4200
END
Note that the literal pools must be placed outside sections of code, since otherwise they would be executed by the processor as instructions. This will typically mean placing them between subroutines as is done here if more pools than the default one at END is required. The above code is available as loadcon2.s in directory examples/basicasm. To assemble this, rst copy it into your current working directory and then issue the command: armasm loadcon2.s To conrm that the assembler has produced the correct code, the code area can be disassembled using the ARM Object format decoder: decaof -c loadcon2.o
4-16
Programming Techniques
ARM DUI 0021A
4.6.1
Programming Techniques
ARM DUI 0021A
4-17
4.6.2
4-18
Programming Techniques
ARM DUI 0021A
The above code is available as loadcon4.s in directory examples/basicasm. To assemble this, rst copy it into your current working directory and then issue the command armasm loadcon4.s To conrm that the assembler produced the correct code, the code area can be disassembled using the ARM Object format decoder: decaof -c loadcon4.o
4.6.3
main
srcstr DCB "This is my first (source) string",0 dststr DCB "This is my second (destination) string",0 ALIGN ; realign address to word boundary strcopy LDRB STRB CMP BNE MOV END
r2, [r1], #1 ; load byte, then update address r2, [r0], #1 ; store byte, then update address r2, #0 ; check for zero terminator strcopy ; keep going if not pc, lr ; return
ADR is used to load the addresses of the two strings into registers r0 and r1, for passing to strcopy. These two strings have been stored in memory using the assembler directive DCB (Dene Constant Byte). The rst string is 33 bytes long, so the ADR offset to the second (as a non-word aligned offset) is limited to 255 bytes, which is therefore within reach.
Programming Techniques
ARM DUI 0021A
4-19
but takes only one cycle to execute rather than two. The above code is available as strcopy1.s in directory examples/basicasm. Copy this into your current working directory and assemble it, with debug information included: armasm strcopy1.s -g Then link it and load it into the debugger armlink strcopy1.o -o strcopy1 -d armsd strcopy1 You can now view the source and destination strings using: print/s @srcstr print/s @dststr Run the program and check that the destination string has been updated: go print/s @srcstr print/s @dststr Also in the examples directory is a version of this program called strcopy2.s, which uses LDR Rd,=PC-relative expression rather than ADR. Assemble this and compare the code and the code size with that of strcopy1.s, using the ARM Object format decoder: armasm strcopy2.s decaof -c strcopy2.o decaof -c strcopy1.o It is preferable to use ADR wherever possible, both because it results in shorter code (no storage space is required for addresses to be placed in the literal pool) and because the resulting code will run more quickly (a non-sequential fetch from memory to get the address from the literal pool is not required).
4-20
Programming Techniques
ARM DUI 0021A
Values outside this range will have the same effect as value 0. AREA ENTRY main MOV MOV MOV BL SWI arithfunc CMP BHI ADR LDR JumpTable DCD DCD DCD DCD DCD ReturnA1 MOV MOV ReturnA2 MOV MOV ArithGate, CODE r0, #2 r1, #5 r2, #15 arithfunc 0x11 ; name this block of code ; mark the first instruction to call ; set up three parameters
; ; ; r0, #4 ; ReturnA1 ; r3, JumpTable ; pc,[r3,r0,LSL #2] ; ReturnA1 ReturnA2 DoAdd DoSub DoRsb r0, r1 pc,lr r0, r2 pc,lr
call the function terminate label the function Treat code as unsigned integer If code > 4 then return first argument Load address of the jump table Jump to appropriate routine
; Operation 0, >4
; Operation 1
Programming Techniques
ARM DUI 0021A
4-21
The ADR pseudo instruction loads the address of the jump table into r3. The following LDR then multiplies the function code in r0 by 4 (using the barrel shifter) and adds this onto the address of the jump table to give the address of the required entry within the jump table. The jump table itself is set up using the DCD directive, which stores the address of the relevant routine (placed there by the linker). The above code is available as jump.s in directory examples/basicasm. Copy this into your current working directory and assemble and link it: armasm jump.s armlink jump.o -o jump Then load the resulting program into the debugger: armsd jump If you now execute the program: go and display the registers: reg the value of r0 should be 0x14.
4-22
Programming Techniques
ARM DUI 0021A
4.8.2
4.8.3
Increment/decrement, before/after
The base address for the transfer can either be incremented or decremented between register transfers, and this can happen either before or after each register transfer: STMIA r10, {r1, r3-r5, r8} The sufx IA could also have been IB, DA or DB, where I indicates increment, D decrement, A after and B before. In all cases the lowest numbered register is transferred to or from the lowest memory address, and the highest numbered register to or from the highest address. The order in which the registers appear in the register list makes no difference. Also, the ARM always performs sequential memory accesses in increasing memory address order. Therefore decrementing transfers actually perform a subtraction rst and then increment the transfer address register by register.
Programming Techniques
ARM DUI 0021A
4-23
4.8.5
Stack notation
Since the load and store multiple instructions have the facility to update the base register (which for stack operations can be the stack pointer), these instructions provide single instruction push and pop operations for any number of registers (LDM being pop, and STM being push). The Load and Store Multiple Instructions can be used with several types of stack:
ascending or descending A stack is able to grow upwards, starting from a low address and progressing to a higher addressan ascending stack, or downwards, starting from a high address and progressing to a lower onea descending stack. empty or full The stack pointer can either point to the top item in the stack (a full stack), or the next free space on the stack (an empty stack).
As stated above, pop and push operations for these stacks can be implemented directly by load and store multiple instructions. To make it easier for the programmer, special stack sufxes can be added to the LDM and STM instructions (as an alternative to Increment/Decrement and Before/After sufxes) as follows: STMFA LDMFA STMFD LDMFD STMEA LDMEA STMED LDMED r13!, r13!, r13!, r13!, r13!, r13!, r13!, r13!, {r0-r5}; {r0-r5}; {r0-r5}; {r0-r5}; {r0-r5}; {r0-r5}; {r0-r5}; {r0-r5}; Push onto a Full Ascending Stack Pop from a Full Ascending Stack Push onto a Full Descending Stack Pop from a Full Descending Stack Push onto an Empty Ascending Stack Pop from an Empty Ascending Stack Push onto Empty Descending Stack Pop from an Empty Descending Stack
Note the use of r13 as the base pointer here. By convention r13 is used as the system stack pointer (sp). In addition, the system stack will usually be Full Descending. The addressing modes are summarised in Table 4-4: Stack addressing modes, below.
4-24
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
4-25
4-26
Programming Techniques
ARM DUI 0021A
5
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
Programming Techniques
ARM DUI 0021A
5-1
5-2
Programming Techniques
ARM DUI 0021A
The example used can be found in le utoa1.s in directory examples/explasm. Its dtoa entry point converts a signed integer to a string of decimal digits (possibly with a leading '-'); its utoa entry point converts an unsigned integer to a string of decimal digits.
5.2.1
Algorithm
To convert a signed integer to a decimal string, generate a '-' and negate the number if it is negative; then convert the remaining unsigned value. To convert a given unsigned integer to a decimal string, divide it by 10, yielding a quotient and a remainder. The remainder is in the range 0-9 and is used to create the last digit of the decimal representation. If the quotient is non-zero it is dealt with in the same way as the original number, creating the leading digits of the decimal representation; otherwise the process has nished.
5.2.2
Implementation
utoa STMFD MOV MOV MOV BL SUB SUB CMP MOVNE MOV BLNE ADD STRB LDMFD sp!, {v1, v2, lr} v1, a1 v2, a2 a1, a2 udiv10 ; ; ; ; function entry - save some v-registers and the return address. preserve arguments over following function calls
; a1 = a1 / 10
v2, v2, a1, LSL #3 ; number - 8*quotient v2, v2, a1, LSL #1 ; - 2*quotient = remainder a1, #0 a2, a1 a1, v1 utoa v2, v2, #'0' v2, [a1], #1 sp!, {v1, v2, pc} ; ; ; ; quotient non-zero? quotient to a2... buffer pointer unconditionally to a1 conditional recursive call to utoa
; final digit ; store digit at end of buffer ; function exit - restore and return
Programming Techniques
ARM DUI 0021A
5-3
5-4
Programming Techniques
ARM DUI 0021A
5.2.4
The -li and -apcs 3/32bit options can be omitted if the tools are configured appropriately.
Programming Techniques
ARM DUI 0021A
5-5
The action of this instruction is as follows: 1 2 Subtract 4 * number-of-registers from sp Store the registers named in {...} in ascending register number order to memory at [sp], [sp,4], [sp,8] ... sp!, {v1, v2, pc}
The matching pop instruction was: LDMFD Its action is: 1 2 Load the registers named in {...} in ascending register number order from memory at [sp], [sp,4], [sp,8] ... Add 4 * number-of-registers to sp.
Many, if not most, register-save requirements in simple assembly language programs can be met using this approach to stacks. A more complete treatment of run-time stacks requires a discussion of: stack-limit checking (and extension) local variables and stack frames
In the utoa program, you must assume the stack is big enough to deal with the maximum depth of recursion, and in practice this assumption will be valid. The biggest 32-bit unsigned integer is about four billion, or ten decimal digits. This means that at most 10 x 3 registers = 120 bytes have to be stacked. Because the ARM Procedure Call Standard guarantees that there are at least 256 bytes of stack available when a function is called, and because we can guess (or know) that udiv10 uses no stack space, we can be condent that utoa is quite safe if called by an APCS-conforming caller such as a compiled C test harness. The stacking technique illustrated here conforms to the ARM Procedure Call Standard only if the function using it makes no function calls. Since utoa calls both udiv10 and itself, it really ought to establish a proper stack framesee The ARM Software Development Toolkit Reference Manual: Chapter 19, ARM Procedure Call Standard. If you really want to write functions that can 'plug and play together' you will have to follow the APCS exactly.
5-6
Programming Techniques
ARM DUI 0021A
So the utoa example is APCS compatible, even though it is not APCS conforming. Be aware however that if you call any function whose stack use is unknown (but which is believed to be APCS-conforming), you court disaster unless you establish a proper APCS call frame and perform APCS stack limit checking on function entry. Please refer to The ARM Software Development Toolkit Reference Manual: Chapter 19, ARM Procedure Call Standard for further details.
Programming Techniques
ARM DUI 0021A
5-7
Throughout the following discussion, registers are referred to using the names Rd, Rm, and Rs, but when trying the examples out for yourself you should use the default register names r0-r15, or names which have been declared using the RN assembler directive.
This section explains how to construct a sequence of ARM instructions to multiply by a constant. For some applications in which speed is essentialDigital Signal Pocessing, for example multiply is used extensively. In many cases where a multiply is used, one of the values is a constant (eg. weeks*7). A naive programmer would assume that the only way to calculate this would be to use the MUL instruction, but there is an alternative. This section demonstrates how to improve the speed of multiply-by-constant by using a sequence of arithmetic instructions instead of the general-purpose multiplier.
5.3.1
Introduction
The MUL instruction has the following syntax: MUL Rd, Rm, Rs The timing of this instruction depends on the value in Rs. The ARM6 datasheet species that for Rs between 2^(2m-3) and 2^(2m-1)-1 inclusive takes 1S + mI cycles. Note ARM 7M family processors have a different implementation of MUL. This leads to a different relationship of cycle counts to values of Rs. When multiplying by a constant value, it is possible to replace the general multiply with a xed sequence of adds and subtracts that have the same effect. For instance, multiply by 5 could be achieved using a single instruction: ADD Rd, Rm, Rm, LSL #2 ; Rd = Rm + (Rm * 4) = Rm * 5 This is obviously better than the MUL version: MOV MUL Rs, #5 Rd, Rm, Rs
The cost of the general multiply includes the instructions needed to load the constant into a register (up to four may be needed, or an LDR from a literal pool) as well as the multiply itself.
5-8
Programming Techniques
ARM DUI 0021A
The second method is the optimal solution (fairly easy to nd for small values such as 105). However, the problem of nding the optimum becomes much more difcult for larger constant values. A program can be written to search exhaustively for the optimum, but it may take a long time to execute. There are no known algorithms which solve this problem quickly. Temporary registers can be used to store intermediate results to help achieve the shortest sequence. For a large constant, more than one temporary may be needed, otherwise the sequence will be longer. The C compiler restricts the amount of searching it performs in order to minimise the impact on compilation time. The current version of armcc has a cut-off so that it uses a normal MUL if the number of instructions used in the multiply-by-constant sequence exceeds some number N. This is to avoid the sequence becoming too long.
Programming Techniques
ARM DUI 0021A
5-9
5-10
Programming Techniques
ARM DUI 0021A
The optimal multiply-by-constant sequence consists of just four data-processing instructions: ADD RSB ADD ADD Rd, Rd, Rd, Rd, Rm, Rd, Rm, Rd, Rm, Rd, Rd, Rm, LSL LSL LSL LSL #1 #4 #8 #6 ; ; ; ; Rd Rd Rd Rd = = = = Rm*3 Rd*15 = Rm*45 Rm + Rd*256 = Rm*11521 Rd + Rm*64 = Rm*11585
The following table shows a comparison of these methods: Method MUL instruction Multiply by constant Cycles 3 instructions + MUL internal cycles 4 instructions
Programming Techniques
ARM DUI 0021A
5-11
5.4.1
5.4.2
5-12
Programming Techniques
ARM DUI 0021A
The lines marked with a # are the special cases 2^n, which have already been dealt with. The lines marked with a * have a simple repeating pattern.
Programming Techniques
ARM DUI 0021A
5-13
For the repeating patterns, it is a relatively easy matter to calculate the product by using a multiply-by-constant method. The result can be calculated in a small number of instructions by taking advantage of the repetition in the pattern. This corresponds to the optimal solution in the multiply-by-constant problem (see 5.3 Multiplication by a Constant on page 5-8). The actual multiply is slightly unusual due to the need to return the top 32 bits of the 64-bit result. It efcient to calculate just the top 32 bits. This can be achieved by modifying the multiply-byconstant sequence so that the input value is shifted right rather than left. Consider this fragment of the divide-by-ten code (x is the input dividend as used in the above equations): SUB ADD ADD ADD MOV a1, a1, a1, a1, a1, x, a1, a1, a1, a1, x, a1, a1, a1, lsr lsr lsr lsr lsr #3 #2 #4 #8 #16 ; ; ; ; ; a1 a1 a1 a1 a1 = = = = = x*%0.11000000000000000000000000000000 x*%0.11001100000000000000000000000000 x*%0.11001100110011000000000000000000 x*%0.11001100110011001100110011001100 x*%0.00011001100110011001100110011001
The SUB calculates (for example): a1 = x - x/4 = x - x*%0.01 = x*%0.11 Therefore, just ve instructions are needed to perform the multiply.
5-14
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
5-15
5-16
Programming Techniques
ARM DUI 0021A
This section will only be of interest to designers working with Architecture 3 ARM cores and devices (eg. ARM6, ARM60, ARM610).
ARM processors designed using ARM Architecture 4 have instructions for loading and storing halfword values. ARM processors designed using version 3 of the architecture, while lacking halfword support, are still capable of handling16-bit data efciently, as this section will demonstrate. This section covers several different approaches to 16-bit data manipulation on ARM processors which do not have halfword support: Converting the 16-bit data to 32-bit data, and from then on treating it as 32-bit data Converting 16-bit data into 32-bit data when loading and storing, but using 32-bit data within ARM's registers Loading 16-bit data into the top 16-bits of ARM registers, and processing it as 16-bit data (ie. keeping the bottom 16-bits clear at all times)
Useful code fragments are given which can be used to help implement these different approaches efciently.
5.5.1
Programming Techniques
ARM DUI 0021A
5-17
5.5.2
Little-endian loading
Code fragments in this section which transfer a single 16-bit data item transfer it to the least signicant 16 bits of an ARM register. The byte offset referred to is the byte offset within a word at the load address. eg. the address 0x4321 has a byte offset of 1. One data item - any alignment (byte offsets 0,1,2 or 3) The following code fragment loads a 16-bit value into a register, whether the data is byte, halfword or word-aligned in memory, by using the ARM's load byte instruction. This code is also optimal for the common case where the 16-bit data is half word-aligned, ie. at either byte offset 0 or 2 (but the same code is required to deal with both cases). Optimisations can be made when it is known that the data is at byte offset 0, and also when it is known to be at byte offset 2 (but not when it could be at either offset). LDRB LDRB ORR MOV MOV R0, [R2, #0] R1, [R2, #1] R0, R0, R1, LSL #8 R0, R0, LSL #16 R0, R0, ASR #16 ; ; ; ; 16-bit value is loaded from the address in R2, and put in R0 R1 is required as a temporary register
; ;
The two MOV instructions are only required if the 16-bit value is signed, and it may be possible to combine the second MOV with another data-processing operation by specifying the second argument as R0, ASR, #16 rather than just R0. One data item - byte offset 2 If the data is aligned on a half word boundary, but not a word boundary (ie. the byte offset is 2), then the following code fragment can be used (which is clearly much more efcient than the general case given above): LDR MOV R0, [R2, #-2] ; 16-bit data is loaded from ; address in R2 into R0 R0, R0, LSR #16 ; (R2 has byte offset 2)
The LSR should be replaced with ASR if the data is signed. Note that as in the previous example it may be possible to combine the MOV with another data processing operation.
5-18
Programming Techniques
ARM DUI 0021A
As before, LSR should be replaced with ASR if the data is signed. Also, it may be possible to combine the second MOV with another data processing operation. This code can be further optimised if non-word-aligned word-loads are permitted (ie. alignment faults are not enabled). This makes use of the way ARM rotates data into a register for non-word-aligned word-loads (see the appropriate ARM Datasheet for more information): LDR MOV R0, [R2, #2] R0, R0, LSR #16 ; 16-bit value is loaded from the ; word-aligned address in R2 ; into R0.
Two data items - byte offset 0 Two 16-bit values stored in one word can be loaded more efciently than two separate values. The following code loads two unsigned 16-bit data items into two registers from a word-aligned address: LDR MOV BIC R0, [R2, #0] R1, R0, LSR #16 R0, R0, R1, LSL #16 ; ; ; ; 2 unsigned 16-bit values are loaded from one word of memory [R2]. The 1st is put in R0, and the 2nd in R1.
The version of this for signed data is: LDR MOV MOV MOV R0, R1, R0, R0, [R2, #0] R0, ASR #16 R0, LSL #16 R0, ASR #16 ; ; ; ; 2 signed 16-bit values are loaded from one word of memory [R2]. The 1st is put in R0, and the 2nd in R1.
The address in R2 should be word-aligned (byte offset 0), in which case these code fragments load the data item in bytes 0-1 into R0, and the data item in bytes 2-3 into R1.
Programming Techniques
ARM DUI 0021A
5-19
The second MOV instruction can be omitted if the data is no longer needed after being stored. Unlike load operations, knowing the alignment of the destination address does not make optimisations possible. Two data items - byte offset 0 Two unsigned 16-bit values in two registers can be packed into a single word of memory very efciently, as the following code fragment demonstrates: ORR STR R3, R0, R1, LSL #16 R3, [R2, #0] ; ; ; ; Two unsigned 16-bit values in R0 and R1 are packed into the word addressed by R2 R3 is a temporary register
If the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register (one of R0 or R1 can be used instead). The version for signed data is: MOV MOV ORR STR R3, R3, R3, R3, R0, LSL #16 R3, LSR #16 R3, R1, LSL #16 [R2, #0] ; ; ; ; Two signed 16-bit values in R0 and R1 are packed into the word addressed by R2 R3 is a temporary register
Again, if the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register (R0 can be used instead).
5-20
Programming Techniques
ARM DUI 0021A
; ;
The two MOV instructions are only required if the 16-bit value is signed, and it may be possible to combine the second MOV with another data-processing operation by specifying the second argument as R0, ASR, #16 rather than simply R0. One data item - byte offset 0 If the data is aligned on a word boundary, the following code fragment can be used (which is clearly much more efcient than the general case given above): LDR MOV R0, [R2, #0] R0, R0, LSR #16 ; 16-bit value is loaded from the ; word-aligned address in R2 ; into R0.
The LSR should be replaced with ASR if the data is signed. Note that as in the previous example it may be possible to combine the MOV with another data-processing operation. One data item - byte offset 2 If the data is aligned on a halfword boundary, but not a word boundary (ie. the byte offset is 2) the following code fragment can be used (again a signicant improvement over the general case): LDR MOV MOV R0, [R2, #-2] R0, R0, LSL #16 R0, R0, LSR #16 ; 16-bit value is loaded from the ; address in R2 into R0. R2 is ; aligned to byte offset 2
As before, LSR should be replaced with ASR if the data is signed. Also, it may be possible to combine the second MOV with another data-processing operation.
Programming Techniques
ARM DUI 0021A
5-21
Two data items - byte offset 0 Two 16-bit values stored in one word can be loaded more efciently than two separate values. The following code loads two unsigned 16-bit data items into two registers from a word-aligned address: LDR MOV BIC R0, [R2, #0] R1, R0, LSR #16 R0, R0, R1, LSL #16 ; 2 unsigned 16-bit values are ; loaded from one word of memory. ; The 1st in R0, the 2nd in R1.
The version of this for signed data is: LDR MOV MOV MOV R0, R1, R0, R0, [R2, #0] R0, ASR #16 R0, LSL #16 R0, ASR #16 ; ; ; ; 2 signed 16-bit values are loaded from one word of memory. The 1st in R0, the 2nd in R1. into R1.
5.5.5
Big-endian storing
The code fragment in this section which transfers a single 16-bit data item, transfers it from the least-signicant 16 bits of an ARM register. The byte offset referred to is the byte offset from a word address of the store address; eg. the address 0x4321 has a byte offset of 1. One data item - any alignment (byte offsets 0,1,2 or 3) The following code fragment saves a 16-bit value to memory, whatever the alignment of the data address: STRB MOV STRB MOV R0, R0, R0, R0, [R2, #1] R0, ROR #8 [R2, #0] R0, ROR #24 ; 16-bit value is stored to the ; address in R2.
The second MOV instruction can be omitted if the data is no longer needed after being stored. Unlike load operations, knowing the alignment of the destination address does not make optimisations possible.
5-22
Programming Techniques
ARM DUI 0021A
Again, if the values in R0 and R1 are not needed after they are saved, R3 need not be used as a temporary register (R0 can be used instead).
5.5.6
Programming Techniques
ARM DUI 0021A
5-23
The examples given above for loading and storing 16-bit data into the bottom half of ARM registers can be easily adapted to load the data into the top half of the registers (and ensure the bottom half is all zero), or save the data from the top half of the registers.
5-24
Programming Techniques
ARM DUI 0021A
This operation is performed for all the newbits needed (ie. 32 bits). The entire operation can be coded compactly by making maximal use of the ARM's barrel shifter: ; ; ; ; ; ; enter with seed in R0 (32 bits), R1 (1 bit in least significant bit) R2 is used as a temporary register. on exit the new seed is in R0 and R1 as before Note that a seed of 0 will always produce a new seed of 0. All other values produce a maximal length sequence. TST MOVS ADC EOR EOR R1, R2, R1, R2, R0, R1, R0, R1, R2, R2, LSR #1 RRX R1 R0, LSL #12 R2, LSR #20 ; ; ; ; ; top bit into Carry 33 bit rotate right carry into lsb of R1 (involved!) (similarly involved!)
5.6.1
These options can be omitted if the tools have already been congured appropriately.
Programming Techniques
ARM DUI 0021A
5-25
5-26
Programming Techniques
ARM DUI 0021A
5.7.1
Programming Techniques
ARM DUI 0021A
5-27
A demonstration program which should help explain how this works has been provided in source form in directory examples/explasm. To compile this program and run it under armsd, rst copy bytedemo.c from directory examples/explasm to your current working directory, and then use the following commands: >armcc bytedemo.c -o bytedemo -li -apcs 3/32bit >armsd -li bytedemo A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file bytedemo armsd: go Note
This program uses ANSI control codes, so should work on most terminal types under Unix and also on the PC. It will not work on HP-UX if the terminal emulator used is an HPTERM. An XTERM should be used to run this program on the HP-UX.
5-28
Programming Techniques
ARM DUI 0021A
5.9.1
LDM / STM
Use LDM and STM instead of a sequence of LDR or STR instructions wherever possible. This provides several benets: The code is smaller (and thus will cache better on an ARM processor with a cache) An instruction fetch cycle and a register copy back cycle is saved for each LDR or STR eliminated On an uncached ARM processor (for LDM) or an unbuffered ARM processor (for STM), non-sequential memory cycles can be turned into faster memory sequential cycles
5.9.2
Conditional execution
In many situations, branches around short pieces of code can be avoided by using conditionally executed instructions. This reduces the size of code and may avoid a pipeline break.
5.9.3
5.9.4
Addressing modes
The ARM instruction set provides a useful selection of addressing modes, which can often be used to improve the performance of code; eg. using LDR or STR pre- or post-indexed with a non-zero offset increments the base register and performs the data transfer. For full details of the addressing modes available, refer to the appropriate ARM datasheet.
5.9.5
Multiplication
Be aware of the time taken by the ARM multiply and multiply accumulate instructions. When multiplying by a constant value note that using the multiply instruction is often not the optimal solution. The issues involved are discussed in the 5.3 Multiplication by a Constant on page 5-8.
Programming Techniques
ARM DUI 0021A
5-29
5.9.7
Loop unrolling
Loop unrolling can be a useful technique, but detailed analysis is often necessary before using it. in some situations can reduce performance. Loop unrolling involves using more than one copy of the inner loop of an algorithm. The following benets may be gained by loop unrolling: the branch back to the beginning of the loop is executed less frequently it may be possible to combine some of one iteration with some of the next iteration, and thereby significantly reduce the cost of each iteration. A common case of this is combining LDR or STR instructions from two or more iterations into single LDM or STM instructions. This reduces code size, the number of instruction fetches, and in the case of LDM, the number of register writeback cycles.
As an example to illustrate the issues involved in loop unrolling, consider calculating the following over an array: x[i] = y[i] - y[i+1]. Below is a code fragment which performs this: LDR Loop LDR SUB STR MOV CMP BLT R3, [R0, #4]!! R2, R2, R3 R2, [R1], #4 R2, R3 R0, R4 Loop ; ; ; ; ; Load y[i+1] x[i] = y[i] - y[i+1] Store x[i] y[i+1] is the next y[i] Finished ? R2, [R0] ; Preload y[i]
First examine the number of execution cycles this will take on an ARM6 based processor, where: IF WB R W stands for Instruction Fetch stands for Register Write Back stands for Read stands for Write
5-30
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
5-31
Analysing how this code executes for a y[] array of size 100, produces the following results: 169 IF (18 caused by branching), 101 R, 10 WB, 100 W (380 cycles) Code size: 16 instructions Saving over unrolled code: 630 IF, 91 WB Thus for this problem, unless the extra seven instructions make the code too large unrolling ten times is likely to be the optimum solution. 5-32
Programming Techniques
ARM DUI 0021A
This could be unrolled as follows: Loop LDMIA STMIA LDMIA STMIA LDMIA STMIA LDMIA STMIA CMP BLT R0!,{R3-R14} R1!,{R3-R14} R0!,{R3-R14} R1!,{R3-R14} R0!,{R3-R14} R1!,{R3-R14} R0!,{R3-R14} R1!,{R3-R14} R0, #LimitAddress Loop
In this code the CMP and BNE will be executed only a quarter as often, but this will give only a small saving. However, other issues should be taken into account: If in the above case the amount of data to be transferred was not a multiple of 48, then this amount of loop unrolling will copy too much data. This may be catastrophic, or may merely be inefficient. On a cached ARM processor, the larger the inner loop, the more likely it is that the loop will not stay entirely in the cache. In this case, it is not obvious at what point the performance gain due to unrolling is offset by the performance loss due to cache misses, or the disadvantage of larger code. On an ARM processor with a write buffer, the loop unrolling in the above example is unlikely to help. If the data being copied is not in the cache, then every LDMIA will be stalled while the write buffer empties. Thus the time the CMP and BNE take is irrelevant, as the processor will be stalled on the following LDMIA.
5.9.8
This advice is not applicable to systems which use the ARM FPA co-processor nor to code using the software floating-point library.
If the software-only oating-point emulator is being used, oating-point instructions should placed sequentially, as the oating-point emulator will detect that the next instruction is also a oating-point instruction, and will emulate it without leaving the undened instruction code.
Programming Techniques
ARM DUI 0021A
5-33
5-34
Programming Techniques
ARM DUI 0021A
This technique is only appropriate to uncached ARM processors, and is intended for memory systems in which non-sequential memory accesses take longer than sequential memory accesses.
Consider a system where the length of memory bursts is B. That is, if executing a long sequence of data operations, the memory accesses which result are: one non-sequential memory cycle followed by B 1 sequential memory cycles. An example of this is DRAM controlled by the ARM memory manager MEMC1a. This sequence of memory accesses will be broken up by several ARM instruction types: Load or Store (single or multiple) Data Swap, Branch instructions SWIs Other instructions which modify the PC
By placing these instructions carefully, so that they break up the normal sequence of memory cycles only where a non-sequential cycle was about to occur anyway, the number of sequential cycles which are turned into longer non-sequential cycles can be minimised. For a memory system which has memory bursts of length B, the optimal position for instructions which break up the memory cycle sequence is 3 words before the next B-word boundary. To help explain this, consider a memory system with memory bursts of length 4 (ie. quad-word bursts), the optimal position for these break-up instructions is 16-12=4 bytes from a quad-word offset. The following code demonstrates this: 0x0000 0x0004 0x0008 0x000C 0x0010 Data STR Data Data Data Instr 1 Instr 2 Instr 3 Instr 4
Taking into account the ARM instruction pipeline, the memory cycles executing this code will produce: Instruction Instruction Instruction Instruction Data Write Instruction Instruction Fetch Fetch Fetch Fetch (Non Seq) (Seq) (Seq) (Seq) (Non Seq) Fetch 0x0010 (Non Seq) Fetch 0x0014 (Seq) 0x0000 0x0004 0x0008 0x000C
+ Execute Data Instr 1 + Execute STR + Execute Data Instr 2 + Execute Data Instr 3
The instruction fetch after the Data Write cycle had to be non-sequential cycle, but since the instruction fetch was of a quad-word-aligned address, it had to be non-sequential anyway. Therefore, the STR is optimally positioned to avoid changing sequential instruction fetches into non-sequential instruction fetches.
Programming Techniques
ARM DUI 0021A
5-35
5-36
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
6-1
You can nd related information in Chapter 13, Benchmarking, Performance Analysis, and Profiling. A full description of the ARM C compiler is given in The ARM Software Development Toolkit Reference Manual: Chapter 2, C Compiler.
6-2
Programming Techniques
ARM DUI 0021A
Some of the rules presented here are quite general; some are quite specic to the ARM or the ARM C compiler. It should be clear from context which rules are portable. When writing C, there are a number of considerations which, if handled intelligently, will result in more compact and efcient ARM code: The way functions are written, their size, and the way in which they call each other. This is discussed in 6.2.1 Function design, below. The distribution of variables within functions, and their scoping. This affects the register allocation of variables, and the frequency with which they are spilled to memory: see 6.2.2 Register allocation and how to help it on page 6-6. The use of alternatives to the switch() statement Under certain circumstances, reductions in code size can be achieved by avoiding the use of switch(), as discussed in 6.2.4 The switch() statement on page 6-8.
6.2.1
Function design
Function call overhead on the ARM is small, and is often in proportion to the work done by the called function. Several features contribute to this: the minimal ARM call-return sequence is BL... MOV pc, lr, which is extremely economical the multiple load and store instructions, STM and LDM, which reduce the cost of entry to and exit from functions that must create a stack frame and/or save registers the ARM Procedure Call Standard, which has been carefully designed to allow two very important types of function call to be optimised so that the entry and exit overheads are minimal.
In general, it is a good idea to keep functions small, because this will help keep function calling overheads low. This section describes the conditions under which function call overhead is minimised, how small functions help the ARM C compiler, and explains how to assist the C compiler when functions cannot be kept small. Leaf functions In 'typical' programs, about half of all function calls made are to leaf functions (a leaf function is one that makes no calls from within its body). Often, a leaf function is rather simple. On the ARM, if it is simple enough to compile using just ve registers (a1-a4 and ip), it will carry no function entry or exit overhead. A surprising proportion of useful leaf functions can be compiled within this constraint.
Programming Techniques
ARM DUI 0021A
6-3
(Note that your version of armcc may produce slightly different code.) Here there is no function entry or exit overhead, and the function return has disappeared entirely return is direct from __sys_alloc to malloc's caller. In this case, the basic call-return cost for the function pair has been reduced from: BL + BL + MOV pc,lr + MOV pc,lr to: BL + B + MOV pc,lr which works out as a saving of 25%.
6-4
Programming Techniques
ARM DUI 0021A
Function arguments and argument passing The nal aspect of function design which inuences low-level efciency is argument passing. Under the ARM Procedure Call Standard, up to four argument words can be passed to a function in registers. Functions of up to four integral (not oating point) arguments are particularly efcient and incur very little overhead beyond that required to compute the argument expressions themselves (there may be a little register juggling in the called function, depending on its complexity). If more arguments are needed, then the 5th, 6th, etc., words will be passed on the stack. This incurs the cost of an STR in the calling function and an LDR in the called function for each argument word beyond four. To minimise argument passing: Try to ensure that small functions take four or fewer arguments. These will compile particularly well. If a function needs many arguments, try to ensure that it does a significant amount of work on every call, so that the cost of passing arguments is amortised. Factor out read-mostly global control state and make this static. If it has to be passed as an argument (to support multiple clients, for example), wrap it up in a struct and pass a pointer to it.
Programming Techniques
ARM DUI 0021A
6-5
referenced throughout the program, but relatively rarely in any given function. Frequent references inside a function should be replaced by references to a local, non-static copy. Note Don't confuse control state with computational arguments, the values of which differ on every call. Collect related data into structs. Decide whether to pass pointers or struct values based on the use of each struct in the called function: If few fields are read or written, passing a pointer is best. The cost of passing a struct via the stack is typically a share in an LDM-STM pair for each word of the struct. This can be better than passing a pointer if on average each field is used at least once, and the register pressure in the function is high enough to force a pointer to be repeatedly re-loaded. As a general rule, you cannot lose much efficiency if you pass pointers to structs rather than struct values. To gain efficiency by passing struct values rather than pointers usually requires careful study of a function's machine code.
6.2.2
6-6
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
6-7
6.2.4
In the rst role, switch() is hard to improve upon: the ARM C compiler does a good job of deciding when to compile jump tables and when to compile trees of if-then-elses. It is rare for a programmer to be able to improve upon this by writing if-then-else trees explicitly in the source. In the second role, however, use of switch() is often mistaken. You can probably do better by being more aware of what is being computed and how. The example below is a simplied version of a function taken from an early version of the ARM C compilers disassembly module. Its purpose is to map a 4-bit eld of an ARM instruction to a 2-character condition code mnemonic. We will use it to demonstrate: the cost of implementing an in-line function using switch() how to implement the same function more economically
Here is the source: char *cond_of_instr(unsigned instr) { char *s; switch (instr & 0xf0000000) { case 0x00000000: s = "EQ"; break; case 0x10000000: s = "NE"; break; ... ... ... case 0xF0000000: s = "NV"; break;
6-8
Programming Techniques
ARM DUI 0021A
++j)
This fragment compiles to 68 bytes of code and 128 bytes of table data. Already this is a 30% improvement on the switch() case, but this schema has other advantages: it copes well with a random code-to-string mapping and, if the mapping is not random, admits further optimisation. For example, if the code is stored in a byte (char) instead of an unsigned and the comparison is with (instr >> 28) rather than (instr & 0xF0000000)then only 60 bytes of code and 64 bytes of data are generated for a total of 124 bytes. Another advantage for table lookup is that is possible to share the same table between a disassembler and an assemblerthe assembler looks up the mnemonic to obtain the code value, rather than the code value to obtain the mnemonic. Where performance is not critical, the symmetric property of lookup tables can sometimes be exploited to yield signicant space savings. Finally, by exploiting the denseness of the indexing and the uniformity of the returned value it is possible to do better again, both in size and performance, by direct indexing: char *cond_of_instr(unsigned instr) { return "\ EQ\0\0NE\0\0CC\0\0CS\0\0MI\0\0PL\0\0VS\0\0VC\0\0\ HI\0\0LS\0\0GE\0\0LT\0\0GT\0\0LE\0\0AL\0\0NV" + (instr >> 28)*4; }
Programming Techniques
ARM DUI 0021A
6-9
6-10
Programming Techniques
ARM DUI 0021A
6.3.1
Compiler options
The ARM C compiler has a number of command line options which control the way in which code is generated. You can nd a full list in The ARM Software Development Toolkit Reference Manual: Chapter 2, C Compiler. There are a number of compiler options which can affect the size and/or the performance of generated code. g -g severely impacts the size and performance of generated code, since it turns off all compiler optimisations. You should use it only when actually debugging your code, and it should never be enabled for a release build. Ospace -Otime These options are complementary: -Ospace -Otime optimises for code size at the expense of performance optimises for performance at the expense of size
They can be used together on different parts of a build. For example, -Otime could be used on time critical source les, with -Ospace being used on the remainder. If neither is specied, the compiler will attempt a balance between optimising for code size and optimising for performance. zpj0 This disables crossjump optimisation. Crossjump optimisation is a space-saving optimisation whereby common sections of code at the end of each element in a switch() statement are identied and commoned together, each occurrence of the code section being replaced with a branch to the commoned code section. However, this optimisation can lead to extra branches being executed which may decrease performance, especially in interpreter-like applications which typically have large switch() statements. Use the -zpj0 option to disable this optimisation if you have a time-critical switch() statement. Alternatively, you can use: #pragma nooptimise_crossjump
Programming Techniques
ARM DUI 0021A
6-11
6-12
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
6-13
6.4.1
In the C library build directory (eg. directory semi for the semi-hosted library), the options le is used to select variants of the C library. The supplied le contains the following: memcpy = fast divide = unrolled stdfile_redirection = off fp_type = module stack = contiguous backtrace = off thumb = false Unrolled divide The default divide implementation 'unrolled' is fast, but occupies a total of 416 bytes (55 instructions for the signed version plus 49 instructions for the unsigned version). This is an appropriate default for most Toolkit users who are interested in obtaining maximum performance. Small divide Alternatively you can change this le to select 'small' divide which is more compact at 136 bytes(20 instructions for signed plus 14 instructions for unsigned) but somewhat slower, as there is considerable looping overhead. For a comparison of the speed difference between these two routines, see Table 6-1: Signed division example timings on page 6-15 (the speed of divide is data-dependent).
6-14
Programming Techniques
ARM DUI 0021A
6.4.2
Programming Techniques
ARM DUI 0021A
6-15
6.4.3
Summary
The standard division routine used by the C library can be selected by using the options le in the C library build area. If the supplied variants are not suitable, you can write your own. For real-time applications, the maximum division time must be as short as possible to ensure that the calculation can complete in time. In this case, the functions __rt_sdiv64by32 and __rt_sdiv32by16 are useful.
6-16
Programming Techniques
ARM DUI 0021A
6.5.1
The source code rtstand.s documents the options which you may want to change for your target. These are not covered here. The header le rtstand.h documents the functions which rtstand.s provides to the C programmer. A Thumb version of this le is located in thumb/rtstand.s. Note No support is provided for outputting data down the debugging channel. This can be done, but is specic to the target application. The example C programs described below use the ARM Debug Monitor available under armsd to output messages using in-line SWIs. See The ARM
Programming Techniques
ARM DUI 0021A
6-17
6.5.2
You are now ready to experiment with the C standalone runtime system. In the examples below, the following options are passed to armcc, armasm, and in the rst case armsd: -li -apcs 3/32bit/hardfp species that the target is a little endian ARM. species that the 32-bit variant of APCS 3 should be used. For armasm this is used to set the built-in variable {CONFIG} to 32. ARM FPA instructions are used for oating point operations.
These arguments can be changed if the target hardware differs from this conguration, or omitted if your tools have been congured to have these options by default. You may nd it useful to study the sources to rtstand.s, errtest.c and memtest.c while working through the example programs.
6.5.3
A simple program
Let us rst compile the example program errtest.c, and assemble the standalone runtime system. These can then be linked together to provide an executable image, errtest: armcc -c errtest.c -li -apcs 3/32bit/hardfp armasm rtstand.s -o rtstand.o -li -apcs 3/32bit armlink -o errtest errtest.o rtstand.o We can then execute this image using armsd as follows: > armsd -li - size 512K errtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file errtest armsd: go (the floating point instruction-set is not available)
6-18
Programming Techniques
ARM DUI 0021A
6.5.4
Error handling
The same program, errtest, can also be used to demonstrate error handling, by recompiling errtest.c and predening the DIVIDE_ERROR macro: armcc -c errtest.c -li -apcs 3/32bit/hardfp -DDIVIDE_ERROR armlink -o errtest errtest.o rtstand.o Again, we can now execute this image under the armsd as follows: > armsd -li -size 512K errtest A.R.M. Source-level Debugger, version 4.10 (A.R.M.) [Aug 26 1992] ARMulator V1.20, 512 Kb RAM, MMU present, Demon 1.01, FPE, Little endian. Object program file errtest armsd: go (the floating point instruction-set is not available) Using integer arithmetic ... 10000 / 0X0000000A = 0X000003E8 10000 / 0X00000009 = 0X00000457
Programming Techniques
ARM DUI 0021A
6-19
6.5.5
6-20
Programming Techniques
ARM DUI 0021A
6.5.6
Programming Techniques
ARM DUI 0021A
6-21
By Zero
caller's pc = 0XE92DE000 returning... Returning from __err_handler() with errnum = 0X80000202 Program terminated normally at PC = 0x00008558 (__rt_exit + 0x10) +0010 0x00008558: 0xef000011 .... : > swi 0x11 armsd: quit Quitting > This time the oating point instruction set is found to be available, and when a oating point division by zero is attempted, __err_handler is called with the details of the oating point divide by zero exception. Note that if you have compiled errtest.c other than as in 6.5.5 longjmp and setjmp on page 6-20, you will not see precisely this dialogue with armsd.
6-22
Programming Techniques
ARM DUI 0021A
6.5.8
Programming Techniques
ARM DUI 0021A
6-23
6.5.9
6-24
Programming Techniques
ARM DUI 0021A
Functions
__main, __rt_exit __rt_fpavailable __rt_trap __rt_alloc __rt_stkovf_split_* longjmp, setjmp __rt_sdiv, __rt_udiv, __rt_udiv10,__rt_udiv10, __rt_divtest
TOTAL
736
Programming Techniques
ARM DUI 0021A
6-25
6-26
Programming Techniques
ARM DUI 0021A
7
7.1 7.2 7.3
Programming Techniques
ARM DUI 0021A
7-1
.s armasm
.o
C library
armlink
executable
C source module(s)
.c
armcc -c
.o
7-2
Programming Techniques
ARM DUI 0021A
Code which is produced by compilers is expected to adhere to the APCS at all times. Such code is said to be strictly conforming. Handwritten code is expected to adhere to the APCS when making calls to externally visible functions. Such code is said to be conforming. The ARM Procedure Call Standard comprises a family of variants. Each variant is exclusive, so that code which conforms to one cannot be used with code that conforms to another. Your choice of variant will depend on whether: the Program Counter is 32-bit or 26-bit stack limit checking is explicit (performed by code) or implicit (performed by memory management hardware) floating point values are passed in floating point registers code is reentrant or non-reentrant
For the full specication of the APCS, see The ARM Software Development Toolkit Reference Manual: Chapter 19, ARM Procedure Call Standard.
Programming Techniques
ARM DUI 0021A
7-3
v1-v5, f4-f7
7-4
Programming Techniques
ARM DUI 0021A
7.2.2
Programming Techniques
ARM DUI 0021A
7-5
Modifying the compiler's output Let us return to our original intention of coding the 64-bit integer addition using the Carry ag. Since use of the Carry ag cannot be specied in C, we must get the compiler to produce almost the right code, and then modify it by hand. Let us start with (incorrect) code which does not perform the carry addition: void add_64(int64 *dest, int64 *src1, int64 *src2) { dest->lo=src1->lo + src2->lo; dest->hi=src1->hi + src2->hi; return; } You will nd this in le examples/candasm/add64_2.c. Copy it to your current working directory, and then compile it to assembler source with the command: armcc -li -apcs 3/32bit -S add64_2.c This produces source in add64_2.s, which will include something like the following code (though yours may be slightly different, depending on the version of armcc supplied with your release):
7-6
Programming Techniques
ARM DUI 0021A
Comparing this to the C source we can see that the rst ADD instruction produces the low order word, and the second produces the high order word. All we need to do to get the carry from the low to high word right is change the rst ADD to ADDS (add and set ags), and the second ADD to an ADC (add with carry). This modied code is available in directory examples/candasm as add64_3.s. What effect did the APCS have? The most obvious way in which the APCS has affected the above code is that the registers have all been given APCS names. a1 holds a pointer to the destination structure, while a2 and a3 hold pointers to the operand structures. Both a4 and ip are used as temporary registers which are not preserved. The conditions under which ip can be corrupted will be discussed later in this chapter. This is a simple leaf function, which uses few temporary registers, so none are saved to the stack and restored on exit. Therefore a simple MOV pc,lr can be used to return. If we wished to return a resultperhaps the carry out from the additionthis would be loaded into a1 prior to exit. We could do this by changing the second ADD to ADCS (add with carry and set ags), and adding the following instructions to load a1 with 1 or 0 depending on the carry out from the high order addition. MOV ADC a1, #0 a1, a1, #0
Back to the rst implementation Although the rst C implementation is inefcient, it shows us more about the APCS than the hand-modied version. We have already seen a4 and ip being used as non-preserved temporary registers. However, here v1 and lr are also used as temporary registers. v1 is preserved by being stored (together with lr) on entry. lr is corrupted, but a copy is saved onto the stack and then reloaded into pc when v1 is restored. Thus there is still only a single exit instruction, but now it is: LDMIA sp!,{v1,pc}
Programming Techniques
ARM DUI 0021A
7-7
lr
sp
sl
fp
sb
sp, sl, fp and sb must all be preserved on function exit for APCS conforming code. For more information refer to The ARM Software Development Toolkit Reference Manual: Chapter 19, ARM Procedure Call Standard.
7-8
Programming Techniques
ARM DUI 0021A
7.3.1
Consider the following code: typedef struct two_ch_struct { char ch1; char ch2; } two_ch; two_ch max( two_ch a, two_ch b ) { return (a.ch1>b.ch1) ? a : b; } This is available in the directory examples/candasm as two_ch.c. It can be compiled to produce assembly language source using: armcc -S two_ch.c -li -apcs 3/32bit where -li and -apcs 3/32bit can be omitted if armcc has been congured appropriately. Here is the code which armcc produced (the version of armcc supplied with your release may produce slightly different output to that listed here):
Programming Techniques
ARM DUI 0021A
7-9
The STMDB instruction saves the arguments onto the stack, together with the frame pointer, stack pointer, link register and current pc value (this sequence of values is the stack backtrace data structure). a2 and a3 are then used as temporary registers to hold the required part of the structures passed, and a1 is a pointer to an area in memory in which the resulting struct is placedall as expected.
7.3.2
The following structures are integer-like: struct { unsigned a:8, b:8, c:8, d:8; } union polymorphic_ptr { struct A *a; struct B *b; int *i; } whereas the structure used in the previous example is not: struct { char ch1, ch2; } An integer-like structure has its contents returned in al. This means that a1 is not needed to pass a pointer to a result struct in memory, and is instead used to pass the rst argument.
7-10
Programming Techniques
ARM DUI 0021A
From this we can see that the contents of the half_words structure is returned directly in a1 as expected.
7.3.3
Programming Techniques
ARM DUI 0021A
7-11
#16 #16
#16 #16
This code is ne for use with assembly language modules, but in order to use it from C we need to tell the compiler that this routine returns its 64-bit result in registers. This can be done by making the following declarations in a header le: typedef struct int64_struct { unsigned int lo; unsigned int hi; } int64; __value_in_regs extern int64 mul64(unsigned a, unsigned b); The above assembly language code and declarations, together with a test program, are all in directory examples/candasm as the les mul64.s, mul64.h, int64.h and multest.c. To compile, assemble and link these to produce an executable image suitable for armsd, rst copy them to your current directory, and then execute the following commands:
7-12
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
7-13
7-14
Programming Techniques
ARM DUI 0021A
8
8.1 8.2 Using Overlays ARM Shared Libraries
Advanced Linking
This chapter explains how to generate programs which use overlays, and the linkers scatter loading facility, and describes the use of the ARM shared libraries. 8-2 8-8
Programming Techniques
ARM DUI 0021A
8-1
Advanced Linking
8.1 Using Overlays
The ARM linker has two different methods of generating applications that use overlays. These are selectable from the command line using the -OVERLAY and -SCATTER linker options. Note that these options are mutually exclusive. -OVERLAY causes the linker to compute the size of the overlay segments automatically, and to abut distinct memory partitions. The linker generates a set of les in a directory specied by the -OUTPUT option. Overlay segments to be forced to specic memory addresses in a simple form of scatter loading. However, PCIT entries will be generated even for non-clashing overlays, producing extra overheads in terms of code size and execution speed. For this reason the -OVERLAY option is not recommended for generating scatter loaded images, and -SCATTER should be used instead. -SCATTER instructs the linker to create either an extended AIF le or a directory of les. The overlays will be placed into load regions and the linker will add information to the executable to allow the overlay manager to copy the overlay segments from the correct load region. The directory of output les will be suitable for use in a ROM-based system. All the overlay segments must have an execution address specied in the scatter load description le. The linker will not place overlay segments automatically. The scatter loading scheme does not support dynamic overlays. With scatter loading, PCIT information is not generated for execution regions not marked as overlays, so these regions do not have any overlay overhead associated with them.
8.1.1
8-2
Programming Techniques
ARM DUI 0021A
Advanced Linking
8.1.2 The overlay manager
The overlay manager for scatter loaded overlays and the conventional overlay scheme are very similar. Indeed, only the code for loading a segment need be different. For a scatter loaded application, the code will be of the form: Retry ; ; Use the overlay table generated by the linker. The table format : is as follows: ; The first word in the table is contains the number of entries in ; the table. ; There follows that number of table entries. Each entry is 3 words : long: ; Word 1 Length of the segment in bytes. ; Word 2 Execution address of the PCIT section address. This is ; compared against the value in R8. If the values are ; equal we have found the entry for the called overlay. ; Word 3 Load address of the segment. ; Segment names are not used. ; IMPORT |Root$$OverlayTable| LDR r0,=|Root$$OverlayTable| LDR r1,[r0],#4 search_loop CMP r1,#0 MOVEQ r0,#2 ; The end the table has been reached BEQ SevereErrorHandler ; and the segment has not been found LDMIA CMP SUBNE BNE LDR MOV MOV BL where: Root$$OverlayTable is a symbol bound to the address of the linker generated overlay information table. SevereErrorHandler is a routine called when the overlay manager detects an error. MemCopy is a system-specific memory copy routine where r0 points to the destination area, r1 points to the source area, and r2 is the block size in bytes. 8-3 r0!,{r2,r3,r4} r8,r3 r1,r1,#1 search_loop r0,[ r8, #PCITSect_Base ] r1,r4 r4,r2 MemCopy
Programming Techniques
ARM DUI 0021A
Advanced Linking
In the scatter loaded example supplied with the toolkit (in the scatter.s le in directory examples/scatter), the overlay manager initialisation routine has no work to do, as all memory copying and initialisation is done as part of the scatter loaded image initialisation. When using the -OVERLAY option, the overlay manager code would be: Retry ; ; ; ; ; ; ;
Call a routine to load the overlay segment. First parameter is the length of the segment name. The second parameter is the address of the segment name The third parameter is the base address of the segment. The routine returns the segment length in r0. MOV ADD LDR BL TEQ MOVEQ BEQ r0,#12 r1, r8, #PCITSect_Name r2, [ r8, #PCITSect_Base] LoadOverlaySegment r0,#0 r0,#2 SevereErrorHandler
LoadOverlaySegment loads the named segment. In a nonembedded environment, this routine would be implemented to load the segment from a le somewhere on the le system. This is the case in the overlmgrs.s le in directory examples/overlay. In an embedded environment where the code is in some form of nonvolatile memory, the overlay segments would need to be packaged up with sufcient information for a LoadOverlaySegment implementation to load the segments correctly. For example, the overlay could be put into a pseudo le system in nonvolatile memory and the segments accessed by name. This 'packaging up' operation would need to be carried out after linking. The ARM software development toolkit does not do this, as it will be highly specic to the applications run time environment. In the overlay example supplied with the toolkit, the overlay manager initialisation routine is used to copy read/write data from the load address to the execution address.
8.1.3
8-4
Programming Techniques
ARM DUI 0021A
Advanced Linking
AREA InitApp, CODE , READONLY EXPORT InitialiseApp InitialiseApp ADR r0,ziTable MOV R3,#0 ziLoop LDR r1,[r0],#4 CMP r1,#0 BEQ initLoop LDR r2,[r0],#4 ziFillLoop STR r3,[r2],#4 SUBS r1,r1,#4 BNE ziFillLoop B ziLoop initLoop LDR r1,[r0],#4 CMP r1,#0 MOVEQ pc,lr LDMIA r0!,{r2,r3} CMP r1,#16 BLT copyWords copy4Words LDMIA r3!,{r4,r5,r6,r7} STMIA r2!,{r4,r5,r6,r7} SUBS r1,r1,#16 BGT copy4Words BEQ initLoop copyWords SUBS r1,r1,#8 LDMIAGE r3!,{r4,r5} STMIAGE r2!,{r4,r5} BEQ initLoop LDR STR B r4,[r3] r4,[r2] initLoop
; ; A couple of MACROS to make the table entries easier to add. ; The execname parameter is the name of execution to initialise or copy. ;
Programming Techniques
ARM DUI 0021A
8-5
Advanced Linking
MACRO ZIEntry $execname LCLS lensym LCLS basesym LCLS namecp namecp SETS "$execname" lensym SETS "|Image$$":CC:namecp:CC:"$$ZI$$Length|" basesym SETS "|Image$$":CC:namecp:CC:"$$ZI$$Base|" IMPORT $lensym IMPORT $basesym DCD $lensym DCD $basesym MEND MACRO InitEntry $execname LCLS lensym LCLS basesym LCLS loadsym LCLS namecp namecp SETS "$execname" lensym SETS "|Image$$":CC:namecp:CC:"$$Length|" basesym SETS "|Image$$":CC:namecp:CC:"$$Base|" loadsym SETS "|Load$$":CC:namecp:CC:"$$Base|" IMPORT $lensym IMPORT $basesym IMPORT $loadsym DCD $lensym DCD $basesym DCD $loadsym MEND ziTable ZIEntry root ; Zero initialised data from the root read/write ; region DCD 0 InitTable InitEntry root ; Initialised data from the root read/write region DCD 0 END
8-6
Programming Techniques
ARM DUI 0021A
Advanced Linking
Each execution region that needs zero-initialised data to be initialised must have an entry in ziTable of the form: ZIEntry
name
where name is the name of the execution region. Similarly, each execution region that needs to be copied must have an entry in InitTable of the form: InitEntry name The InitialiseApp routine is not called automatically at startup: it must be called explicitly before the application main program is entered. This code can be found in the initapp.s le in directory examples/scatter.
Programming Techniques
ARM DUI 0021A
8-7
Advanced Linking
8.2 ARM Shared Libraries
This section explains: what an ARM shared library is how the shared library mechanism works how to instruct the ARM linker to make a shared library how to make a toy shared library from the string section of the ANSI C library
8.2.1
8.2.2
Each member of the entry vector is a proxy for a function in the matching shared library.
8-8
Programming Techniques
ARM DUI 0021A
Advanced Linking
When a client rst calls a proxy function, the call is directed to a dynamic linker. This is a small function (typically about 50-60 ARM instructions) which: locates the matching shared library if required, copies an initial image of the library's static data from the library to the place holding area in the stub patches the entry vector so each proxy function points at the corresponding library function resumes the call
Once an entry vector has been patched, all future proxy calls proceed directly to the target library function with only minimal indirection delay and no intervention by the dynamic linker. Making an inter-link-unit call like this is more expensive than making a straightforward local procedure call, but not by much. It is also the only supported way to call a function more than 32MBytes away.
8.2.3
Please refer to The ARM Software Development Toolkit Reference Manual: Chapter 6, Linker for a full explanation of parameter blocks.
8.2.4
Programming Techniques
ARM DUI 0021A
8-9
Advanced Linking
A primitive location mechanism might be to search a ROM for a matching string. This would identify the start of the parameter block of the matching shared library. Immediately preceding it will be negative offsets to library entry points and a non-negative count word containing the number of entry points. By working backwards through memory and counting, you can be sure you have found the entry vector and can return the address of its count word to the dynamic linker. More sophisticated location schemes are possible, for example: You might include in your library a header containing code to be executed when the library is first loaded (into RAM) or initialised (in ROM) which registers the library's name with a library manager. Since the library manager has to be locatable without using the library manager, either its address has to be known or its function has to be supported by an underlying system call. You might adopt a scheme similar to that which is used by Acorns RISC OS operating system. This supports a module mechanism which is often used to implement shared libraries. A RISC OS module may, by declaring so in its module header, be called when software interrupts (SWIs) in a specified range occur. When such a module is loaded, it extends the range of SWIs interpreted by the operating system. This mechanism can be used to locate a shared library by storing the identity of a library location SWI in the library's parameter block, and by implementing this SWI in the library module's header.
8.2.5
An immediate consequence of the second rule is that it is impossible to make two shared libraries which refer to one another: to make the second library and its stub would require the stub of the rst, but to make the rst and its stub would require the stub of the second. The rst rule is not 100% necessary, and is difcult to enforce. The linker warns you if it nds a non-reentrant code area in the list of objects to be linked into a shared library, but will build the library and its matching stub anyway. You must decide whether the warning is real, or merely a formality.
8-10
Programming Techniques
ARM DUI 0021A
Advanced Linking
Linker outputs The ARM linker generates a shared library as two les: a plain binary file containing the read-only, reentrant, usually position independent, shared code an AOF format stub file with which client applications can be linked.
The linker can also generate a reentrant stub suitable for inclusion in another shared library. The library image le contains, in order: read only code sections from your input objects if requested, a read only copy of the initialised static data from the input objects a table of (negative) offsets from the end of the library to its entry points if requested, the size and offset of the static data image a copy of the library's parameter block
You request a copy of the initialised static data to be included in a library when you describe to the linker how to make a shared library. If you request this, the linker writes the length and offset of the data image immediately after the entry vector. During linking, armlink denes symbols SHL$$data$$Size and SHL$$data$$Base to have these values; components of your library may refer to these symbols. Instead of including the static data in the stub, armlink includes a zero initialised place holding area of the same size. It also writes the length and (relocatable) address of this area immediately after the dynamic linker's entry veneer, giving the dynamic linker sufcient information to initialise the place holder at run time. During linking, the linker symbols SHL$$data$$Size and $$0$$Base describe this length and relocatable address. Any data included in your shared library must be free of relocation directives. Please refer to The ARM Software Development Toolkit Reference Manual: Chapter 6, Linker for a full explanation of what kind of data can be included in a shared library. You specify a parameter block when you describe to the linker how to make a shared library. You might, for example, include the name of the library in its parameter block, to aid its location. An identical copy of the parameter block is included in the library's entry vector in the stub le. Describing a shared library to the linker To describe a shared library to the linker you have to prepare a le which describes: the name of the library the library parameter block what data areas to include what entry points to export
Programming Techniques
ARM DUI 0021A
8-11
Advanced Linking
For precise details of how to do this, please refer to The ARM Software Development Toolkit Reference Manual: Chapter 6, Linker. Below is an intuitive example you can work with and adapt: ; First, give the name of the file to contain the library ; strlib - and its parameter block - the single word 0x40000... > strlib \ 0x40000 ; ...then include all suitable data areas... + () ; ... finally export all the entry points... ; ... mostly omitted here for brevity of exposition. memcpy ... strtok The name of this le is passed to armlink as the argument to the -SHL command line option: see
The ARM Software Development Toolkit Reference Manual: Chapter 6, Linker for further
details.
8.2.6
-apcs /reenttells armcc to compile reentrant code. -zps1 turns off software stack limit checking and allows the string library to be independent of all other objects and libraries. With software stack limit checking turned on, the library would depend on the stack limit checking functions which, in turn, depend on other sections of the C run time library. While such dependencies do not much obstruct the construction of full scale, production quality shared libraries, they are major impediments to a simple demonstration of the underlying mechanisms. tells armcc to look for needed header les in the current directory.
-I.
8-12
Programming Techniques
ARM DUI 0021A
Advanced Linking
Linking the string library To make a shared library and matching stub from string.o, use armlink -o strstub.o -shl strshl -s syms string.o where: -o -shl -s instructs the linker to put strlibs stub in strstub.o points to the le which contains instructions for making a shared library called strlib asks for a listing of symbol values in a le called syms
You may later need to look up the value of EFT$$Offset. As supplied, the dynamic linker expects a library's external function table (EFT) to be at address 0x40000. So, unless you extend the dynamic linker with a library location mechanism, you will have to load strlib at the address 0x40000-EFT$$Offset. Making the test program and dynamic linker You should now assemble the dynamic linker and compile the test code: armasm -li dynlink.s dynlink.o armcc -li -c strtest.c To make the test program you must link together the test code, the dynamic linker, the string library stub and the appropriate ARM C library (so that references to library members other than the string functions can be resolved): armlink -d -o strtest strtest.o dynlink.o strstub.o ../../lib/ armlib.32l Running the test program with the shared string library Now you are ready to try everything under the control of command-line armsd. For this example the value of EFT$$offset is assumed to be 0xa38. >armsd strtest A.R.M. Source-level Debugger version ... ARMulator V1.30, 4 Gb memory, MMU present, Demon 1.1,... Object program file strtest armsd: getfile strlib 0x40000-0xa38 armsd: go strerror(42) returns unknown shared string-library error 0x0000002A Program terminated normally at PC = 0x00008354 (__rt_exit + 0x24) +0024 0x00008354: 0xef000011 .... : swi 0x11 armsd: q Quitting >
Programming Techniques
ARM DUI 0021A
8-13
Advanced Linking
Before starting strtest you must load the shared string library with the command: getfile strlib 0x40000-0xa38 where strlib is the name of the le containing the library, 0x40000 is the hard-wired address at which the dynamic linker expects to nd the external function table, and 0xa38 is the value of EFT$$Offset, the offset of the external function table from the start of the library. When strtest runs, it calls strerror(42) which causes the dynamic linker to be entered, the static data to be copied, the stub vector to be patched and the call to be resumed. You can watch this is more detail by setting a breakpoint on __rt_dynlink and single stepping.
8-14
Programming Techniques
ARM DUI 0021A
9
9.1 9.2 9.3 9.4 Introduction
Application Startup Using the C Library in ROM Troubleshooting Hints and Tips
Programming Techniques
ARM DUI 0021A
9-1
9.2
Application Startup
One of the main considerations with C code in ROM is the way in which the application initialises itself and starts executing. If there is an operating system present this causes no problem as the application is entered automatically via the main() function. In an embedded system there are a number of ways an image may be entered: via the RESET vector at location 0 at the base address of the image
Applications entered via the RESET vector The simplest case is where the application ROM is located at address 0 in the address map. The rst instruction of your application will then be a branch instruction to the real entry point. Applications entered at the base address An application may be entered at its base address in one of two ways: The hardware can fake a branch at address 0 to the base address of the ROM. On RESET the ROM is mapped to address 0 by the memory management. When the application initialises the MMU it remaps the ROM to its correct address and performs a jump to the copy of itself running at the correct address.
9.2.1
Initialisation on RESET
Nothing is initialised on RESET so the entry point will have to perform some initialisation before it can call any C code. Typically, the initialisation code may perform some or all of the following: define the entry point The assembler directive ENTRY marks the entry point.
9-2
Programming Techniques
ARM DUI 0021A
The above must be initialised before interrupts are enabled. SP_abt SP_und for data abort handling for undened instruction handling
Generally, the above two will not be used in a simple embedded system, however you may wish to initialise them for debugging purposes. SP_svc must always be initialised
initialise the memory system If your system has an MMU, the memory mapping must be initialised at this point before interrupts are enabled and before any code is called which might rely on RAM being present at particular address, either explicitly, or implicitly via the use of stack space.
Critical IO devices are any devices which must be initialised before interrupts are enabled. Typically these devices will need to be initialised at this point. If they are not, they may cause spurious interrupts when interrupts are enabled.
initialise any RAM variables required by the interrupt system For example, if your interrupt system has buffer pointers to read data into memory buffers, the pointers must be initialised at this point before interrupts are enabled. enable interrupts and change processor mode/state if necessary At this stage the processor will be in Supervisor mode. If your application runs in User mode, you should change to User mode at this point. You will also need to initialise the User mode SP register. initialise memory required by C code The initial values for any initialised variables must be copied from ROM to RAM. All other variables must be initialised to zero.
Programming Techniques
ARM DUI 0021A
9-3
1 2
r0, =|Image$$RO$$Limit|; Get pointer to ROM data r1, =|Image$$RW$$Base| ; and RAM copy r3, =|Image$$ZI$$Base| ; Zero init base => top of ; initialised data r0, r1 ; Check that they are different %1 r1, r3 ; Copy init data r2, [r0], #4 r2, [r1], #4 %0 r1, =|Image$$ZI$$Limit|; Top of zero init segment r2, #0 r3, r1 ; Zero init r2, [r3], #4 %2
enter C code If your application runs in Thumb state, you should change to Thumb state using: ORR lr, pc, #1 BX lr It is now safe to call C code provided that it does not rely on any memory being initialised. For example: IMPORT C_Entry BL C_Entry
9.2.2
9-4
Programming Techniques
ARM DUI 0021A
; Now some standard definitions... Mode_IRQ Mode_SVC I_Bit F_Bit SWI_Exit EQU EQU EQU EQU EQU 0x12 0x13 0x80 0x40 0x11
; Locations of various things in our memory system RAM_Base RAM_Limit IRQ_Stack SVC_Stack EQU EQU EQU EQU 0x10000000 0x10010000 RAM_Limit RAM_Limit-1024 ; 64k RAM at this base
; --- Define entry point EXPORT __main; The symbol '__main' is defined here to ensure __main ; the C runtime system is not linked in. ENTRY ; --- Setup interrupt / exception vectors IF :DEF: ROM_AT_ADDRESS_ZERO ; If the ROM is at address 0 this is just a sequence of branches B Reset_Handler B Undefined_Handler B SWI_Handler B Prefetch_Handler B Abort_Handler
Programming Techniques
ARM DUI 0021A
9-5
; Now fall into the LDR PC, Reset_Addr instruction which will continue ; execution at 'Reset_Handler' Vector_Init_Block LDR LDR LDR LDR LDR NOP LDR LDR Reset_Addr Undefined_Addr SWI_Addr Prefetch_Addr Abort_Addr DCD 0 IRQ_Addr FIQ_Addr ENDIF PC, PC, PC, PC, PC, Reset_Addr Undefined_Addr SWI_Addr Prefetch_Addr Abort_Addr
DCD Reset_Handler DCD Undefined_Handler DCD SWI_Handler DCD Prefetch_Handler DCD Abort_Handler ; Reserved vector DCD IRQ_Handler DCD FIQ_Handler
; The following handlers do not do anything useful in this example. ; Undefined_Handler B Undefined_Handler SWI_Handler 9-6
Programming Techniques
ARM DUI 0021A
; The RESET entry point Reset_Handler ; --- Initialise stack pointer registers ; Enter IRQ mode and set up the IRQ stack pointer MOV R0, #Mode_IRQ:OR:I_Bit:OR:F_Bit ; No interrupts MSR CPSR, R0 LDR R13, =IRQ_Stack ; Set up other stack pointers if necessary ; ... ; Set up the SVC stack pointer last and return to SVC mode MOV R0, #Mode_SVC:OR:I_Bit:OR:F_Bit ; No interrupts MSR CPSR, R0 LDR R13, =SVC_Stack ; --- Initialise memory system ; ... ; --- Initialise critical IO devices ; ... ; --- Initialise interrupt system variables here ; ... ; --- Enable interrupts ; Now safe to enable interrupts, so do this and remain in SVC mode MOV R0, #Mode_SVC:OR:F_Bit ; Only IRQ enabled MSR CPSR, R0 ; --- Initialise memory required by C code IMPORT |Image$$RO$$Limit| IMPORT |Image$$RW$$Base| ; End of ROM code (=start of ROM data) ; Base of RAM to initialise 9-7
Programming Techniques
ARM DUI 0021A
1 2
; --- Now we enter the C code IMPORT C_Entry [ :DEF:THUMB ORR lr, pc, #1 BX lr CODE16 ; Next instruction will be Thumb ] BL C_Entry ; In a real application we wouldn't normally expect to return, however ; this example does so the debug monitor swi SWI_Exit is used to halt the ; application. SWI SWI_Exit END --- ex.c ----------------------------------------------------------/* We use the following Debug Monitor SWIs to write things out * in this example */ extern __swi(0) WriteC(char c); /* Write a character */ extern __swi(2) Write0(char *s); /* Write a string */ /* The following symbols are defined by the linker and define * various memory regions which may need to be copied or initialised */ extern char Image$$RO$$Limit[]; 9-8
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
9-9
-apcs 3/noswst/nofp Tells the compiler not to include code to do software stack checking (noswst) and not to use a frame pointer (nofp). 2 Assemble the initialisation code init.s. armasm -apcs 3/noswst init.s -apcs 3/noswst Tells the assembler that this code is only suitable for use with other code which does not have software stack checking. Code which uses software stack checking cannot generally be mixed with code which does not. The assembler will mark the object le as containing code which does not perform software stack checking so that the linker can give an error if it is mixed with code which does.
Build the ROM image using armlink. armlink -o ex1_rom -Bin -RO 0xf0000000 -RW 0x10000000 -First init.o(Init) -Remove -NoZeroPad -Map -Info Sizes init.o ex.o -Bin Tells the linker to produce a plain binary image with no header. This is the most suitable form of image for putting in ROM. Tells the linker that the ReadOnly or code segment will be placed at 0xf0000000 in the address map.This is the base of the ROM in this example. Tells the linker that the ReadWrite or data segment will be placed at 0x10000000 in the address map. This is the base of the RAM in this example.
-RO 0xf0000000
-RW 0x10000000
-First init.o(Init) Tells the linker to place this area rst in the image. Note that on Unix systems you may need to put a backslash \ before each bracket. -Remove Tells the linker to remove any unused code areas. In this example there are no unused areas, however this is a useful option for larger ROM builds.
9-10
Programming Techniques
ARM DUI 0021A
-Map -Info Sizes These two options tell the linker to output various sorts of information during the link process. Neither of these options are necessary to build the ROM but are included here as an example. The output generated by each option is given below.
-Map tells the linker to print an AREA map or listing showing where each code or data section will be placed in the address space. The output from the above example is given below. AREA map of ex1_rom: Base f0000000 f00000e4 f000031c 10000000 10000004 Size e4 238 10 4 140 Type CODE CODE CODE DATA ZERO RO? RO RO RO RW RW Name !!! from object file init.o C$$code from object file ex.o C$$constdata from object file ex.o C$$data from object file ex.o C$$zidata from object file ex.o
This shows that the linker places three code areas at successive locations starting from 0xf0000000 (where our ROM is based) and two data areas starting at address 0x10000000 (where our RAM is based). -Info Sizes tells the linker to print information on the code and data sizes of each object le along with the totals for each type of code or data. object file init.o ex.o code size 228 184 inline inline data strings 0 0 28 356 inline inline data strings 28 356 'const' data 0 16 'const' data 16 RW data 0 4 RW data 4 0-Init data 0 320 0-Init data 320 debug data 0 0 debug data 0
The required ROM size will be the sum of the code size (412), the inline data size (28), the inline strings (356), the const data (16) and the RW data (4). In this example the required ROM size would be 816 bytes. This should be exactly the same as the size of the ex1_rom image produced by the linker. The required RAM size will be the sum of the RW data (4) and the 0-Init data (320), in this case 324 bytes. Note that the RW data is included in both the ROM and the RAM counts. This is because the ROM contains the initialisation values for the RAM data.
Programming Techniques
ARM DUI 0021A
9-11
; ; ; ; ;
= = = = =
You should now be able to execute the ROM image. Set the PC to the base of the ROM image, then run it. pc=0xf0000000 go This should produce the following output: 'factory_id' is at address 10000000, contents = AA55AA55 'display' is at address 10000004 Program terminated normally at PC = 0xf00000c8 0xf00000c8: 0xef000011 .... : swi 0x11
9-12
Programming Techniques
ARM DUI 0021A
9.2.4
In this case, the -o option will create a subdirectory called ex3_rom containing a single binary le called root. Note that -Scatter tells the linker not to pad the end of the output binary les with zeros. Hence the -NoZeroPad option is not required when using -Scatter.
Programming Techniques
ARM DUI 0021A
9-13
9.3
You are more likely to want to include particular standalone functions from the C library in your ROM. Note Standalone functions are functions which do not rely on any part of the operating system environment. The functions memcpy() and strcpy() are examples of standalone functions. fopen() is not standalone, since it relies on being able to open les which are part of the operating system. Only standalone functions can be included easily in ROM. See 9.3.2 Standalone C functions on page 9-17 for a list of which functions in the C library are standalone. No special code is necessary in your C code to use a standalone C function, just use the function as normal. See 6.5 Using the C Library in Deeply Embedded Applications on page 6-17 for further details of runtime support for deeply embedded applications.
9.3.1
9-14
Programming Techniques
ARM DUI 0021A
/* We define some more meaningful names here */ #define rom_data_base Image$$RO$$Limit #define ram_data_base Image$$RW$$Base void C_Entry(void) { char s[80]; if (rom_data_base == ram_data_base) { Write0("Warning: Image has been linked as an application. To link as a ROM image\r\n"); Write0(" link with the options -RO <rom-base> -RW <ram-base>\r\n"); } sprintf(s, "ROM is at address %p, RAM is at address %p\n", rom_data_base, ram_data_base); Write0(s); } ---------------------------------------------------------------------
Programming Techniques
ARM DUI 0021A
9-15
Building the ROM image Build the ROM image with the following armlink command: armlink -o ex4_rom -Bin -RO 0xf0000000 -RW 0x10000000 -First init.o(Init) -Remove -NoZeroPad -Info Sizes init.o sprintf.o armlib.16l If armlib.16l is not in the current directory, you will need to specify the directory on the command line. This will produce the following output: object file init.o sprintf.o code size 236 40 inline inline data strings 0 0 12 184 inline inline data strings 8 0 0 0 0 0 4 68 0 0 0 0 0 0 0 0 inline inline data strings 12 184 12 68 24 252 'const' data 0 0 'const' data 0 0 0 0 0 0 0 0 'const' data 0 0 0 RW 0-Init data data 0 0 0 0 RW 0-Init data data 0 0 0 0 0 0 0 0 0 0 0 0 260 0 0 0 RW 0-Init data data 0 0 260 0 260 0 debug data 0 0 debug data 0 0 0 0 0 0 0 0 debug data 0 0 0
library membercode size _sprintf.o 56 _sputc.o 16 nofpdisp.o 4 __vfpntf.o 1828 rtudiv10.o 40 strlen.o 68 ctype.o 0 ferror.o 8 code size Object totals 276 Library totals2020 Grand totals 2296 9-16
Programming Techniques
ARM DUI 0021A
9.3.2
Standalone C functions
The following functions are standalone functions and may be safely used in standalone ROM code. <string.h> memcpy memmove memset memcmp strcpy strcmp strncmp strcoll strxfrm memchr strrchr strspn strstr strtok <ctype.h> You must call the _ctype_init() function in your initialisation if you wish to use any of the ctype.h functions. isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct isspace tolower toupper isxdigit <math.h> acos asin sinh tanh pow sqrt <setjmp.h> setjmp longjmp <stdlib.h> atof atoi atol strtod bsearch qsort abs div wctomb mbstowcs wcstombs <locale.h> You must call the _locale_init() function in your initialisation if you wish to use any of the locale.h functions. setlocale localeconv 9-17 strncpy strcat strncat strchr strcspn strpbrk
tan log10
cosh modf
strtol labs
srand mbtowc
Programming Techniques
ARM DUI 0021A
gmtime
9.4
9-18
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
9-19
Problem The linker reports a number of undened symbols of the form: __rt_... or __16__rt_... Cause These are run time support functions which are called by code generated by the compiler to perform tasks which cannot be performed simply in ARM or Thumb code such as integer division or oating point operations. For example, the following code will generate a call to the run time support function __rt_sdiv to perform a division. int test(int a, int b) { return a/b; } Solution You should assemble le examples/clstand/rtstand.s and link this in. A Thumb version of this le is available in the thumb subdirectory. Note The divide routines in rtstand.s use Demon SWIs to report division be zero. You may need to edit rtstand.s to change these SWIs if your system does not support them. Problem The linker produces the error message: ARM Linker: (Fatal) No entry point for image. ARM Linker: garbage output file aif removed Cause You have not dened an entry point. You must dene the entry point even if the entry point is the start of the ROM image.
9-20
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
9-21
The initialisation code may not be at the start of the image because you have omitted the -First option. Solution Try relinking with the -First option to see if this resolves the problem.
Problem The image loads without problem but when trying to run, it crashes/hangs immediately. Causes Any of the causes in the previous problem may also apply here. Another possibility is that it has been linked or loaded at the wrong address. Solution Check that the address is the same on each of the following: The linkers -RO option The GetFile command in armsd The PC= command in armsd
If all this is correct, try setting the PC to the start and using the Step In command to step through all the initialisation code to see if it is going wrong in the initialisation. 9-22
Programming Techniques
ARM DUI 0021A
10
10.1 10.2 10.3 10.4 The ARMulator
The ARMulator
This chapter describes the ARM processor software emulator, ARMulator. 10-2 10-4 10-11 10-13
Using the ARMulator Rapid Prototype Memory Model Writing Custom Serial Drivers for ARM Debuggers Rebuilding the ARMulator
Programming Techniques
ARM DUI 0021A
10-1
The ARMulator
10.1 The ARMulator
The ARMulator is a software emulator of the ARM processor forming part of ARM's debugger. It allows you to debug ARM programs on an emulated system. It can emulate any current ARM processor at the instruction level, including Thumb-aware processors. The ARMulator consists of four parts: the model of the ARM processor, together with various utility functions to support system modelling This part of the ARMulator is not customisable and handles all communication with the debugger. It is supplied in object form only, on Unix hosts, and is built into the debugger in the Windows Toolkit. a memory interface which transfers data between the ARM model and the memory model or memory management unit model The memory model is fully customisable. Example implementations are provided with the ARMulator. Features such as models of memory-mapped I/O can be provided through the memory interface. Three memory models are provided with the ARMulator. armfast (a fast model of 512Kb of RAM) and armvirt (a slower model which models a full 4Gb of physical memory) use the full-blown memory interface (as described in the Software Development Toolkit Reference Manual). armproto provides a simpler memory interface allowing more rapid prototyping of memory models. An example of building such a model is provided in 10.4 Rebuilding the ARMulator on page 10-13. a coprocessor interface to optional ARM coprocessor models Although the ARM floating-point instruction set is implemented using ARM coprocessors, the ARMulator does not use a coprocessor model to emulate these on the host. The coprocessor model does not handle such instructions, so they are passed through the undefined instruction vector, and the ARM-code floating-point emulator (FPE400) emulates the operations. The default coprocessor model (armproto.c) provides a cut-down coprocessor #15 model, allowing software control of the processor's endianness, abort behaviour, etc. This model is fully customisable, and models of other co-processors can be added easily. an operating system interface to provide an execution environment The operating system model (supplied in armos.c) directly implements some operating system calls (such as open file, read the clock etc.) on the debugger host. These calls form the basis for the library calls (eg. fopen() and time()) provided by the ANSI C library. This part of the ARMulator is also fully customisable. Extra SWIs can be added to provide more host system functionality to the debuggee. SWIs that are not handled by this model take the SWI trap and can be handled by ARM SWI handler code running on the ARMulator.
10-2
Programming Techniques
ARM DUI 0021A
The ARMulator
By modifying or rewriting the supplied default models, you can make a model of almost any ARM system, and use it to debug code. For simple modelling systems with different RAM types and access speeds, the armvirt.c memory model supports memory map les. See The ARM Software Development Toolkit Reference Manual: Chapter 14, ARMulator for further details. User supplied memory models can also support map les using armvirt.c as a template. A complete description of the API between the ARM debugger and the memory model, coprocessor model and operating system, and some additional calls for setting timed callbacks etc., can be found in The ARM Software Development Toolkit Reference Manual: Chapter 14, ARMulator.
Programming Techniques
ARM DUI 0021A
10-3
The ARMulator
10.2 Using the ARMulator Rapid Prototype Memory Model
10.2.1 Overview
This section gives an example implementation of a memory system using the rapid prototype ARMulator memory model. The starting point for this model is the ARMulator armproto.c model. It gives the implementation of an example ARMul_MemAccess function, and discusses methods for improving the efciency of this model.
10-4
Programming Techniques
ARM DUI 0021A
The ARMulator
This produces the following memory map:
Abort I/O port Page select Paged RAM Read-only RAM Paged RAM Read-only RAM
00040000
The example does not consider different endian modes. It assumes that the ARM is congured to be the same endianness as the host architecture. #define OFFSET(addr) ((addr) & 0x7fff) #define WORDOFF(addr) (OFFSET(addr)>>2) unsigned ARMul_MemoryInit(ARMul_State *state, unsigned long initmemsize)
Programming Techniques
ARM DUI 0021A
10-5
The ARMulator
{ ModelState *s; int i; s=(ModelState *)malloc(sizeof(ModelState)); if (memory==NULL) return FALSE; for (i=0;i<8;i++) { s->p[i]=(page *)malloc(sizeof(page)); if (s->p[i]==NULL) return FALSE; memset(s->p[i], 0, sizeof(page)); } state->MemDataPtr=(unsigned char *)s; s->mapped_in=0; state->MemSize=8*PAGESIZE; /* ignore initmemsize */ ARMul_ConsolePrint(state, ", 1Mb memory"); /* Ask ARMulator to clear aborts for us regularly */ state->clearAborts=TRUE; return TRUE; } The Exit function is shown below: void ARMul_MemoryExit(ARMul_State *state) { free(state->MemDataPtr); } Finally you need to write the generic access function: ARMword ARMul_MemAccess(ARMul_State *state, ARMword address, ARMword dataOut, ARMword mas1, ARMword mas0, ARMword Nrw, ARMword seq, ARMword Nmreq, ARMword Nopc, ARMword lock, ARMword trans, ARMword account) { 10-6
Programming Techniques
ARM DUI 0021A
The ARMulator
int highpage=(address & (1<<17)); ModelState *s=(ModelState *)(state->MemDataPtr); page *mem; if (highpage) mem=s->p[s->mapped_in]; else mem=s->p[0]; if (Nmreq==LOW) { /* memory request */
The memory models must track the numbers of N, S, I and C cycles that occur in the ARMulator. These counts are used to provide the $statistics and $statistics_inc variables in armsd. if (account) { /* an ARMulator request */ if (seq==LOW) state->NumNcycles++; else state->NumScycles++; } switch ((address>>30)&0x3) { case 0: /* 00 - memory access */ if (Nrw==LOW) return mem->word[WORDOFF(address)]; You do not need to extract the relevant byte or halfword presented on the data bus for byte or half-word loads, as the ARM will do this for you. Note that this is not true of the high speed memory interface. else /* write - need to do right width access */ /* Ignore writes out of supervisor mode to the "low" page */ if (highpage || account==FALSE || state->NtransSig==LOW) { Note The trans value supplied is not correct. Use the NtransSig in the ARMul_State instead. if (mas0==LOW) { /* byte or word */ if (mas1==LOW) /* byte */ mem->byte[OFFSET(address)]=dataOut; else mem->word[WORDOFF(address)]=dataOut; } else { /* half-word */ ARMword offset=OFFSET(address) & ~1; mem->byte[offset]=dataOut>>8; mem->byte[offset+1]=dataOut; } } break;
Programming Techniques
ARM DUI 0021A
10-7
The ARMulator
case 1: To change the mapped in page: if (state->NtransSig==LOW || account==FALSE) { s->mapped_in=(address>>16) & 7; } break; case 2: /* 10 - single byte I/O */ ARMul_ConsolePrint(state,"%c",(address>>16) & 0xff); break; /* 11 - generate an abort */ /* 01 - page select in SVC mode */
case 3: There are two types of abort: prefetch abort data abort
Use the appropriate macro. (A real ARM has only one abort pin.) if (Nopc==LOW) { ARMul_PREFETCHABORT(address); } else { ARMul_DATAABORT(address); } return ARMul_ABORTWORD; break; } } else { /* not a memory request */
MemAccess is called for all ARM cycles, not just memory cycles, and must keep count of these I and C cycles. if (seq==LOW) /* I-cycle */ state->NumIcycles++; else state->NumCcycles++; } return 0; }
10-8
Programming Techniques
ARM DUI 0021A
The ARMulator
10.2.4 Improving performance
Whilst running an emulation of the ARM, the ARMulator spends a lot of time in the memory access functions. Small improvements in the efciency of the memory model can give signicant performance boosts. In the simple case of this memory model, removing the need to look up the number of the currently mapped in top page on each access optimises the code. Instead, you can retain a pointer to it in the ARMul_State, on the MemSparePtr. The modied MemoryInit is shown below: unsigned ARMul_MemoryInit(ARMul_state *state, unsigned long initmemsize) { page *memory; memory=(page *)calloc(8, sizeof(page)); if (memory==NULL) return FALSE; state->MemDataPtr=(unsigned char *)memory; /* attach page zero to the top page pointer */ state->MemSparePtr=(unsigned char *)(&memory[0]); ... } MemAccess is similar to the following: ... { int highpage=(address & (1<<17)); page *mem; mem=(page *)(highpage ? state->MemDataPtr /* high */ : state->MemSparePtr); /* low */ ... Only one load from memory is required to get the address of the page giving great improvement in performance.
Programming Techniques
ARM DUI 0021A
10-9
The ARMulator
In other memory models, you can also improve performance by: Using a form of tree to model the memory map. The memory-model performs access checks on memory cycles and these access permissions can be encoded by using multiple trees: one that maps the real physical memory, one that maps pages that are readable, and one that maps pages that are writable. Using such a system, the model need not look up whether a page is read/write-able before an access takes place. It can gain a handle onto the page from the appropriate tree structure, and test it's validity. This may improve performance of the common case, where there is no exception, at the cost of slowing down the uncommon cases where an exception occurs, or where a page has yet to be allocated in the memory model. A similar trick can be used for distinguishing between cached and uncached memory, or for modelling memory-mapped I/O devices. Using the distinction between S- and N-cycles, and caching a pointer into the memory model on non-sequential cycles which is used on sequential cycles. This is similar to the page-mode access provided by DRAM hardware.
Take care because in some circumstances the ARM will perform an I-cycle followed by an S cycle and not the expected N cycle. On ARM6 and ARM7 these merged I-S cycles occur for the N-cycle instruction prefetch following a data load. The ARM (and ARMulator) will perform a cycle marked as sequential where the address is not actually sequential from the previous memory access. However the address will be sequential from the previous instruction fetch, so a sequential instruction fetch will always be sequential from the previous instruction fetch, on current processors. On a model which uses, for example, a deep tree structure or a hash table, this trick can be used to remove the need to search the model, instead allowing the program to immediately find the appropriate page. Using these and similar techniques ARM have developed a complete model of the ARM610 memory system (physical memory, cache and MMU based translation unit) which only has a 25% overhead over the standard armvirt memory model.
10-10
Programming Techniques
ARM DUI 0021A
The ARMulator
10.3 Writing Custom Serial Drivers for ARM Debuggers
In addition to the ARMulator target, the ARM debuggers can be used to debug a remote target, using the remote debug protocol (RDP). RDP is a byte-stream protocol, allowing communication to take place over any kind of channel, providing that drivers are written for both ends of the link. At the DEMON end, the drivers should be written in C or ARM assembler as part of the port of DEMON to your target hardware. At the debugger end of the link, the new drivers must be included as part of armsd/windbg. This involves adding a module into armsd/windbg, using the supplied prebuilt objects and makele. For Win32 tools (Windows95 and Windows NT), the remote drivers are packaged into a DLL so they can be easily replaced with a user-supplied DLL. This is because the debugger (armsd.exe or windbg.exe) will not need to be rebuilt, and the same DLL will be used for both debuggers. Under Unix/DOS the entire tool needs to be rebuilt. See 10.4 Rebuilding the ARMulator on page 10-13 for further details. It is possible to rebuild the RDP drivers for the following hosts: SunOS HP/UX Macintosh Windows 95 & Windows NT DOS It is NOT possible to rebuild the RDP drivers for: Windows 3.1 the Windows 3.1 serial/parallel drivers are complicated by extra 'thunking' between the 32-bit application and 16-bit Windows. Users wishing to rebuild the RDP drivers should upgrade to Windows 95 or Windows NT.
Programming Techniques
ARM DUI 0021A
10-11
The ARMulator
serdrive.c spdrive.c drivers.c pirdi.h This is an implementation of the serial driver. This is an implementation of the serial/parallel driver. New drivers are added to the debugger by including a pointer to their DriverDesc in the drivers array dened by drivers.c Denes functions used by hostos.c to read and write bytes across the RDP link.
In addition the DOS serial drivers include code for interrupt-driven serial I/O (in the le comsisr.c). Any new driver must provide a DriverDesc structure, dened in serdrive.h and add this structure to the DriverList dened in drivers.h. The elements of this structure are: name This should be a unique name for the driver, which is used by armsd as a command-line argument. For example the standard serial driver denes the name "SERIAL". This function opens a connection, and returns a handle onto it. This handle is passed into the other driver functions. On the standard Unix serial port drivers it merely opens the appropriate serial port and returns the (Unix) le handle. The conguration function is used to set the linespeed on the connection and it initialise it. These functions read and write data across the link. The RDP assumes that the link is error free, and that characters are not lost (must have ow-control or adequate buffering). The CloseProc is called when the RDP wishes to close the connection. This is called when the level of RDP logging is changed. (e.g. when the $rdi_log variable is changed in armsd). For a description of the meaning of the logging values, see The ARM Software Development Toolkit Reference Manual: Chapter 7, Symbolic Debugger.
OpenProc
CloseProc LoggingProc
10-12
Programming Techniques
ARM DUI 0021A
The ARMulator
10.4 Rebuilding the ARMulator
Under Unix, DOS, and on the Macintosh the ARMulator is linked onto a core debugger, to produce an armsd executable image. Under Windows, the ARMulator exists as a Windows DLL.
New sources, e.g. memory models, serial drivers, etc., should be placed in the source directory. Any new memory models can be added to the ARMulator Makefile by adding a rule to it for the model. For example: example.o: $(SRC)example.c $(HFILES) $(CC) $(CFLAGS) -c $(SRC)example.c Any new serial drivers also need rules adding (similar to the above), but also need to be added to the OFILES list of object les at the top of the Makele to be linked into the resulting armsd. You also need to declare the driver in drivers.c. An ARMulator can then be built using make: make MODEL=example This compiles to an ARMulator based armsd which uses the example memory model. By default, armsd will be rebuilt with the armvirt memory model.
Programming Techniques
ARM DUI 0021A
10-13
The ARMulator
10.4.2 Rebuilding ARMULATE.DLL under Windows
The process required to rebuild the ARMULATE.DLL component is very similar to that described above. WIN32 Rebuild parent directory for WIN32 components
CLX CUSTOM REMOTE MSVC20 DLL WATCOM10 DLL Watcom C/C++ V10.0a DLL rebuild directory Microsoft Visual C++ V2.0+ DLL rebuild directory Customisable resources for Windows Tools WIN32 serial driver components
There is a choice of compilers for rebuilding the ARMulator DLL, Microsoft Visual C++ (produces the fastest code, but only runs under 32-bit Windows - Windows NT or Windows 95) and Watcom C/C++ V10.0a; the rebuild kit provides makeles for both these compilers. As described above, new rules may be added to the ARMulator DLL makele ARMULATE.MAK. When complete, the make procedure will produce a le called ARMULATE.DLL, this should be placed in the BIN sub-directory of the ARM Tools200 installation directory, typically C:\ARM200\BIN.
10-14
Programming Techniques
ARM DUI 0021A
The ARMulator
The new DLL will automatically be used on the next invocation of either the ARM Debugger for Windows or armsd (on 32-bit Windows operating systems). Rebuilding the serial driver DLL REMOTE.DLL is accomplished in the same way (using REMOTE.MAK). However, it should be noted that the rebuild kit provided is for 32-bit Windows operating systems only, i.e. Windows NT or Windows 95.
Programming Techniques
ARM DUI 0021A
10-15
The ARMulator
10-16
Programming Techniques
ARM DUI 0021A
11
11.1 11.2 11.3 11.4 11.5 11.6 Overview Entering and Leaving an Exception The Return Address and Return Instruction Writing an Exception Handler Installing an Exception Handler
Exceptions
This chapter explains how the ARM deals with exceptions, and discusses the issues involved in writing exception handlers. 11-2 11-5 11-6 11-8 11-12 11-14
Programming Techniques
ARM DUI 0021A
11-1
Exceptions
11.1 Overview
During the normal ow of execution through a user program, the program counter generally increases sequentially through the address space, with branches to nearby labels or branch-with-links to subroutines. Exceptions occur when this normal ow of execution is diverted, so that the processor can handle events generated by internal or external sources. Examples of such events are: externally generated interrupts an attempt by the processor to execute an undefined instruction
It is necessary for the handling of such exceptions to preserve the previous processor state so that execution of the original user program can resume once the appropriate exception routine has been completed. The ARM recognises seven different types of exception, as shown below: Exception Reset Description Occurs when the CPU reset pin is asserted. Only expected to occur for signalling power-up, or for resetting as if the CPU has just powered up. It can therefore be useful for producing soft resets. Occurs if neither the CPU nor any attached coprocessor recognises the currently executing instruction. Is a user-defined synchronous interrupt instruction, so that a program running in User mode can request privileged operations which need to be run in Supervisor mode. Occurs when the CPU attempts to execute an instruction which has prefetched from an illegal address, ie. an address that the memory management subsystem has determined as inaccessible to the CPU in its current mode. Occurs when a data transfer instruction attempts to load or store data at an illegal address. Occurs when the CPUs external interrupt request pin is asserted (low) and the I bit in the CPSR is clear. Occurs when the CPUs external fast interrupt request pin is asserted (low) and the F bit in the CPSR is clear.
11-2
Programming Techniques
ARM DUI 0021A
Exceptions
11.1.1 The vector table
Exception handling is controlled by a vector table. This is a reserved area of 32 bytes at the bottom of the memory map with one word of space allocated to each exception type (plus one word currently reserved for handling address exceptions when the processor is congured for a 26-bit address space). This is not enough space to contain the full code for a handler, so the vector entry for each exception type typically contains a branch or load PC instruction to continue execution with the appropriate handler.
Each exception handler must ensure that other registers are restored to their original state upon exit. This can be done by storing the contents of any registers the handler needs to use onto its stack and restoring them before returning. Note
You must ensure that the required stacks have been set up. If you are using Demon or ARMulator, this is done for you.
Table 11-2: The exception vectors on page 11-4 shows the processor modes, the exceptions that give rise to them and the priority in which the exceptions are handled.
Programming Techniques
ARM DUI 0021A
11-3
Exceptions
Vector Address 0x0 0x4 0x8 0xC 0x10 0x14 0x18 0x1C Exception Type Reset Undefined Instruction Software Interrupt (SWI) Prefetch Abort Data Abort Exception Mode svc undef svc abort abort Priority (1=High, 6=Low) 1 6 6 5 2
Reserved
Interrupt (IRQ) Fast Interrupt (FIQ)
Not applicable
irq fiq
Not applicable
4 3
11-4
Programming Techniques
ARM DUI 0021A
Exceptions
11.2 Entering and Leaving an Exception
11.2.1 The processors response to an exception
When an exception is generated, the processor: 1 Copies the Current Program Status Register (CPSR) into the Saved Program Status Register (SPSR) for the mode in which the exception will be handled. This saves the current mode, interrupt mask and condition flags. 2 Sets the appropriate CPSR mode bits: a) b) to change to the appropriate mode, also mapping in the appropriate banked registers for that mode. to disable interrupts. IRQs are disabled once any other exception occurs, and FIQs are also disabled when a FIQ occurs.
3 4
Stores the return address (PC 4) in LR_<mode>. Sets the PC to the appropriate vector address. This forces the branch to the appropriate exception handler.
These can be achieved in a single instruction, because adding the S ag (update condition codes) to a data-processing instruction when in a privileged mode with the PC as the destination register, also transfers the SPSR to CPSR as required. This also applies to the Load Multiple instruction (using the ^ qualier).
Programming Techniques
ARM DUI 0021A
11-5
Exceptions
11.3 The Return Address and Return Instruction
The actual value in the PC which causes a return from a handler varies with the exception type. When an exception is taken, the PC may or may not have been updated, and the return address may not necessarily be the next instruction pointed to by the PC, because of the way the ARM loads its instructions. When loading the instructions it needs to execute a program, the ARM uses a pipeline with a fetch, decode and execute stage. At any one time, there will be one instruction in each stage of the pipeline. The PC actually points to the instruction being fetched. Since each instruction is a word long, the instruction being decoded is at address PC 4 and the instruction being executed is at PC 8.
Programming Techniques
ARM DUI 0021A
Exceptions
11.3.4 Returning from data abort
When a load or store instruction tries to access memory, the PC has already been updated, so that storing PC 4 in LR_ABORT makes it point to two instructions beyond the address where the exception was generated. Once the MMU has loaded the appropriate address into physical memory, the handler should return to the original, aborted instruction so a second attempt can be made to execute it. The return address is therefore two words (or eight bytes) less than that in LR_ABORT, making the return instruction: SUBS pc, lr, #8
Programming Techniques
ARM DUI 0021A
11-7
Exceptions
11.4 Writing an Exception Handler
This section explains the functions performed by the code that handles each type of exception.
Because of the need to access the link register and load in the actual SWI instruction, the top-level SWI handler must be written in assembly language. However, the individual routines that implement each SWI can be written in C if required: see Chapter 12, Implementing SWIs.
FIQs have higher priority than IRQs in two ways: 1 2 FIQs are serviced rst when multiple interrupts occur. Servicing an FIQ causes IRQs to be disabled, preventing them from being serviced until after the FIQ handler has re-enabled them (usually by restoring the CPSR from the SPSR at the end of the handler).
You can set up C functions as interrupt handlers by using the special function declaration keyword __irq. This keyword: preserves all registers (excluding floating-point) exits the function by setting the PC to LR 4 and restoring the CPSR to its original value.
11-8
Programming Techniques
ARM DUI 0021A
Exceptions
The simple example handler below reads a byte from location 0xc0000000 and writes it to location 0xc0000004: void __irq IRQHandler (void) { volatile char *base = (char *) 0xc0000000; *(base+4) = *base; } Installing the FIQ handler The FIQ vector is the last entry in the vector table, at address 0x1C. It is situated there so that the FIQ handler can be placed directly at the vector location and run sequentially from that address. This removes the need for a branch and its associated delays, and also means that if the system has a cache, the vector table and FIQ handler may all be locked down in one block within it. This is important because FIQs are designed to service interrupts as quickly as possible. The simplest way to place the FIQ handler at 0x1c is to copy it there. ARM code is inherently relocatable, but note that: the code should not use any absolute addresses PC-relative addresses are allowable as long as the data is copied as well as the code (so that it remains in the same relative place)
The ve extra FIQ mode banked registers mean that status can be held between calls to the handler A simple FIQ handler is shown below. This takes some data and copies it out to an i/o port, using the following registers: r8 r9 r10 r11 r12 points to the i/o port (with an interrupt ag at r8 + 4) points to the current word in the data points to the end of data is used as temporary storage points to a semaphore which is set when the copy is complete
FIQ_Start; Note no stack usage - banked registers STR r8, [r8,#4];set int_flag in port CMP r9, r10;End of data reached? LDRNE r11,[r9],#4;Read in next word STRNE r11, [r8];Copy it to port STREQ r8,[r12];Set semaphore when finished SUBS pc,lr,#4; Return FIQ_End This can be copied to the bottom of the vector table with: memcpy (0x1c, FIQ_Start, FIQ_End-FIQ_Start);
Programming Techniques
ARM DUI 0021A
11-9
Exceptions
11.4.3 Reset handler
The operations carried out by the Reset handler depend upon the system that the software is part of. It might, for example: do a hardware self test detect how much memory is available initialise stacks and registers initialise peripheral hardware such as i/o ports initialise the MMU if one is being used (ARM cached processors) call the main body of code (__main() if using C)
Once any chain of emulators is exhausted, no further processing of the instruction can take place, so the undened instruction handler should report an error and quit.
11-10
Programming Techniques
ARM DUI 0021A
Exceptions
11.4.5 Prefetch abort handler
If the system contains no MMU, the Prefetch Abort handler can simply report the error and quit. If there is an MMU, the address that caused the abort needs to be restored into physical memory. LR_ABORT points to the instruction at the address following the one that caused the abort, so the address that needs restoring is at LR_ABORT 4. Thus the virtual memory fault for that address can be dealt with and the instruction fetch re-attempted. The handler should therefore return to the same instruction, rather than the following one.
Load / Store Multiple: If writeback is enabled, the base register will have been updated as if the whole transfer had taken place. (In the case of an LDM with the base register in the register list, the processor will handle replacing the overwritten value with the modified base value in such a way that recovery is possible.) The number of registers involved will therefore need to be used to recalculate the original base address.
In each case, the MMU can load the required virtual memory into physical memory (the address which caused the abort being stored in the MMUs Fault Address Register (FAR)). Once this is done, the handler can return and retry executing the instruction.
Programming Techniques
ARM DUI 0021A
11-11
Exceptions
11.5 Installing an Exception Handler
Once a handler for a particular exception has been written, it must be installed in the vector table so that it will be executed when the exception occurs.
A C function which implements this algorithm is provided below. This takes as its arguments the address of the handler and the address of the vector in which the handler is to be to installed. The function installs the handler and returns the original contents of the vector. This result might be used for creating a chain of handlers for a particular exception. unsigned Install_Handler (unsigned routine, unsigned *vector) /* Updates contents of 'vector' to contain branch instruction */ /* to reach 'routine' from 'vector'. Function return value is */ /* original contents of 'vector'.*/ /* NB: 'Routine' must be within range of 32Mbytes from 'vector'.*/ { unsigned vec, oldvec; vec = ((routine - (unsigned)vector - 0x8)>>2); if (vec & 0xff000000) { printf ("Installation of Handler failed"); exit (0); } vec = 0xea000000 | vec; oldvec = *vector; *vector = vec; return (oldvec); }
11-12
Programming Techniques
ARM DUI 0021A
Exceptions
Code to call this to install an IRQ handler might be: unsigned *irqvec = (unsigned *)0x18; Install_Handler ((unsigned)IRQHandler,(unsigned)irqvec); In this case the returned, original contents of the IRQ vector are discarded.
The following C routine implements this: unsigned Install_Handler (unsigned *location, unsigned *vector) /* Updates contents of 'vector' to contain LDR pc, [pc, #offset] */ /* instruction to cause long branch to address in location. */ /* Function return value is original contents of 'vector'.*/ { unsigned vec, oldvec; vec = ((unsigned)location - (unsigned)vector) | 0xe59ff000 oldvec = *vector; *vector = vec; return (oldvec); } Code to call this to install an IRQ handler might be: unsigned *irqvec = (unsigned *)0x18; unsigned *irqaddr = (unsigned *)0x38; /* For example */ *irqaddr = (unsigned)IRQHandler; Install_Handler (irqaddr,irqvec); Again in this case the returned, original contents of the IRQ vector are discarded.
Programming Techniques
ARM DUI 0021A
11-13
Exceptions
11.6 Exception Handling on Thumb-Aware Processors
Note
This section only applies to processors that implement ARM Architecture 4T.
When writing exception handlers suitable for use on Thumb-aware processors, there are some further considerations to those already described in this chapter. The basic exception handling mechanism on Thumb-aware processors is the same as that of non-Thumb-aware processors, where an exception causes the next instruction to be fetched from the appropriate vector table entry. The same vector table is used for both Thumb state and ARM state exceptions. This means that an initial step must be added at the top of the exception handling procedure described in 11.2.1 The processors response to an exception on page 11-5. The procedure now reads: 1 2 3 4 5 Check the processors state. If it is operating in Thumb state, switch to ARM state. Copy the CPSR into SPSR_<mode> Set the CPSR mode bits: Store the return address (PC 4) in LR_<mode>. Set the PC to the appropriate vector address
The switch from Thumb state to ARM state in step 1ensures that the ARM instruction installed at the appropriate vector (either a branch or a PC-relative load) is correctly fetched, decoded and executed. Execution then moves to a top-level veneer, also written in ARM code, which saves the processor status and any registers. The programmer then has two choices. 1 2 Write the whole exception handler in ARM code. Make the top-level veneer store any necessary status, and then perform a BX (branch and exchange) to a Thumb code routine which handles the exception. Such a routine will need to return to an ARM code veneer in order to return from the exception, since the Thumb instruction set does not have the instructions required for restoring the CPSR from the SPSR. This second strategy is shown in Figure 11-1: Handling an exception in Thumb state on page 11-15.
11-14
Programming Techniques
ARM DUI 0021A
Exceptions
Thumb-coded application Vector table Arm-coded veneers Thumb-coded handler
move SPSR into general-purpose register Test if bit 5 is set T bit set - exception occurred in Thumb state T bit clear - exception occurred in ARM state
An example of where this would be needed would be in a SWI handler. Both ARM and Thumb instruction sets contain SWI instructions. See 11.4.1 SWI handler on page 11-8 and in Chapter 12, Implementing SWIs for more information. When handling a Thumb SWI instruction, three things need to be taken into account: 1 2 3 The address of the instruction will be at LR 2, rather than LR 4. A halfword load is required to fetch the instruction. There are only 8 bits available for the SWI number instead of the ARM versions 24 bits.
Programming Techniques
ARM DUI 0021A
11-15
Exceptions
The following fragment of ARM code will handle a SWI from either source: MRS TST LDRHEQ BICEQ LDRNE BICNE r0, spsr r0, #T_bit r0,[lr,#-2] r0,r0,#0xff00 ; ; ; ; ; r0,[lr,#-4] ; r0,r0,#0xff000000; move SPSR into general purpose register Test if bit 5 is set T_bit set so load halfword (Thumb) and clear top 8 bits of halfword (LDRH clears top 16 bits of word) T_bit clear so load word (ARM) and clear top 8 bits of word
11-16
Programming Techniques
ARM DUI 0021A
12
12.1 12.2 12.3 12.4 12.5 12.6 Introduction
Implementing SWIs
This chapter explains how to implement SWIs and how to call them from your programs. 12-2 12-7 12-9 12-11 12-15 12-18
Implementing a SWI Handler Loading the Vector Table Calling SWIs from your Application Development Issues: SWI Handlers and Demon Example SWI Handler
Programming Techniques
ARM DUI 0021A
12-1
Implementing SWIs
12.1 Introduction
This chapter explains the steps involved in writing and installing a Software Interrupt (SWI) handler that is able to deal with SWIs in your application code. It also examines the additions required to allow a user SWI handler to cooperate with the Debug Monitor (Demon) SWI handler when developing on a PIE card. For additional information, refer to:
Chapter 3, Programmers Model, which explains the ARMs usage of modes and banked registers Chapter 11, Exceptions, which gives a general guide to writing exception handlers
12-2
Programming Techniques
ARM DUI 0021A
Implementing SWIs
User Program
ADD r0,r0,r1 SWI 0x10 SUB r2,r2,r0
Vector Table 0x0 0x4 0x8 0xC 0x10 0x14 0x18 0x1C
B B B B B R_Handler U_Handler S_Handler P_Handler D_Handler
SWI Handler Reset Undened instruction SWI Prefetch Abort Data Abort Reserved IRQ FIQ
B I_Handler B F_Handler
Programming Techniques
ARM DUI 0021A
12-3
Implementing SWIs
12.1.4 Decoding the SWI instruction
The handlers rst task is to decode the SWI number to decide which function to perform. The SWI number is stored within the SWI instruction itself as a 24-bit eld, giving a range of 0 to 0xFFFFFF. This is shown in Figure 12-2: The SWI instruction, below.
31
28 27
24 23
cond
1 1 1
Condition eld
The handlers rst task is therefore to locate the instruction, so it can read the SWI number. When a SWI instruction is executed, LR_SVC is set to PC 4, so the instruction is located at LR_SVC 4. Because LR_SVC can only be accessed via assembly language, so the code that obtains the SWI number must be written in assembler. The following two lines extract the SWI number and place it in r0: LDR r0,[lr,#-4] BIC r0, r0, #0xff000000 ; Load the SWI instruction into r0 ; Mask out the top 8 bits.
The part of the handler that actually implements the SWIs can be written in assembly language (using r0 to control execution through a jump table) or as a C subroutine (with r0 being passed as a parameter that controls a switch() statement). The last part of the handler will again need to be in assembly language because of the need to restore the CPSR from SPSR_SVC.
12-4
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.1.6 Re-entrant SWI handling
Unless your SWI handler is written to be re-entrant, it will be unable to use SWIs itself because taking another exception while in Supervisor mode will corrupt both SPSR_SVC and LR_SVC. The second exception will return correctly to the rst, but the rst will be unable to return to the calling program. If the handler stores SPSR_SVC and LR_SVC, along with the non-banked registers each time it is called and then retrieves them again each time it exits, this problem will not arise, as each instance will have access to the correct return address and status information.
SWI Routine
Programming Techniques
ARM DUI 0021A
12-5
Implementing SWIs
Passing parameters via the stack Storing the registers has another advantage. A SWI will often be executed with a set of parameters, passed in registers. These can be accessed easily in assembly language, but you may have written your SWI handler in C. In this case, you can pass the SWI number in r0 (as extracted above), with r1 pointing to the location of the registers on the stack. See 12.2.1 Implementing a SWI handler in C on page 12-7. Note that the stack used here is the Supervisor stack so before the SWI Handler is called, it must have been set up to point to a dedicated area of memory. In a nal system this might be done in the __main routine within rtstand.s, the standalone C library. It might also be necessary to add code for stack overow checking to the top level SWI Handler, as follows: SVCStackBase EQU 0xA00 SVCStackEnd EQU 0x800 SVCStackHeadroom EQU 0x40 ; Full descending stack so base ; is higher in memory than end. ;Allow headroom of 16 words, even ; though maximum handler places on ; stack is 15 words, because can then ; use (8 bit shifted) immediate value ; in the CMP. EQU SVCStackEnd + SVCStackHeadroom
SVCStackLimit : : : MOV sp, #SVCStackBase ; Set up SVC stack pointer : ; (in rtstand.s, say) : : SWIHandler CMP sp, #SVCStackLimit ; Check if enough room on stack to BLS stack_overflow ; store registers, if not report error ; ; Rest of SWI Handler code
12-6
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.2 Implementing a SWI Handler
12.2.1 Implementing a SWI handler in C
The easiest way of implementing the SWI handling mechanism is to write it in C, using a switch() statement. Suppose we have the following function: void C_SWI_handler (unsigned number, unsigned *reg) { /* Handle the SWIs */ } the actual body of which is in the format: switch (number) { case 0 : /* SWI number 0 code */ break; case 1 : /* SWI number 1 code */ break; /* Rest of SWI routines */ } The code implementing each SWI must be kept as short as possible and in particular should not call routines from the C library, as these can make many nested procedure calls which can exhaust the stack space and cause the application to crash. This C function is called from the top-level assembly language routine, which places the number of the SWI to be handled into r0, and a pointer to the registers as they were when the SWI was encountered (ie. the Supervisor stack) in r1. It then invokes the C function with a branch with link: BL C_SWI_Handler Passing arguments from the top-level routine The APCS ensures that when C_SWI_Handler is called, r0 is allocated to its rst argument and r1 to the second. To read the values in registers r0 to r12 from C_SWI_Handler, access the integer values pointed to by reg, for example: value_in_reg_0 = reg [0]; value_in_reg_1 = reg [1]; value_in_reg_2 = reg [2]; value_in_reg_3 = reg [3]; : : value_in_reg_12 = reg [12]; How reg relates to the stack is shown in Figure 12-4: Accessing the Supervisor stack.
Programming Techniques
ARM DUI 0021A
12-7
Implementing SWIs
Previous sp_svc spsr_svc lr_svc r12 reg[12]
sp_svc
r0
reg[0]
*reg
EndofSWI ; Return execution to top level SWI handler ; so as to restore registers and go back to user program 12-8
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.3 Loading the Vector Table
Finally, having written your SWI handler, install an instruction in the vector table so that encountering a SWI causes it to be called. To do this, place a branch instruction in the table. The instruction can be generated using the following method: 1 2 3 4 5 6 Take the address of the top level SWI handler. Subtract the address of the SWI vector (ie. 0x8). Subtract 0x8 to allow for the pipeline. Shift the result right by two to give a word offset rather than a byte offset. Test that the top eight bits of this offset are clear to ensure that the offset is only 24 bits long (as the branch is limited to this). Logically OR this with 0xea000000 (BAL instruction) to produce the complete instruction for placing in the vector. *vector) */ */ */ */ */
In C this could be coded as: unsigned Install_Handler (unsigned routine, unsigned /* Updates contents of 'vector' to contain branch /* instruction to reach 'routine' from 'vector'. /* Function return value is original contents of /* vector. /* NB: 'Routine' must be within range of 32Mbytes { unsigned vec, oldvec; vec = ((routine - (unsigned)vector - 0x8)>>2); if (vec & 0xff000000) { printf ("Installation of Handler failed"); exit (0); } vec = 0xea000000 | vec; oldvec = *vector; *vector = vec; return (oldvec); }
Notice that the contents of the vector are updated by the routine itself; the return value is the previous contents of the vector. The reason for returning this value will be examined shortly. For now, as no use is made of the previous contents, this could be called from your C program with: Install_Handler((unsigned)SWIHandler,swivec); where unsigned *swivec = (unsigned *)0x8;
Programming Techniques
ARM DUI 0021A
12-9
Implementing SWIs
In most circumstances, the branch instructions 32 Mbyte range will be sufcient to reach the SWI handler from the vector table. Sometimes, however, an alternate method is needed. This is to directly force the PC to the handlers start address. For this to work: the address of the handler must be stored in a suitable memory location the vector must contain the encoding of an instruction to load the PC with the contents of that memory location
This can be implemented as: unsigned Install_LDR_Handler (unsigned *vector, unsigned address) /* Updates contents of 'vector' to contain 'LDR pc,[pc,#offset]'*/ /* to cause branch from vector to location contained within */ /* 'address'. Function return value is original contents of */ /* 'vector'. */ /* NB: 'address' must be within 4k of vector */ { unsigned vec, oldvec; vec = ((address - (unsigned)vector - 0x8); if (vec & 0xfffff000) { printf ("Installation of Handler failed"); exit (0); } vec = 0xe59ff000 | vec; /* LDR pc, [pc,#offset] */ oldvec = *vector; *vector = vec; return (oldvec); } This again returns the original contents of the vector. Temporarily ignoring this returned value, this routine could be called from the users C program by: Install_LDR_Handler(swivec,(unsigned)swiaddr); where unsigned *swivec = (unsigned *) 0x8; unsigned *swiaddr= (unsigned*)0x38; /*An address<=4k from vector*/ *swiaddr = (unsigned)SWIHandler;
12-10
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.4 Calling SWIs from your Application
This is very simple in assembly language. Set up your registers as required and then call the relevant SWI: SWI 700 In C, things are slightly more complicated. Onto each SWI, you must map a call to a function in your code using the __swi compiler directive.This allows a SWI to be compiled in-line, without additional calling overhead, provided that: its arguments (if any) are passed in r0-r3 only its results (if any) are returned in r0-r3 only
The following sections demonstrate how to use the compilers in-line SWI facility for a variety of different SWIs that conform to these rules. These SWIs are taken from the ARM Debug Monitor interface. For more information see The ARM Software Development Toolkit Reference Manual: Chapter 17, Demon. In the examples below, the following options are used with armcc: -li -apcs 3/32bit species that the target is a little endian ARM species that the 32-bit variant of APCS 3 should be used
Programming Techniques
ARM DUI 0021A
12-11
Implementing SWIs
This generates the following: output_newline MOV a1,#&d SWI &0 MOV a1,#&a SWI &0 MOV pc,lr Note that your version of armcc may produce slightly different output to that listed here.
12-12
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.4.3 Calling a SWI which returns 2-4 results
If a SWI returns two, three or four results, its declaration must specify that it is a struct-valued SWI, and the special keyword __value_in_regs must also be used. This is because a struct valued function is usually treated as if it were a void function whose rst argument is the address where the result structure should be placed. See 7.3 Passing and Returning Structures on page 7-9 for more details. As an example, consider SWI_InstallHandler, which we want to be SWI number 0x70. On entry r0 contains the exception number, r1 contains the workspace pointer and r2 contains the address of the handler. On exit r0 is undened, r2 contains the address of the previous handler and r1 the previous handler's workspace pointer. The following fragment demonstrates how this SWI could be declared and used in C: typedef struct SWI_InstallHandler_struct { unsigned exception; unsigned workspace; unsigned handler; } SWI_InstallHandler_block;
SWI_InstallHandler_block __value_in_regs __swi(0x70) SWI_InstallHandler(unsigned r0, unsigned r1, unsigned r2); void InstallHandler(SWI_InstallHandler_block *regs_in, SWI_InstallHandler_block *regs_out) { *regs_out=SWI_InstallHandler(regs_in->exception, regs_in->workspace, regs_in->handler); } This code is provided in directory examples/swi as installh.c, and can be compiled to produce ARM assembler source using: armcc -S -li -apcs 3/32bit installh.c -o installh.s The code which armcc produces is: InstallHandler STMDB sp!,{lr} MOV lr,a2 LDMIA a1,{a1-a3} SWI &70 STMIA lr,{a1-a3} LDMIA sp!,{pc} Note that your version of armcc may produce slightly different output to that listed here.
Programming Techniques
ARM DUI 0021A
12-13
Implementing SWIs
12.4.4 Dealing with a SWI whose number is not known until run time
If you need to call a SWI whose number is not known until run time, the mechanisms discussed above are not appropriate. This situation might occur when there are a number of related operations that can be performed on an object, and each operation has its own SWI. There are several ways of dealing with this. For example: constructing the SWI instruction from the SWI number, storing it somewhere and then executing it using a 'generic' SWI which takes as an extra argument a code for the actual operation to be performed on its arguments. This 'generic' SWI would then decode the operation and perform it.
A mechanism has been added to armcc to support the second method outlined here. The operation is specied by a value which is passed in r12 (ip). The arguments to the 'generic' SWI are passed in registers r0-r3, and values optionally returned in r0-r3 using the mechanisms described above. The operation number passed in r12 could be, but need not be, the number of the SWI to be called by the 'generic' SWI. Here is an C fragment which uses a 'generic', or 'indirect' SWI: unsigned __swi_indirect(0x80) SWI_ManipulateObject(unsigned operationNumber, unsigned object,unsigned parameter); unsigned DoSelectedManipulation(unsigned object, unsigned parameter, unsigned operation) { return SWI_ManipulateObject(operation, object, parameter); } This code is provided in directory examples/swi as swimanip.c, and can be compiled to produce ARM Assembler source using: armcc -S -li -apcs 3/32bit swimanip.c -o swimanip.s This produces the following code: DoSelectedManipulation MOV ip,a3 SWI &80 MOV pc,lr Note that the your version of armcc may produce output which is slightly different from that listed here.
12-14
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.5 Development Issues: SWI Handlers and Demon
When developing an application for ARM, the initial testing ground for the code is liable to be armsd using either the ARMulator or a PIE card. The ARM debug monitor (Demon) reserves SWIs in the range 0255. These implement many of the semi-hosted functions which, among other things, are needed for armsd to work correctly. You should therefore avoid dening SWIs that overlap this range. If you are using an ARMulator, installing your own handler will not stop Demons SWIs still being accessible. However, if you are using a PIE card, Demons SWI facilities will disappear. You can prevent this from happening by intercepting Demons SWI handler before installing your own. If your handler does not deal with a particular SWI, it can pass the SWI on to Demons. To do this, move Demons installation instruction out of the vector table and replace it with one pointing to your own handler, putting Demons instruction at an address where your own handler can call it if required. Bear in mind that Demon installs its SWI handler using the LDR method described above, since on a PIE card the SWI handler code is in ROM some 3 Gbytes above the vector table. You will therefore have to adjust the instructions PC-relative offset value to take account of its new location. First you need a location in which to store the original Demon SWI vector instruction. This might be: unsigned *Dswivec = (unsigned *) 0x20; The Install_Handler() routine described in 12.3 Loading the Vector Table on page 12-9 returns the original contents of the vector being installed into, so the following call will store the original Demon SWI vector instruction at its new location as well as installing our own handler: *Dswivec = Install_Handler ((unsigned)SWIHandler, swivec); Next update the PC-relative offset in the original Demon instruction, to allow for the fact that it now occupies a different memory location. The following call will do this: Update_Demon_Vec (swivec, Dswivec); where: void Update_Demon_Vec (unsigned *original, unsigned *Dvec) /* Returns updated instruction 'LDR pc, [pc,#offset]' when */ /* moved from 'original' to 'Dvec' (ie recalculates offset). */ /* Assumes 'Dvec' is higher in memory than 'original'. */ { *Dvec = ((*Dvec &0xfff) - ((unsigned) Dvec - (unsigned) original)) | (*Dvec & 0xfffff000); }
Programming Techniques
ARM DUI 0021A
12-15
Implementing SWIs
The C SWI handler function must be updated so that it can report whether or not it has handled this SWI: unsigned C_SWI_Handler (unsigned number, unsigned *reg) { unsigned done = 1; switch (number) { case 256: /* SWI number 256 code */ break; case 257: /* SWI number 257 code */ break; default: done = 0; } return (done); } The result passed back can be used by the top-level assembly language handler to determine whether the SWI has been handled, or whether it should be handed on to Demons handler: BL C_SWI_Handler CMP r0, #0 ; Call C routine to handle SWI ; Has C routine handled SWI ? ; 0 = no, 1 = yes
; ; Restore registers and cpsr from stack ; ; Now need to decide whether to return from handler or to ; call the next handler in the chain (the debugger's). MOVNES pc,lr ; return from handler if SWI handled BEQ Dswivec ; else jump to address containing ; instruction to branch to address of ; debugger's SWI handler. Note The BEQ Dswivec instruction would not actually branch to the required stored vector, but would instead jump to the address where the location of that pointer is stored in the data area. It is cited here to illustrate the location to which the handler is attempting to branch. The easiest way to write it is as a branch to a known address, which in this case would be: BEQ 0x20 However, this can be done more exibly by importing the Dswivec label into the assembly language module. You can then store the address where the Demon vector is stored within the module and force the PC to that address.
12-16
Programming Techniques
ARM DUI 0021A
Implementing SWIs
This requires a short piece of code (MakeChain) which can be called from the main program after it has set up the new vector and stored the old vector. LDR pc, swichain ; else jump to address containing ; instruction to branch to address of ; debugger's SWI handler.
: : swichain DCD 0 MakeChain LDR r0, =swichain LDR r1, =Dswivec LDR r2, [r1] STR r2, [r0] MOV pc,lr
; ; ; ; ; ; ;
Load address of swichain into r0. Load address of Dswivec into r1. Load contents of Dswivec, i.e. the location of the stored Demon vector. Store vector location within range of PC relative load. Return from routine.
Note that while developing under Demon, you will not need to set up a stack in Supervisor mode, as Demon creates a 512 byte stack for you. Once development is nished, you will need to do two further things before producing the code for your nal system: set up the Supervisor mode stack (as described earlier) remove the additions that patch your handler in front of Demons
Programming Techniques
ARM DUI 0021A
12-17
Implementing SWIs
12.6 Example SWI Handler
The following two program listings implement an example SWI handler that can be run on a PIE card, via armsd. To produce this program, enter the listings, then type: armcc -li -c install.c armasm -li handle.s armlink install.o handle.o /work/arm/lib/armlib.32l -o swi armsd -serial swi go Note that you should use the pathname of the library on your system at the link stage. In addition, Demon only installs its vectors on start up of armsd, so the updating of the Demon SWI vector will only work correctly during the rst execution of the application. The C SWI handler actually stores the values it is passed in the memory locations pointed to by called_256, param_257, param_258 and param_259. After running the program you can check that the parameters were passed correctly by examining these locations.
12.6.1 install.c
/***************************/ /* File: install.c */ /* Author: Andy Beeson */ /* Date: 7th February 1994 */ /***************************/ #include <stdio.h> #include <stdlib.h> extern void SWIHandler (void); extern void MakeChain (void); unsigned *Dswivec =(unsigned *)0x20; /*ie place to store old one*/ struct four_results { unsigned a; unsigned b; unsigned c; unsigned d; }; __swi (256) void my_swi_256 (void); __swi (257) void my_swi_257 (unsigned); __swi (258) unsigned my_swi_258 (unsigned,unsigned,unsigned,unsigned); __swi (259) __value_in_regs struct four_results my_swi_259 (unsigned, unsigned, unsigned,unsigned); unsigned Install_Handler (unsigned routine, unsigned *vector)
12-18
Programming Techniques
ARM DUI 0021A
Implementing SWIs
/* Updates contents of 'vector' to contain branch instruction */ /* to reach 'routine' from 'vector'. Function return value is */ /* original contents of 'vector'. */ /* NB: 'Routine' must be within range of 32Mbytes from 'vector'. */ { unsigned vec, oldvec; vec = ((routine - (unsigned)vector - 0x8)>>2); if (vec & 0xff000000) { printf ("Installation of Handler failed"); exit (0); } vec = 0xea000000 | vec; oldvec = *vector; *vector = vec; return (oldvec); } void Update_Demon_Vec (unsigned *original, unsigned *Dvec) /* Returns updated instruction 'LDR pc, [pc,#offset]' when */ /* moved from 'original' to 'Dvec' (ie recalculates offset). */ /* Assumes 'Dvec' is higher in memory than 'original'. */ { *Dvec = ((*Dvec &0xfff) - ((unsigned) Dvec - (unsigned) original)) | (*Dvec & 0xfffff000); } unsigned C_SWI_Handler (unsigned number, unsigned *reg) { unsigned done = 1; /* Set up parameter storage block pointers */ unsigned *called_256 = (unsigned *) 0x24; unsigned *param_257 = (unsigned*) 0x28; unsigned *param_258 = (unsigned*) 0x2c; /* + 0x30,0x34,0x38 */ unsigned *param_259 = (unsigned*) 0x3c; /* + 0x40,0x44,0x48 */ switch (number) { case 256: *called_256 = 256; /* Store a value to show that */ break; /* SWI was handled correctly. */ case 257: *param_257 = reg [0]; /* Store parameter */ break; case 258: *param_258++ = reg [0]; /* Store parameters */ *param_258++ = reg [1]; *param_258++ = reg [2]; *param_258 = reg [3]; /* Now calculate result */
Programming Techniques
ARM DUI 0021A
12-19
Implementing SWIs
reg [0] += reg [1] + reg [2] + reg [3]; break; case 259: *param_259++ = reg [0]; /* Store parameters */ *param_259++ = reg [1]; *param_259++ = reg [2]; *param_259 = reg [3]; reg [0] *= 2; /* Calculate results */ reg [1] *= 3; reg [2] *= 4; reg [3] *= 5; break; default: done = 0; /* SWI not handled */ } return (done); } int main () { struct four_results r_259; /* Results from SWI 259 */ unsigned *swivec = (unsigned *)0x8; /* Pointer to SWI vector */ *Dswivec = Install_Handler ((unsigned)SWIHandler, swivec); Update_Demon_Vec (swivec, Dswivec); MakeChain (); printf("Hello 256\n"); my_swi_256 (); printf("Hello 257\n"); my_swi_257 (257); printf("Hello 258\n"); printf(" Result = %u\n",my_swi_258 (1,2,3,4)); printf ("Hello 259\n"); r_259 = my_swi_259 (10,20,30,40); printf (" Results are: %u %u %u %u\n", r_259.a,r_259.b,r_259.c,r_259.d); printf("The end\n"); return (0); }
12-20
Programming Techniques
ARM DUI 0021A
Implementing SWIs
12.6.2 handle.s
/***************************/ /* File: handle.s */ /* Author: Andy Beeson */ /* Date: 7th February 1994 */ /***************************/ AREA TopSwiHandler, CODE ; name this block of code EXPORT SWIHandler EXPORT MakeChain IMPORT C_SWI_Handler IMPORT Dswivec SWIHandler SUB r13, r13, #4 STMFD r13!,{r0-r12,r14} MOV r1, r13 LDR r0,[r14,#-4] BIC MRS STR BL CMP r0,r0,#0xff000000 r2, spsr r2,[r13,#14*4] C_SWI_Handler r0, #0
; ; ; ; ; ; ; ; ; ; ; ; ; ;
leave space to store spsr store registers second parameter to C routine is register values. Calculate address of SWI instruction and load it into r0 mask off top 8 bits of instruction store spsr on stack at original r13 Call C routine to handle SWI Has C routine handled SWI ? 0 = no, 1 = yes extract spsr from stack and restore it Restore original registers
LDR r2, [r13,#14*4] MSR spsr,r2 LDMFD r13!, {r0-r12,lr} ADD r13,r13,#4 ; Now need to decide whether to return from handler or to call ; the next handler in the chain (the debugger's). MOVNES pc,lr ; return from handler if SWI handled LDR pc, swichain ; else jump to address containing ; instruction to branch to address of ; debugger's SWI handler. swichain DCD 0 MakeChain LDR r0, =swichain LDR r1, =Dswivec LDR r2, [r1] STR r2, [r0]
; ; ; ; ; ;
Load address of swichain into r0. Load address of Dswivec into r1. Load contents of Dswivec, i.e. the location of the stored Demon vector. Store vector location within range of PC relative load.
Programming Techniques
ARM DUI 0021A
12-21
Implementing SWIs
MOV pc,lr END ; Return from routine. ; mark end of this file
12-22
Programming Techniques
ARM DUI 0021A
13
13.1 13.2 13.3 13.4
Programming Techniques
ARM DUI 0021A
13-1
Such information can allow the you to: compare the ARMs performance against other processors in benchmark tests make decisions about required clock speed and memory configuration of a projected system pinpoint where an application can be streamlined, leading to a reduction in the systems memory requirements identify performance-critical sections of code which can then be optimised using a different algorithm, or by rewriting in assembler
This chapter shows you how to measure code size and execution time, and how to generate an execution prole to discover where the time is being spent in your application.
13-2
Programming Techniques
ARM DUI 0021A
The columns in the table have the following meanings: code size inline data gives the code size, excluding any data which has been placed in the code segment (see inline data, below). reports the size of the data included in the code segment by the compiler. Typically, this data will contain the addresses of variables which are accessed by the code, plus any oating point immediate values or immediate values that are too big to load directly into a register. In does not include inlined strings, which are listed separately (see inline strings, below).
Programming Techniques
ARM DUI 0021A
13-3
RW data
0-init data
debug data
The ROM and RAM requirements for the Dhrystone program would be: ROM = code size + inline data + inline strings + const data + RW data = 36680 + 428 + 2304 + 128 + 748 = 40278 RAM = RW data + 0-Init data = 748 + 11376 = 12124 To repeat this experiment with the Thumb compiler, issue the command: tcc -c -Ospace -DMSC_CLOCK dhry_1.c dhry_2.c This time use armlinks -info sizes option to give a complete breakdown of the code and data sizes: armlink -o dhry -info sizes dhry_1.o dhry_2.o armlib.16l
13-4
Programming Techniques
ARM DUI 0021A
This information is conveyed via a le called armsd.map, which must be in the current directory when the debugger is run. You can nd the following example map le in directory examples/sorts: 0 80000000 RAM 4 rw 135/80 135/80 This describes a single contiguous section of memory from 0 up to 0x80000000. The memory system is 32 bits wide, and has an N cycle access time of 135nS and an S cycle access time of 80nS. The cycle times for reads and writes are the same. The following steps investigate how changing the armsd.map le parameters alters the processors performance. Compile the sorts.c example program in directory examples/sorts, as follows: armcc -Otime -o sorts sorts.c This program sorts 1000 strings using three different algorithmsinsertion, shell and quick sortand reports the time taken by each. Run the program under armsd using the command: armsd -clock 33MHz sorts where -clock 33MHz species the processor speed. When armsd starts up, it will report the following: Memory map ... 00000000..80000000, 32-Bit, rw, Clock speed = 33.33Mhz R(N/S) = 135/80, W(N/S) = 135/80
If this information does not appear, armsd has failed to read the map lecheck that it is in the current directory.
Programming Techniques
ARM DUI 0021A
13-5
13-6
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
13-7
Clock speed = 33MHz, Memory access times (N = 115nS, S = 85nS) ARM 32-bit memory 32-bit memory 16-bit memory 8-bit memory 16759.8 16759.8 9063.4 4724.4 Thumb 14018.7 17142.9 (with 16 bit latch) 11718.7 6237.0
Clock speed = 33MHz, Memory access times (N = 30nS, S = 30nS) ARM 32-bit memory 32-bit memory 16-bit memory 8-bit memory 52083.3 52083.3 27624.3 14285.7 Thumb 43478.3 43478.3 (with 16 bit latch) 35971.2 18939.4
13-8
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
13-9
cum%
self% desc%
calls
The section for insert_sort shows that it made 243432 calls to strcmp, and that this accounted for 59.44% of the time spent in strcmp (the desc% column shows 0 in this case because strcmp does not call any functions). In the case of strcmp, qs_string_compare (which is called by qsort), shell_sort and insert_sort made respectively 13021, 14059 and 243432 calls to strcmp and the time spent in strcmp is shared out between the functions in the ratio 3.17% to 3.43% to 59.44%.
13-10
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
13-11
13-12
Programming Techniques
ARM DUI 0021A
14
This chapter offers advice on exploiting the features of the Thumb instruction set. 14.1 14.2 14.3 14.4 Working with Thumb Assembly Language Hand-optimising the Thumb Compilers Output ARM/Thumb Interworking Division by a Constant in Thumb Code 14-2 14-5 14-8 14-12
Programming Techniques
ARM DUI 0021A
14-1
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
14-3
14-4
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
14-5
14-6
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
14-7
14-8
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
14-9
Programming Techniques
ARM DUI 0021A
Recompile main.c with armcc and re-run it to check that you get the same results: armcc -c main.c armlink -o mul main.o mul.o armlib.32l
Programming Techniques
ARM DUI 0021A
14-11
cc -o divc divc.c
where cc is the appropriate compiler command. Run the resulting the program, giving as the argument the constant you wish to divide byfor example, to generate code to divide by 3, enter divc 3 This will produce the following assembly output. ; generated by Thumb divc 1.00 (Advanced RISC Machines) [4 Apr 95] CODE16 AREA |div3$code|, CODE, READONLY EXPORT udiv3 udiv3 ; takes argument in a1 ; returns quotient in a1, remainder in a2 ; cycles could be saved if only divide or remainder is required MOV a2, a1 LSR a1, #1 LSR a3, a1, #2 ADD a1, a3 LSR a3, a1, #4 ADD a1, a3 LSR a3, a1, #8 ADD a1, a3 LSR a3, a1, #16 ADD a1, a3 LSR a1, #1 ASL a3, a1, #2 SUB a3, a3, a1 ASL a3, #0 SUB a2, a3 CMP a2, #3 BLT %FT0 ADD a1, #1 SUB a2, #3 0
14-12
Programming Techniques
ARM DUI 0021A
Programming Techniques
ARM DUI 0021A
14-13
14-14
Programming Techniques
ARM DUI 0021A
Index
Index
A
Accessing hi registers in Thumb state 3-9 Addition of 64-bit integers 7-5 Addresses 32-bit 4-2 loading into registers 4-17 Addressing modes 5-29 ADR instruction 4-17 ADRL instruction 4-17 ALU status flags 4-6 APCS conforming 7-3 inter-link-unit 7-4 register usage 7-3, 7-8 stack chunk 7-4 static base 7-4 strictly conforming 7-3 APM 2-3
Symbols
__swi directive 12-11
Numerics
16-bit data 5-17, 5-34 using on the ARM 5-17 registers 5-24 32-bit addresses 4-2 data 4-2 instructions 4-2 64-bit integer addition 7-5 multplication result 7-12 8-bit data 4-2, 5-34
Programming Techniques
ARM DUI 0021A
Index-1
Index
Applications entered at the base address 9-2 entered via the reset vector 9-2 startup 9-2 Areas 4-4 Argument passing 6-5 ARM architecture 3-2 emulator 10-2 processor state 3-5 registers register set 3-6 using as 16-bit 5-24 ARM Procedure Call Standard see APCS ARM Project Manager 2-3 ARM state. See Operating state ARM Windows Debugger 2-3 ARM/Thumb interworking 14-8 armasm summary of features 2-3 armcc -apcs /nofp option 6-12 -apcs /noswst option 6-12 -ARM7 option 6-12 assembly output 5-10 -c option 2-6 -g option 2-5, 6-11 -o option 2-4 -Ospace option 6-11 -Otime option 6-11 -pcc option 6-12 -S option 2-6 summary of features 2-3 -zpj0 option 6-11 armfast 10-2 armlink 2-6 summary of features 2-3 armproto 10-2 armsd command break 2-5 go 2-4 print 2-5 quit 2-5 Index-2 reg 2-5 reload 2-5 type 2-5 summary of features 2-3 ARMulator 10-2 improving performance 10-9 profiling programs 13-9 rebuilding 10-13 source tree 10-13 timing program execution 13-5 using the rapid prototype model 10-4 armvirt 10-2 Assembler module example 4-4 structure of 4-4 Assembly language Thumb 14-2
B
Barrel shifter 4-3, 4-10, 5-12, 5-29 Base register writeback 4-24 Big endian. See Memory format Branches long distance 11-13 to an exception handler 11-12 Breakpoints 2-5 Building a ROM to be entered at its base address 9-4 to be loaded at address 0 9-13 using scatter loading 9-13 Byte order reversal 5-28
C
C for deeply embedded applications 6-17 function design 6-3 standalone functions 9-17 using libraries in ROM 9-14 -c option armcc 2-6 Cache lines 5-34 Carry flag 14-6
Programming Techniques
ARM DUI 0021A
Index
Changing endianness 5-28 Clash detection overlays 8-2 Client applications 8-8 Code size measuring 13-3 Compiler options improving performance 6-11 Condition code flags 3-10 Conditional execution 4-6, 5-29 Constants literal pools 4-15 multiplication 4-11 Current Program Status Register 4-3, 11-5, 12-2 Custom serial drivers 10-11 Division by a constant 5-12 generating sequences 5-16 Thumb 14-12 Division implementation 6-14 Dynamic linker 8-9
E
Embedded applications 6-17 Entering ARM state 3-5 exceptions 11-5 Thumb state 3-5 ENTRY directive 4-4, 9-2 Error handling 6-19 Exception handlers data abort handler 11-11 FIQ handler 11-9 in Thumb state 11-14 installing 11-12 interrupt handler 11-8 on Thumb-aware processors 11-14 prefetch abort handler 11-11 reset handler 11-10 returning from 11-5 SWI handler 11-8 undefined instruction handler 11-10 writing 11-8 Exceptions 11-2 entering 11-5 leaving 11-5 priorities 11-3 response by processors 11-5 returning from 11-14 use of modes 11-3 use of registers 11-3 Execution conditions 4-6 Extending the standalone runtime system 624
D
Data 32-bit 4-2 8-bit 4-2 Data abort 11-7 exception 11-2 handler 11-11 Data flow analysis 6-7 Data size measuring 13-3 Debug monitor 12-2 Debugger Windows 2-3 Debuggers writing serial drivers 10-11 decAOF summary of features 2-3 Deeply embedded applications 6-17 Demon 12-2 and SWI 12-15 Detecting overflow 5-23 into the top 16 bits 5-23 Dhrystone 13-3 Divide routines for real-time applications 6-15
Programming Techniques
ARM DUI 0021A
Index-3
Index
F
Feedback 1-3 FIQ banked register 3-7 exception 11-2 handler 11-9 handlers 11-6 vector 11-9 Flags ALU status 4-6 condition code 3-10 Floating point emulator 5-33 support 6-21 Formats memory 3-3 Function arguments 6-5 Function call overhead 6-3 Function design 6-3
Installing exception handlers 11-12 Instructions 32-bit 4-2 ADR 4-17 ADRL 4-17 LDM 5-29 LDR/STR 4-12 load/store multiple 4-2, 4-23 MOV/MVN 4-14 program status register transfer 4-13 STM 5-29 Integer to string conversion 5-3 Integer-like structures 7-10 Inter-link-unit 7-4 Interrupt handlers 11-8 Interworking between ARM and Thumb 14-8 IRQ exception 11-2 handlers 11-6
G
-g option armcc 2-5
J
Jump tables 4-21
H
Halfword data 5-17 Handling SWIs in Thumb state 11-15 Hello World example 2-4 hello.c file 2-4 Hi registers 3-9 description 3-9
L
LDM instruction 5-29 LDR Rd, = mechanism 4-15, 4-18 LDR/STR instruction 4-12 Leaf functions 6-3 Leaving exceptions 11-5 Link register 12-2 Linker outputs shared libraries 8-11 Linking with libraries 2-6 Literal pools 4-15 Little endian. See Memory format Lo registers 3-9
I
Increment /Decrement, Before/After 4-23 Initialisation on RESET 9-2 In-line functions 6-8
Index-4
Programming Techniques
ARM DUI 0021A
Index
Load/store architecture 4-2 instruction 4-12 multiple 5-29 multiple instructions 4-2, 4-23 Loading a word from an unknown alignment 5-27 addresses into registers 4-17 big endian 5-21 constants into registers 4-14 little endian 5-18 Locating a shared library 8-9 Long multiply and Thumb 14-10 Longjmp and setjmp 6-20 Loop unrolling 5-30 Multiplication 5-29 constant 4-11 returning a 64-bit result 7-12 Multiply by a constant 5-8
N
Non integer-like structures 7-11 Non-sequential cycles minimising 5-35
O
-o option armcc 2-4 Operating state switching to ARM 3-5 to Thumb 3-5 Optimising multiple loads 14-5 multiple stores 14-5 register usage 5-30 Overflow detecting 5-23 Overlay manager 8-3 -OVERLAY option 8-2 Overlays clash detection 8-2 managing 8-3 -OVERLAY option 8-2 -SCATTER option 8-2 using 8-2
M
main.c file 14-8 Memory formats 3-3 big endian description 3-3 little endian description 3-3 loading big endian 5-21 loading little endian 5-18 storing big endian 5-22 storing little endian 5-20 Memory models armfast 10-2 armproto 10-2 armvirt 10-2 Minimising non-sequential cycles 5-35 Modules (RISC OS) 8-10 MOV/MVN instruction 4-14 Multiple instructions load/store 4-2 Multiple loads optimising 14-5 Multiple stores optimising 14-5 Multiple versus single transfers 4-23
P
Page table generation 5-25 Passing and returning structures 7-9 arguments 6-5 PC relative expressions 4-18
Programming Techniques
ARM DUI 0021A
Index-5
Index
Performance analysis 5-11, 5-30 improving 10-9 issues 5-29 Pools literal 4-15 Pop and push using load and store multiple 4-24 Prefetch abort 11-6 exception 11-2 handler 11-11 Processor states 3-5 Processors responding to exceptions 11-5 Profile reports 13-10 Profiling programs 13-9 Program Status Register transfer 4-13 Program Status Registers 3-10 Project Manger 2-3 Proxy functions 8-8 PSR instructions 4-13 roles 3-6 usage 12-4 Reset exception 11-2 handler 11-10 vector applications entered via 9-2 Return address 11-6 Return instruction 11-6 Running out of heap 6-23
S
-S option armcc 2-6 Saved Program Status Register 11-5, 12-2 -SCATTER option 8-2 Scatter loading initialisation 8-4 Serial drivers writing for ARM debuggers 10-11 Serial/parallel driver source files 10-11 Shared libraries 8-8 linker outputs 8-11 parameter block 8-11 prerequisites 8-10 Small functions 6-3 Software interrupt 12-2 exception 11-2 sort.c file 14-8 sorts.c file 13-5 Source files supplied for serial/parallel drivers 10-11 Speed improvement 5-11 sprintf using in ROM 9-14 Stack notation 4-24 Stack overflow checking 6-23 Stack pointer 12-2 Stack usage 12-4
R
Random number generation 5-25 Rapid prototype model ARMulator 10-4 Rebuilding ARMulator 10-13 Recursion in assembly language 5-3 Reentrant code 8-8 Register allocation 6-6 Register list 4-23 Register sets ARM 3-6 Thumb 3-8 Registers 16-bit 5-24 FIQ banked 3-7 loading addresses into 4-17 loading constants into 4-14 moving to and from memory 4-23 program status 3-10
Index-6
Programming Techniques
ARM DUI 0021A
Index
Stacks and SWIs 12-6 in assembly language 5-3, 5-6 Standalone C functions 9-17 runtime library size 6-25 runtime system 6-18 States of processors 3-5 Static base register 8-8 STM instruction 5-29 Storing big endian 5-22 little endian 5-20 strcopy 4-19 Strings copying 4-19 Structure passing and returning 7-9 Stubs 8-8 Supervisor mode 12-2, 12-4, 12-5, 12-17 Supervisor stack 12-6 SWI handler 11-8, 12-18 SWIs 12-2 and Demon 12-15 and undefined instruction handlers 11-6 calling from application 12-11 calling standard 12-4 decoding 12-4 executing 12-2 handling in Thumb state 11-15 implementing in C 12-7 re-entrant 12-5 returning from 12-3 that return no result 12-11 that return one result 12-12 that return two to four results 12-13 Switch statement 6-8 Switching processor state 3-5
T
Tail continued functions 6-4 tasm summary of features 2-3 tcc summary of features 2-3 Thumb 14-1 assembly language 14-2, 14-5 division by a constant 14-12 processor state 3-5 register set 3-8 state accessing hi registers 3-9 Thumb/ARM interworking 14-8 Tools armasm 2-3 armcc 2-3 armlink 2-3 armsd 2-3 decAOF 2-3 tasm 2-3 tcc 2-3
U
Undefined instruction exception 11-2 handler 11-10 Unused code finding and destroying 6-13 User mode 11-3
V
Variable access costs 6-7 Vector table 11-3, 12-2, 12-9 Vector tables branching 11-12 Veneer functions 6-4
Programming Techniques
ARM DUI 0021A
Index-7
Index
W
Windbg 2-3 Write buffer stalling 5-34 Writing code for ROM troubleshooting 9-18 custom serial drivers for ARM debuggers 10-11
Index-8
Programming Techniques
ARM DUI 0021A