Module 3
Module 3
Optimizing code can be time-consuming and make it harder to read. Focus on optimizing
functions that are used often and are important for performance. Use profiling tools to find these
key functions, and document any complex optimizations with comments.
C compilers convert your C code to assembly, ensuring it works for all inputs, even though some
inputs are unlikely.
C compilers have to translate your C function literally into assembler so that it works for all
possible inputs. In practice, many of the input combinations are not possible or won’t occur.
Let’s start by looking at an example of the problems the compiler faces. The memclr function
clears N bytes of memory at address data.
The memclr function sets a block of memory to zero. It takes two arguments:
This process repeats until N reaches 0, clearing each byte in the specified memory block.
No matter how advanced the compiler, it does not know whether N can be 0 on input or not.
Therefore the compiler needs to test for this case explicitly before the first iteration of the loop.
Explanation of Compiler Challenges
• Early Architectures: Used before ARM7TDMI, with details shown in Table 5.1.
• Value Extension: When loading 8- or 16-bit values, they extend to 32 bits in registers.
Unsigned values are zero-extended, signed values are sign-extended. This makes casting
to int efficient.
• Storing Values: When storing 8- or 16-bit values, only the lowest 8 or 16 bits of the
register are used. Casting int to a smaller type is efficient.
• ARMv4 and Above: Support signed 8-bit and 16-bit loads and stores directly with new
instructions.
ARMv5: Adds 64-bit load/store instructions, available in ARM9E and later cores.
Before ARMv4, ARM processors struggled with signed 8-bit and any 16-bit values, so ARM
C compilers define char as an unsigned 8-bit value.
Before ARMv4: ARM processors couldn't handle signed 8-bit or any 16-bit values well, so
charis defined as unsigned 8-bit.
The armcc and gcc compilers use specific data type mappings (see Table 5.2).
Porting Code Issues: The unsigned char type can cause issues, like an infinite loop if char is
used as a loop counter (i >= 0). armcc warns about this with an "unsigned comparison with 0"
message.
ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. However, most
ARM data processing operations are 32-bit only.
For this reason, we should use a 32-bit datatype, int or long, for local variables wherever
possible. Avoid using char and short as local variable types, even if we are manipulating an 8- or
16-bit value. The one exception is when you want wrap-around to occur.
To see the effect of local variable types, let’s consider a simple example.
Example: A checksum function sums values in a data packet, common in protocols like
TCP/IP for error checking.
The following code checksums a data packet containing 64 words. It shows why we should
avoid using char for local variables.
int checksum_v1(int *data) // Function definition for checksum_v1, takes an integer pointer as input
{
char i; // Declaring a variable i of type char
int sum = 0; // Initializing a variable sum of type int with value 0
for (i = 0; i < 64; i++) // Loop from 0 to 63 (64 iterations)
sum += data[i]; // Add the value at index i of the data array to the sum
}
return sum; // Return the final sum after the loop ends
}
In the first case, the compiler inserts an extra AND instruction to reduce i to the range 0 to 255
before the comparison with 64. This instruction disappears in the second case.
Next, suppose the data packet contains 16-bit values and we need a 16-bit
checksum. It is attractive to write the following C code:
The expression sum + data[i] is an integer and so can only be assigned to a short
using an (implicit or explicit) narrowing cast.
As you can see in the following assembly output, the compiler must insert extra
instructions to implement the narrowing cast:
The loop is now three instructions longer than the loop for example checksum_v2 earlier!
5.2.2 Function Argument Types
(Explain the concept of passing function arguments "wide" versus "narrow" in the context of the
following function. Describe how different compilers might handle these arguments, using the armcc and
gcc compilers as examples. How does each compiler's approach affect the handling of the input values
and the resulting assembly code?)
Overview
In function argument passing, performance and code size can be optimized by converting local
variables from types like char or short to int. This concept extends to function arguments.
Understanding how compilers handle these conversions is crucial for efficient code.
Example Function
This function takes two 16-bit (short) values, halves the second, adds it to the first, and returns a
16-bit result.
Key Points
1. Argument Passing:
o In ARM architecture, function arguments and return values are typically passed in 32-bit
registers.
o The question arises whether these 32-bit registers should contain values restricted to the
short range (−32,768 to +32,767).
2. Wide vs. Narrow Passing:
o Wide Passing: Arguments are passed without reducing them to the type’s range. The
callee (the function being called) must reduce them to the correct range.
o Narrow Passing: Arguments are reduced to the type’s range before passing. The caller
(the function making the call) ensures this reduction.
3. Compiler Behavior:
o Different compilers handle these conversions differently, affecting both performance and
correctness.
o armcc Compiler:
▪ Passes arguments narrow and returns values narrow.
▪ Caller casts arguments to short type, and callee casts return value to short type.
▪ Assembly output shows caller ensures 32-bit values are in short range, and callee
casts return value.
o gcc Compiler:
▪ Reduces input arguments to the short range, ensuring values are in the correct
range regardless of assumptions about the caller.
The code assumes the caller provided arguments within the short range.
Shifting operations ensure the return value fits within the short range.
In programming, the choice between using signed int and unsigned int can affect the efficiency of
certain operations, particularly division. Let's explore how these types behave differently.
Arithmetic Operations
This function adds two integers and divides the sum by 2. When compiled, it
translates into the following assembly code:
average_v1
ADD r0, r0, r1 ; r0 = a + b
ADD r0, r0, r0, LSR #31 ; if (r0 < 0) r0++
MOV r0, r0, ASR #1 ; r0 = r0 >> 1
MOV pc, r14 ; return r0
1. Addition:
o ADD r0, r0, r1: Adds the two integers a and b.
2. Handling Signed Division:
o ADD r0, r0, r0, LSR #31: Adjusts the result for negative values. This line adds 1 to r0 if the
sum a + b is negative (ensuring proper rounding for negative numbers).
3. Arithmetic Shift Right:
o MOV r0, r0, ASR #1: ; r0 = r0 >> 1: Divides the sum by 2 using an arithmetic shift right
(ASR), which preserves the sign of the integer.
4. Return:
o MOV pc, r14: Returns the result.
Key Points
• Signed Division:
o The compiler includes an extra step to handle rounding correctly for negative numbers.
This is why the assembly code includes ADD r0, r0, r0, LSR #31.
o Signed division requires careful handling to ensure that negative numbers are rounded
correctly.
• Unsigned Division:
o If the integers were unsigned, the compiler would not need to add 1 for negative numbers,
simplifying the division process.
Practical Implications
• When performing division in your code, consider whether the values can be negative. If they are
always non-negative, using unsigned int can simplify the generated code and potentially improve
performance.
Understanding the differences between signed and unsigned types helps you write more efficient
and accurate code, especially in performance-critical applications.
5.3 C Looping Structures
Loops with a Fixed Number of Iterations
This is not efficient. On the ARM, a loop should only use two
instructions:
■A subtract to decrement the loop counter, which also sets the
condition code flags on the result
■ A conditional branch instruction.
The key point is that the loop counter should count down to zero rather
than counting up to some arbitrary limit.
Example: This example shows the improvement if we switch to a
decrementing loop rather than an incrementing loop.
MOV r2,r0: Copy the address of the data block from register r0 to register
r2. r2 will now point to the data array.
MOV r1,#0x40:Initialize the loop counter i (stored in r1) to 64. This implies
the function will process 64 elements of data.
• The checksum_v7 example shows how the compiler handles a for loop
with a variable number of iterations N.
int checksum_v7(int *data, checksum_v7
unsigned int N) MOV r2,#0 ; sum = 0
{ CMP r1,#0 ; compare N, 0
int sum=0; BEQ checksum_v7_end ; if (N==0) goto end
for (; N!=0; N--) checksum_v7_loop
{ LDR r3,[r0],#4 ; r3 = *(data++)
sum += *(data++); SUBS r1,r1,#1 ; N-- and set flags
} ADD r2,r3,r2 ; sum += r3
return sum; BNE checksum_v7_loop ; if (N!=0) goto loop
} checksum_v7_end
MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0
Code Explanation:
MOV r2,#0: Initialize the sum (stored in r2) to 0.
CMP r1,#0: Compare the value of N (stored in r1) with 0.
BEQ checksum_v7_end: If N is 0 (comparison sets the zero flag), branch to the end of the function.
This handles the case where there is no data to process.
Loop Start:
checksum_v7_loop: This is the label marking the beginning of the loop.
LDR r3,[r0],#4: Load the word (4 bytes) from the address pointed to by r0
into r3, then increment
r0 by 4 to point to the next word in the data array.
SUBS r1,r1,#1: Decrement the loop counter N by 1. The S suffix means it will set the condition
flags based on the result (such as the zero flag if N becomes 0).
ADD r2,r3,r2: Add the value in r3 to the sum stored in r2.
BNE checksum_v7_loop: Branch to the checksum_v7_loop label if the zero flag is not set (i.e., if N is
not zero). This creates a loop that will iterate N times.
A do-while loop gives better performance and code density than a for loop.
5.3: Example: This example shows how to use a do-while loop to remove the test
for N being zero that occurs in a for loop.
MOV r2,#0: Initialize the sum (stored in r2) to 0. This will hold the running total of
the checksum.
checksum_v8_loop: This is the label marking the beginning of the loop.
LDR r3,[r0],#4: Load the word (4 bytes) from the address pointed to by r0
into r3, then increment
• This means r3 = *(data++), and r0 now points to the next word in the
array.
• Using register in C isn't consistently effective across different compilers and architectures
like Thumb and ARM.
• It's best to let the compiler handle register allocation without specifying register explicitly.
5.5 Function Calls
• ARM Procedure Call Standard (APCS) defines how function arguments and return values are
managed in ARM registers.
• ARM-Thumb Procedure Call Standard (ATPCS) extends this for ARM and Thumb
interworking.
• First four integer arguments go into registers r0, r1, r2, and r3; additional ones use the
stack.
• Function return values are typically passed back in register r0.
• Arguments like long long or double are handled in pairs of registers and returned in r0, r1.
• For C++, the first argument to an object method is the implicit this pointer, separate from
explicit arguments.
• If a function exceeds four arguments (or a C++ method three explicit ones), using structures
for related arguments is often more efficient.
• Group related arguments into structures and pass a structure pointer instead of multiple
individual arguments, based on your software's design.
5.6 Pointer Aliasing
(A pointer is a fundamental concept in programming, especially in languages like C, C++,
and similar languages. Here's a simplified explanation:
Definition of a Pointer:
• Concept: A pointer is a variable that stores the memory address of another variable.
• Purpose: Pointers allow direct access and manipulation of memory locations.)
Pointer Aliasing
• Definition: Pointer aliasing occurs when two pointers point to the same memory location.
• Effect: If we modify (write to) one pointer, the modification can affect the value read
from the other pointer.
• Compiler Assumption: When optimizing code, the compiler assumes that any write
operation to a pointer could potentially change the value that another pointer reads, if
they alias.
*timer1 += *step;
*timer2 += *step;
(increments two pointers (timer1 and timer2) by the value stored in another pointer (step).
This compiles to
LDR r3, [r0, #0] ; Load the value of *timer1 into r3 (r3 = *timer1)
LDR r12, [r2, #0] ; Load the value of *step into r12 (r12 = *step)
ADD r3, r3, r12 ; Add r12 (which is *step) to r3 (which is *timer1)
STR r3, [r0, #0] ; Store the updated value of r3 back into *timer1
LDR r0, [r1, #0] ; Load the value of *timer2 into r0 (r0 = *timer2)
LDR r2, [r2, #0] ; Load the value of *step into r2 (r2 = *step)
ADD r0, r0, r2 ; Add r2 (which is *step) to r0 (which is *timer2)
STR r0, [r1, #0] ; Store the updated value of r0 back into *timer2
MOV pc, r14 ; Return from the subroutine (assuming this is the end of a function)
Compiler Behavior:
The compiler generates code to load values from memory (LDR instructions).
In the example, *step is loaded twice because the compiler cannot optimize it by
storing the value once and reusing it (common subexpression elimination), as it
must assume that *step might change between the two accesses.
Simplified Points:
• Avoid relying on the compiler to optimize repeated memory accesses. Instead, create new
local variables to store the result of expressions that involve accessing memory. This ensures the
expression is computed only once.
• Avoid taking the address of local variables. This can make accessing the variable less
efficient because it might prevent optimizations by the compiler.
The way we organize a structure that is frequently used can greatly affect how efficiently our
program runs and how much memory it uses. There are two main concerns when it comes to
structures on ARM processors: alignment and overall size.
Alignment:
• ARM processors up to and including ARMv5TE have specific rules about how data
should be aligned in memory for efficient access.
• Load and store instructions on these architectures are guaranteed to work correctly only
when the data is stored at addresses aligned to the size of the data itself (like 4 bytes or 8
bytes).
• To ensure efficient memory access, ARM compilers automatically align the starting
address of a structure to a multiple of the largest data size used within the structure
(usually 4 or 8 bytes).
• They also align each data entry within the structure to its own size by adding extra
padding if needed.
Example Structure:
struct {
char a;
int b;
char c;
short d;
};
• Alignment Requirements:
• Memory Alignment: Many computer architectures, including ARM, require that certain
data types be stored at memory addresses that are multiples of their size. This is known as
memory alignment.
• For example, an int (typically 4 bytes) should ideally start at a memory address that is
divisible by 4 (4-byte alignment). Similarly, a short (typically 2 bytes) should start at a
memory address that is divisible by 2 (2-byte alignment).
• Padding:
• Between a and b: After char a (which takes 1 byte), the next variable int b requires a
memory address that is a multiple of 4 bytes (assuming int is 4 bytes). If a is at an
address like 0x1000, the compiler might insert 3 padding bytes after a so that b starts at
0x1004 (4-byte aligned).
• Between c and d: Similarly, after char c (which takes 1 byte), the next variable short
d needs to start at a memory address that is a multiple of 2 bytes (assuming short is 2
bytes). If c ends at 0x1005 (assuming a and b have 4 bytes allocated each), the compiler
• Little-Endian Memory System:
• Explanation of Layout:
• Padding: