0% found this document useful (0 votes)
18 views21 pages

Module 3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
18 views21 pages

Module 3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 21

Overview of C Compilers and Optimization

Optimizing code can be time-consuming and make it harder to read. Focus on optimizing
functions that are used often and are important for performance. Use profiling tools to find these
key functions, and document any complex optimizations with comments.

C compilers convert your C code to assembly, ensuring it works for all inputs, even though some
inputs are unlikely.

C compilers have to translate your C function literally into assembler so that it works for all
possible inputs. In practice, many of the input combinations are not possible or won’t occur.
Let’s start by looking at an example of the problems the compiler faces. The memclr function
clears N bytes of memory at address data.

void memclr(char *data, int N)


{
for (; N > 0; N--)
{
*data = 0;
data++;
}
}

The memclr function sets a block of memory to zero. It takes two arguments:

1. char *data: A pointer to the start of the memory block.


2. int N: The number of bytes to clear.

Here's what the function does, step by step:

1. Loop: The for loop runs as long as N is greater than 0.


2. Set to Zero: Inside the loop, it sets the value at the current memory location (*data) to
zero.
3. Move to Next Byte: It then moves the pointer to the next byte (data++).
4. Decrease N: It decreases N by 1.

This process repeats until N reaches 0, clearing each byte in the specified memory block.

No matter how advanced the compiler, it does not know whether N can be 0 on input or not.
Therefore the compiler needs to test for this case explicitly before the first iteration of the loop.
Explanation of Compiler Challenges

1. Unknown N Value: The compiler doesn't know if N is 0, so it has to check before


starting the loop.
2. Data Alignment: The compiler doesn't know if the data pointer is aligned to four bytes. If
it is, it can clear four bytes at once.
3. Multiple of Four: The compiler doesn't know if N is a multiple of four. If it is, it can
optimize by clearing four bytes at a time or by repeating the loop body four times.

Basic C Data Types


ARM processors have 32-bit registers and operations.
ARM uses a load/store architecture, meaning you must load values into registers before using
them.
There are no instructions to directly perform arithmetic or logic on memory values.
ARMv1 to ARMv3 could load and store unsigned 8-bit and unsigned or signed 32-bit values.

• Early Architectures: Used before ARM7TDMI, with details shown in Table 5.1.
• Value Extension: When loading 8- or 16-bit values, they extend to 32 bits in registers.
Unsigned values are zero-extended, signed values are sign-extended. This makes casting
to int efficient.
• Storing Values: When storing 8- or 16-bit values, only the lowest 8 or 16 bits of the
register are used. Casting int to a smaller type is efficient.
• ARMv4 and Above: Support signed 8-bit and 16-bit loads and stores directly with new
instructions.
ARMv5: Adds 64-bit load/store instructions, available in ARM9E and later cores.

Before ARMv4, ARM processors struggled with signed 8-bit and any 16-bit values, so ARM
C compilers define char as an unsigned 8-bit value.

Before ARMv4: ARM processors couldn't handle signed 8-bit or any 16-bit values well, so
charis defined as unsigned 8-bit.

The armcc and gcc compilers use specific data type mappings (see Table 5.2).

Porting Code Issues: The unsigned char type can cause issues, like an infinite loop if char is
used as a loop counter (i >= 0). armcc warns about this with an "unsigned comparison with 0"
message.

We are using ARMv4 or newer processors, like the ARM7TDMI.

5.2.1 Local Variable Types


ARMv4 processors efficiently handle 8-, 16-, and 32-bit data loads and stores, but most data
operations are 32-bit.

ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. However, most
ARM data processing operations are 32-bit only.

For this reason, we should use a 32-bit datatype, int or long, for local variables wherever
possible. Avoid using char and short as local variable types, even if we are manipulating an 8- or
16-bit value. The one exception is when you want wrap-around to occur.

To see the effect of local variable types, let’s consider a simple example.
Example: A checksum function sums values in a data packet, common in protocols like
TCP/IP for error checking.

The following code checksums a data packet containing 64 words. It shows why we should
avoid using char for local variables.

int checksum_v1(int *data) // Function definition for checksum_v1, takes an integer pointer as input
{
char i; // Declaring a variable i of type char
int sum = 0; // Initializing a variable sum of type int with value 0
for (i = 0; i < 64; i++) // Loop from 0 to 63 (64 iterations)
sum += data[i]; // Add the value at index i of the data array to the sum
}
return sum; // Return the final sum after the loop ends
}

Consider the compiler output for this function

Now compare this to the compiler output where instead we declare i as


an unsigned int.

In the first case, the compiler inserts an extra AND instruction to reduce i to the range 0 to 255
before the comparison with 64. This instruction disappears in the second case.

Next, suppose the data packet contains 16-bit values and we need a 16-bit
checksum. It is attractive to write the following C code:

short checksum_v3(short *data)


{
unsigned int i;
short sum = 0;
for (i = 0; i < 64; i++)
{
sum = (short)(sum + data[i]);
}
return sum;
}
You may wonder why the for loop body doesn’t contain the code
sum += data[i];

The expression sum + data[i] is an integer and so can only be assigned to a short
using an (implicit or explicit) narrowing cast.

As you can see in the following assembly output, the compiler must insert extra
instructions to implement the narrowing cast:

The loop is now three instructions longer than the loop for example checksum_v2 earlier!
5.2.2 Function Argument Types

(Explain the concept of passing function arguments "wide" versus "narrow" in the context of the
following function. Describe how different compilers might handle these arguments, using the armcc and
gcc compilers as examples. How does each compiler's approach affect the handling of the input values
and the resulting assembly code?)

Overview

In function argument passing, performance and code size can be optimized by converting local
variables from types like char or short to int. This concept extends to function arguments.
Understanding how compilers handle these conversions is crucial for efficient code.

Example Function

Consider the function:

short add_v1(short a, short b)


{
return a + (b>>1);
}

This function takes two 16-bit (short) values, halves the second, adds it to the first, and returns a
16-bit result.

Key Points

1. Argument Passing:
o In ARM architecture, function arguments and return values are typically passed in 32-bit
registers.
o The question arises whether these 32-bit registers should contain values restricted to the
short range (−32,768 to +32,767).
2. Wide vs. Narrow Passing:
o Wide Passing: Arguments are passed without reducing them to the type’s range. The
callee (the function being called) must reduce them to the correct range.
o Narrow Passing: Arguments are reduced to the type’s range before passing. The caller
(the function making the call) ensures this reduction.
3. Compiler Behavior:
o Different compilers handle these conversions differently, affecting both performance and
correctness.
o armcc Compiler:
▪ Passes arguments narrow and returns values narrow.
▪ Caller casts arguments to short type, and callee casts return value to short type.
▪ Assembly output shows caller ensures 32-bit values are in short range, and callee
casts return value.
o gcc Compiler:
▪ Reduces input arguments to the short range, ensuring values are in the correct
range regardless of assumptions about the caller.

Assembly Code Analysis (armcc)


ADD r0, r0, r1, ASR #1 ; r0 = (int)a + ((int)b >> 1)
MOV r0, r0, LSL #16 ; shift left to remove upper bits
MOV r0, r0, ASR #16 ; shift right to restore original bits (sign-extend)
MOV pc, r14 ; return r0

The code assumes the caller provided arguments within the short range.
Shifting operations ensure the return value fits within the short range.

5.2.3 Signed versus Unsigned Types

Comparing Signed and Unsigned Integers


Overview

In programming, the choice between using signed int and unsigned int can affect the efficiency of
certain operations, particularly division. Let's explore how these types behave differently.

Arithmetic Operations

For most arithmetic operations (addition, subtraction, multiplication), there is no performance


difference between signed and unsigned integers. However, division operations can behave
differently.

Example: Averaging Two Integers

Consider the following function that averages two integers:

int average_v1(int a, int b)


{
return (a + b) / 2;
}

This function adds two integers and divides the sum by 2. When compiled, it
translates into the following assembly code:

average_v1
ADD r0, r0, r1 ; r0 = a + b
ADD r0, r0, r0, LSR #31 ; if (r0 < 0) r0++
MOV r0, r0, ASR #1 ; r0 = r0 >> 1
MOV pc, r14 ; return r0

Assembly Code Explanation

1. Addition:
o ADD r0, r0, r1: Adds the two integers a and b.
2. Handling Signed Division:
o ADD r0, r0, r0, LSR #31: Adjusts the result for negative values. This line adds 1 to r0 if the
sum a + b is negative (ensuring proper rounding for negative numbers).
3. Arithmetic Shift Right:
o MOV r0, r0, ASR #1: ; r0 = r0 >> 1: Divides the sum by 2 using an arithmetic shift right
(ASR), which preserves the sign of the integer.
4. Return:
o MOV pc, r14: Returns the result.

Key Points

• Signed Division:
o The compiler includes an extra step to handle rounding correctly for negative numbers.
This is why the assembly code includes ADD r0, r0, r0, LSR #31.
o Signed division requires careful handling to ensure that negative numbers are rounded
correctly.
• Unsigned Division:
o If the integers were unsigned, the compiler would not need to add 1 for negative numbers,
simplifying the division process.

Practical Implications

• When performing division in your code, consider whether the values can be negative. If they are
always non-negative, using unsigned int can simplify the generated code and potentially improve
performance.

Understanding the differences between signed and unsigned types helps you write more efficient
and accurate code, especially in performance-critical applications.
5.3 C Looping Structures
Loops with a Fixed Number of Iterations

What is the most efficient way to write a for loop on the


ARM?

Let’s return to our checksum example and look at the looping


structure.

Here is the last version of the 64-word packet checksum routine we


studied in Section 5.2. This shows how the compiler treats a loop with
incrementing count i++.

int checksum_v5(int *data) checksum_v5


{ MOV r2,r0 ; r2 = data
unsigned int i; MOV r0,#0 ; sum = 0
int sum=0; MOV r1,#0 ; i = 0
for (i=0; i<64; i++) checksum_v5_loop
{ LDR r3,[r2],#4 ; r3 = *(data++)
sum += *(data++); ADD r1,r1,#1 ; i++
} CMP r1,#0x40 ; compare i, 64
return sum; ADD r0,r3,r0 ; sum += r3
} BCC checksum_v5_loop ; if (i<64)
goto loop
MOV pc,r14 ; return sum

It takes three instructions to implement the for loop structure:


■ An ADD to increment i
■ A compare to check if i is less than 64
■ A conditional branch to continue the loop if i < 64

This is not efficient. On the ARM, a loop should only use two
instructions:
■A subtract to decrement the loop counter, which also sets the
condition code flags on the result
■ A conditional branch instruction.

The key point is that the loop counter should count down to zero rather
than counting up to some arbitrary limit.
Example: This example shows the improvement if we switch to a
decrementing loop rather than an incrementing loop.

int checksum_v6(int *data) checksum_v6


{ MOV r2,r0 ; r2 = data
unsigned int i; MOV r0,#0 ; sum = 0
int sum=0; MOV r1,r1,#0x40 ; i = 64
for (i=64; i!=0; i--) checksum_v6_loop
{ LDR r3,[r2],#4 ; r3 = *(data++)
sum += *(data++); SUBS r1,r1,#1 ; i-- and set flags
} ADD r0,r3,r0 ; sum += r3
return sum; BNE checksum_v6_loop ; if (i!=0) goto
} loop
MOV pc,r14 ; return sum

MOV r2,r0: Copy the address of the data block from register r0 to register
r2. r2 will now point to the data array.

MOV r0,#0: Initialize the sum (stored in r0) to 0.

MOV r1,#0x40:Initialize the loop counter i (stored in r1) to 64. This implies
the function will process 64 elements of data.

checksum_v6_loop: This is the label marking the beginning of the loop.


LDR r3,[r2],#4: Load the word (4 bytes) from the address pointed to by r2
into r3, then increment r2 by 4 to point to the next word in the data array.
SUBS r1,r1,#1: Decrement the loop counter i by 1. The S suffix means it
will set the condition flags based on the result (such as the zero flag if i
becomes 0).
ADD r0,r3,r0: Add the value in r3 to the sum stored in r0.

BNE checksum_v6_loop: Branch to the checksum_v6_loop label if the zero flag


is not set (i.e., if i is not zero). This creates a loop that will iterate 64
times.

5.3.2 Loops Using a Variable Number of Iterations

• Now suppose we want our checksum routine to handle packets of


arbitrary size.

• We pass in a variable N giving the number of words in the data packet.


Using the lessons from the last section we count down until N = 0 and
don’t require an extra loop counter i.

• The checksum_v7 example shows how the compiler handles a for loop
with a variable number of iterations N.
int checksum_v7(int *data, checksum_v7
unsigned int N) MOV r2,#0 ; sum = 0
{ CMP r1,#0 ; compare N, 0
int sum=0; BEQ checksum_v7_end ; if (N==0) goto end
for (; N!=0; N--) checksum_v7_loop
{ LDR r3,[r0],#4 ; r3 = *(data++)
sum += *(data++); SUBS r1,r1,#1 ; N-- and set flags
} ADD r2,r3,r2 ; sum += r3
return sum; BNE checksum_v7_loop ; if (N!=0) goto loop
} checksum_v7_end
MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0
Code Explanation:
MOV r2,#0: Initialize the sum (stored in r2) to 0.
CMP r1,#0: Compare the value of N (stored in r1) with 0.
BEQ checksum_v7_end: If N is 0 (comparison sets the zero flag), branch to the end of the function.
This handles the case where there is no data to process.
Loop Start:
checksum_v7_loop: This is the label marking the beginning of the loop.
LDR r3,[r0],#4: Load the word (4 bytes) from the address pointed to by r0
into r3, then increment
r0 by 4 to point to the next word in the data array.
SUBS r1,r1,#1: Decrement the loop counter N by 1. The S suffix means it will set the condition
flags based on the result (such as the zero flag if N becomes 0).
ADD r2,r3,r2: Add the value in r3 to the sum stored in r2.
BNE checksum_v7_loop: Branch to the checksum_v7_loop label if the zero flag is not set (i.e., if N is
not zero). This creates a loop that will iterate N times.

A do-while loop gives better performance and code density than a for loop.

5.3: Example: This example shows how to use a do-while loop to remove the test
for N being zero that occurs in a for loop.

int checksum_v8(int checksum_v8


*data, unsigned int N) MOV r2,#0 ; sum = 0
{ checksum_v8_loop
int sum=0; LDR r3,[r0],#4 ; r3 = *(data++)
do SUBS r1,r1,#1 ; N-- and set flags
{ ADD r2,r3,r2 ; sum += r3
sum += *(data++); BNE checksum_v8_loop ; if (N!=0) goto loop
} while (--N!=0); MOV r0,r2 ; r0 = sum
return sum; MOV pc,r14 ; return r0
}
Code Explanation:

MOV r2,#0: Initialize the sum (stored in r2) to 0. This will hold the running total of
the checksum.
checksum_v8_loop: This is the label marking the beginning of the loop.
LDR r3,[r0],#4: Load the word (4 bytes) from the address pointed to by r0
into r3, then increment

t r0 by 4 to point to the next word in the data array.

• This means r3 = *(data++), and r0 now points to the next word in the
array.

SUBS r1,r1,#1: Decrement the loop counter N (stored in r1) by 1. The S


suffix means it will set the condition flags based on the result (such as
the zero flag if N becomes 0).
ADD r2,r3,r2: Add the value in r3 to the sum stored in r2.
BNE checksum_v8_loop: Branch to the checksum_v8_loop label if the zero flag
is not set (i.e., if N is not zero). This creates a loop that will iterate N
times.
5.4 Register Allocation
• The compiler assigns registers to local variables in C functions.
• It tries to reuse registers if variables are not used simultaneously.
• Extra variables beyond available registers are stored on the processor stack.
• These stored variables are called spilled or swapped out variables.
• Accessing spilled variables is slower compared to accessing variables in
registers.
• To optimize a function:

• Minimize spilled variables.


• Ensure important and frequently accessed variables are in registers.

• ARM C compilers follow the ARM-Thumb procedure call standard (ATPCS)


for register allocation.

• Using register in C isn't consistently effective across different compilers and architectures
like Thumb and ARM.
• It's best to let the compiler handle register allocation without specifying register explicitly.
5.5 Function Calls
• ARM Procedure Call Standard (APCS) defines how function arguments and return values are
managed in ARM registers.
• ARM-Thumb Procedure Call Standard (ATPCS) extends this for ARM and Thumb
interworking.
• First four integer arguments go into registers r0, r1, r2, and r3; additional ones use the
stack.
• Function return values are typically passed back in register r0.
• Arguments like long long or double are handled in pairs of registers and returned in r0, r1.
• For C++, the first argument to an object method is the implicit this pointer, separate from
explicit arguments.
• If a function exceeds four arguments (or a C++ method three explicit ones), using structures
for related arguments is often more efficient.
• Group related arguments into structures and pass a structure pointer instead of multiple
individual arguments, based on your software's design.
5.6 Pointer Aliasing
(A pointer is a fundamental concept in programming, especially in languages like C, C++,
and similar languages. Here's a simplified explanation:
Definition of a Pointer:
• Concept: A pointer is a variable that stores the memory address of another variable.
• Purpose: Pointers allow direct access and manipulation of memory locations.)

Pointer Aliasing

• Definition: Pointer aliasing occurs when two pointers point to the same memory location.
• Effect: If we modify (write to) one pointer, the modification can affect the value read
from the other pointer.
• Compiler Assumption: When optimizing code, the compiler assumes that any write
operation to a pointer could potentially change the value that another pointer reads, if
they alias.

void timers_v1(int *timer1, int *timer2, int *step)

*timer1 += *step;

*timer2 += *step;

(increments two pointers (timer1 and timer2) by the value stored in another pointer (step).
This compiles to
LDR r3, [r0, #0] ; Load the value of *timer1 into r3 (r3 = *timer1)
LDR r12, [r2, #0] ; Load the value of *step into r12 (r12 = *step)
ADD r3, r3, r12 ; Add r12 (which is *step) to r3 (which is *timer1)
STR r3, [r0, #0] ; Store the updated value of r3 back into *timer1
LDR r0, [r1, #0] ; Load the value of *timer2 into r0 (r0 = *timer2)
LDR r2, [r2, #0] ; Load the value of *step into r2 (r2 = *step)
ADD r0, r0, r2 ; Add r2 (which is *step) to r0 (which is *timer2)
STR r0, [r1, #0] ; Store the updated value of r0 back into *timer2
MOV pc, r14 ; Return from the subroutine (assuming this is the end of a function)

Compiler Behavior:

The compiler generates code to load values from memory (LDR instructions).
In the example, *step is loaded twice because the compiler cannot optimize it by
storing the value once and reusing it (common subexpression elimination), as it
must assume that *step might change between the two accesses.

Simplified Points:

Pointers alias when they point to the same memory address.


The compiler assumes any pointer write can affect the value read from any other pointer.
This caution prevents certain optimizations, like reusing loaded values, to ensure
correctness.
Understanding pointer aliasing helps in writing efficient code and understanding compiler
optimizations.
Avoiding Pointer Aliasing

• Avoid relying on the compiler to optimize repeated memory accesses. Instead, create new
local variables to store the result of expressions that involve accessing memory. This ensures the
expression is computed only once.

• Avoid taking the address of local variables. This can make accessing the variable less
efficient because it might prevent optimizations by the compiler.

5.7 Structure Arrangement


Structure Layout and Performance on ARM Architecture

The way we organize a structure that is frequently used can greatly affect how efficiently our
program runs and how much memory it uses. There are two main concerns when it comes to
structures on ARM processors: alignment and overall size.

Alignment:

• ARM processors up to and including ARMv5TE have specific rules about how data
should be aligned in memory for efficient access.
• Load and store instructions on these architectures are guaranteed to work correctly only
when the data is stored at addresses aligned to the size of the data itself (like 4 bytes or 8
bytes).
• To ensure efficient memory access, ARM compilers automatically align the starting
address of a structure to a multiple of the largest data size used within the structure
(usually 4 or 8 bytes).
• They also align each data entry within the structure to its own size by adding extra
padding if needed.
Example Structure:
struct {
char a;
int b;
char c;
short d;
};

• Data Types and Sizes:

• char a;: This variable occupies 1 byte in memory.


• int b;: An int typically takes 4 bytes in memory onmany systems, including ARM
processors.
• char c;: Another char variable, taking 1 byte.
• short d;: A short typically takes 2 bytes in memory.

• Alignment Requirements:

• Memory Alignment: Many computer architectures, including ARM, require that certain
data types be stored at memory addresses that are multiples of their size. This is known as
memory alignment.
• For example, an int (typically 4 bytes) should ideally start at a memory address that is
divisible by 4 (4-byte alignment). Similarly, a short (typically 2 bytes) should start at a
memory address that is divisible by 2 (2-byte alignment).

• Padding:

• Between a and b: After char a (which takes 1 byte), the next variable int b requires a
memory address that is a multiple of 4 bytes (assuming int is 4 bytes). If a is at an
address like 0x1000, the compiler might insert 3 padding bytes after a so that b starts at
0x1004 (4-byte aligned).
• Between c and d: Similarly, after char c (which takes 1 byte), the next variable short
d needs to start at a memory address that is a multiple of 2 bytes (assuming short is 2
bytes). If c ends at 0x1005 (assuming a and b have 4 bytes allocated each), the compiler
• Little-Endian Memory System:

• In a little-endian memory system, which is common in many computer architectures


(including x86 and ARM), multi-byte data types are stored with their least significant
byte (LSB) at the lowest memory address.

• Memory Layout with Padding:

• Explanation of Layout:

• char a; starts at address 0.


• int b; starts at address 4. It'sa 4-byte integer, and in little-endian format, its bytes
would be stored from the lowest address (0x04) to the highest (0x07 or 0x03 depending
on the endianness of the ARM architecture).
• char c; starts at address 8. It's another 1-byte variable.
• short d; starts at address 10 (assuming c ends at 9). It's a 2-byte short, and in little-
endian format, its bytes would be stored from the lowest address (0x0A) to the highest
(0x0B).

• Padding:

• Between a and b: Depending on the architecture's alignment requirements, padding bytes


might be inserted after a to ensure that b starts at an address that's a multiple of its size (4
bytes for int).
• Between c and d: Similarly, padding might be added after c so that d starts at an address
that's a multiple of its size (2 bytes for short).

You might also like