0% found this document useful (0 votes)
13 views

Module 3

The document discusses compilers, optimization, and porting C code to ARM processors. It covers endianness, structure layout, avoiding division and modulus through rearrangement of expressions, inline functions, and issues that can arise when porting code between ARM architectures like char types being unsigned, integer size mismatches, unaligned data, endianness assumptions, function prototyping, and bitfield layout dependencies. Optimization techniques include reordering structure elements, using multiplication instead of division with precalculated constants, and removing function call overhead with inline functions and assembly.

Uploaded by

webprogram32
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Module 3

The document discusses compilers, optimization, and porting C code to ARM processors. It covers endianness, structure layout, avoiding division and modulus through rearrangement of expressions, inline functions, and issues that can arise when porting code between ARM architectures like char types being unsigned, integer size mismatches, unaligned data, endianness assumptions, function prototyping, and bitfield layout dependencies. Optimization techniques include reordering structure elements, using multiplication instead of division with precalculated constants, and removing function call overhead with inline functions and assembly.

Uploaded by

webprogram32
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Overview of C compilers and Optimization

What is Little Endian and Big Endian format?


It is the way in which the LSB to MSB are stored in the memory.
Suppose, we have a hexa decimal number

0x01234567, it will be stored as below

The way we layout a frequently used structure affects the performance and code density.
There are two issues
1. Alignment of the structure entries
2. Over all size of the structure

Structures often end up containing padding.

 Required because of target’s data type restrictions.


o e.g. ARM keeps ints on a 32-bit boundary.,
o char on a 8 bit boundary.
o Short on a 16 bit boundary

Easy to waste memory if you’re not aware of where padding is inserted.

Solution:

 Sort elements in the structure by size:


o Place elements in small-to-large or large-to-small order.
o This minimizes the amount of padding.

The compiler cannot perform this transformation itself as the C standard guarantees that
structure members will be laid out in the order that they’re specified.

For example, consider the structure

For a little Endian memory system, the compiler will lay this by adding padding
But to improve the memory usage, reorder the elements as

With this we can avoid unnecessary padding.


The armcc compiler does include a keyword __packed that removes all padding.

However, packed structures are slow and inefficient to access.


So __packed keyword can be used in places where space is far more important than speed.
The exact layout of a structure in memory may depend upon the compiler vendor and the compiler
version you use.
Another ambiguity is enum. Different compilers use different sizes for enumerated data types.
For example, consider
The armcc will treat Bool as a one-byte type. Whereas, gcc treats it as 32 bit.
So, to avoid ambiguity, better avoid using enum data types.
Another consideration is the size of the structure and the offsets of elements within the structure.
This is quite serious, when we compile Thumb instruction. Since, Thumb instructions are 16 bit
wide so allow only small offsets.
So, the following rule may be applied when creating structures for maximum efficiency.

For portability, manually add padding into API structures, so that the layout of the structure does
not depend on the compiler.
Beware of using enum. The size of the enum type is compiler dependent.

Bitfields are probably the least standardized part of the ANSI C specifications.
Bitfields are structure elements usually accessed using structure pointers.
1. They suffer from pointer aliasing problem
2. Every bitfield access is really a memory access.
3. Bcaz of pointer aliasing, it forces the compiler to reload the bit field several times.
4. Also, compilers do not tend to optimize bit field testing very well
Example
The compiler output will be

Instead if we treat the variable as flag data type and rewrite the code as
Hint

A "flag" variable is simply a boolean variable whose contents is "true" or "false".


You can use either the bool type with true or false, or an integer variable with zero for "false"
and non-zero for "true".

Summary of Bitfields usage

 ARM favours 32-bit aligned addresses.


 Unaligned values have to be pulled from memory a byte at a time and reformed.

Usually, Load and Store of ARM assumes that the address is a multiples of the type you are
loading or storing.
If it is not aligned , then the core may generate a data abort or load a rotated value.
A well written portable code should avoid unaligned access.
Example
Example
1
Summary on unalignment and Endianness
ARM does not have a divide instruction in hardware. Instead the compiler implements division
by calling software routines in the C library.
Division and Modulus are slow operations so, it must be avoided as much as possible.
Tricks to avoid modulus
For an expression involving modulus as

(this types of expressions are often used in


circular buffer0
it takes 50 cycles to complete.
However,
The same when implemented using if statement takes minimum cycles only.

Will take 3 cycles because it does not involve division.

If you can’t avoid division in a program, then try to keep the Nr and Dr as unsigned int.
Signed versions are slower bcaz, they take the absolute values of the Nr and Dr and then call the
unsigned version. Afterwards fix the sign of the result.
To convert data available in a ordinary buffer with 1 dimensional location detail to a 2D screen
buffer, we usually write C program like
It appears that we have saved a division.
An improved version is

With quotient and remainder it is done. There is only one division call. The corresponding
compiler output has shorter ins than the getxy_v1.
If same Dr occurs multiple times, as below,

Here, instead of dividing by z, we can find 1/z and multiply it with the Nr.
However, we must stick on to integer arithmetic and avoid floating point.

This is applicable for integer division, not for mathematical division.


To find the integer part of n/d, as well as to find the quotient and remainder, this can be applied.
Approach -1;
To find n/d; we can do n*d-1.

To find d-1 , with 32 bit processor, calculate


Then calculate n/d as

This approach has the following drawbacks


Substitute the value of s from 5.3 we get

A small correcting code given below can be accommodated to remove the error.
Example application of divide through multiplication

Example 5.13
Here, we assumed that the Nr and Dr are 32 bit unsigned int.
However this algorithm works for 32bit, 64 bot as well as for 16 bit
To divide by a constant, pre calculate s as given in the example 5.13.
Another efficient method employed in ADS1.2 compiler is
Find d-1 that is sufficiently accurate so that multiplying by the approximation gives the
exact value n/d.

If d <0 i.e negative, then divide by |𝑑|, and correct the sign later.
Summary

Hardware floating point support is not provided in ARM7. In order to save power ad area and
to make the ARM price sensitive, usually, it is not integrated into the core.
Specific processor namely ARM7500FE is provided with Floating Point Accelerator (FPA)
and Vector Floating Point Accelerator, the C compiler must provide support for the floating
point software.
The c compiler converts every floating point arithmetic into subroutine calls.
The C library contains subroutines to simulate floating point behavior using integer
arithmetic
This code is written in highly optimized assembly.
Even then, these operations are slow compared to the integer version.

Function call overhead can be completely removed by inline functions and inline assembly.
Example:
Consider an operation that calculates a+2*x*y
Where x and y are Q15 fixed point integers and a is a Q31 fixed point integer.
Q15 / Q31 are a popular format in which the most significant bit is the sign bit, followed by 15
bits of fraction / 31 bits of fraction.
The Q15 number has a decimal range between –1 and 0.9999 (0x8000 to 0x7FFF).
Q31 has a decimal range between -1 to 2147483647/2147483648≃0.99999999953.
All operations saturate to a nearest possible value if they exceed 16 / 32 bit range in a
calculation.

Inline function example: It uses the keyword inline.

Here inline function as well as inline assembly are employed.


When porting c code to the ARM, or porting codes between ARM architecture needs attention in
the following cases.
Sl.No Issue type Description
1 Char Type On the ARM, char is unsigned rather than signed. If we
use loop counter variable as char, then the condition i>=0
become infinite loops. At this point, armcc produces a
warning of unsigned int comparison with 0.
2 Int –type It causes problem when moving from 16 bit architecture to
32 bit architecture. Therefore, for one architecture it
shows True and it shows False for the same value when
working with 32 bit architecture.
Example:
Let i=-0x1000;
If(i==0xF000) evaluates to True on 16 bit, but False on a
32 bit architecture.
3 Unaligned data Pointers Old processors do not support unaligned pointers. (Ex -
Typecasting from char * to an int * in ARMv5 do not
support)
To detect this types of error, alignment checking trap need
to be run.
4 Endian assumptions C-code make assumptions on endianness of memory
system. So, better replace the code by endian
independent one while using pointers and intensive
memory access task
5 Function prototyping Armcc compiler always passes the arguments tailored to
the correct range (i.e narrow). If functions are not
prototyped correctly, it may return wrong answers. Some
compilers pass arguments wide which give the correct
answer even in non prototyped function. Hence always
follow ANSI prototype.
6 Use of bit fields The layout of bits in bitfields is always implementation and
Endianness dependent. If there is a mismatch, then the
code is not portable.
7 Use of enum data type. Though enum is portable, it is always compiler dependent.
Some compilers allot 8 bit, some compilers allot 32 bit. So
it requires cross linking code and libraries. It can’t be
done.
8 Inline assembly Though inline assembly speed up the process, it reduces
portability among different architectures. Also inline
functions are little easier to port than inline assembly.

Students are informed to refer Yext book 1 as specified in thesyllabus, incase of more explanations.
Module-3 Chapter 6
Optimization Compilers do perform optimization, but they have blind sites.

There are some optimization tools that you can’t explicitly use by writing C, for example.

–Instruction scheduling

Reordering the instructions in a code sequence to avoid processor stalls.(ARM- Have pipelined
architecture, stalling happens)

–Register allocation – To minimize the number of memory access, one must carefully decide register
allocation.

–Conditional execution

Accessing the full range of ARM condition instruction for better efficiency.

You have to use hand-written assembly to optimize critical routines.

1. How to convert a C function into assembly function.

Let us take a program that calculates squares of integers from 0 to 9.

Now let us write an assembly lang. program to replace the function square.

To do this, first remove the C definition of square. (Don’t remove the declaration) save the new c file as
main1.c.

Now, create an assembly program (ARM code) with the following contents and save it as square.s
Explanation

The AREA directive instructs the assembler to assemble a new code or data area. If the area name you
are giving starts with alphanumeric, then enclose the name with vertical bars.

EXPORT directive makes the symbol square to be used by external ; modules.

Line 6, square is declared as a label name.

When square is called, parameter passing is done as per the ATPCS rule (Four register rue, i.e R0 is used,
and the same R0 is used to return) MUL instruction has a rule that the Rd should not be the same as the
first source register in the operation. Hence, Multiplication result is first stored in R1 and then moved to
R0.

The END directive marks the end of the assembly code.

Build the script in the command line as below

It works fine, if you compile the C as ARM code.

If you compile the C as Thumb code, then the assembly program must be changed in the return as below
To build the code,

2. How to call a subroutine from an assembly routine.

To understand this, let us take the same program and write it in assembly including the main.

Create a new assembly file main3.s with the following contents.


IMPORT directive is used to declare symbols that are defined in other files.

i. The imported symbol |Lib$$Request$$armlib|,WEAK - makes a request that the linker links
with the standard ARM C library. The WEAK specifier prevents the linker from giving an
error, if the symbol is not found at link time. (If not found, it will take the value as zero.)
ii. The imported symbol __main is the start of the C library initialization code. When we define,
our own main, we must import it.
iii. Importing printf allows us to call the corresponding C library files (i.e stdout)

The RN directive allows us to use names for registers. In this program, we define I as an alternate
name for register r4. (This will make the code readable)

According to ATPCS, a function must preserve registers r4 to R11 and sp.

Here we planned r4 and lr. (we corrupt r4 for I, and printf corrupt lr).

So, we stack these two registers at the start of the function using STMFD.

STMFD - Store Multiple Decrement Before (Store Multiple Full Descending) stores multiple registers
to consecutive memory locations using an address from a base register.

ADR is similar to LDR, but it s PC-relative.

LDMFD - pulls these registers from the stack and returns by writing the return address to pc.

DCB directive defines byte data described as a string or a comma separated list of bytes.
To build this, the following script should be run from the command line

3. Passing more than 4 parameters to a function


Let us take an example function that can sum any number of integers that is passed as arguments.

Now define the sumof function in assembly lang. as sumof.s


The code keeps count of the number of remaining values to sum.

This can be built from using the following command line


Our aim is to optimize the ARM code.

The first stage of any optimization process is to identify the critical routines and measure their current
performance.

Profiler is a tool that measures and analyses a program.

Profiling: a form of dynamic program analysis that measures, for example, the space (memory) or time
complexity of a program, the usage of particular instructions, or the frequency and duration of function
calls. Most commonly, profiling information serves to aid program optimization.

The Cycle Counter counts the processor clock cycles. It measures the number of cycles taken by a
specific routine.

You can use a cycle counter to benchmark a given subroutine before and after optimization.

ARMulator is a debugger available in ADS1.1 provides profiling and cycle counting features.

It samples the pc at regular intervals. The profiler identifies the function the pc points to and updates a
hot counter for each function it encounters.

Another approach is to use the trace output of a simulator to analyse the code.

People relying on profiler’s result must ensure that they know the working principle of the tool as well
as its limitation.

If handled with limited knowledge, it may produce meaningless result.

People with well versed knowledge can create their own profiler. Cycle counting hardwares are not
available in ARM. So to carry out cycle counting, one should use ARM debugger with ARM simulator .

ARMulator with properly configured data can serve the purpose of cycle counting and profiling.

The time taken to execute instructions depends on the implementation pipeline.


The total time it takes for the ARM processor to execute an instruction depends on the number
of cycles in the instruction fetch and instruction decode/execute phases12. ARM has a pipelined
architecture, and each clock cycle advances the pipeline by one step2. The overall time to execute
each instruction can approach 1 cycle, but the actual time for an individual instruction from
'fetch' through completion can be 3+ cycles2. Typically, the total time ranges from 3 to 7 cycles

To understand and do instruction scheduling efficiently, we need to understand the ARM


pipeline and dependencies.
ARM9TDMI processor performs five operations in parallel as shown in fig.

Fetch- Get the instruction from the memory into core

Decode – Decodes the previous instruction, reads the operands from the register banks or
buffers.

ALU- Executes the previous instruction that was decoded in the previous cycle. This instruction
which was originally fetched from pc-8 (ARM state) or pc-4 (Thumb state).
This may be a data processing operation, address for a load, store or branch operation.
Depending upon the operation involved at this stage, it takes multiple cycles. For example-
multiply and register-controlled shift operation take several ALU cycles.
LS1 - Load or Store ins gets executed. (If it is not a Load / store, this stage has no effect.)
LS2- LDRB, STRB, LDRH, STRH gets executed. (If it is not a Byte or half-word Load or Store, this
stage has no effect)
After an Ins. has completed the five stages of the pipeline, the core writes the result to the
register file.

Cycle Timing rules for common Instruction


Example codes to understand Processor stalling / interlock.
If an instruction requires the result of a previous ins, that is not available, then the processor
stalls. This is called pipeline hazard or pipeline interlock.
Consider the example where there is no interlock.

This ins.pair takes two cycles. ALU calculates r0+r1 in one cycle. This result is available for the
ALU to calculate r0+r2 in the second cycle.
Consider the example below which shows one-cycle interlock.
This takes three cycles. The ALU calculates the address r2+4 in the first cycle, at the same time,
the ADD ins is in the decode phase. However ADD cannot be executed in the second cycle
because the Load ins has not yet loaded the value of r1. Hence, the pipeline stalls for one cycle
while the load ins. completes the LS1 stage. Now that r1 is ready, ALU executes the ADD in the
third cycle.
Fig shows the one-cycle interlock happened in this.

Example 3 Pipeline flush caused by a branch

Cycle-1 MOV is executed.


Cycle-2 Branch ins address is calculated
This causes the pipeline flush and new address is loaded in the PC. This refill takes 2 cycles.
Cycle-5 SUB Ins is executed.

Load ins generally occur frequently in a task.


It accounts for one third of the whole in a task.
Careful scheduling of load instructions will avoid pipeline stall and hence improve performance.
In general, compiler applies the optimization rules and gives its best.
There are some limitations in the optimization due to pointer aliasing. (Whenever any pointer
operation comes, the compiler will load the instruction once again from the memory and will
not use the data available in the register.)
In general, compiler will reorder the instruction without changing the meaning of the program.
However, it can’t do the store before load.
Let us consider an example program where we have to load zero terminated string of character
one by one from the memory and convert the string to lowercase letters.
The compiled output from ADS1.1 compiler is gn here.

Here the loop is rearranged such that the data is loaded at the end of the loop than at the
beginning. Though the code size slightly increases, the unnecessary processor stall is cut.
For each loop, two cycles are saved.
ARM architecture is well suited for this type of pre loading.
Instructions can be executed consitionally.
Loop I is loading the data for i+1the data.

For the first loop, we can preload the data by inserting extra load instructions before the start
of loop.
For the last loop, no byte load occurs.

This method of load scheduling works by unrolling and then prevailing the body of the loop.
This is the most efficient implementation.
This gives a 1.57 times speed increase over the original function.
The conditional Instructions of ARM is best utilized.

However, the number of instructions of double than the original size of the instructions.
Performance can be slower for very short strings because, i) stacking lr causes additional
function call overhead, ii) the routine may process upto two characters pointlessly, before
discovering that they lie beyond the end of the string.

This type of scheduling by unrolling can be employed in time-critical parts of an application,


where the data size is large,

,and the rest of the topics


Use the link shared in whatsapp group.

Condition Codes

Processor core has 15 conditional execution ins.

If condition is not specified, the assembler defaults to AL(always condition)

The remaining 14 condns are split into 7 pairs of complements.

The condn depends on the 4 condition flags namely

N,Z, C, V of the CPSR register.

By default, ARM ins do not update the flags in the CPSR unless the instruction has S suffix.

Exceptions to this are CMP ins, whose prime work is to update the flags. (do not update the register).
By combining conditional execution and conditional setting of flags, we can implement simple if
statements without doing branch.

This improves the efficiency bcaz, it can be accomplished with minimum machine cycles.

Example.

Following C code converts an unsigned int 0<=i<=15 to a hexadecimal number

Using conditional execution this can be written in assembly language as


Example 2

In assembly,
we can write using conditional comparision

You might also like