Module 3
Module 3
The way we layout a frequently used structure affects the performance and code density.
There are two issues
1. Alignment of the structure entries
2. Over all size of the structure
Solution:
The compiler cannot perform this transformation itself as the C standard guarantees that
structure members will be laid out in the order that they’re specified.
For a little Endian memory system, the compiler will lay this by adding padding
But to improve the memory usage, reorder the elements as
For portability, manually add padding into API structures, so that the layout of the structure does
not depend on the compiler.
Beware of using enum. The size of the enum type is compiler dependent.
Bitfields are probably the least standardized part of the ANSI C specifications.
Bitfields are structure elements usually accessed using structure pointers.
1. They suffer from pointer aliasing problem
2. Every bitfield access is really a memory access.
3. Bcaz of pointer aliasing, it forces the compiler to reload the bit field several times.
4. Also, compilers do not tend to optimize bit field testing very well
Example
The compiler output will be
Instead if we treat the variable as flag data type and rewrite the code as
Hint
Usually, Load and Store of ARM assumes that the address is a multiples of the type you are
loading or storing.
If it is not aligned , then the core may generate a data abort or load a rotated value.
A well written portable code should avoid unaligned access.
Example
Example
1
Summary on unalignment and Endianness
ARM does not have a divide instruction in hardware. Instead the compiler implements division
by calling software routines in the C library.
Division and Modulus are slow operations so, it must be avoided as much as possible.
Tricks to avoid modulus
For an expression involving modulus as
If you can’t avoid division in a program, then try to keep the Nr and Dr as unsigned int.
Signed versions are slower bcaz, they take the absolute values of the Nr and Dr and then call the
unsigned version. Afterwards fix the sign of the result.
To convert data available in a ordinary buffer with 1 dimensional location detail to a 2D screen
buffer, we usually write C program like
It appears that we have saved a division.
An improved version is
With quotient and remainder it is done. There is only one division call. The corresponding
compiler output has shorter ins than the getxy_v1.
If same Dr occurs multiple times, as below,
Here, instead of dividing by z, we can find 1/z and multiply it with the Nr.
However, we must stick on to integer arithmetic and avoid floating point.
A small correcting code given below can be accommodated to remove the error.
Example application of divide through multiplication
Example 5.13
Here, we assumed that the Nr and Dr are 32 bit unsigned int.
However this algorithm works for 32bit, 64 bot as well as for 16 bit
To divide by a constant, pre calculate s as given in the example 5.13.
Another efficient method employed in ADS1.2 compiler is
Find d-1 that is sufficiently accurate so that multiplying by the approximation gives the
exact value n/d.
If d <0 i.e negative, then divide by |𝑑|, and correct the sign later.
Summary
Hardware floating point support is not provided in ARM7. In order to save power ad area and
to make the ARM price sensitive, usually, it is not integrated into the core.
Specific processor namely ARM7500FE is provided with Floating Point Accelerator (FPA)
and Vector Floating Point Accelerator, the C compiler must provide support for the floating
point software.
The c compiler converts every floating point arithmetic into subroutine calls.
The C library contains subroutines to simulate floating point behavior using integer
arithmetic
This code is written in highly optimized assembly.
Even then, these operations are slow compared to the integer version.
Function call overhead can be completely removed by inline functions and inline assembly.
Example:
Consider an operation that calculates a+2*x*y
Where x and y are Q15 fixed point integers and a is a Q31 fixed point integer.
Q15 / Q31 are a popular format in which the most significant bit is the sign bit, followed by 15
bits of fraction / 31 bits of fraction.
The Q15 number has a decimal range between –1 and 0.9999 (0x8000 to 0x7FFF).
Q31 has a decimal range between -1 to 2147483647/2147483648≃0.99999999953.
All operations saturate to a nearest possible value if they exceed 16 / 32 bit range in a
calculation.
Students are informed to refer Yext book 1 as specified in thesyllabus, incase of more explanations.
Module-3 Chapter 6
Optimization Compilers do perform optimization, but they have blind sites.
There are some optimization tools that you can’t explicitly use by writing C, for example.
–Instruction scheduling
Reordering the instructions in a code sequence to avoid processor stalls.(ARM- Have pipelined
architecture, stalling happens)
–Register allocation – To minimize the number of memory access, one must carefully decide register
allocation.
–Conditional execution
Accessing the full range of ARM condition instruction for better efficiency.
Now let us write an assembly lang. program to replace the function square.
To do this, first remove the C definition of square. (Don’t remove the declaration) save the new c file as
main1.c.
Now, create an assembly program (ARM code) with the following contents and save it as square.s
Explanation
The AREA directive instructs the assembler to assemble a new code or data area. If the area name you
are giving starts with alphanumeric, then enclose the name with vertical bars.
When square is called, parameter passing is done as per the ATPCS rule (Four register rue, i.e R0 is used,
and the same R0 is used to return) MUL instruction has a rule that the Rd should not be the same as the
first source register in the operation. Hence, Multiplication result is first stored in R1 and then moved to
R0.
If you compile the C as Thumb code, then the assembly program must be changed in the return as below
To build the code,
To understand this, let us take the same program and write it in assembly including the main.
i. The imported symbol |Lib$$Request$$armlib|,WEAK - makes a request that the linker links
with the standard ARM C library. The WEAK specifier prevents the linker from giving an
error, if the symbol is not found at link time. (If not found, it will take the value as zero.)
ii. The imported symbol __main is the start of the C library initialization code. When we define,
our own main, we must import it.
iii. Importing printf allows us to call the corresponding C library files (i.e stdout)
The RN directive allows us to use names for registers. In this program, we define I as an alternate
name for register r4. (This will make the code readable)
Here we planned r4 and lr. (we corrupt r4 for I, and printf corrupt lr).
So, we stack these two registers at the start of the function using STMFD.
STMFD - Store Multiple Decrement Before (Store Multiple Full Descending) stores multiple registers
to consecutive memory locations using an address from a base register.
LDMFD - pulls these registers from the stack and returns by writing the return address to pc.
DCB directive defines byte data described as a string or a comma separated list of bytes.
To build this, the following script should be run from the command line
The first stage of any optimization process is to identify the critical routines and measure their current
performance.
Profiling: a form of dynamic program analysis that measures, for example, the space (memory) or time
complexity of a program, the usage of particular instructions, or the frequency and duration of function
calls. Most commonly, profiling information serves to aid program optimization.
The Cycle Counter counts the processor clock cycles. It measures the number of cycles taken by a
specific routine.
You can use a cycle counter to benchmark a given subroutine before and after optimization.
ARMulator is a debugger available in ADS1.1 provides profiling and cycle counting features.
It samples the pc at regular intervals. The profiler identifies the function the pc points to and updates a
hot counter for each function it encounters.
Another approach is to use the trace output of a simulator to analyse the code.
People relying on profiler’s result must ensure that they know the working principle of the tool as well
as its limitation.
People with well versed knowledge can create their own profiler. Cycle counting hardwares are not
available in ARM. So to carry out cycle counting, one should use ARM debugger with ARM simulator .
ARMulator with properly configured data can serve the purpose of cycle counting and profiling.
Decode – Decodes the previous instruction, reads the operands from the register banks or
buffers.
ALU- Executes the previous instruction that was decoded in the previous cycle. This instruction
which was originally fetched from pc-8 (ARM state) or pc-4 (Thumb state).
This may be a data processing operation, address for a load, store or branch operation.
Depending upon the operation involved at this stage, it takes multiple cycles. For example-
multiply and register-controlled shift operation take several ALU cycles.
LS1 - Load or Store ins gets executed. (If it is not a Load / store, this stage has no effect.)
LS2- LDRB, STRB, LDRH, STRH gets executed. (If it is not a Byte or half-word Load or Store, this
stage has no effect)
After an Ins. has completed the five stages of the pipeline, the core writes the result to the
register file.
This ins.pair takes two cycles. ALU calculates r0+r1 in one cycle. This result is available for the
ALU to calculate r0+r2 in the second cycle.
Consider the example below which shows one-cycle interlock.
This takes three cycles. The ALU calculates the address r2+4 in the first cycle, at the same time,
the ADD ins is in the decode phase. However ADD cannot be executed in the second cycle
because the Load ins has not yet loaded the value of r1. Hence, the pipeline stalls for one cycle
while the load ins. completes the LS1 stage. Now that r1 is ready, ALU executes the ADD in the
third cycle.
Fig shows the one-cycle interlock happened in this.
Here the loop is rearranged such that the data is loaded at the end of the loop than at the
beginning. Though the code size slightly increases, the unnecessary processor stall is cut.
For each loop, two cycles are saved.
ARM architecture is well suited for this type of pre loading.
Instructions can be executed consitionally.
Loop I is loading the data for i+1the data.
For the first loop, we can preload the data by inserting extra load instructions before the start
of loop.
For the last loop, no byte load occurs.
This method of load scheduling works by unrolling and then prevailing the body of the loop.
This is the most efficient implementation.
This gives a 1.57 times speed increase over the original function.
The conditional Instructions of ARM is best utilized.
However, the number of instructions of double than the original size of the instructions.
Performance can be slower for very short strings because, i) stacking lr causes additional
function call overhead, ii) the routine may process upto two characters pointlessly, before
discovering that they lie beyond the end of the string.
Condition Codes
By default, ARM ins do not update the flags in the CPSR unless the instruction has S suffix.
Exceptions to this are CMP ins, whose prime work is to update the flags. (do not update the register).
By combining conditional execution and conditional setting of flags, we can implement simple if
statements without doing branch.
This improves the efficiency bcaz, it can be accomplished with minimum machine cycles.
Example.
In assembly,
we can write using conditional comparision