Optimizing Itanium-Based Applications 1
Optimizing Itanium-Based Applications 1
Version 1.11
Table of Contents
introduction .....................................................................................................................................................3
six levels of optimization.................................................................................................................................3
level zero .................................................................................................................................................... 3
level one ..................................................................................................................................................... 3
level two ..................................................................................................................................................... 4
level two -ipo.............................................................................................................................................. 4
level three ................................................................................................................................................... 5
level four (level three ipo) ........................................................................................................................ 5
interprocedural optimizations with -ipo...........................................................................................................6
loop optimizations at +O3 or +O4...................................................................................................................8
advanced optimization options and pragmas ...................................................................................................9
enabling aggressive optimizations.............................................................................................................10
removing compilation time limits when optimizing..................................................................................10
limiting the size of optimized code............................................................................................................11
controlling the scheduling model ..............................................................................................................11
controlling floating point optimization......................................................................................................11
controlling data allocation .........................................................................................................................13
controlling symbol binding........................................................................................................................13
controlling other optimization features......................................................................................................16
profile-based optimization.............................................................................................................................20
instrumenting the code ..............................................................................................................................20
collecting execution profile data................................................................................................................20
performing profile-based optimization......................................................................................................20
maintaining profile data files.....................................................................................................................21
merging profile data files...........................................................................................................................21
locking of profile database files.................................................................................................................22
Itanium- versus PA-RISC profile-based optimization differences ............................................................22
compiler-generated performance advice........................................................................................................23
putting it together with optimization option recipes......................................................................................23
References .....................................................................................................................................................25
introduction
The HP Itanium-based optimizer transforms code so that it runs more efficiently on Itanium-based
HP-UX systems. The optimizer can dramatically improve application performance. In addition, compile
time and memory resources increase with each higher level of optimization due to the increasingly complex
analysis that is performed.
This document discusses the following topics:
Note that this version applies to the A.06.26 (AR1109) release of the HP compilers. For an overview of the
HP compiler technology, see HP Compilers for HP Integrity Servers[1].
level zero
+O0
description:
benefits:
Fastest compile time; however, use of this optimization level is strongly discouraged due to the
poor quality of the resulting code.
level one
+O1 (default)
description:
Local optimizations that optimize over a single basic block, including common subexpressions
elimination, constant folding, and load-store elimination.
Performs data prefetching of simple array traversals.
More sophisticated instruction scheduling.
Register promotion of some scalar locals and C/C++ scalar formals.
In C++, inlining of calls within a translation unit.
benefits:
Produces much faster code than +O0, but faster compile time than +O2.
Debugging correctness of code is maintained. Breakpoints behave as expected and variables have
expected values at breakpoints. See Section 14.27 (Debugging optimized code) in Debugging with
GDB[2] for more information on this topic.
level two
+O2 or O
description:
Significantly faster code than produced at Level 1, due to optimized code and better use of machine
resources and Itanium architectural features.
Non-numeric applications can be improved by 50% or more.
Loop intensive numeric applications achieve even greater speedups due to optimizations such as
more aggressive data prefetching and software pipelining.
Performs Level 2 optimizations, plus optimizations across the entire application program.
Performs interprocedural optimizations (IPO) at link time, including improved range propagation
and alias analysis, cross module inlining, interprocedural data prefetching, dead variable and dead
function removal, variable privatization, short data optimization, data layout optimization, constant
propagation, and import stub inlining.
Performs indirect call promotion in whole program mode if dynamic PBO data is available
(+Oprofile=use).
Performs inlining of a larger set of math library routines into user code.
See chapter on interprocedural optimizations below for more details.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.
benefits:
Better alias information and inlining improves and enables additional optimizations over Level 2.
Applications containing many indirect calls or virtual function calls can benefit greatly from
indirect call promotion.
Data optimizations improve cache and TLB behavior.
Code optimizations reduce number of instructions.
level three
+O3
description:
Performs Level 2 optimizations, plus optimizations across all functions in a single file.
Includes inlining and cloning of functions within the same file.
High-level optimizations, such as loop transformations (interchange, fusion, unrolling, and so on)
occur. Please see the section about loop optimization below.
Performs inlining of a larger set of math library routines into user code.
Recognizes simple copy loops and replaces them with calls to optimized memory copy routines.
Recognizes simple manually unrolled loops and rerolls them, enabling better unrolling decisions
for a given platform later in the loop optimizer.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.
benefits:
Can produce faster code than Level 2. This is particularly true for numerical codes, which tend to
benefit more from the loop transformations, and for codes that frequently call small functions
within the same file or math library functions, which benefit from inlining.
Performs Level 3 optimizations, plus optimizations across the entire application program.
Performs interprocedural optimizations at link time, please see level two -ipo for a summary.
This level of optimization limits the ability to debug the application. See Section 14.27 (Debugging
optimized code) in Debugging with GDB[2] for more information on this topic.
benefits:
Interprocedural optimizations generally improve application performance (see Level two -ipo).
Better alias information and inlining improves and enables additional loop transformations.
Interprocedural analysis of memory references and function arguments enables and improves many
optimizations, for example, it yields several additional opportunities for optimizations in the low level
optimizer, including register promotion.
Consider this example:
void foo( int *x, int *y )
{
... = *x;
// load
*y
// store 1
= ...
... = *x;
// load
1
2
}
Without any additional knowledge about the properties of the pointers x and y, the compiler has to
issue a second load instruction (load 2), since the store (store 1) may overwrite the content of the
pointer x.
If, as a result of interprocedural analysis, the compiler is able to determine that x and y never alias
(point to the same memory location), the compiler can promote the value of *x into a register and just
reuse this register (load 2).
The compiler interprocedurally propagates information about modified and referenced data items
(mod/ref analysis), which can benefit various other compiler analyses and transformations which need
to consider global side effects.
The compiler also interprocedurally propagates range information for certain entities.
Function inlining exposes traditional benefits, such as the reduction of call overhead, the improvement
of the locality of the executing code and the reduction of the number of branches. More importantly
though, inlining exposes additional optimization opportunities because of the widened scope and
enables better instruction scheduling.
The inliner framework has been designed to scale to very large applications, uses a novel and very fast
underlying algorithm, and employes an elaborate set of new heurisitics for its inlining decisions.
Note: The inlining engine is also employed at +O2 for intra-module inlining. At this optimization level
the inliner uses tuned down heuristics in order to guarantee fast compile times in addition to positive
performance effects.
The whole call graph is constructed, enabling indirect call promotion, where an indirect call is
converted to a test and a direct call. Depending on the application characteristics, and in the presence
of PBO data, this can result in significant application speedups (we have observed up to 20%
improvements for certain applications)
Dead variable removal allows the high level optimizer to reduce the total memory requirements of the
application by removing global and static variables that are never referenced.
Recognition of global, static and local variables that are assigned but never used allows the optimizer
to remove dead code (which may result in additional dead variables).
Conversion of global variables that are referenced only within a module allows the high level
optimizer to convert the symbol to a private symbol, guaranteeing that it can only be accessed from
within this module. This gives the low-level optimizer greater freedom in optimizing references to that
variable.
Dead function removal (functions that are never called) and redundant function removal (for example,
duplicate template instantiations) help to reduce compile time and improve the effectiveness of cross
module inlining by reducing the working set. Additionally, as the applications total code size reduces,
it will incur less cache and page misses (resulting in potentially higher performance)
Short data optimizations. Global and static data allocated in the short data area can be accessed with a
more efficient access sequence. In whole program mode (-ipo) the compiler can perform precise
analysis to determine if all global and static data fits into the short data area and allocate it there. If the
data doesnt fit, the compiler can determine the best safe short data size threshold, enabling a
maximum amount of data items to be addressable more effectively.
Note: This is an IPO advantage. At other optimization levels the same optimization can be enabled
with the option +Oshortdata. The option -ipo derives an optimal short data threshold.
For calls to external functions (function not residing in a binary) the linker introduces a small import
stub. If the compiler knows that a function call is a call to an external function, it can inline the import
stub, resulting in better performance.
The HP compilers support a mechanism that allows annotating function prototypes with a pragma
(#pragma extern) marking those functions as external functions, enabling import stub inlining.
All this is no longer necessary with -ipo in whole program mode. In this model the compiler knows
which functions are defined by the application and which are external and automatically marks
functions appropriately.
The compiler performs interprocedural data layout optimizations, in particular, structure splitting, array
splitting and dead field removal. If the compiler is able to determine that a given record type can be
modified safely and if additionally heuristics find that such type modifications are beneficial, the
compiler may break a record type into a cold part and a hot part with the goal of reducing cache miss
and TLB penalties.
Currently, this optimization is limited to a very restricted set of scenarions. Please use +Oinfo to
determine whether this optimization has been performed.
The compiler can also perform non-contiguous array fusion. For some multi-dimensional, noncontiguous, pointer-based arrays, the compiler will modify the declaration, allocation, and uses of such
arrays to instead use a contiguous memory layout. This transformation both allows for more efficient
element access, and results in more optimal cache utilization.
The compiler inserts inter-procedural data prefetches before callsites for data accessed in the call chain
rooted at the call site.The inserted prefetches will attempt to fetch data accessed via dereferences of
pointer parameters of the call.
The interprocedural analysis phase is also able to expose and warn on additional source problems, for
example, for variables that are declared with incompatible attributes in different source files.
The interprocedural optimization framework has been designed to scale to very large applications.
Fortunately, nothing changes from a users perspective, in particular, existing build processes do not have
to modified. Since the IPO and code generation is performed at link time, the link time may increase
significantly.
At the end of the IPO phase, the code generation and low-level optimization phase is started by invoking
multiple parallel processes of the binary be. The default number of parallel be processes is set to the
number of processors on a machine. This number can be overriden by setting the environment variable
PARALLEL, for example:
export PARALLEL=4
effect that the if-statement is now executed only when the loop is reached, and no longer on every loop
iteration.
loop cloning
Loop cloning seeks to special case loops with variable trip counts with help of profile information. For
example, if a loop iterates from 0 to N, but the profile information hints that the loop most of the time
executes with a constant trip count C, it can be beneficial to special case the loop for C and to check for
this value at runtime to select the proper loop variant. A loop with known trip count can be scheduled most
effectively by the low level optimizer, which can result in dramatic runtime improvements.
loop unrolling
The high level optimizer performs full outer loop unrolling for loops with small trip counts.
loop unroll and jam
The loop unroll and jam transformation performs outer loop unrolling and fusion, which increases
opportunities for scalar replacement. This can reduce the number of memory operations, resulting in better
instruction scheduling.
recognition of memset/memcpy type loops
For loops that essentially copy blocks of data to another memory location, the compiler determines loop
properties, such as the direction of the copy, and then replaces the whole loop with a direct call to a highly
specialized and optimized copy routine.
loop rerolling
Some user code contains manually unrolled loops. These forms of manual unrolling usually comes from
tuning efforts on a particular machine. However, on a different machine, this manually unrolled code may
perform poorly! The compiler tries to identify such unrolled loops, rerolls them by removing incremental
statements and adjusting the loop boundaries and increment. If such a rerolled loop is then passed through
the loop optimizer, better unrolling decisions can be made, depending on machine characteristics. After
loop rerolling, a loop merging pass is run to merge manually unrolled loops and their remainder loops.
loop blocking
Loop blocking is a combination of strip mining and interchange that maximizes data localization. It is
provided primarily to deal with nested loops that manipulate arrays that are too large to fit into the cache.
Under certain circumstances, loop blocking allows reuse of these arrays by transforming the loops that
manipulate them so that they manipulate strips of the arrays that fit into the cache. Effectively, a
blocked loop accesses array elements in sections that are optimally sized to fit in the cache.
scalar replacement
The optimizer finds reuses of array locations in a loop an replaces them with uses of scalar temporaries.
These temporaries can be register promoted to reduce memory acceeses.
loop multiversioning
The loop optimizer can find that some optimizations can be performed on the loop if some conditions are
met (eg: two array references do not overlap). However, some of these conditions may not be known at
compile time. The optimizer can clone the loop, introduce runtime checks for these conditions and optimize
the cloned loop more aggresively.
malloc combining
The optimizer can combine several small block allocations in a loop into a single large block allocation.
This improves locality and reduces the cost of calling the allocation routine.
Use +Ofast with stable, well-behaved code that does not rely on FP corner-case values, and that does
not utilize extremely large integer values.
+Ofast might imply +O3 in a future release, rather than +O2.
benefits:
Safely improves performance for most applications, particularly when the application only runs on
the type of system on which it was compiled.
Avoids the need to specify a larger number of optimization flags because it implies a number of
optimizations that are generally safe and can greatly improve application performance.
+Ofaster
description:
[Alias for +Ofast +O4]
Enables interprocedural optimizations in addition to the advanced optimizations described for +Ofast.
See the descriptions of +Ofast and +O4 for more information.
benefits:
10
Users can remove optimization time restrictions at +O2 and above by using the +Onolimit or
+Olimit=none option. This allows full optimization of large procedures, but can incur significant
compile time increases for very large procedures, especially those with large sequences of straight-line
code. If you are willing to tolerate longer compile times, +Onolimit can result in significant
performance improvements.
Users can limit the amount of time spent optimizing code to completely avoid non-linear compile times
using +Olimit or +Olimit=min.
11
On Itanium, the benefit of forming these contractions can be significant. Contractions can be enabled
and disabled in different blocks of code using the FP_CONTRACT pragma. FP_CONTRACT OFF
overrides any prior pragma or +Ofltacc=strict option. FP_CONTRACT ON has no effect other
than undoing a prior FP_CONTRACT OFF, and is overridden by +Ofltacc=strict.
+Ofltacc=limited enables a small number of other value-changing optimizations in addition to
the contractions. These optimizations can prevent the propagation of Not-a-Numbers (NaNs), infinities,
and the sign of zero. For example, performing the optimization of 0.0*x => 0.0 will prevent the
propagation of NaN, infinities, and the sign of zero if x is a Nan, infinity, or negative number.
The most aggressive floating-point optimizations are enabled with +Ofltacc=relaxed (or its
equivalent +Onofltacc). For example, faster and more efficient floating-point divide sequences are
enabled under relaxed accuracy.
Additionally, optimizations that reassociate floating-point computation are enabled with
+Ofltacc=relaxed. For example, the sum reduction optimization, which hides floating-point add
latency by computing partial sums, can be enabled in C or C++ with +Ofltacc=relaxed. It also
enables loop optimizations such as fusion, distribution, blocking, unroll and jam, and interchange in
loops with floating-point accesses. For Fortran, these optimizations are already enabled because
reassociation that does not violate explicit parentheses is always legal.
Finally, +Ofltacc=relaxed implies the +Ocxlimitedrange option (described below).
+O[no]sumreduction
Will [dis]allow the sum reduction optimization, regardless of the floating-point accuracy setting. This
can be used to enable optimization of sum reductions via the computation of partial sums for C or C++
without having to specify the more aggressive +Ofltacc=relaxed, which is less safe. Conversely,
+Onosumreduction can be used to disallow the sum reduction optimization under a floating-point
accuracy setting where it is normally allowed (e.g. by default for Fortran, where the language standard
allows this type of reassociation).
+O[no]cxlimitedrange
(default +Onocxlimitedrange for C, +Ocxlimitedrange for Fortran)
#pragma STDC CX_LIMITED_RANGE [ON/OFF/DEFAULT]
You can use this option to obtain faster, complex arithmetic sequences when an application does not
rely on out-of-range floating point values. This option indicates whether out-of-range floating point
values (for example, NaNs and infinities) can occur and must be preserved. With
+Ocxlimitedrange, out-of-range floating-point values might not be preserved. Enabling the limited
range switch results in faster complex arithmetic sequences. The CX_LIMITED_RANGE pragma
enables limited range behavior for specific blocks of code, whereas the option is global.
CX_LIMITED_RANGE ON overrides +Onocxlimitedrange, and CX_LIMITED_RANGE OFF
has no effect except to undo a prior CX_LIMITED_RANGE ON or +Ocxlimitedrange.
+O[no]fenvaccess (default +Onofenvaccess)
#pragma STDC FENV_ACCESS [ON/OFF/DEFAULT]
#pragma FLOAT_TRAPS_ON
+Ofenvaccess disables any optimizations that might affect behavior under non-default
floating-point modes (for example, alternate rounding directions or trap enables) or where floating-point
exception flags are queried. It can also be enabled locally using either the FENV_ACCESS or
FLOAT_TRAPS_ON pragmas. FENV_ACCESS ON and FLOAT_TRAPS_ON override
+Onofenvaccess. FENV_ACCESS OFF has no effect other than to undo a prior FENV_ACCESS
ON, FLOAT_TRAPS_ON, or +Ofenvaccess. Enabling fenvaccess, for example, prevents dead
code elimination of instructions that can raise exceptions, results in longer floating-point-to-integer
conversion sequences that explicitly check for out-of-range results, and results in longer floating-point
division sequences.
12
+O[no]libmerrno
(default +Onolibmerrno, except with Cs Aa, c89, or AC89 the default
is +Olibmerrno)
Enables support for errno in libm functions. Different, less optimal versions of libm functions are
invoked under +Olibmerrno. Additionally, the optimizer is prohibited from performing
optimizations of these calls (such as coalescing calls to the same libm function with identical inputs)
because they are no longer side-effect-free.
Under Cs Aa, c89, or AC89, the default becomes +Olibmerrno.
13
You can use this option or pragma to obtain the most optimized access sequences for data and code
symbols. Symbols with the given name(s) are specified as having protected export class. If no symbols
are given, then all symbols, including those referenced but not defined in the translation unit, are
specified as having protected export class. This means that these symbols are not preempted and can be
optimized as such. For example, the compiler can bypass the linkage table for both code and data
references. Additionally, the compiler can omit the saving and restoring of gp around calls to protected
symbols, and can generate a pc-relative call. If the target of the call is not local to the load module, the
linker produces an error. These optimizations are always performed for locally-defined symbols 1 unless
the optimizations have been named in a -Bextern option list. -Bprotected enables these
optimizations for symbols that are not locally defined.
When -Bprotected is specified with no symbol list, it also implies -Wl, -aarchive_shared,
causing the linker to prefer an archive library to a shared one if one is available. This results in better
performance because accesses to archived libraries are faster than those to shared libraries.
The #pragma binding protected applies to all globally-scoped symbols2 following the
pragma prior to the next #pragma binding.
To avoid linker errors when making calls into shared system libraries, include the system header files
for these routines. The symbols are marked properly in the system headers as being preemptible. If the
header files are not included, and therefore these symbols are not marked properly, the linker issues an
error because they are not defined in the load module. This linker error prevents a run-time error, which
would occur due to incorrect optimization such as omission of gp saves and restores around calls to
these symbols. Similar problems can be encountered when linking with applications or third-party
shared libraries, unless they are decorated with the proper pragmas. Library providers should consult
David Grosss Library Providers Guide to Symbol Binding[3] on how to enable use of -Bprotected
in user applications.
For application builds, this option can be used with -exec to obtain fastest data access and call
sequences (see -exec and -minshared).
-Bprotected_data
Marks all data symbols as having protected export class, implying the optimizations to data accesses
discussed under -Bprotected. This option can be used when system header files are not included for
shared library calls made by the application, to obtain a subset of the optimizations available with
-Bprotected. However, header files declaring any shared library data being accessed by the
application must be included. For fastest code, users should add the appropriate header file includes,
and compile with -Bprotected. Alternatively, use -Bprotected_data in combination with
either -Bprotected_def or -exec, to obtain optimized access sequences, if modifying source
code to add header file includes is not an option.
-Bprotected_def
Marks locally (non-tentatively) defined symbols as having protected export class. The optimizations
discussed under -Bprotected are applied to these symbols only. This can be used when system header
files are not included for shared library calls or data accesses. For fastest code, users should add the
appropriate header file includes, and compile with -Bprotected.
This option is a subset of -exec.
-Bhidden[=symbol[,symbol]*]
-Bhidden:filename
A locally-defined symbol is a global or static symbol with a definition in the compilation set from which it is
referenced. The compilation set is the translation unit without -ipo, and with ipo is the collection of translation units
presented to a single linker invocation.
2
A globally-scoped symbol is a symbol that is visible across translation unit boundaries. Examples include simple
globals, static data members, and certain namespace members.
14
15
-exec
Asserts that code is being compiled for an executable. Similar to -Bprotected_def, all locally
defined symbols are marked as having protected export class. Additionally, accesses to symbols known
to be defined in the executable can be materialized with absolute addressing, rather than linkage table
accesses.
-minshared
Equivalent to -Bprotected -exec. When building an executable that makes minimal use of shared
libraries, use this option to obtain fastest access sequences to non-shared library code and data.
16
With profile data, the compiler may also insert stride prefetches for linked-list traversals that have
regular runtime address strides. Consider the following source code example:
for (p = ptr; p != 0; p = p->next)
x += p->data;
Normally, the compiler cannot insert prefetches for later iterations of the loop without dereferencing
successive values of the next field. However, profile data may indicate that the values of the p pointer
have a regular address stride in virtual memory. For example, if the values of p on successive iterations
are {8, 16, 24, 32, }, then it has a regular stride of 8 bytes. The compiler can then insert a prefetch
using this stride to prefetch later iterations:
for (p = ptr; p != 0; p = p->next) {
x += p->data;
lfetch p + PF*8;
}
In some cases, profile data may indicate that there are multiple dominant strides across the programs
execution. In that case, the compiler may insert a prefetch using a runtime computation of the stride,
such that the stride used in the current iterations prefetch is the stride between the values of the pointer
in the last two successive iterations.
Without profile data indicating a regular stride for a linked-list traversal, the compiler will insert a
prefetch of the next fields pointer. For the above example, it would insert the following prefetch:
for (p = ptr; p != 0; p = p->next) {
lfetch p->next->next;
x += p->data;
}
If the loop is reasonably large, this can help hide some of the latency from the subsequent iterations
dereference of p.
+Oprefetch_latency=n
Indicates that data prefetches in loops should hide n cycles of memory latency. By default, the compiler
attempts to issue prefetches far enough ahead to just fill the L2 cache outstanding request queue or
cover the expected memory latency. Using this option will override that heuristic, and cause prefetches
to be inserted enough iterations ahead of the corresponding load to cover the n cycles.
+O[no]inline:filename
+O[no]inline=symlist
#pragma no_inline
#pragma inline
#pragma [no]inline_call
Enable or disable inlining for specific functions. The functions can be listed in either a separate file
filename or on the command-line in symlist. By default, the compiler uses heuristics to determine
the profitability of inlining candidates, but these heuristics are overridden by this option. This option
can be used when the user knows that inlining of a certain function is always profitable, or never
profitable. The no_inline pragma can also be used to list those functions that should never be
inlined, and the inline pragma to list those that should always be inlined. Place the appropriate
pragma in the source file that contains the definition of the function that should or should not be inlined.
The [no]inline_call pragma is used to enable or disable inlining of a particular call site. It takes
no arguments and affects the outermost, leftmost call in the next statement. However, the
[no]inline_call pragma is not implemented at first release.
17
+inline_level n
Fine tunes the aggressiveness of the inliner. The value of <n> can be in the range 0.0-9.0 with 0.1
increments. The following values/ranges have special meaning:
18
With +Onoparmsoverlap, the optimizer assumes that subprogram arguments do not refer to
overlapping memory locations. This allows more aggressive optimization and scheduling of pointerintensive code.
+O[no]parminit (default +Onoparminit)
Not supported for Fortran.
When enabled, the optimizer inserts instructions to initialize to zero any unspecified function
parameters at call sites. This avoids NaT values in parameter registers. Enabling this option results in
small performance losses, but might be required for correctness.
+O[no]store_ordering (default +Onostore_ordering)
Not supported for Fortran.
Enabling this option forces the optimizer to preserve the original program order for stores to memory
that is possibly visible to another thread. This does not imply strong ordering. This option can be used
to achieve program ordering of stores without using the more conservative volatile semantics applied to
all accesses to global variables with +Ovolatile.
#pragma IF_CONVERT
This block-scoped pragma can be used to indicate that the compiler should employ if-conversion to
eliminate all control flow resulting from conditional code within that scope. If-conversion is the process
by which the compiler uses predicates to eliminate conditional branches. By default at +O2 and higher,
the compiler uses heuristics to determine when it is beneficial to apply if-conversion to eliminate a
conditional branch. This pragma overrides those heuristics and causes the compiler to eliminate all nonloop control flow within the scope of the pragma. Users can specify this pragma to facilitate software
pipelining of inner loops that contain conditional code, because the compiler can only software pipeline
loops that contain certain types of control flow. When placed within the scope of an inner loop, this
pragma causes the compiler to eliminate all branches except for the loop back branch.
+O[no]loop_unroll[=n] (default +Oloop_unroll)
#pragma UNROLL_FACTOR n
#pragma UNROLL n | (n)
The option indicates how many times the optimizer should attempt to unroll each loop. In most cases,
this will only affect innermost loops. Similarly, the block-scoped UNROLL_FACTOR and UNROLL
pragmas specify that the particular innermost loop should be unrolled n times. The
UNROLL_FACTOR pragma must be placed inside the associated loop, whereas the UNROLL pragma
can be placed just before the specified loop. By default the compiler uses heuristics to determine the
best unroll factor for an inner loop. However, if the user knows that a particular unroll factor is best for
the given loop, or alternatively, that no unrolling should be applied to the loop, the option or pragma
can be used to communicate this information to the compiler. The user specified unroll factor overrides
the unroll factor computed by the compiler. Specifying n=1 prevents the compiler from unrolling the
loop. Specifying n=0 causes the compiler to use its own heuristics to determine the best unroll factor
(same as not specifying the option or pragma). The pragma is ignored if it decorates a non-innermost
loop.
+integer_overflow=[ moderate|conservative] (default
+Ointeger_overflow=moderate)
Specifies how aggressive the optimizer should be in assuming that integer arithmetic computations do
not overflow. According to C and C++ language standards, signed integer arithmetic overflow in user
code results in undefined behavior. Therefore, by default (under
+Ointeger_overflow=moderate), the compiler assumes that such overflow will not occur. As a
result, the compiler may remove sign extensions of signed integer accumulations within loop bodies,
which enables further analysis and optimizations. Applications that rely on particular signed integer
overflow behaviour should use +Ointeger_overflow=conservative.
+Oautopar
19
When the +Oautopar option is used at optimization levels +O3 and above, the compiler will
automatically parallelize those loops which are deemed safe and profitable by the loop transformer.
This optimization allows the compiled program to take advantage of more than one processor (or core)
when executing loops determined to be parallelizable. Most programs which spend a significant
percentage of their execution time in such loops will improve their performance by using this technique
occasionally dramatically. By contrast, some programs may experience performance degradations
when parallelized, and all parallelized programs will increase their use of system resources, which may
slow down other programs running alongside them.
profile-based optimization
Profile-based optimization (PBO) is a set of performance-improving code transformations that make use of
an execution profile gathered for an application. There are three steps involved in performing this
optimization:
1.
2.
3.
The first command line compiles the code; the +Oprofile=collect option requests that the compiler
prepare the module for profile collection. The -c option in the first command line suppresses linking and
creates an intermediate object file called sample.o. The second command line uses the -o option to link
sample.o into sample.exe. The +Oprofile=collect option prepares sample.exe with data
collection code.
Note: Instrumented programs run slower than non-instrumented programs. Use instrumented code only
to collect statistics for profile-based optimization.
20
The +Oprofile=use option is supported at optimization level 2 (-O or +O2) and above.
Note: Profile-based optimization has a greater impact on application performance at each higher level of
optimization. Profile-based optimization should be enabled during the final stages of application
development. To obtain the best performance, re-profile and re-optimize your application after making
source code changes.
During the second run of the instrumented executable, the execution profile data derived from running the
program on B.input is merged with (added to) the existing profile database /tmp/program.flow. Profile
databases may also be merged explicitly using the tool /opt/langtools/bin/fdm. Here is an
example of an explicit merge:
% unset FLOW_DATA ; rm flow.data
21
22
The PA-RISC equivalent of +Oprofile=collect command line option is +I and the PARISC equivalent of the +Oprofile=use is +P; however, the PA-RISC options are honored by
the Itanium compiler.
Instrumented applications are optimized less aggressively than non-instrumented executables. The
PA-RISC compiler is capable of optimizing instrumented code at level +O2, whereas with the
current Itanium-based compilers, profile collection is supported at +O1 optimization (a warning is
issued indicating that the optimization level will drop to +O1 internally for +Oprofile=collect
compiles). This restriction may be lifted in a future release, however.
In the PA-RISC +I implementation, profile counters are 32 bits in size. When selecting input data
sets for runs of instrumented executables, counter saturation can occur if the training run is too
lengthy. On Itanium, profile counters are 64 bits in size, meaning that you can use more lengthy
training runs without concerns about counter saturation.
Frequently executed indirect function calls which may perform better as direct calls.
Frequently called routines that are not defined in the load module and cannot be inlined by the
compiler
Use optimization level 2 (-O or +O2) at a minimum (+O3 for floating-point applications).
Consider compiling with +O4 if not shipping archive libraries (if +O4 is not an option, consider using
minshared, +Bprotected_def and +Oshortdata to attain some of the benefit).
Use PBO (profile-based optimization) for a potentially large improvement in performance (especially
for large commercial applications). PBO provides even bigger improvements on top of +O4.
Use +Ofast, which is safe and effective for the vast majority of programs.
For memory-intensive programs, use large pages via the +pd and +pi linker options or chatr(1).
For floating-point applications, as mentioned above, +O3 should be the minimum optimization level.
Additionally, +Ofltacc=relaxed and +FPD (both included in +Ofast) often provide large
improvements.
23
Index
A
access sequences, optimized, 14
aggressive optimization
safety of, 10
aggressive optimization, enabling, 10
aggressively schedule code, 10
archive library, 14
C
compilation time limits, removing, 10
controlling optimization, 3
cross-region addressing, enabling/disabling, 18
interprocedural optimizations, 6
ipo. See interprocedural-optimizations
L
large procedures, 10
level four, 5
level one, 3
level three, 5
level two, 4
level zero, 3
library, shared versus archived, 14
linker errors, avoiding, 14
loop optimizations, 8
N
D
data allocation, controlling, 13
data prefetch insertion, 16
dead code elimination, preventing, 12
debugging, 4
E
executable, compiling code for, 16
execution profile, collecting, 21
export class
default, 15
hidden, 15
export class, protected, 14
F
floating point optimizations
reassociating, 12
floating-point code, controlling optimization on,
11
floating-point contractions, 12
floating-point modes
non-default, 12
floating-point optimizations
aggressive, 12
floating-point values
out of range, 12
flush-to-zero rounding mode, 10
FP accuracy, 10
NaN
preventing propagation of, 12
Not-a-Number. See NaN
O
optimization levels, 3
P
PA RISC, differences, 22
PBO. See profile-based optimization
performance advice, 23
PGO. See profile-based optimization
prefetch insertion, 16
profile data
maintaining files, 21
merging files, 22
profile-based optimization, 20
program order for stores, preserving, 19
protected export class
marking all data symbols, 14
marking locally defined symbols, 14, 15
S
scheduling model, controlling, 11
shared library, 14
short data area, 13
symbol binding, controlling, 13
I
if-conversion, 19
inlining the import stub, 15
inlining, enabling or disabling, 17
24
U
unrolling, 19
References
[1] HP Compilers for HP Integrity Servers,
http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/CompilersTechOverview.pdf, 2011.
[2] R. Stallman, R. Pesch, S. Shebs, et al., Debugging with GDB, HP 18th Edition
http://h21007.www2.hp.com/portal/download/files/unprot/devresource/Tools/wdb/doc/gdb60.pdf, Sep
2008.
[3] David Gross, Library Providers Guide to Symbol Binding,
http://h21007.www2.hp.com/portal/download/files/unprot/Itanium/Lib-prov-guide.pdf, 2005
25