Table of Contents
What is the
relation between ORC and Pro64?.. 2
Will there always
be two separate sources, one ORC, the other Pro64?.. 2
How well were the
compilers tested?..
2
Can I use ORC to
build the Linux/IA-64 kernel?.. 2
What is the status
of ORC native compiler on IA-64 Itanium machines?.. 3
Are there tools that come with the compilers?.. 3
What are the differences between release 1.0 and 1.1?.. 3
What are the differences between release 1.1 and 2.0?.. 3
What are the differences between release 2.0 and 2.1?..
How do one generate best optimized code using ORC?.. 3
How to use Whirl and code generation phase profile/feedback?.. 3
How to use the compilers to achieve peak performance?..
Are there documents that come with the compilers?.. 3
Interactions of ORC and IA-64
Linux questions. 4
How do I install ORC compilers on IA-64 Linux systems?.. 4
How do I install ORC compilers on IA32 Redhat 6.2 Linux systems?..4
How do I install
ORC compilers on IA32 Redhat 7.1 Linux systems?..
How do I install
ORC compilers on IA32 Redhat 7.2 Linux systems?..
How do I make a native built compiler? 3
How do I install
multiple versions of ORC compilers on IA-64 systems?.. 5
If I do cross compile on an IA32 machine, how do I produce IA-64 binaries?.. 8
How do I rebuild the Fortran front-end?..
Reporting bugs
and problems.
8
How do I decide if
the bug is ORC specific?.. 9
Using tools that come with the compiler. 9
How to use hot path enumeration tool.. 16
How to use cycle counting tool.. 16
Adding new
optimizations or phases. 18
Are there any
coding conventions?..
18
How do I add my
own tracing options?.. 18
How do I change the
driver and related files when I add an optimization?.. 19
How do
I effectively use gdb to debug the compilers?.. 20
How do
I make sure my new phase is not the cause of unnecessary long compile
time?..
21
ORC is based on Pro64. Major changes have been done in the code generator and the profiling framework. We intend to continue improving ORC in both features and performance for Itanium processor and its follow-ups.
Currently Pro64 source is hosted by the open64-user group, also found in SourceForge. Release 1.0 has been merged into Pro64 source. It has been incorporated into Open64 0.14 release. More merge has been discussed but no concrete proposal at this point.
How well were the compilers tested?
For 1.0 release,
a number of test suites for C, C++, and Fortran90 were
used, as well as a number of open source programs.
Tests were run at -O2, and -O3 levels of
optimization. The tests were run on simulators and Itanium systems. Also a
number of large (>400,000 source lines) applications were used, thanks to
University of Alberta, TsingHua University and University of
Minnesota.
For 1.1
release, the release is not a full scale release. Hence, it
has not gone through the same rigorous testing as 1.0,
yet we have done our best to ensure the quality of the 1.1 release. The
same test suites and open source programs were
used during our testing, with more optimization combinations (O2,
O3, IPA, profiling) than that of 1.0. No external
organizations were involved for this release. The same large
application used during 1.0 testing were not used for 1.1 release either.
For 2.0 release, we have gone through similar rigorous testing as for 1.0 release. Most of the testing is done with the cross compiler. The native compiler has gone through less testing due to lack of resources.
For 2.1 release, we have gone through similar rigorous testing as for 2.0 release. And most of the testing is done with both cross compiler and native compiler.
we have not attempted to build the Linux/IA64 kernel, and we have no plan to do so. Hence, it is safe to say ORC cannot build Linux. However, the following instruction is extracted from Readme file of 0.13 Pro64(tm) release:
Edit the Linux/Makefile and replace the definitions for CROSS_COMPILE, CC, and CFLAGS with the following:
CROSS_COMPILE= CC = orcc -D__KERNEL__ -I$(HPATH) -D__LP64__ CFLAGS=$(CPPFLAGS) -Wall -Wstrict-prototypes -O0 -fomit-frame-pointer -ffixed-r15 -CG:emit_unwind_info=off -Wf,-O2 -PHASE:p -D__OPTIMIZE -OPT:Olimit=5000 |
For 1.1 release, we have included
a binary that is built natively. The binary is in the
tar file orc-1.1.0.native.tgz. This
compiler, although built -O0, still outperforms
that of a cross built compiler running on an Itanium
machine. This binary have not been tested as thoroughly as the cross built
compiler.
For 2.0 release, we have included an
optimized built compiler (-O2 with all internal consistency
checks removed, except that WOPT.so and BE.so are built at -O0). The binary is
in the tar file orc-2.0-bin.tar.gz
You should see
large compile time improvement from this compiler. This native
compiler are not tested as thoroughly as the cross built compiler
We have also included two
new tools for performance tuning and analysis - "hpe.pl", a
hot path enumeration tool and a cycle counting tool, "cycount.pl" .
Details are described below.
Three major focuses
in 1.1: IPA functional, performance improvement and full profile feedback
functional. To turn on InterProcedural
Optimization, add -IPA in your compile line. You still
need to specify -O0, -O2 or -O3 besides that -IPA flag.
Performance has also improved compared with
1.0. We have measured an average 10%+ performance gain compiling
with 1.1 release, under the same optimization flags. Two
types of profile/feedback is now fully supported: Whirl level
profiling and code generation level profiling.
There
is also change in default optimization
behavior. With 1.1 release alias analysis (memory disambiguation) defaults
to assume user code is ANSI-type compliant. Also, compiled code is assumed
to become part of main executable, where we assume functions defined
in an main executable (non-dso) will not be
preempted. To get back to 1.0 behavior for both, use
-OPT:Olegacy.
The major focus in 2.0
has been performance. Although performance is not the primary goal
of this Open Research Compiler, it is vital for researchers who
base their work on this compiler to know that the binary
produced by this compiler is very competitive. Hence,
their work is also trustworthy and sound.
The concentration for ORC has always been in the integer side.
This is mainly due to insufficient resource. We have not
studied, nor conduct measurement for the relative
performance of floating point code for ORC compared to other research
or production compilers. Given that ORC is based on SGI's
product compiler, heavy on scientific applications, we have reason to
expect that it has the capability to be among the best, should there
be resource put onto it. In the integer side, we are
pleased to say that to the best of our knowledge, the code produced
by this compiler is very competitive, to all existing IA64
compilers, at optimization level -O2, -O3 and -IPA, with and without feedback
information.
A major performance improvement of 2.0
over 1.1 is that of C++ code quality. With 2.0, we have spent a lot of
efforts to ensure C++ style code can achieve similar kind of
speed up people expects to achieve when they write C code.
Besides performance
enhancement, 2.0 release also includes Itanium-2 micro-architecture
support. A new Itanium-2 machine model is added into the compiler. Now the
compiler optimizations such as instruction scheduling and bundling are
Itanium-2 aware. The option to turn on Itanium-2 mode is -TARG:platform=itanium2
or -itanium2. The default for 2.0 release is Itanium1 machine. Performance investigation and enhancements
specific to Itanium-2 will be in the next release.
The major focus in 2.1 has been the performance on Itanium-2 systems. The 2.0 release was already capable of generating code based on Itanium-2 machine model, but the performance for 2.0 was on Itanium rather than Itanium-2. After extensive work, we are pleased to say that to the best of
our knowledge, the code generated for both Itanium-2 and Itanium by ORC is very competitive to all existing IA64 compilers, at optimization level -O2, -O3 and -IPA, with and without feedback information.
In addition to all features in 2.0, the 2.1 release has added or improved a number of features: cache
optimizations, loop-invariant code motion in code generator, switch optimization, multi-way branches and renaming in instruction scheduling, SWP, unrolling, etc. In addition, we provided a new inter-procedural framework to balance RSE traffic and explicit spills. A dynamic instrumentation tool, called Pin, is also
available along with 2.1.
The 2.1 release is also more robust comparing with the 2.0 release with a number of defects fixed in both cross and native build environments.
By default, ORC 2.1 generates code for Itanium-2 systems. It also can generate code for Itanium by adding the option "-itanium".
Optimization results
will vary according to
your specific program/application. Different optimization phases are
involved at various optimization levels: -O2 goes through
the global optimizer and code generator, -O3 will go
through the loop level optimizer in addition, -IPA will
invoke the inter-procedural optimizer. Hence, in
general, -O3 code should run faster than -O2. Also, -O3 -IPA will
outperform that of -O3 generated code. -O3 also
allow the compiler to perform more
aggressive optimization that may cause differences in results also (such as
allow wrap around to be safe, allow change in floating
operation ordering,...). Occasionally, when the compiler sees
that the procedures it's compiling is too big, it
might choose to turn off optimization to curtail compile time
(you will see a warning message when
it happens). You can use the option -OPT:Olimit=0 to ensure the
compiler not to turn off optimizations.
How to use Whirl and code generation phase profile/feedback?
The options to do instrumentation
are:
-fb_create feed_back_file_name -fb_type={1,
2, 4, 8} -fb_phase={0, 4}
The options to use feedback data
are:
-fb_opt feed_back_file_name
where fb_type = 1 : whirl; 2 : cg
edge; 4 : cg value; 8: cg stride
fb_phase = 0 : before very high whirl optimizer (vho), 4 : before cg
region formation
directory specifies where the feedback data file will
be produced. Currently, only the above types and phases are
supported.
e.g. To use feedback data of whirl
profiling and edge profiling, one needs to go through the following,
compile with options:
-fb_create feed_back_file_name -fb_type=1
-fb_phase=0
run binary
compile with options: -fb_opt feed_back_file_name -fb_create
feed_back_file_name -fb_type=2 -fp_phase=4 -O3
run binary
compile with options:
-fb_opt feed_back_file_name
final run of binary
How to use the compilers to achieve peak performance?
To use the compilers to generate peak performance binary code, you need to turn on all optimizations in ORC (including inter-procedural analysis (IPA), inlinling, stride pre-fetching, procedure reordering, etc.), as well as the various profiling.The process consists of three phases, as described below:
1. The first phase is Whirl profiling instrumentation. An example of the compiler options used is shown below:
"-O3 -ipa -fb_create fb.mid -fb_type=1 -fb_phase=0 "
After linkage, run the application with "train" input data set. It'll generate whirl feedback files with names like fb.mid.instr0.aginqw.
2. The second phase involves whirl profiling annotation, stride profiling instrumentation and edge profiling instrumentation, e.g.,
"-O3 -ipa -fb_opt fb.mid -fb_create fb.mid -fb_type=10 -fb_phase=4"
After linkage, run the application with "train" input data set. It'll generate feedback files with names like fb.mid.instr0.01324, which
combine the information obtained by edge profiling and stride profiling.
3. The last phase uses all the profiling information collected and turns on all optimizations, e.g.,
"-O3 -ipa -fb_opt fb.mid"
For further
optimization opportunities, you may use -OPT:Olimit=0 as described above.
Are there documents that come
with the compilers?
We have included in the release
some documents related to our added features and
optimizations. More will be coming. The documents will
mostly be in the
code generator area, related
to our changes/additions. For documents related to Pro64,
please refer to the publication list. Those marked
with * in the list reflect actual implementation in various
components of the compilers. We have also done
three tutorials in the past two years. Two in Micro34
and Micro35, one in PACT02. Each tutorial covers different
aspects of the compiler. You can find the tutorial material in the ORC site in
sourceforge.
The binaries are packaged in tar-ball form. Just untar the file downloaded, run the script "install-bin.sh". Help manual of this script is:
Usage: install.sh [-hHnc] [-t toolroot] [-l
native-archive-root]
-h
--help
give this
help
Invoke "install.sh -H" to get better unstanding of "toolroot" and
"libroot" |
To install compiler on IA-64 Linux system, simplily by invoking ./install
-n.
The environment variables $TOOLROOT will affect installation
as well as orcc's run-time behavior. $TOOLROOT refers to root
for the binary hierarchy of ORC suite. For example, if this
variables is not set, the full path of orcc is /usr/bin/orcc (it obviously
requires root proprietary), otherwise, ${TOOLROOT}/usr/bin/orcc.
If $TOOLROOT is set, you
need to add $TOOLROOT/usr/bin/ to $PATH.
Following are 2
installation examples:
e.g 1
Assume the account name is joesmith and the IA-64 host name is uranus, the login shell of joesmith is GNU BASH.
[joesmith@uranus
joesmith]$cat $HOME/.bashrc | grep "TOOLROOT\|PATH" |
e.g
2 Assume the account name is joesmith and the IA-64 host name is uranus, the login shell of joesmith is GNU BASH.
[joesmith@uranus
joesmith]cat $HOME/.bashrc | grep "TOOLROOT\|PATH"
|
Please note that the binary in
orc-2.0-bin.tar.gz is prepared for RedHat 7.2 Linux system. Some users
experienced certain problems when installing RedHat 7.2 on Itanium I machines.
Our test shows the compiler works fine on successfully installed
machines.
Sometimes it's desirable to install the compilers on IA32 machines and do cross compile on a bare Linux 6.2 box. In this case you can use NUE, an Itanium simulation environment, to get a "virtual native IA64 system" on IA32.
1. Download NUE from HP website and install it.
2. After installing NUE, you might need to re-link
following 2 folders (depending on your
NUE version number) to make orcc also workable outside
NUE.
if
/nue/usr/include/asm is a symbolic link then change it to
/nue/usr/src/linux/include/asm, by:
cd /nue/usr/include
rm asm
ln -s /nue/usr/src/linux/include/asm
asm
if /nue/usr/include/linux is a symbolic link then change it
to /nue/usr/src/linux/include/linux/, by:
cd
/nue/usr/include
rm linux
ln -s /nue/usr/src/linux/include/linux/
linux
3.
Download gcc (we only used 2.95.2 release and
2.96), and then
configure --prefix=/usr; make all
install
and then do a gcc -v to make sure you have the right
version.
4. You don't need to build the
binary at this point, just install that from the download.
To build cross environment, keep in mind a number of
things:
1. The NUE environment is a simulated IA64 system on top of
the IA32 Linux box. However, we do not need to enter NUE to perform cross
compilation.
2. ORC
relies on NUE's cpp (C
PreProcessing) functionality as well as its native header files.
Therefore NUE is an essential setup for cross
compilation.
3. By entering "nue", you
are switched to a simulated native box. But you need not do so for cross
compilation.
4. From the IA32
Linux box, your file structure is really under /nue.
5. To do cross compile, do
${TOOLROOT}/usr/ia64-orc-linux/bin/orcc file.c.
NOTE: The path for cross orcc is a little bit different from that of
the native one, i.e., "ia64-orc-linux" is interposed between "/usr" and
"/bin"
orc-2.0-bin.tar.gz is not workable on
this platform due to the shared object compatibility problem. A cross compiler
on a 7.1 system is possible, you need to build that from the source
tree.
For IA32 system with Redhat 7.2 installed, the ORC compiler will be used as a cross compiler. We have provided a script INSTALL.cross that will install the compiler binaries. The binaries are packaged in tar-ball form. Just untar the file downloaded, run the script "install.sh -c". We recommend the users install NUE 1.1 on RedHat 7.2 or higher. Since NUE 1.0 does not run on this system, you will be doing strictly cross compilation in this case.
We have provided a sample makefile (Make.native) in the same directory as Make.cross. Simply do a make with Make.native.
Sometimes it's desirable to install multiple versions of the compilers on a machine, for debugging or experimental purposes. Assuming you have two different versions of ORC compiler binaries, installed in different directories, simply set the environment variables TOOLROOT and COMP_TARGET_ROOT to point to the directories for the desired binary and archives and it should work.
How do I rebuild the Fortran front-end?
The Fortran front-end (i.e., mfef90) will not be built by default. If you want to rebuild it, you need to install the ORC binary first.
Then do the following:
cd ${ORC_SRC_ROOT}/src/osprey1.0/targia64_ia64_nodebug/crayf90/sgi; make BUILD_COMPILER=SGI
In order to pick up the right set of archives or dynamic shared libraries, the simplest way is to produce object files in cross compilation, copy the objects to the target IA64 systems and do the final link there to produce the binary, as follows:
1. At IA32 side:
[IA32]% orcc –c hello.c –o hello.o
[IA32]% ftp IA64 (transfer hello.o to an Itanium machine)
2. At IA64 side:
[IA64]% orcc hello.o –o hello
[IA64]% ../hello
hello world!
%
ORC 2.0 release allows you to produce IA64 binary on IA32 machines directly. Simply follow the install instructions to set the environment variables and install the pre-built native archives properly.
We cannot promise to fix every bug
reported. If you think that you have uncovered a bug in the ORC compilers, you
can post that to ipf-orc-support@lists.sourceforge.net. We will do our best
if the bug is found to be ORC specific. For Pro64 specific bugs, we encourage
you to post the problem in open64-devel@lists.sourceforge.net.
Our changes are primarily in the code generator area. To find out if the problem you have is specific to code due to our changes:
If the bug persists at –O0 level, it is likely Pro64 specific.
Otherwise, add “-CG:opt=0” and if the bug persists, it is most likely not in the code generator area
Otherwise, add “-ORC:=off” and if the bug goes away, the bug is ORC related.
You can help us quickly turn around the fixes by doing some up front work.
Minimize the test cases. Provide a fully preprocessed file (-E option) to avoid dependence on specific header files.
Give us a full command line description to compile the tests.
Tell us how to run your program if the symptom occurred at runtime. If the program needs data input, please append it as well.
You can even help by utilizing the triage tool we provided to narrow down the optimization, procedure and BB/region/instruction that is showing the problem.
The hot path enumeration tool (hpe.pl) can be used to enumerate hot paths in a PU. It is suitable for performance analysis.
It works on an assembly file (generated by ORC).
usage : hpe.pl [options]
file |
Usage: cycount.pl [options]
benchmarks Options: -h --help Display this information -d --spec_home <dir> Set <dir> as the SPEC2KINT home. Default is $HOME/spec2000 -l --log <filename> Give a log file name. Default is STDOUT benchmarks SPEC2KINT benchmarks. Type 'int' for all 12 SPEC2KINT benchmarks |
Please find the coding convention guideline here. This guideline is designed to stay close to the Pro64 coding style and convention. You can also find documentation on how the compiler handles memory management and various other issues.
One can have either summarized or detailed trace of any specific optimization performed. The traces are dumped to the file xxx.t where xxx.o is the desired output object file. To add your own tracing options, do:
1. Make sure what you want to add is a phase-specific trace flag. Assume your phase is XYZ (a corresponding number xyz can be found in osprey1.0/common/util/tracing.h). It will be used like this:
-Wb,-ttXYZ,0xnn
( -Wb,-ttxyz,0xnn is the same )
2. Modify xyz_defs.h to define your flags
(the minor number above, nn. The major number is mm. ).
Please add them in the end.
3. Your code using these flags should be like:
if (Get_Trace(TP_XYZ, YOUR_FLAG)) {
YOUR_CLASS.Print(TFile);
}
where Tfile is the file handle of the trace file.
4. common/util/tracing.h is the file of interest to add tracing.
When you add your phase in the compilers, the following issues are important:
1. Suppose you want to add a phase inside the code generator
2. All files you want to add should have lower-cased file names, with words separated with underscore, such as: if_conv.cxx.
3. You should define (declare) a flag in cg_flags.cxx(.h) controlling whether your phase should be run.
4. You should add an element in arrary Options_IPFEC in cgdriver.cxx describing your flags.
There are already several flags there, you can take them as examples.
5. Your flags should have prefix: IPFEC_Enable_XXX, such as IPFEC_Enable_If_Conversion.
At least, they should have prefix: IPFEC_XXX.
6. For other components, they are all similar. The files common/com/xxx_config.h and common/util/flags.{h,c} are files of interest.
Pre-requirements:
1. Your gcc version needs to be 2.95.2 or 2.96
2. To enable debugging, you can build the entire compiler with
make BUILD_OPTIMIZE=DEBUG
Or you can build
individual components (such as wopt.so) by going into the corresponding
targia_xxx directories, and build with "BUILD_OPTIMIZE=DEBUG".
To single step inside the backend components (we'll show the CG portion, other components are similar):
1. Remember to set the environment variable LD_LIBRARY_PATH.
For cross compiler this variable should be set as
export
LD_LIBRARY_PATH=${TOOROOT}/usr/ia64-orc-linux/lib/gcc-lib/ia64-orc-linux/2.0:$LD_LIBRARY_PATH
For native compiler it should be
export LD_LIBRARY_PATH=${TOOROOT}/usr/lib/gcc-lib/ia64-orc-linux/2.0:$LD_LIBRARY_PATH
2. First run orcc using options " ... -show -keep ... ". It will keep some needed intermediate files.
orcc -show -keep kk16.c
It will print some info like this:
/home/xyz/orc-2.0/usr/ia64-orc-linux/altbin/gcc -D_LANGUAGE_C -D_SGI_COMPILER_VERSION=10. -D__host_ia32 -D__INLINE_INTRINSICS -D_LP64 -D__ia64=1 kk16.c -E > kk16.i
/usr/ia64-orc-linux/lib/gcc-lib/ia64-orc-linux/2.0/gfec -O0 -dx -quiet -dumpbase kk16.c kk16.i -o kk16.B
/usr/ia64-orc-linux/lib/gcc-lib/ia64-orc-linux/2.0/be -PHASE:c -G8 -O0 -TENV:PIC -m1 -INTERNAL:return_val=on -INTERNAL:mldid_mstid=on -INTERNAL:return_info=on -show -TARG:abi=i64 -LANG:=ansi_c -fB,kk16.B -s -fs,kk16.s kk16.c
Compiling kk16.c (kk16.B) -- Back End
Compiling main(0)
/usr/ia64-orc-linux/bin/as kk16.s -o kk16.o
/usr/ia64-orc-linux/bin/ld -dynamic-linker /lib/ld-linux-ia64.so.2 -rpath-link ...
3. Now we can use the input file and options for BE as the input to gdb when debugging BE:
gdb /usr/ia64-orc-linux/lib/gcc-lib/ia64-orc-linux/2.0/be
(gdb)break be_debug
(gdb)run -PHASE:c -G8 -O0 -TENV:PIC -m1 -INTERNAL:return_val=on -INTERNAL:mldid_mstid=on -INTERNAL:return_info=on -show -TARG:abi=i64 -LANG:=ansi_c -fB,kk16.B -s -fs,kk16.s kk16.c
4. After gdb stopped, set another breakpoint and trace into the component that you are interested in.
All optimization phases are timed by the compiler for ease of measurement. Please make your phase timed also.
It's simple and you only need to add two lines in your code, one after entering your phase, for example:
Start_Timer(T_Aurora_LOCS_CU);
and another before leaving your phase, for example:
Stop_Timer(T_Aurora_LOCS_CU);
and remember to include timing.h in your code.
Naturally, you'll need to define your timing phase, please look at file be/com/timing.h.
After re-compiling, you can get timing info using -Wb,-ti1.