
The MIPSpro compilers are SGI's third-generation family of optimizing and parallelizing compilers. The compilers generate 64-bit optimized code for the MIPS R8000-based POWER CHALLENGE systems running IRIX 6.0, the new 64-bit operating system, and support Fortran 77, C, and C++. Highlights:
Figure 5-1 diagrams the relationship between MIPSpro compilers and IRIX systems.
Figure 5-1 MIPSpro Compilers and IRIX Versions
Figure 5-2 illustrates the various components of the MIPSpro compilation system. Components include the compiler back end and the Parallel C and Fortran 77 analyzer. Scalar and superscalar optimizations common to all the compilers are performed in the common back end. All parallel processing optimizations are performed in the parallel analyzer phases of the compilation system. Kuck and Associates (KAI) technology is an integral part of the MIPSpro compilation system.
The compilers support value-added extensions to
Figure 5-2 MIPSpro Compiler Components This section explains:
The compilers perform a range of general-purpose and architecture-specific optimizations to improve application performance by reducing the number of instructions executed, thus making better use of the CPU's instruction set, maximizing register use, minimizing memory references, and eliminating unused or redundant code.
A range of new optimization techniques takes maximum advantage of the new processor features such as on-chip and off-chip caches, pipelining, and superscalar chip architecture. The optimizations are applicable to a wide range of scientific and engineering applications and benefit both scalar and parallel performance of applications.
A rich assortment of command-line options can leverage different combinations of optimizations. In general, the optimizations are spread across the compilation system for better efficiency. For instance, high-level optimizations like loop interchange and loop unrolling are performed in the compiler front ends, whereas architecture-specific optimizations like software pipelining and automatic blocking are implemented in the common back end. All optimizations are fine-tuned to take advantage of the new system. Key optimizations:
Memory hierarchy optimizations play a key role in matching the performance capabilities of the fast superscalar processor with the relatively slower main memory system. The primary function of the cache subsystem is to bridge the gap between processor and main memory speed. In this sense the cache architecture of a modern superscalar microprocessor like the MIPS R8000 is similar to the vector register sets of traditional vector supercomputers that used vector registers to keep the processors busy with computation without having to frequently reference main memory.
Caches are based on the observation that most application programs exhibit some degree of locality of reference: programs access pieces of data that are "near" already requested data, in space and time. A program that accesses memory without regard to locality of reference might perform poorly because of a high number of cache misses. The compiler plays a crucial role in restructuring programs to reduce cache misses by interchanging loops, or by tiling or blocking loop nests so that data is consumed most efficiently by the processor. This arrangement is similar to traditional vectorizing compilers that restructured programs to fit in vector memory (registers) in pieces. In short, the compiler restructures programs so that a useful subset of the problem can fit into the cache. Thus the processor can work on patches of the original code and data from the cache memory (thereby avoiding main memory references) before moving on to the next patch.
Figure 5-3 illustrates the cache misses per FLOP as a function of different cache sizes for a broad range of scientific and engineering applications.
Figure 5-3 Working Set Effects on Scientific Applications
For cache sizes of less than 1 MB, the miss ratio becomes negligible. This experiment illustrates how a combination of a moderately large cache size and good compiler technology can reduce cache misses to a negligible amount for a large (but not all-inclusive) class of big scientific and engineering problems.
5.1.3 Automatic and User-assisted Parallelization
The MIPSpro Power compilers (MIPSpro Power Fortran 77 and MIPSpro Power C) support automatic and user-directed parallelization of Fortran and C applications for multiprocessing execution. The compilers employ automatic parallelization techniques to analyze and restructure user applications for parallel execution, as preferred by users who rely on the compilers to parallelize their applications.
The compilers also provide a comprehensive set of standards-based comment directives that enable users to assist the compiler in the parallelization process. Users can use these directives to provide additional information to the compiler to boost parallel performance.
The parallelization technology is fine-tuned to take advantage of the POWER CHALLENGE system architecture. A combination of automatic and user-assisted parallelization can lead to substantial improvements in the performance of many programs.
5.1.4 High-performance Scientific Libraries
The compilers are complemented by CHALLENGEcomplib, a comprehensive collection of scientific and math subroutine libraries that provide support for mathematical and numerical algorithms used in scientific computing. The key motivation for creating CHALLENGEcomplib is to provide standard library functionality and to improve the runtime performance of applications.
CHALLENGEcomplib is similar to scientific libraries provided by other supercomputing vendors like the Cray SCILIB, IBM ESSL and the Convex VECLIB. The library consists of two subcomponents: complib.sgimath and complib.slatec. These libraries include support for Basic Linear Algebra Subprograms (BLAS), Extended BLAS (Level 2 and Level 3), LAPACK, FFT, Convolutions, and selected routines from Cray's SCILIB and SLATEC from the Energy, Science and Technology Software Center.
5.1.5 Software Development Support
The MIPSpro compiler family includes basic debugging and program runtime analysis tools including dbx, pixie and prof.
Note: Both pixie and prof can be used to profile parallel programs.
5.2 Optimization Technology in the MIPSpro Compilation System
Optimization technology is an integral part of the MIPSpro Compilation system. Supercomputing microprocessors like the MIPS R8000 offer enormous performance potential that needs to be harnessed by advanced optimization techniques. The optimizer takes the processor, system and program characteristics into consideration when restructuring programs for performance.
Figure 5-4 illustrates three important parameters that influence the effectiveness of the optimization phases of the compilers.
Figure 5-4 Optimization Phase Parameters
This section discusses
The optimizer has a sound understanding of the architecture and performance characteristics of the MIPS R8000 64-bit superscalar processor and it uses this information in generating optimized code for applications. The R8000 processor implements the MIPS IV Instruction Set Architecture (ISA) and has the following features:
The POWER CHALLENGE system architecture is also taken into consideration when restructuring programs for fast execution. For example, the compilers can generate parallel code to take advantage of the multiprocessing capabilities of the POWER CHALLENGE systems. Similarly, the compilation system takes the system bus and memory performance characteristics into consideration when implementing efficient synchronization primitives for parallel execution of programs.
POWER CHALLENGE system characteristics that are important to the optimizer include:
The compilers perform extensive analysis of application programs to perform appropriate optimizing transformations. Different applications exhibit different characteristics that determine the degree of optimizations that can be performed. For instance, compute-intensive applications that reference data in a regular manner may be very amenable to an optimization known as loop blocking. Similarly, applications that reference different sections of data simultaneously can be parallelized by the compiler for concurrent execution. In short, application characteristics play an important role in determining the degree and effectiveness of the optimizations that can be performed.
5.2.4 Optimization Technology
The MIPSpro compilers perform a hierarchy of optimizations ranging from fine-grained instruction-level optimizations to coarse-grained parallelization through loops and tasks to reduce the execution time of applications. The optimization phases are spread across the compilation system.
Instruction-level optimizations are performed mostly in the common back end to extract maximum performance out of the R8000 superscalar processor. Common instruction-level optimizations include software pipelining, instruction scheduling, global instruction movement and register allocation.
Loop-level optimizations are performed in the early stages of the compilation process. Key loop-level optimizations include automatic loop blocking, loop interchange and loop unrolling.
Automatic loop blocking and loop interchange are memory hierarchy optimizations that take advantage of the cache architecture of the machine. Similarly, loop unrolling attempts to expose more instruction level parallelism to the optimizer for fine-grained parallelism.
Other optimizations such as loop distribution and loop fusion are important for efficient parallel execution. The compiler uses extensive analysis and transformation techniques to detect parallelism in programs. The compilation system supports a complete runtime environment for parallel execution. This runtime library is common to all the MIPSpro compilers.
Figure 5-5 illustrates the different kinds of parallelism exploited by the compilers.
Figure 5-5 Compiler Parallelisms
5.2.5 Basic Block Optimizations
Basic optimizations are performed at optimization level 1 (-O1) and are meant to create efficient scalar code at the basic block level for both C and Fortran programs. A basic-block is a sequence of statements ending with a condition or unconditional branch. Many scalar optimizations are performed at the basic block level to improve the efficiency of the generated code. The following are some of the most common optimizations performed at the basic-block level:
5.2.6 Global Optimizations
The compilers perform extensive global optimizations at optimization level 2 (-O2) that are usually beneficial to most applications, and are conservative in nature to ensure integrity of the results of floating point computations. The compiler performs memory hierarchy optimizations to maximize reuse of data in the cache. Optimizations performed at this level include:
5.2.7 Advanced Optimizations
The compilers perform aggressive optimizations at level 3 (-O3) that focus on maximum code quality. Considerable flexibility is provided to enable combination of optimizations at this level for maximum floating point performance. Important superscalar optimizations like software pipelining are performed at this level.
5.2.8 Floating-point Optimizations
The compilers normally generate floating-point code that conforms to the IEEE 754 floating-point standard. However, many floating-point-intensive codes that have not been written with careful attention to floating-point behavior do not require precise conformance with the source language expression evaluation standards or the IEEE 754 arithmetic standards. It is therefore possible to relax the conformance restrictions in favor of better performance. The MIPSpro compilers provide a number of different command-line options to accomplish this goal:
The -OPT:roundoff=n flag is available to determine the extent to which optimizations are allowed to affect floating-point results, in terms of both accuracy and overflow/underflow behavior.
5.2.8.2 IEEE Option
The -OPT:IEEE_arithmetic=n flag specifies the extent to which optimizations should preserve IEEE floating -point arithmetic.
5.2.8.3 Reciprocal and Reciprocal Square Root
The flexible floating-point options provide users with a range of alternatives to trade off accuracy for speed. Thus applications can take advantage of fast MIPS IV instructions like recip and rsqrt. This is particularly significant for applications that were running on Crays, which have several fewer bits of precision than IEEE 64-bit. Heavy users of Cray and other non-IEEE compliant vector machines who have a need for speed can use these options.
In short, optimizing divides into multiplies by using reciprocal, and lifting the inverse calculation outside the loop can give rise to superior performance improvements. For example, if IEEE conformance is required, the generated code must do the n loop iterations in order, with a divide and an add in each iteration. Alternatively, if IEEE conformance is not required, the implementation of x/y as x * recip (y), or sqrt(x) as x * rsqrt(x) can be used to treat the divide as a(i) * ( 1.0/divisor ). On the MIPS R8000, the reciprocal can be calculated once before the loop is entered, thereby reducing the loop body to a much faster multiply and add per iteration, which can be a single madd instruction on the R8000:
INTEGER i,n REAL sum, divisor, a(n) sum = 0.0 do i = 1, n sum = sum + a (i) / divisor enddo
For example, a loop encountered in the Computational Chemistry Application called WESDYN developed at Wesleyan University is representative of loops that are frequently encountered in computation-intensive portions of chemistry applications. The loop contains reciprocal as well as square-root operations that are candidates for higher performance.
do 200i = 1,n r2( i ) = 1 / ( xx( i3 ) ** 2 + xx( i3 + 1 ) ** 2 + xx( i3 + 2 ) ** 2 ) r1( i ) = sqrt( r2( i ) ) i3 = i3 + 3 200 continuerecip and rsqrt are also important to graphics applications that use the reciprocal and reciprocal square root operations in important computational parts.
Valid speculative code motion must normally avoid moving operations which may cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013.spice2g6, it is possible to enable speculative code motion for a critical loop by turning off floating point and memory exceptions around that loop, resulting in a healthy performance gain.
5.2.8.4 Fast Intrinsics
The MIPSpro compilation system supports a fast version of intrinsic library functions. Selected mathematical functions from the standard mathematical library are hand-coded in assembly language to take maximum advantage of the MIPS IV instruction set of the MIPS R8000 architecture. Specifically, frequently used intrinsics like the transcendental functions (log, exp, power, sin, cos, cis and tan) are hand-coded in assembly and are part of a separate fast mathematical library.
The fast library can be invoked with the -lfastm command-line flag. The accuracy level of all the hand-code transcendental functions (except for tan) is better than 2 ULPS (units in the least significant place).
5.2.9 Global Code Motion
Global Code Motion (GCM) is a general-purpose optimization that redistributes the instructions among basic blocks along an execution path to improve instruction-level parallelism and make better use of machine resources.
Traditional global optimizers avoid moving instructions in cases that might cause them to be executed along control flow paths where they would not have been in the original program. The MIPSpro global optimizer does perform such code motion, called speculative code motion, because the instructions moved are executed based on speculation that they will actually prove useful. This kind of aggressive code motion is unique to the MIPSpro compilers. By default, GCM is very conservative in its speculations. However, a number of options are available to control the degree of speculation.
Valid speculative code motion must normally avoid moving operations that might cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013.spice2g6, speculative code motion can be enabled for a critical loop by turning off floating-point and memory exceptions around that loop, resulting in a healthy performance gain.
Figure 5-6 illustrates the different kinds of speculations available.
Figure 15-6 Speculation Types
What follows is a brief description of each of these optimizations (with appropriate user-level flag control):
if ( p -> next ! = NULL ) {
sum = sum + p -> next -> val ;
} else {
sum = sum + p -> final_val ;
}
5.2.10 Software Pipelining
Software pipelining is arguably the most important architecture-specific optimization implemented in the MIPSpro Compilers. It is a practical, efficient and general-purpose scheduling technique for exploiting fine-grained, instruction level parallelism available in modern-day superscalar processors like the MIPS R8000.
In software pipelining, iterations of loops are continuously initiated at constant intervals without having to wait for preceding iterations to complete. That is, multiple iterations, in different stages of computation, are in progress simultaneously. The steady state of this pipeline constitutes the loop body of the object code. Consider this simple DAXPY loop:
do i = 1, n v (i) = v (i) + X * w (i) enddoOn the MIPS R8000 architecture, this loop can be coded in assembly language as two load instructions followed by a multiply-add instruction and a store. Figure 5-7 shows this execution order constraint.
Figure 5-7 Execution Order Constraint
This simple schedule completes one iteration of the loop body in five machine cycles. Considering that the R8000 processor allows up to two memory operations and two floating-point operations in the same cycle, the instructions above are initiated by the machine as outlined in Table\x115-1. (This schedule completes one iteration of the loop in five machine cycles):
Table 5-1 Simple Schedule for Loop Body of DAXPY
If the same loop were unrolled by a factor of four, the loop body would look like:
Cycle Count Memory Operations Floating-point Operations
0 load andv(i); load andw(i) madd X, w(i), v(i) 1 2 3 4 store andv(i)
do i = 1, n, 4 v (i) = v (i) + X * w (i) v (i + 1) = v (i + 1) + X * w (i + 1) v (i + 2) = v (i + 2) + X * w (i + 2)` v (i + 2) = v (i + 2) + X * w (i + 2) enddoThe unrolled loop allows for instructions from four independent iterations to execute in parallel, thereby increasing the instruction level parallelism and hence the performance of this loop. Table 5-2 illustrates the sequence of loads, madds and stores for this unrolled loop. (This schedule completes four iterations of the loop in eight machine cycles).
Table 5-2 Schedule for Loop Body of Four-way Unrolled DAXPY
The schedule above completes four iterations of the loop body in eight cycles, thereby improving the performance to one quarter of R8000's peak megaFLOPs. However, this schedule still leaves room for improvement. Each store has to wait three cycles for its corresponding madd instruction to complete, thereby forcing the four store operations to be initiated in different cycles. In other words, the schedule above does not take advantage of the R8000's ability to do two stores in one cycle.
Cycle Memory Operations Floating Point Operations Count
0 load andv ( i); load andw (i) madd X, w(i), v(i) 1 load andv (i + 1); load andw (i + 1) madd X, w (i+1), v (i+1) 2 load andv (i + 2); load andw (i + 2) madd X, w (i+2), v (i+2) 3 load andv (i + 3); load andw (i + 3) madd X, w (i+3), v (i+3) 4 store andv (i) 5 store andv (i + 1) 6 store andv (i + 2) 7 store andv (i + 3)
By using software pipelining, the loop instructions can be initiated at constant intervals such that each iteration executes a combination of loads and stores from different iterations. In the DAXPY example, this can result in a schedule that would complete two iterations every three cycles to realize significant performance improvements over the two previous schedules.
Table 5-3 illustrates the machine schedule for the software pipelined version of the above loop. To prepare properly for entry into such a loop, a prologue section of code is added that sets up the registers for the first few stores in the main loop. Similarly, in order to exit the loop properly, an epilog section is added that performs the final stores. Any preparation of registers needed for the epilog is done in the cleanup section of the code. (The main loop completes four different iterations in six machine cycles).
Table 5-3 Schedule for Software-pipelined DAXPY.
Cycle Memory Operations Floating Point Operations Count
PROLOG: t1 = load andv (i); t7 = madd X, w(i), v(i) t2 = load andw (i) t4 = load andv (i + 1); t8 = madd X, w (i+1), v (i+1) t5 = load andw (i + 1) MAINLOOP: 0 t1= load andv (i + 2); t3 = madd X, w (i+2), v (i+2) t2 = load andw (i + 2) 1 t4 = load andv (i + 3); t6 = madd X, w (i+3), v (i+3) t5 = load andw (i + 3) 2 store t7; store t8 beq CLEANUP 3 t1 = load andv (i +4); t7 = madd X, w (i + 4), v (i +4) t2 = load andw (i + 4) 4 t4 = load andv (i + 5); t8 = madd X, w (i + 5), v (i + 5) t5 = load andw (i + 5) 5 store t3; store t6 bne MAINLOOP; EPILOG: store t7; store t8 br ALLDONE CLEANUP:t7 = t3; t8 = t6 br EPILOG ALLDONE:
In the main loop, the code completes four different iterations in six cycles, which is better than the previous two schedules. Table 5-4 illustrates the performance improvement in the three cases.
Table 5-4 DAXPY Speedup Factor for Simple Schedule
Scheduling Type Performance Speedup
Simple scheduling 1 iteration in 5 cycles 1.0 4-way unrolling 4 iterations in 8 cycles 2.5 Software pipelining 4 iterations in 6 cycles 3.3
As Table 5-4 suggests, software pipelining can make a huge difference in the performance of compute-intensive applications.
Another advantage of software pipelining is its ability to generate compact code, compared to transformations like loop unrolling that can increase the program size by a noticeable amount. Compact code prevents instruction cache penalties resulting from increased code size. The most important aspect of software pipelining is its ability to generate near-optimal code for loops.
Software pipelining in the MIPSpro compilers is based on the concept of "modulo iteration-interval scheduling", a technique for software pipelining innermost loops. This is a proven technique for generating code with near-optimal performance. In effect, this technique sets a performance goal for each loop prior to scheduling and then attempts to achieve this goal by taking into account resource constraints and the program data reference constraints. Typical benchmark programs, like the SPECfp92 benchmarks, illustrate substantial improvements in performance when compiled with software pipelining.Figure 5-8 summarizes the performance and cost effectiveness of software pipelining as implemented in the MIPSpro compilers.
Figure 5-8 Cost Effectiveness of Software Pipelining
5.2.11 Pointer Optimization
For many compiler optimizations, ranging from simply holding a value in a register to the parallel execution of a loop, it is necessary to determine whether two distinct memory references designate distinct objects. If the objects are not distinct, the references are said to be aliases. When these references are pointers, there is often not enough information available within a single function or compilation unit to determine whether the two pointers are aliased. Even when enough information is available, the analysis can require substantial amounts of time and space. For example, it could require an analysis of a whole program to determine the possible values of a pointer that is a function parameter.
In short, compilers must normally be conservative in optimizing memory references involving pointers (especially in languages like C), since aliases (i.e. different ways of accessing the same memory location) may be very hard to detect. Consider the following example:
float x[100];
float *c;
void f4 ( n , p , q )
int n;
float * p;
float * q;
for ( i = 0 ; i < n ; i ++ ) {
p[ i ] = q[ i ] + c[ i ] + x[i] ;
}
}
To be safe, the compiler must assume that the pointer references a, b, and c may all be aliased to each other. This in turn precludes the possibility of aggressive loop optimizations by the optimizer.
The MIPSpro compilers alleviate this general problem of aliasing by providing users with a number of different options for specifying pointer aliasing information to the compiler. The compiler can use this information to perform aggressive optimizations in the presence of pointers and thereby achieve healthy performance improvements. Table 5-5 illustrates the various user-controlled options that can be useful for improving runtime performance.
Table 5-5 User-assisted Pointer Optimizations
Flag Description
-OPT:alias=any Compiler should assume that any pair of memory references may be aliases unless proven otherwise. This is the default setting in the compiler and reflects a safe assumption by the compiler. -OPT:alias=typed Compiler assumes that any pair of memory references that are of distinct types cannot be aliased. For example: void dbl ( i , f ) { int * i; float * f; *i = *i + * i ; *f = *f + *f ; } The compiler assumes that i and f point to different memory locations as they are of different types. This can result in producing an overlapped schedule for the two calculations. -OPT:alias=unnamed Compiler can assume that pointers will never point to named objects. In the following example compiler will assume that the pointer p cannot point to the object g, and will produce an overlapped schedule for the two calculations. This is the default assumption for the pointers implicit in Fortran dummy arguments according to the ANSI standard. float g; void double (f) float* p; g = g * g ; *p = *p + *p ; } -OPT:alias=restrict Compiler should assume a very restrictive model of aliasing, where no two pointers ever point to the same memory area. This is a rather restrictive assumption, but when applied for specific well-controlled, valid cases, can produce significantly better code. vopid double (p,q) int*p; int*q; { *p = *p + *p ; *q = *q + *q ; }
5.2.12 Loop Optimizations
Most compute-intensive programs spend a significant portion of their execution time in loops; the MIPSpro compilers spend a significant portion of their time in optimizing loops in the program. Various techniques optimize the performance of loop, many of them automatically enabled by the compilers at various levels of optimizations. This section explains important loop transformations performed by the MIPSpro compilers.
Loop interchange is a memory hierarchy optimization that modifies the data access pattern of nested loops to match with the way data is laid out in memory. For example, in a typical Fortran 77 loop nest, Fortran stores array elements in column-major order (not row-major like most programming languages). Each iteration of the i loop steps across contiguous elements of A, while each iteration of the j loop steps over an entire column of A. Assuming that A is dimensioned as A (M, N), each iteration of the j loop steps across M elements of A. If M is larger than a page size of the machine, each iteration of the j loop steps on a new page, thereby exhibiting bad locality. As a result, the program may spend considerable portion of its time moving data between memory and the cache system, and exhibit poor performance.
do i = 1, m do j = 1, n do j = 1, n do i = 1, m a ( i , j) = a ( i-1 , j ) + 1.0 a ( i , j) = a ( i-1 , j ) + 1.0 enddo enddo enddo enddo
Original Loop Interchanged Loop
Figure 5-9 Loop Comparison
This problem of the innermost loop having a large stride is eliminated by interchanging the two loops, as shown in the right-hand example in Figure 5-9. Now the innermost loop runs across contiguous elements, minimizing page faults and cache misses. Depending on the dimensions of the arrays, the transformed loop can exhibit significantly better runtime performance.
Another advantage of loop interchange is its ability to move parallelism to outer levels of a nested loop. Before interchange, the innermost j loop can be parallelized by the compiler. However, the amount of work performed within the j loop may not be sufficient for efficient parallel execution. Once loop interchange is performed, the parallelism moves to the outer loop thereby increasing the amount of work in the loop. In effect the compiler is able to parallelize a larger region of code for better performance.
5.2.12.2 Loop Distribution
The compiler performs loop distribution to partition a single loop into multiple loops. Loop distribution has the advantage of making a loop's working set better fit the paging structure of the underlying machine. It can also expose more parallelism to the compiler. By distributing the loop into a sequential loop and a parallel loop, the compiler is able to efficiently execute parts of the original loop in parallel. The multiple loops are usually smaller (in body size) compared to the original loop and are more amenable to software pipelining. Figure 5-10 illustrates this transformation:
Figure 5-10 Loop Before (Left) and After (Right) Distribution
do i = 1 , m a ( i ) = b ( i ) + c ( i ) enddo do i = 1 , m a ( i ) = b ( i ) + c ( i ) do ii = 1 , m d ( i ) = d ( i + 1) + e ( i ) d ( ii ) = d ( ii + 1) = e ( ii) enddo enddo
The original loop (left) cannot be parallelized because of the data dependency arising from the reference to array D. However, after distribution the first i loop can be parallelized and the ii loop software pipelined for performance.
5.2.12.3 Loop Fusion
Loop fusion, the inverse of loop distribution, involves "jamming" two originally separate loops into a single loop. Figure 5-11 illustrates this transformation.
Figure 5-11 Loop Before (Left) and After (Right) Fusion
do i = 1 , m a ( i ) = b ( i ) + c ( i ) enddo do ij = 1 , m a ( ij ) = b ( ij) + c ( ij ) do j = 1 , m d ( j ) = a ( j ) + e ( j ) (d ( ij ) = a ( ij ) + e ( ij ) enddo enddo
Loop fusion can be used in many cases to combine two loops, each of which utilizes a large portion of the page space of the machine. The fused loop can have a working set that is smaller than the sum of the two individual loops, improving data reuse, and permitting better register allocation. In Figure 5-11, after loop fusion the elements of array A are immediately available for use by the second statement in each iteration. The optimizer recognizes the reuse of elements of A, and keeps them in registers for the operation. Loop fusion can also increase the size of loops to improve the efficiency of parallel execution. By combining two small loops into a bigger loop, fusion sets the stage for profitable parallelization of the bigger loop. In Figure 5-11, the two individual loops may be too small to overcome the overheads of parallelization. However, the combined loop after fusion may be large enough to realize performance improvements from parallelization.
5.2.12.4 Loop Blocking
Loop blocking is an effective technique available in the MIPSpro compilers for optimizing the performance of the memory hierarchy for numerical algorithms. The reason for blocking is that entire matrices typically do not fit in the fast data storage (for example, the register file or cache) of the machine. Blocking decomposes matrix operations into submatrix operations, with a submatrix size chosen so that the operands can fit in the register file or cache. Since elements of a submatrix are reused in matrix operations, this reduces slow memory accesses and speeds up the computation.
Figure 5-12 (courtesy of Dr. James Winget, Chief Scientist, SGI) shows the change in the memory access pattern as a result of loop blocking.
Figure 5-12 Memory Reference Pattern Before (Top) and After (Bottom) Loop Blocking
The before picture in Figure\x115-12 references four sets of consecutive addresses over a certain period of time before repeating the access pattern. Blocking restructures the loop to reflect the memory access pattern illustrated in the after picture. Here, subsets of all four data sets reside in cache and are accessed in a shorter period of time. This arrangement enables useful computation to be performed efficiently on a cache resident subset of the original dataset before moving on to the next subset. Performance improvements come from reduced processor-to-main memory traffic as a result of efficient cache utilization.
5.2.12.5 Loop Unrolling
Loop unrolling is a fundamental transformation that is a basic component of other restructuring techniques like software pipelining and unroll-and-jam. The unrolling of outer loops of nested loop regions are usually important for good use of the memory hierarchy. Unrolling of inner loops improves the usage of the floating-point registers and provides more room for instruction overlap. Unrolling decreases the trip count of loops, thereby reducing the loop's conditional branch overhead.
The number of times a loop should be unrolled (unrolling factor) is determined by the compiler, based on numerous considerations including the amount of data referenced in the loop body, the data access dependencies, the availability of registers, the size of data cache, and the purpose of unrolling.
Figure 5-13 illustrates the process of unrolling.
do i = 1 , n , 4 v ( i ) = v ( i - 2 ) + X * w ( i ) v ( i + 1 ) = v ( i - 2 ) + X * w ( i + 1 ) do i = 1 , n v ( i + 2 ) = v ( i ) + X * w ( i + 2 ) v ( i ) = v ( i - 2 ) + X * w ( i ) v ( i + 3 ) = v ( i + 1 ) + X * w ( i + 3 ) enddo enddo
In Figure 5-13, the unrolled loop exposes a lot of instruction level parallelism as the different assignments in the unrolled loop can be overlapped for performance.
5.2.12.6 Loop Multiversioning
Multiversioning is a technique employed by the compiler to improve the efficiency of parallel performance. Many loops, specially in Fortran, use symbolic bounds as trip counts which cannot be determined at compile time. However, the compiler can generate multiple versions of the original code at compile time. The resulting program will execute the appropriate path depending on the loops's trip count as determined dynamically at execution time.
Multiversioning is a technique employed by the compiler to improve the efficiency of parallel performance. Many loops, particularly in Fortran, use symbolic bounds as trip counts, which cannot be determined at compile time. However, the compiler can generate multiple versions of the original code at compile time. The resulting program executes the appropriate path depending on the loops' trip count, as determined dynamically at execution time.
Figure 5-14 Loop Before (Left) and After (Right) Multiversioning
if ( n > 100 ) then /* run this loop in parallel */ do i = 1 , n v ( i ) = v (i) + x * w(i) enddo else /* run this loop sequentially*/ do i = 1 , n v(i) = v(i) + X * w(i) do i = 1, m enddo v ( i ) = v ( i ) + X * w ( i ) enddo endif
Multiversioning improves the overall efficiency of parallel execution by using accurate information at program execution time.
5.2.12.7 Pattern Matching
The compiler back end understands patterns of standard computational kernels like SAXPY, DAXPY, LINPACK, and the like, and generates optimal code for such code sequences. Pattern matching is also used for performing basic dependency analysis on loops to improve the effectiveness of software pipelining.
5.2.13 Parallelization
The MIPSpro Fortran 77 and C compilers fully support automatic and user-assisted parallelization. User-assisted parallelization is available in the form of comment-based directives to guide program parallelization, which is enabled by specifying the --mp flag for both Fortran and C.
The directives provide comprehensive support for specifying and controlling the degree and dynamics of parallel execution. For example, the directives can be used to specify conditional parallelism to ensure that parallelism occurs only under certain dynamic conditions. In the two examples shown in Figure 5-15, the if clause in the directive specifies the conditions for parallel execution. In the Fortran 77 example, the loop executes in parallel only if the value of jmax is greater than 1000. Similarly, the C example executes in parallel only if the value of max is greater than 1000.
Figure 5-15 Fortran and C Parallelization Examples The compilers automatically detect program parallelism by employing the technique of data dependency analysis. Both Fortran 77 (-pfa flag) and C (-pca flag) compilers have this capability. Data dependence information is also used by a number of loop transformations listed in previous sections of this chapter.
5.2.14 Procedure Inlining
The compilers provide for automatic and user-directed inlining of functions and subroutines in Fortran and C programs. Inlining is the process of replacing a function reference with the text of the function. This process eliminates function call overhead and improves the effectiveness of numerous scalar and parallel optimizations by exposing the relationships between function arguments, returned values, and the surrounding code.
The compilers provide command-line options to direct the inlining of the specified list of subroutines. Flags are available to limit inlining to routines that are referenced in deeply nested loops, where the reduced call overhead or enhanced optimization is multiplied. Options exist to perform interprocedural inlining, whereby instances of routines can be inlined across different files.
One drawback of unlimited inlining is its tendency to increase the code size of the resulting program. Uncontrolled replacement of function or subroutine calls with the actual body of the called routine can cause "code explosion," which in turn increases compile time and reduces the effectiveness of other optimizations. The technique of Interprocedural Analysis (IPA) provides the benefits of inlining without necessitating inlining the code.
5.2.15 Interprocedural Analysis (IPA)
To increase the effectiveness of the automatic parallelizer via interprocedural analysis, the -ipa[=list] option can be used to specify the degree of interprocedural analysis to be performed. By default, this option turns on interprocedural analysis on all eligible routines. IPA can be enabled on selective routines by specifying the list of routines.
IPA tracks the flow of control and data across procedure boundaries and uses this information to drive the optimization process. IPA is particularly useful for performing interprocedural constant propagation, which enables routines with incoming loop bounds information to be available at compile time. This ability can be useful for driving optimization decisions. Figure 5-16 shows an example.
Figure 5-16 Loops With and Without Interprocedural Analysis
In this example, the value of n is used as a loop bound within the subroutine foo. In the absence of IPA, the compiler assumes that the values of n is modified inside the call to subroutine foo. Moreover, the value of n on entry to subroutine foo will not be known at compile-time. As a result, the compiler must generate multiversion code when parallelizing the j loop within foo. However, with IPA turned on, it is possible to know (at compile time) the value of n on entry to foo at the call site. This information is then used by the automatic parallelizer to decide how to parallelize the j loop in subroutine foo. If the value of n is small, the loop may not be profitably parallelized. On the other hand, if the value of n is large, the loop gets parallelized for profitable execution.
In short, IPA provides a mechanism to propagate information across procedure boundaries without having to inline calls, thereby increasing the effectiveness of all optimizations.
5.3 Porting Cray Code
The MIPSpro compilers have a number of value-added features and extensions to facilitate easy migration of existing Cray Fortran programs over to SGI systems. Important Cray extensions are accepted as-is by the MIPSpro compilers. A large portion of the remaining Cray features are handled by providing equivalent functionality in the MIPSpro compilation system. Before examining the degree of Cray compatibility provided by the MIPSpro compilers, it is worthwhile to understand the key features of Cray's compilation system. Cray extensions fall into three categories:
Cray Fortran 77 supports the 64-bit programming model for most elementary datatypes. For instance, basic datatypes like REAL, INTEGER and LOGICAL are 64-bit quantities for Cray. Support also exists for the quadruple-precision COMPLEX data type. Table 5-6 illustrates the equivalent functionality supported by the MIPSpro compilation system.
Table 5-6 Cray Fortran 77 data type extensions
Cray Fortran Corresponding SGII Comments Extension Functionality
real real*8 Replace real with real*8 or turn on the command-line option -r8, which promotes all real data types to 64-bit quantities. integer integer*8 Replace integer declarations with integer*8, or turn on the command-line option -i8 to promote all integer declarations to become 64-bit quantities. The -i8 flag provides a convenient way for users to selectively promote integers to be 64-bit quantities. This enables default integer data type size to remain as a 32-bit quantity, thereby preventing integer data cache conflicts that may result from larger integer quantities. Thus, the -i8 flag provides a much more flexible, performance-oriented approach to manage 64-bit integer data types. Library routines called from user applications should conform to the integer*8 format. logical logical*8 Replace logical with logical*8 if a 64-bit quantity is required. complex complex*32 Use complex*32 to get quad precision support.
Cray Fortran 77 also supports Fortran 90 style indexed array syntax. Table 5-7 illustrates the equivalent SGI functionality for handling Cray Fortran 77 array extensions. The MIPSpro Fortran 77 supports the Fortran 90 style array syntax. MIPSpro Fortran 77 also supports dynamic allocation of arrays.
Table 5-7 Cray Fortran 77 Array Syntax
Cray Program with Indexed Equivalent SGI Program and Vector-valued Array Section Selectors
dimension A(10), B(10), dimension A(10), B(10), C(5) C(5), TEMP(5) dimension X(5, 5), Y(2:10) dimension X(5, 5), Y(2:10) A = B do i = 1, 10 A(i) = B(i) enddo C = A(3: 7) do i = 1, 5 C(i) = A( i + 2 ) enddo B(1: 5) = X(3, 1: 5) do i = 1 , 5 B(i) = X(3, i) enddo A = sin(B) do i = 1, 10 A(i) = sin( B (i) ) enddo TEMP = A(C) do i = 1 , 5 TEMP(i) = A(C(i)) enddo TEMP = X(1, C) do i = 1, 5 TEMP(i) = X(1, C(i)) enddo TEMP = A (C + C) do i = 1 , 5 TEMP (i) = A ( C(i) + C(i) ) enddo
5.3.2 Compiler Directives for Scalar and Vector Performance
Cray Fortran supports a few directives to improve the performance of scalar and vector code. Notable vectorization directives are listed in the following table with the equivalent SGI functionality.
Table 5-8 Cray Fortran 77 Vectorization Directives
Cray Fortran SGI's Corresponding Functionality Vectorization Directive
cdir$ivdep This directive informs the Cray compiler that there is no data dependency in the loop that follows this directive, thus ensuring complete vectorization of the loop. This directive is accepted as is by the MIPSpro compilers. It tells the compiler to be less strict when deciding whether it can get some sort of parallelism between loop iterations. By default, the compiler makes conservative assumptions about possible memory reference conflicts. The directive allows the compiler to be more aggressive about such assumptions. Thus, superscalar optimizations like software pipelining can benefit from recognizing this directive. cdir$nextscalar This directive informs the compiler to generate only scalar (nonparallel) instructions for the loop that immediately follows this directive. The MIPSpro compiler accepts this directive as is and adheres to the original meaning of this directive.
The Cray compilers support three different techniques to parallelize Fortran code. These are:
To summarize, the MIPSpro compilers have sufficient functionality built into them to facilitate smooth migration of Cray code onto SGI platforms. Important Cray directives like c$dir ivdep are recognized and processed to improve the performance of superscalar optimizations like software pipelining, while other vector specific directives like c$dir vector are essentially ignored by the MIPSpro compilers. In terms of general purpose optimizations, the MIPSpro compilers favor superscalar optimizations like software pipelining and global speculative code motion while the Cray compilers attempt vectorization for best performance.
Table 5-9 Cray Parallel Processing Functionality Cray's Parallel SGI's Equivalent Capability Processing Functionality
The "cmic$parallel" directive The PCF directive "c*KAP* is used to declare a parallel parallel region" provides region in Cray Fortran. equivalent functionality. cmic$ parallel [ if (exp) ] c*KAP* parallel region [ if (exp) ] cmic$ shared ( var, ... ) c*KAP* shared ( var, ... ) cmic$ private ( var, ... ) c*KAP* local ( var, ... ) Parallel code includes Parallel code includes loops with independent loops with independent iterations, adjacent iterations, adjacent independent blocks of independent blocks of code, critical sections of code, critical sections or a combination of or a combination of all all these cases. these cases. cmic$ endparallel c*KAP* end parallel region The "cmic$doall" directive is This functionality can be used for specifying loop-level replicated by using SGI's parallelism in Cray Fortran. Fortran MP "c$doacross" directive. cmic$ doall [ if (exp) ] c$doacross [ if (exp) ] cmic$ shared ( var, ... ) c$and shared ( var, ... ) cmic$ private ( var, ... ) c$and last local ( var, ... ) cmic$ [single |chunksize (n) |numchunks (n) | c$and last local ( var, ... ) cmic$ guided \ vector] [ savelast ] c$and [ mp_schedtype=type, chunk=n ] do i = n1, n2 do i = n1, n2 .... .... enddo enddo cmic$ endparallel
5.4 References