The POWER CHALLENGE (TM) Technical Report

Chapter 5: MIPSpro(TM) Compiler Technology


This chapter consists of three sections: 5.1 The MIPSpro Compiler

The MIPSpro compilers are SGI's third-generation family of optimizing and parallelizing compilers. The compilers generate 64-bit optimized code for the MIPS R8000-based POWER CHALLENGE systems running IRIX 6.0, the new 64-bit operating system, and support Fortran 77, C, and C++. Highlights:

The compilers can be installed on IRIX 6.0- or IRIX 5.2-based SGI systems:

Figure 5-1 diagrams the relationship between MIPSpro compilers and IRIX systems.

Figure 5-1 MIPSpro Compilers and IRIX Versions

Figure 5-2 illustrates the various components of the MIPSpro compilation system. Components include the compiler back end and the Parallel C and Fortran 77 analyzer. Scalar and superscalar optimizations common to all the compilers are performed in the common back end. All parallel processing optimizations are performed in the parallel analyzer phases of the compilation system. Kuck and Associates (KAI) technology is an integral part of the MIPSpro compilation system.

The compilers support value-added extensions to

Figure 5-2 MIPSpro Compiler Components This section explains:

5.1.1 Optimizations

The compilers perform a range of general-purpose and architecture-specific optimizations to improve application performance by reducing the number of instructions executed, thus making better use of the CPU's instruction set, maximizing register use, minimizing memory references, and eliminating unused or redundant code.

A range of new optimization techniques takes maximum advantage of the new processor features such as on-chip and off-chip caches, pipelining, and superscalar chip architecture. The optimizations are applicable to a wide range of scientific and engineering applications and benefit both scalar and parallel performance of applications.

A rich assortment of command-line options can leverage different combinations of optimizations. In general, the optimizations are spread across the compilation system for better efficiency. For instance, high-level optimizations like loop interchange and loop unrolling are performed in the compiler front ends, whereas architecture-specific optimizations like software pipelining and automatic blocking are implemented in the common back end. All optimizations are fine-tuned to take advantage of the new system. Key optimizations:

5.1.2 Memory Hierarchy (Cache) Optimizations

Memory hierarchy optimizations play a key role in matching the performance capabilities of the fast superscalar processor with the relatively slower main memory system. The primary function of the cache subsystem is to bridge the gap between processor and main memory speed. In this sense the cache architecture of a modern superscalar microprocessor like the MIPS R8000 is similar to the vector register sets of traditional vector supercomputers that used vector registers to keep the processors busy with computation without having to frequently reference main memory.

Caches are based on the observation that most application programs exhibit some degree of locality of reference: programs access pieces of data that are "near" already requested data, in space and time. A program that accesses memory without regard to locality of reference might perform poorly because of a high number of cache misses. The compiler plays a crucial role in restructuring programs to reduce cache misses by interchanging loops, or by tiling or blocking loop nests so that data is consumed most efficiently by the processor. This arrangement is similar to traditional vectorizing compilers that restructured programs to fit in vector memory (registers) in pieces. In short, the compiler restructures programs so that a useful subset of the problem can fit into the cache. Thus the processor can work on patches of the original code and data from the cache memory (thereby avoiding main memory references) before moving on to the next patch.

Figure 5-3 illustrates the cache misses per FLOP as a function of different cache sizes for a broad range of scientific and engineering applications.

Figure 5-3 Working Set Effects on Scientific Applications

For cache sizes of less than 1 MB, the miss ratio becomes negligible. This experiment illustrates how a combination of a moderately large cache size and good compiler technology can reduce cache misses to a negligible amount for a large (but not all-inclusive) class of big scientific and engineering problems.

5.1.3 Automatic and User-assisted Parallelization

The MIPSpro Power compilers (MIPSpro Power Fortran 77 and MIPSpro Power C) support automatic and user-directed parallelization of Fortran and C applications for multiprocessing execution. The compilers employ automatic parallelization techniques to analyze and restructure user applications for parallel execution, as preferred by users who rely on the compilers to parallelize their applications.

The compilers also provide a comprehensive set of standards-based comment directives that enable users to assist the compiler in the parallelization process. Users can use these directives to provide additional information to the compiler to boost parallel performance.

The parallelization technology is fine-tuned to take advantage of the POWER CHALLENGE system architecture. A combination of automatic and user-assisted parallelization can lead to substantial improvements in the performance of many programs.

5.1.4 High-performance Scientific Libraries

The compilers are complemented by CHALLENGEcomplib, a comprehensive collection of scientific and math subroutine libraries that provide support for mathematical and numerical algorithms used in scientific computing. The key motivation for creating CHALLENGEcomplib is to provide standard library functionality and to improve the runtime performance of applications.

CHALLENGEcomplib is similar to scientific libraries provided by other supercomputing vendors like the Cray SCILIB, IBM ESSL and the Convex VECLIB. The library consists of two subcomponents: complib.sgimath and complib.slatec. These libraries include support for Basic Linear Algebra Subprograms (BLAS), Extended BLAS (Level 2 and Level 3), LAPACK, FFT, Convolutions, and selected routines from Cray's SCILIB and SLATEC from the Energy, Science and Technology Software Center.

5.1.5 Software Development Support

The MIPSpro compiler family includes basic debugging and program runtime analysis tools including dbx, pixie and prof.

5.2 Optimization Technology in the MIPSpro Compilation System

Optimization technology is an integral part of the MIPSpro Compilation system. Supercomputing microprocessors like the MIPS R8000 offer enormous performance potential that needs to be harnessed by advanced optimization techniques. The optimizer takes the processor, system and program characteristics into consideration when restructuring programs for performance.

Figure 5-4 illustrates three important parameters that influence the effectiveness of the optimization phases of the compilers.

Figure 5-4 Optimization Phase Parameters

This section discusses

5.2.1 Processor Architecture

The optimizer has a sound understanding of the architecture and performance characteristics of the MIPS R8000 64-bit superscalar processor and it uses this information in generating optimized code for applications. The R8000 processor implements the MIPS IV Instruction Set Architecture (ISA) and has the following features:

5.2.2 System Architecture

The POWER CHALLENGE system architecture is also taken into consideration when restructuring programs for fast execution. For example, the compilers can generate parallel code to take advantage of the multiprocessing capabilities of the POWER CHALLENGE systems. Similarly, the compilation system takes the system bus and memory performance characteristics into consideration when implementing efficient synchronization primitives for parallel execution of programs.

POWER CHALLENGE system characteristics that are important to the optimizer include:

5.2.3 Application Characteristics

The compilers perform extensive analysis of application programs to perform appropriate optimizing transformations. Different applications exhibit different characteristics that determine the degree of optimizations that can be performed. For instance, compute-intensive applications that reference data in a regular manner may be very amenable to an optimization known as loop blocking. Similarly, applications that reference different sections of data simultaneously can be parallelized by the compiler for concurrent execution. In short, application characteristics play an important role in determining the degree and effectiveness of the optimizations that can be performed.

5.2.4 Optimization Technology

The MIPSpro compilers perform a hierarchy of optimizations ranging from fine-grained instruction-level optimizations to coarse-grained parallelization through loops and tasks to reduce the execution time of applications. The optimization phases are spread across the compilation system.

Instruction-level optimizations are performed mostly in the common back end to extract maximum performance out of the R8000 superscalar processor. Common instruction-level optimizations include software pipelining, instruction scheduling, global instruction movement and register allocation.

Loop-level optimizations are performed in the early stages of the compilation process. Key loop-level optimizations include automatic loop blocking, loop interchange and loop unrolling.

Automatic loop blocking and loop interchange are memory hierarchy optimizations that take advantage of the cache architecture of the machine. Similarly, loop unrolling attempts to expose more instruction level parallelism to the optimizer for fine-grained parallelism.

Other optimizations such as loop distribution and loop fusion are important for efficient parallel execution. The compiler uses extensive analysis and transformation techniques to detect parallelism in programs. The compilation system supports a complete runtime environment for parallel execution. This runtime library is common to all the MIPSpro compilers.

Figure 5-5 illustrates the different kinds of parallelism exploited by the compilers.

Figure 5-5 Compiler Parallelisms

5.2.5 Basic Block Optimizations

Basic optimizations are performed at optimization level 1 (-O1) and are meant to create efficient scalar code at the basic block level for both C and Fortran programs. A basic-block is a sequence of statements ending with a condition or unconditional branch. Many scalar optimizations are performed at the basic block level to improve the efficiency of the generated code. The following are some of the most common optimizations performed at the basic-block level:

The compiler also performs the full range of scalar optimizations, including:

5.2.6 Global Optimizations

The compilers perform extensive global optimizations at optimization level 2 (-O2) that are usually beneficial to most applications, and are conservative in nature to ensure integrity of the results of floating point computations. The compiler performs memory hierarchy optimizations to maximize reuse of data in the cache. Optimizations performed at this level include:

5.2.7 Advanced Optimizations

The compilers perform aggressive optimizations at level 3 (-O3) that focus on maximum code quality. Considerable flexibility is provided to enable combination of optimizations at this level for maximum floating point performance. Important superscalar optimizations like software pipelining are performed at this level.

5.2.8 Floating-point Optimizations

The compilers normally generate floating-point code that conforms to the IEEE 754 floating-point standard. However, many floating-point-intensive codes that have not been written with careful attention to floating-point behavior do not require precise conformance with the source language expression evaluation standards or the IEEE 754 arithmetic standards. It is therefore possible to relax the conformance restrictions in favor of better performance. The MIPSpro compilers provide a number of different command-line options to accomplish this goal:

5.2.9 Global Code Motion

Global Code Motion (GCM) is a general-purpose optimization that redistributes the instructions among basic blocks along an execution path to improve instruction-level parallelism and make better use of machine resources.

Traditional global optimizers avoid moving instructions in cases that might cause them to be executed along control flow paths where they would not have been in the original program. The MIPSpro global optimizer does perform such code motion, called speculative code motion, because the instructions moved are executed based on speculation that they will actually prove useful. This kind of aggressive code motion is unique to the MIPSpro compilers. By default, GCM is very conservative in its speculations. However, a number of options are available to control the degree of speculation.

Valid speculative code motion must normally avoid moving operations that might cause runtime traps. As a result, turning off certain traps at runtime enables more code motion. Thus, applications that can ignore floating point exceptions in certain segments of the program can take advantage of this optimization. Similarly, applications that can ignore memory access exceptions can also take advantage of this feature. For example, in the SPECfp92 benchmark 013.spice2g6, speculative code motion can be enabled for a critical loop by turning off floating-point and memory exceptions around that loop, resulting in a healthy performance gain.

Figure 5-6 illustrates the different kinds of speculations available.

Figure 15-6 Speculation Types

What follows is a brief description of each of these optimizations (with appropriate user-level flag control):

5.2.10 Software Pipelining

Software pipelining is arguably the most important architecture-specific optimization implemented in the MIPSpro Compilers. It is a practical, efficient and general-purpose scheduling technique for exploiting fine-grained, instruction level parallelism available in modern-day superscalar processors like the MIPS R8000.

In software pipelining, iterations of loops are continuously initiated at constant intervals without having to wait for preceding iterations to complete. That is, multiple iterations, in different stages of computation, are in progress simultaneously. The steady state of this pipeline constitutes the loop body of the object code. Consider this simple DAXPY loop:

do i = 1, n
	v (i) = v (i) + X * w (i)
enddo
On the MIPS R8000 architecture, this loop can be coded in assembly language as two load instructions followed by a multiply-add instruction and a store. Figure 5-7 shows this execution order constraint.

Figure 5-7 Execution Order Constraint

This simple schedule completes one iteration of the loop body in five machine cycles. Considering that the R8000 processor allows up to two memory operations and two floating-point operations in the same cycle, the instructions above are initiated by the machine as outlined in Table\x115-1. (This schedule completes one iteration of the loop in five machine cycles):

Table 5-1 Simple Schedule for Loop Body of DAXPY


Cycle Count Memory Operations Floating-point Operations
0 load andv(i); load andw(i) madd X, w(i), v(i) 1 2 3 4 store andv(i)
If the same loop were unrolled by a factor of four, the loop body would look like:

do i = 1, n, 4 

	v (i) = v (i) + X * w (i)
	v (i + 1) = v (i + 1) + X * w (i + 1)
	v (i + 2) = v (i + 2) + X * w (i + 2)`	
	v (i + 2) = v (i + 2) + X * w (i + 2)

enddo
The unrolled loop allows for instructions from four independent iterations to execute in parallel, thereby increasing the instruction level parallelism and hence the performance of this loop. Table 5-2 illustrates the sequence of loads, madds and stores for this unrolled loop. (This schedule completes four iterations of the loop in eight machine cycles).

Table 5-2 Schedule for Loop Body of Four-way Unrolled DAXPY


Cycle Memory Operations Floating Point Operations Count
0 load andv ( i); load andw (i) madd X, w(i), v(i) 1 load andv (i + 1); load andw (i + 1) madd X, w (i+1), v (i+1) 2 load andv (i + 2); load andw (i + 2) madd X, w (i+2), v (i+2) 3 load andv (i + 3); load andw (i + 3) madd X, w (i+3), v (i+3) 4 store andv (i) 5 store andv (i + 1) 6 store andv (i + 2) 7 store andv (i + 3)
The schedule above completes four iterations of the loop body in eight cycles, thereby improving the performance to one quarter of R8000's peak megaFLOPs. However, this schedule still leaves room for improvement. Each store has to wait three cycles for its corresponding madd instruction to complete, thereby forcing the four store operations to be initiated in different cycles. In other words, the schedule above does not take advantage of the R8000's ability to do two stores in one cycle.

By using software pipelining, the loop instructions can be initiated at constant intervals such that each iteration executes a combination of loads and stores from different iterations. In the DAXPY example, this can result in a schedule that would complete two iterations every three cycles to realize significant performance improvements over the two previous schedules.

Table 5-3 illustrates the machine schedule for the software pipelined version of the above loop. To prepare properly for entry into such a loop, a prologue section of code is added that sets up the registers for the first few stores in the main loop. Similarly, in order to exit the loop properly, an epilog section is added that performs the final stores. Any preparation of registers needed for the epilog is done in the cleanup section of the code. (The main loop completes four different iterations in six machine cycles).

Table 5-3 Schedule for Software-pipelined DAXPY.


Cycle Memory Operations Floating Point Operations Count
PROLOG: t1 = load andv (i); t7 = madd X, w(i), v(i) t2 = load andw (i) t4 = load andv (i + 1); t8 = madd X, w (i+1), v (i+1) t5 = load andw (i + 1) MAINLOOP: 0 t1= load andv (i + 2); t3 = madd X, w (i+2), v (i+2) t2 = load andw (i + 2) 1 t4 = load andv (i + 3); t6 = madd X, w (i+3), v (i+3) t5 = load andw (i + 3) 2 store t7; store t8 beq CLEANUP 3 t1 = load andv (i +4); t7 = madd X, w (i + 4), v (i +4) t2 = load andw (i + 4) 4 t4 = load andv (i + 5); t8 = madd X, w (i + 5), v (i + 5) t5 = load andw (i + 5) 5 store t3; store t6 bne MAINLOOP; EPILOG: store t7; store t8 br ALLDONE CLEANUP:t7 = t3; t8 = t6 br EPILOG ALLDONE:

In the main loop, the code completes four different iterations in six cycles, which is better than the previous two schedules. Table 5-4 illustrates the performance improvement in the three cases.

Table 5-4 DAXPY Speedup Factor for Simple Schedule


Scheduling Type Performance Speedup
Simple scheduling 1 iteration in 5 cycles 1.0 4-way unrolling 4 iterations in 8 cycles 2.5 Software pipelining 4 iterations in 6 cycles 3.3

As Table 5-4 suggests, software pipelining can make a huge difference in the performance of compute-intensive applications.

Another advantage of software pipelining is its ability to generate compact code, compared to transformations like loop unrolling that can increase the program size by a noticeable amount. Compact code prevents instruction cache penalties resulting from increased code size. The most important aspect of software pipelining is its ability to generate near-optimal code for loops.

Software pipelining in the MIPSpro compilers is based on the concept of "modulo iteration-interval scheduling", a technique for software pipelining innermost loops. This is a proven technique for generating code with near-optimal performance. In effect, this technique sets a performance goal for each loop prior to scheduling and then attempts to achieve this goal by taking into account resource constraints and the program data reference constraints. Typical benchmark programs, like the SPECfp92 benchmarks, illustrate substantial improvements in performance when compiled with software pipelining.Figure 5-8 summarizes the performance and cost effectiveness of software pipelining as implemented in the MIPSpro compilers.

Figure 5-8 Cost Effectiveness of Software Pipelining

5.2.11 Pointer Optimization

For many compiler optimizations, ranging from simply holding a value in a register to the parallel execution of a loop, it is necessary to determine whether two distinct memory references designate distinct objects. If the objects are not distinct, the references are said to be aliases. When these references are pointers, there is often not enough information available within a single function or compilation unit to determine whether the two pointers are aliased. Even when enough information is available, the analysis can require substantial amounts of time and space. For example, it could require an analysis of a whole program to determine the possible values of a pointer that is a function parameter.

In short, compilers must normally be conservative in optimizing memory references involving pointers (especially in languages like C), since aliases (i.e. different ways of accessing the same memory location) may be very hard to detect. Consider the following example:

float x[100];
float *c;

void f4 ( n , p , q )
int n;
float * p;
float * q;

	for ( i = 0 ; i < n ; i ++ ) {
		p[ i ] = q[ i ] + c[ i ] + x[i] ;
	}
}

To be safe, the compiler must assume that the pointer references a, b, and c may all be aliased to each other. This in turn precludes the possibility of aggressive loop optimizations by the optimizer.

The MIPSpro compilers alleviate this general problem of aliasing by providing users with a number of different options for specifying pointer aliasing information to the compiler. The compiler can use this information to perform aggressive optimizations in the presence of pointers and thereby achieve healthy performance improvements. Table 5-5 illustrates the various user-controlled options that can be useful for improving runtime performance.

Table 5-5 User-assisted Pointer Optimizations


Flag Description
-OPT:alias=any Compiler should assume that any pair of memory references may be aliases unless proven otherwise. This is the default setting in the compiler and reflects a safe assumption by the compiler. -OPT:alias=typed Compiler assumes that any pair of memory references that are of distinct types cannot be aliased. For example: void dbl ( i , f ) { int * i; float * f; *i = *i + * i ; *f = *f + *f ; } The compiler assumes that i and f point to different memory locations as they are of different types. This can result in producing an overlapped schedule for the two calculations. -OPT:alias=unnamed Compiler can assume that pointers will never point to named objects. In the following example compiler will assume that the pointer p cannot point to the object g, and will produce an overlapped schedule for the two calculations. This is the default assumption for the pointers implicit in Fortran dummy arguments according to the ANSI standard. float g; void double (f) float* p; g = g * g ; *p = *p + *p ; } -OPT:alias=restrict Compiler should assume a very restrictive model of aliasing, where no two pointers ever point to the same memory area. This is a rather restrictive assumption, but when applied for specific well-controlled, valid cases, can produce significantly better code. vopid double (p,q) int*p; int*q; { *p = *p + *p ; *q = *q + *q ; }

5.2.12 Loop Optimizations

Most compute-intensive programs spend a significant portion of their execution time in loops; the MIPSpro compilers spend a significant portion of their time in optimizing loops in the program. Various techniques optimize the performance of loop, many of them automatically enabled by the compilers at various levels of optimizations. This section explains important loop transformations performed by the MIPSpro compilers.


do i = 1 , n , 4 v ( i ) = v ( i - 2 ) + X * w ( i ) v ( i + 1 ) = v ( i - 2 ) + X * w ( i + 1 ) do i = 1 , n v ( i + 2 ) = v ( i ) + X * w ( i + 2 ) v ( i ) = v ( i - 2 ) + X * w ( i ) v ( i + 3 ) = v ( i + 1 ) + X * w ( i + 3 ) enddo enddo
5.2.13 Parallelization

The MIPSpro Fortran 77 and C compilers fully support automatic and user-assisted parallelization. User-assisted parallelization is available in the form of comment-based directives to guide program parallelization, which is enabled by specifying the --mp flag for both Fortran and C.

The directives provide comprehensive support for specifying and controlling the degree and dynamics of parallel execution. For example, the directives can be used to specify conditional parallelism to ensure that parallelism occurs only under certain dynamic conditions. In the two examples shown in Figure 5-15, the if clause in the directive specifies the conditions for parallel execution. In the Fortran 77 example, the loop executes in parallel only if the value of jmax is greater than 1000. Similarly, the C example executes in parallel only if the value of max is greater than 1000.

Figure 5-15 Fortran and C Parallelization Examples The compilers automatically detect program parallelism by employing the technique of data dependency analysis. Both Fortran 77 (-pfa flag) and C (-pca flag) compilers have this capability. Data dependence information is also used by a number of loop transformations listed in previous sections of this chapter.

5.2.14 Procedure Inlining

The compilers provide for automatic and user-directed inlining of functions and subroutines in Fortran and C programs. Inlining is the process of replacing a function reference with the text of the function. This process eliminates function call overhead and improves the effectiveness of numerous scalar and parallel optimizations by exposing the relationships between function arguments, returned values, and the surrounding code.

The compilers provide command-line options to direct the inlining of the specified list of subroutines. Flags are available to limit inlining to routines that are referenced in deeply nested loops, where the reduced call overhead or enhanced optimization is multiplied. Options exist to perform interprocedural inlining, whereby instances of routines can be inlined across different files.

One drawback of unlimited inlining is its tendency to increase the code size of the resulting program. Uncontrolled replacement of function or subroutine calls with the actual body of the called routine can cause "code explosion," which in turn increases compile time and reduces the effectiveness of other optimizations. The technique of Interprocedural Analysis (IPA) provides the benefits of inlining without necessitating inlining the code.

5.2.15 Interprocedural Analysis (IPA)

To increase the effectiveness of the automatic parallelizer via interprocedural analysis, the -ipa[=list] option can be used to specify the degree of interprocedural analysis to be performed. By default, this option turns on interprocedural analysis on all eligible routines. IPA can be enabled on selective routines by specifying the list of routines.

IPA tracks the flow of control and data across procedure boundaries and uses this information to drive the optimization process. IPA is particularly useful for performing interprocedural constant propagation, which enables routines with incoming loop bounds information to be available at compile time. This ability can be useful for driving optimization decisions. Figure 5-16 shows an example.

Figure 5-16 Loops With and Without Interprocedural Analysis

In this example, the value of n is used as a loop bound within the subroutine foo. In the absence of IPA, the compiler assumes that the values of n is modified inside the call to subroutine foo. Moreover, the value of n on entry to subroutine foo will not be known at compile-time. As a result, the compiler must generate multiversion code when parallelizing the j loop within foo. However, with IPA turned on, it is possible to know (at compile time) the value of n on entry to foo at the call site. This information is then used by the automatic parallelizer to decide how to parallelize the j loop in subroutine foo. If the value of n is small, the loop may not be profitably parallelized. On the other hand, if the value of n is large, the loop gets parallelized for profitable execution.

In short, IPA provides a mechanism to propagate information across procedure boundaries without having to inline calls, thereby increasing the effectiveness of all optimizations.

5.3 Porting Cray Code

The MIPSpro compilers have a number of value-added features and extensions to facilitate easy migration of existing Cray Fortran programs over to SGI systems. Important Cray extensions are accepted as-is by the MIPSpro compilers. A large portion of the remaining Cray features are handled by providing equivalent functionality in the MIPSpro compilation system. Before examining the degree of Cray compatibility provided by the MIPSpro compilers, it is worthwhile to understand the key features of Cray's compilation system. Cray extensions fall into three categories:

5.3.1 Language Extensions

Cray Fortran 77 supports the 64-bit programming model for most elementary datatypes. For instance, basic datatypes like REAL, INTEGER and LOGICAL are 64-bit quantities for Cray. Support also exists for the quadruple-precision COMPLEX data type. Table 5-6 illustrates the equivalent functionality supported by the MIPSpro compilation system.

Table 5-6 Cray Fortran 77 data type extensions


Cray Fortran Corresponding SGII Comments Extension Functionality
real real*8 Replace real with real*8 or turn on the command-line option -r8, which promotes all real data types to 64-bit quantities. integer integer*8 Replace integer declarations with integer*8, or turn on the command-line option -i8 to promote all integer declarations to become 64-bit quantities. The -i8 flag provides a convenient way for users to selectively promote integers to be 64-bit quantities. This enables default integer data type size to remain as a 32-bit quantity, thereby preventing integer data cache conflicts that may result from larger integer quantities. Thus, the -i8 flag provides a much more flexible, performance-oriented approach to manage 64-bit integer data types. Library routines called from user applications should conform to the integer*8 format. logical logical*8 Replace logical with logical*8 if a 64-bit quantity is required. complex complex*32 Use complex*32 to get quad precision support.

Cray Fortran 77 also supports Fortran 90 style indexed array syntax. Table 5-7 illustrates the equivalent SGI functionality for handling Cray Fortran 77 array extensions. The MIPSpro Fortran 77 supports the Fortran 90 style array syntax. MIPSpro Fortran 77 also supports dynamic allocation of arrays.

Table 5-7 Cray Fortran 77 Array Syntax


Cray Program with Indexed Equivalent SGI Program and Vector-valued Array Section Selectors
dimension A(10), B(10), dimension A(10), B(10), C(5) C(5), TEMP(5) dimension X(5, 5), Y(2:10) dimension X(5, 5), Y(2:10) A = B do i = 1, 10 A(i) = B(i) enddo C = A(3: 7) do i = 1, 5 C(i) = A( i + 2 ) enddo B(1: 5) = X(3, 1: 5) do i = 1 , 5 B(i) = X(3, i) enddo A = sin(B) do i = 1, 10 A(i) = sin( B (i) ) enddo TEMP = A(C) do i = 1 , 5 TEMP(i) = A(C(i)) enddo TEMP = X(1, C) do i = 1, 5 TEMP(i) = X(1, C(i)) enddo TEMP = A (C + C) do i = 1 , 5 TEMP (i) = A ( C(i) + C(i) ) enddo

5.3.2 Compiler Directives for Scalar and Vector Performance

Cray Fortran supports a few directives to improve the performance of scalar and vector code. Notable vectorization directives are listed in the following table with the equivalent SGI functionality.

Table 5-8 Cray Fortran 77 Vectorization Directives


Cray Fortran SGI's Corresponding Functionality Vectorization Directive
cdir$ivdep This directive informs the Cray compiler that there is no data dependency in the loop that follows this directive, thus ensuring complete vectorization of the loop. This directive is accepted as is by the MIPSpro compilers. It tells the compiler to be less strict when deciding whether it can get some sort of parallelism between loop iterations. By default, the compiler makes conservative assumptions about possible memory reference conflicts. The directive allows the compiler to be more aggressive about such assumptions. Thus, superscalar optimizations like software pipelining can benefit from recognizing this directive. cdir$nextscalar This directive informs the compiler to generate only scalar (nonparallel) instructions for the loop that immediately follows this directive. The MIPSpro compiler accepts this directive as is and adheres to the original meaning of this directive.
5.3.3 Cray Compiler Parallel-processing Directives

The Cray compilers support three different techniques to parallelize Fortran code. These are:

Table 5-9 maps Cray's parallel processing capabilities with their equivalent SGI functionality.


Table 5-9 Cray Parallel Processing Functionality Cray's Parallel SGI's Equivalent Capability Processing Functionality
The "cmic$parallel" directive The PCF directive "c*KAP* is used to declare a parallel parallel region" provides region in Cray Fortran. equivalent functionality. cmic$ parallel [ if (exp) ] c*KAP* parallel region [ if (exp) ] cmic$ shared ( var, ... ) c*KAP* shared ( var, ... ) cmic$ private ( var, ... ) c*KAP* local ( var, ... ) Parallel code includes Parallel code includes loops with independent loops with independent iterations, adjacent iterations, adjacent independent blocks of independent blocks of code, critical sections of code, critical sections or a combination of or a combination of all all these cases. these cases. cmic$ endparallel c*KAP* end parallel region The "cmic$doall" directive is This functionality can be used for specifying loop-level replicated by using SGI's parallelism in Cray Fortran. Fortran MP "c$doacross" directive. cmic$ doall [ if (exp) ] c$doacross [ if (exp) ] cmic$ shared ( var, ... ) c$and shared ( var, ... ) cmic$ private ( var, ... ) c$and last local ( var, ... ) cmic$ [single |chunksize (n) |numchunks (n) | c$and last local ( var, ... ) cmic$ guided \ vector] [ savelast ] c$and [ mp_schedtype=type, chunk=n ] do i = n1, n2 do i = n1, n2 .... .... enddo enddo cmic$ endparallel
To summarize, the MIPSpro compilers have sufficient functionality built into them to facilitate smooth migration of Cray code onto SGI platforms. Important Cray directives like c$dir ivdep are recognized and processed to improve the performance of superscalar optimizations like software pipelining, while other vector specific directives like c$dir vector are essentially ignored by the MIPSpro compilers. In terms of general purpose optimizations, the MIPSpro compilers favor superscalar optimizations like software pipelining and global speculative code motion while the Cray compilers attempt vectorization for best performance.

5.4 References

  1. MIPSpro Fortran 77 Language Reference Manual
  2. MIPSpro POWER Fortran 77 Programmer's Guide
  3. MIPSpro POWER C Programmer's Guide
  4. MIPSpro COmpiling, Debugging, and Performance Tuning
  5. Practical Parallel Programming, Barr E. Bauer, Academic Press, Inc. Presents the practical aspects of parallel programming on Silicon Graphics multiprocessor systems.
  6. "Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors," Jaswinder Pal Singh, Anoop Gupta, and Edward Rothberg, in Proceedings of the 20th International Symposium on Computer Architecture, IEEE, New York, 1993, pp. 14-25.


    We welcome feedback and comments at webmaster@www.sgi.com.

    Copyright © 1994, 1995 Silicon Graphics, Inc.