File: OPTF4.WU of Tape: Various/Decus/decus-4
(Source file text)
OPTF4 Fortran IV Optimizer OPTF4 is an OS/8 compatible program which optimizes RALF code generated by the Fortran IV compiler. Version l.02 performs the following optimizations: 1. Unnecessary FLDA instructions are eliminated, 2. FCLA, FADDM and FMULM instructions are implemented, 3. Addition and subtraction of zero are eliminated (this optimizes some IF statements), 4. Absolute values are computed in-line 5. Squaring is reduced to a single multiply instruction. OPTF4 is designed to be included in the set of chained programs that comprize the Fortran IV compiler. OPTF4 executes after PASS2O (or PASS3) and before the RALF assembler. Optimization is activated by the /O switch (along with the /Q switch): .R F4 Compiles MAIN with optimization *MAIN<MAIN/O/Q and produces MAIN.RL. OPTF4 may also be run from the monitor. In this case, the command decoder is called, and one input and one output filename may be specified. The default output filename is SYS:FORTRN.RA and the the default extension is .RA. Switches are: /R chain to RALF /L chain to RALF and then LOAD /G chain to RALF, LOAD, and execute. Example: .R OPTF4 Optimizes PROG1.RA, chains to RALF for *PROG1/L assembly, then chains to LOAD. SUBR is *SUBR/G$ added to the load module and the program is executed. The following changes will incorporate OPTF4.SV into the F4 chaining sequence: PASS2O.SV: 5025/ 2201 1720 5026/ 1406 2406 5027/ 0000 6400 PASS3.SV: 2157/ 2201 1720 2160/ 1406 2406 2161/ 0000 6400 These changes are good for OS/8 V3C and V3D; older versions may be different. Note that the filename 'RALF ' is simply changed to 'OPTF4 '. The following example shows optimization of some Fortran code. Consider the program fragment, DIMENSION A(60,60),B(60) . . . DO 66 J=1,60 DO 60 I=1,60 60 A(I,J)=A(I,J)*B(J)**2 66 CONTINUE This will compile into the following RALF code (found in the file SYS:FORTRN.RA): FLDA #LIT+0000 FSTA J #G0013, FLDA #LIT+0000 FSTA I #G0014, #60, FLDA J FMUL #LIT+0014 FADD I ATX 7 FLDA J ATX 6 FLDA B-0003,6 FSTA #BASE FLDA #LIT+0003 EXTERN #EXPII JSA #EXPII FMUL A-0267,7 FSTA A-0267,7 FLDA I FADD #LIT+0000 FSTA I FSUB #LIT+0014 JLE #G0014 #66, FLDA J FADD #LIT+0000 FSTA J FSUB #LIT+0014 JLE #G0013 This will execute in about 4.7 sec, with a FPP-12. After optimization with OPTF4, the code will look like: FLDA #LIT+0000 FSTA J #G0013, FLDA #LIT+0000 FSTA I #G0014, #60, FLDA J FMUL #LIT+0014 FADD I ATX 7 FLDA J ATX 6 FLDA B-0003,6 FMUL B-0003,6 FMULM A-0267,7 FLDA #LIT+0000 FADD I FSTA I FSUB #LIT+0014 JLE #G0014 #66, FLDA #LIT+0000 FADD J FSTA J FSUB #LIT+0014 JLE #G0013 Here the squaring has been done in-line, and the FMULM instruction has replaced the FMUL; FSTA instructions. This executes in 1.3 seconds. Manual optimization, for ambitious programmers. In some cases, the programmer may wish to further optimize particular sections of RALF code, especially those sections which are suspected of accounting for most of the execution time. Manual optimization will be illustrated for the above example, but first, some observations about Fortran generated RALF code are useful. 1. The compiler generates constants as needed, which are referenced by expressions of the sort #LIT+nnnn. 2. The compiler generates tags of the sort #Gnnnn to use as targets for jump instructions. One of these will be found near the beginning of each DO loop. 3. The index of a DO loop is actually a floating point number, even though we think of it as an integer. It is set, incremented, and tested by floating point instructions. 4. Arrays are addressed through index registers. However, the compiler generates code to compute the index using floating point instructions, then uses the ATX instruction to put the computed value into an index register. Using this information and some ingenuity, the code for the inner DO loop in the above example can be made considerably shorter and faster. The revisions are done by compiling with /O and /Q options, then editing the file SYS:FORTRN.RA (which is facilitated by using an advanced editor such as SCROLL). We proceed as follows: Noting that I and J serve only to index through the arrays, we will use two index registers (6 and 7) to address the arrays, and use a third index register (1) to count the 60 trips through the loop. The variable I will not be needed at all. We proceed as follows. The FLDA J; ATX 6 can be moved to before the inner loop, since this is invariant during the inner loop. Now we notice that the code FLDA J; FMUL #LIT+0014; FADD I; ATX 7 has the effect of setting XR7 to the value J times a constant, then incrementing XR7 at the beginning of each trip through the loop. So we put FLDA J; FMUL #LIT+0014; ATX 7 before the loop (taking advantage of the fact that we already have a FLDA J there), and remember that we must increment XR7 during the loop. XR1 is set to -60 (-74 octal), and will count trips through the loop. Now we are ready for the loop. The code for B(J)**2 remains as before. A(I,J) is addressed as before, but XR7 must be pre-incremented, which is nicely done by adding a + sign to the instruction. Finally, the conditional jump instruction JXN completes the loop. The final loop is just four instructions long. Here is the final code, which executes in .5 seconds on a FPP-12: FLDA #LIT+0000 FSTA J #G0013, FLDA J ATX 6 /FOR ADDRESSING B(J) FMUL #LIT+0014 ATX 7 /FOR ADDRESSING A(I,J) LDX -74,1 /XR1 COUNTS FROM -60 TO 0 #G0014, #60, FLDA B-0003,6 FMUL B-0003,6 /B**2 FMULM A-0267,7+ /INCREMENTS XR7 FIRST JXN #G0014,1+ /INCREMENTS XR1, JUMPS IF <0 #66, FLDA #LIT+0000 FADD J FSTA J FSUB #LIT+0014 JLE #G0013 /OUTER LOOP Often a similar strategy can be employed, using index registers advantageously and eliminating floating point instructions. The programmer should be careful that an eliminated variable, such as I in the above example, is not expected to have some value later on (as 60 in the above example). Also, some caution is needed to avoid X-register conflicts. The compiler assigns index registers starting with 7 and works down toward 1, so XR1 is not likely to be used; however this is not always true. Scan surrounding code and test the manually optimized program against the original. Eric Swanson Department of Biochemistry University of Washington March 20, l978