File: OPTF4.WU of Tape: Various/Decus/decus-4
(Source file text) 


		OPTF4    Fortran IV Optimizer


OPTF4 is an OS/8 compatible program which optimizes RALF code
generated by the Fortran IV compiler.  Version l.02 performs the 
following optimizations:
	1.  Unnecessary FLDA instructions are eliminated,
	2.  FCLA, FADDM and FMULM instructions are implemented,
	3.  Addition and subtraction of zero are eliminated (this
optimizes some IF statements),
	4.  Absolute values are computed in-line
	5.  Squaring is reduced to a single multiply instruction.

OPTF4 is designed to be included in the set of chained programs
that comprize the Fortran IV compiler.  OPTF4 executes after 
PASS2O (or PASS3) and before the RALF assembler.  Optimization
is activated by the /O switch (along with the /Q switch):

	.R F4               Compiles MAIN with optimization
	*MAIN<MAIN/O/Q      and produces MAIN.RL.

OPTF4 may also be run from the monitor.  In this case, the command
decoder is called, and one input and one output filename may be 
specified.  The default output filename is SYS:FORTRN.RA and the 
the default extension is .RA.  Switches are:
	/R   chain to RALF
	/L   chain to RALF and then LOAD
	/G   chain to RALF, LOAD, and execute.

Example:
	.R OPTF4	Optimizes PROG1.RA, chains to RALF for
	*PROG1/L	assembly, then chains to LOAD.  SUBR is
	*SUBR/G$	added to the load module and the program
			is executed.

The following changes will incorporate OPTF4.SV into the F4 chaining
sequence:

PASS2O.SV:
	5025/ 2201  1720
	5026/ 1406  2406
	5027/ 0000  6400
PASS3.SV:
	2157/ 2201  1720
	2160/ 1406  2406
	2161/ 0000  6400
These changes are good for OS/8 V3C and V3D; older versions may be
different.  Note that the filename 'RALF  ' is simply changed to
'OPTF4 '.


The following example shows optimization of some Fortran code.
Consider the program fragment,

		DIMENSION A(60,60),B(60)
		.
		.
		.
		DO 66  J=1,60
		DO 60  I=1,60
	60	A(I,J)=A(I,J)*B(J)**2
	66	CONTINUE
	

This will compile into the following RALF code (found in the file
SYS:FORTRN.RA):
	FLDA	#LIT+0000
	FSTA	J
#G0013,	FLDA	#LIT+0000
	FSTA	I
#G0014,
#60,	FLDA	J
	FMUL	#LIT+0014
	FADD	I
	ATX	7
	FLDA	J
	ATX	6
	FLDA	B-0003,6
	FSTA	#BASE
	FLDA	#LIT+0003
	EXTERN	#EXPII
	JSA	#EXPII
	FMUL	A-0267,7
	FSTA	A-0267,7
	FLDA	I
	FADD	#LIT+0000
	FSTA	I
	FSUB	#LIT+0014
	JLE	#G0014

#66,	FLDA	J
	FADD	#LIT+0000
	FSTA	J
	FSUB	#LIT+0014
	JLE	#G0013

This will execute in about 4.7 sec, with a FPP-12.  After optimization
with OPTF4, the code will look like:
	FLDA	#LIT+0000
	FSTA	J
#G0013,	FLDA	#LIT+0000
	FSTA	I
#G0014,
#60,	FLDA	J
	FMUL	#LIT+0014
	FADD	I
	ATX	7
	FLDA	J
	ATX	6
	FLDA	B-0003,6
	FMUL	B-0003,6
	FMULM	A-0267,7
	FLDA	#LIT+0000
	FADD	I
	FSTA	I
	FSUB	#LIT+0014
	JLE	#G0014
#66,	FLDA	#LIT+0000
	FADD	J
	FSTA	J
	FSUB	#LIT+0014
	JLE	#G0013

Here the squaring has been done in-line, and the FMULM instruction has
replaced the FMUL; FSTA instructions.  This executes in 1.3 seconds.





Manual optimization, for ambitious programmers.

	In some cases, the programmer may wish to further optimize
particular sections of RALF code, especially those sections which are
suspected of accounting for most of the execution time.  Manual
optimization will be illustrated for the above example, but first,
some observations about Fortran generated RALF code are useful.
	1.  The compiler generates constants as needed, which are
referenced by expressions of the sort #LIT+nnnn.
	2.  The compiler generates tags of the sort #Gnnnn to use as
targets for jump instructions.  One of these will be found near the 
beginning of each DO loop.
	3.  The index of a DO loop is actually a floating point number,
even though we think of it as an integer.  It is set, incremented, and
tested by floating point instructions.
	4.  Arrays are addressed through index registers.  However, the
compiler generates code to compute the index using floating point
instructions, then uses the ATX instruction to put the computed value
into an index register.
	Using this information and some ingenuity, the code for the inner
DO loop in the above example can be made considerably shorter and faster.
The revisions are done by compiling with /O and /Q options, then editing
the file SYS:FORTRN.RA (which is facilitated by using an advanced editor
such as SCROLL).  We proceed as follows:
	Noting that I and J serve only to index through the arrays, we
will use two index registers (6 and 7) to address the arrays, and
use a third index register (1) to count the 60 trips through the
loop.  The variable I will not be needed at all.  We proceed as follows.
The FLDA J; ATX 6 can be moved to before the inner loop, since this
is invariant during the inner loop.  Now we notice that the code
FLDA J; FMUL #LIT+0014; FADD I; ATX 7 has the effect of setting XR7
to the value J times a constant, then incrementing XR7 at the beginning
of each trip through the loop.  So we put FLDA J; FMUL #LIT+0014;
ATX 7 before the loop (taking advantage of the fact that we already
have a FLDA J there), and remember that we must increment XR7 during
the loop.  XR1 is set to -60 (-74 octal), and will count trips through
the loop.
 	Now we are ready for the loop.  The code for B(J)**2 remains as
before.  A(I,J) is addressed as before, but XR7 must be pre-incremented,
which is nicely done by adding a + sign to the instruction.  Finally, 
the conditional jump instruction JXN completes the loop.  The final 
loop is just four instructions long.  Here is the final code, which
executes in .5 seconds on a FPP-12:


	FLDA	#LIT+0000
	FSTA	J
#G0013,	FLDA	J
	ATX	6		/FOR ADDRESSING B(J)
	FMUL	#LIT+0014
	ATX	7		/FOR ADDRESSING A(I,J)
	LDX	-74,1		/XR1 COUNTS FROM -60 TO 0
#G0014,
#60,	FLDA	B-0003,6
	FMUL	B-0003,6	/B**2
	FMULM	A-0267,7+	/INCREMENTS XR7 FIRST
	JXN	#G0014,1+	/INCREMENTS XR1, JUMPS IF <0
#66,	FLDA	#LIT+0000
	FADD	J
	FSTA	J
	FSUB	#LIT+0014
	JLE	#G0013		/OUTER LOOP




	Often a similar strategy can be employed, using index registers
advantageously and eliminating floating point instructions.  The
programmer should be careful that an eliminated variable, such as
I in the above example, is not expected to have some value later on
(as 60 in the above example).  Also, some caution is needed to avoid
X-register conflicts.  The compiler assigns index registers starting
with 7 and works down toward 1, so XR1 is not likely to be used;
however this is not always true.  Scan surrounding code and test
the manually optimized program against the original. 


Eric Swanson
Department of Biochemistry
University of Washington
March 20, l978