silverfrost Site Admin
Joined: 29 Nov 2006 Posts: 136 Location: Manchester

Posted: Tue Sep 07, 2004 9:42 pm Post subject: Unexpected floating point results 


Summary
It is possible that a program running on your system will not produce the expected results from floating point calculations due to the differences between 64 bit and 80 bit precision.
Details
The following program is an example of this problem and produces unexpected results:
Code: 
INTEGER*4 MAXROW, I
PARAMETER (MAXROW = 100)
COMMON /BLOCK/ VOLUME
COMMON /BOUND/ L(MAXROW), U(MAXROW)
DOUBLE PRECISION VOLUME, L, U
VOLUME = 100D0
DO 100 I = 1, MAXROW
L(I) = .2D0
U(I) = .2D0
100 CONTINUE
CALL SUB(L, U, MAXROW)
END
SUBROUTINE SUB(L, U, ROWS)
INTEGER*4 ROWS, I
DOUBLE PRECISION L, U, VOLUME
COMMON /BLOCK/ VOLUME
DIMENSION L(*), U(*)
DO 100 I = 1, ROWS
L(I) = L(I) / VOLUME
U(I) = U(I) / VOLUME  L(I)
WRITE(*, *) I, L(I), U(I)
READ(*, *)
100 CONTINUE
RETURN
END

The problem occurs with the line U(I) = U(I) / VOLUME  L(I).
At this point each element in the array L should be 0.002D0, and the expected result of this assignment is that each element in the array U should be 0. Running the program results in the following output:
Code:  1 2.0E3 6.95E20
2 2.0E3 6.95E20 
and so on
The values output from the program have been rounded and trailing zeroes removed. The third column is for each element of the array U and should always be 0. If you examine the assembly code generated by the compiler you will find the following code has been generated:
Code:  L(I) = L(I) / VOLUME
DFLD L(I)
DFDIV VOLUME
DFSTP L(I)
U(I) = U(I) / VOLUME  L(I)
DFLD U(I)
DFDIV VOLUME
DFSUB L(I)
DFSTP U(I) 
All the arithmetic within each statement is performed to 80 bit precision. However, when the data is saved to memory it is truncated to 64 bits for the double precision variable. If this were decimal and L(I) and U(I) were both 2/3 and volume was 2, you would effectively be performing 0.3333333333  0.333333 yielding a result of 0.3333e6. In this case the value is 0.002 which is not exactly representable in binary. The result is truncated to the 64 bits when saved to memory. When the DFSUB is performed, the 80 bit value of 0.002 has a 64 bit representation of the same number subtracted from it. This leaves the residue in the coprocessor which is still a valid 64 bit number.
In order to obtain the expected answers the above code would need to be recoded as follows:
Code: 
L(I) = L(I) / VOLUME
U(I) = U(I) / VOLUME
U(I) = U(I)  L(I)

For FTN77, the program would also need to be coded with the floating point tracking option turned off (i.e. /NO_FLOATING_TRACKING)
Note that, in order to allow you to obtain the best possible floating point accuracy within expressions, Salford compilers are coded so that the coprocessor works in 80 bit mode. 
