forums.silverfrost.com :: View topic - Unexpected floating point results

forums.silverfrost.com
Welcome to the Silverfrost forums

FAQ

Memberlist

Usergroups

Profile

Unexpected floating point results

forums.silverfrost.com Forum Index -> KBase

View previous topic :: View next topic

Author

Message

silverfrost
Site Admin

Joined: 29 Nov 2006
Posts: 195
Location: Manchester

Posted: Tue Sep 07, 2004 9:42 pm Post subject: Unexpected floating point results

Summary
It is possible that a program running on your system will not produce the expected results from floating point calculations due to the differences between 64 bit and 80 bit precision.

Details

The following program is an example of this problem and produces unexpected results:

Code:

INTEGER*4 MAXROW, I
PARAMETER (MAXROW = 100)
COMMON /BLOCK/ VOLUME
COMMON /BOUND/ L(MAXROW), U(MAXROW)
DOUBLE PRECISION VOLUME, L, U
VOLUME = 100D0
DO 100 I = 1, MAXROW
L(I) = .2D0
U(I) = .2D0
100 CONTINUE
CALL SUB(L, U, MAXROW)
END

SUBROUTINE SUB(L, U, ROWS)
INTEGER*4 ROWS, I
DOUBLE PRECISION L, U, VOLUME
COMMON /BLOCK/ VOLUME
DIMENSION L(*), U(*)
DO 100 I = 1, ROWS
L(I) = L(I) / VOLUME
U(I) = U(I) / VOLUME - L(I)
WRITE(*, *) I, L(I), U(I)
READ(*, *)
100 CONTINUE
RETURN
END

The problem occurs with the line U(I) = U(I) / VOLUME - L(I).

At this point each element in the array L should be 0.002D0, and the expected result of this assignment is that each element in the array U should be 0. Running the program results in the following output:

Code:

1 2.0E-3 6.95E-20
2 2.0E-3 6.95E-20

and so on

The values output from the program have been rounded and trailing zeroes removed. The third column is for each element of the array U and should always be 0. If you examine the assembly code generated by the compiler you will find the following code has been generated:

Code:

L(I) = L(I) / VOLUME
DFLD L(I)
DFDIV VOLUME
DFSTP L(I)
U(I) = U(I) / VOLUME - L(I)
DFLD U(I)
DFDIV VOLUME
DFSUB L(I)
DFSTP U(I)

All the arithmetic within each statement is performed to 80 bit precision. However, when the data is saved to memory it is truncated to 64 bits for the double precision variable. If this were decimal and L(I) and U(I) were both 2/3 and volume was 2, you would effectively be performing 0.3333333333 - 0.333333 yielding a result of 0.3333e-6. In this case the value is 0.002 which is not exactly representable in binary. The result is truncated to the 64 bits when saved to memory. When the DFSUB is performed, the 80 bit value of 0.002 has a 64 bit representation of the same number subtracted from it. This leaves the residue in the coprocessor which is still a valid 64 bit number.

In order to obtain the expected answers the above code would need to be re-coded as follows:

Code:

L(I) = L(I) / VOLUME
U(I) = U(I) / VOLUME
U(I) = U(I) - L(I)

For FTN77, the program would also need to be coded with the floating point tracking option turned off (i.e. /NO_FLOATING_TRACKING)

Note that, in order to allow you to obtain the best possible floating point accuracy within expressions, Salford compilers are coded so that the coprocessor works in 80 bit mode.

Display posts from previous:

	forums.silverfrost.com Forum Index -> KBase	All times are GMT + 1 Hour
Page 1 of 1

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum