forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Unexpected floating point results

 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> KBase
View previous topic :: View next topic  
Author Message
silverfrost
Site Admin


Joined: 29 Nov 2006
Posts: 191
Location: Manchester

PostPosted: Tue Sep 07, 2004 9:42 pm    Post subject: Unexpected floating point results Reply with quote

Summary
It is possible that a program running on your system will not produce the expected results from floating point calculations due to the differences between 64 bit and 80 bit precision.

Details

The following program is an example of this problem and produces unexpected results:
Code:

INTEGER*4 MAXROW, I
PARAMETER (MAXROW = 100)
COMMON /BLOCK/ VOLUME
COMMON /BOUND/ L(MAXROW), U(MAXROW)
DOUBLE PRECISION VOLUME, L, U
VOLUME = 100D0
DO 100 I = 1, MAXROW
L(I) = .2D0
U(I) = .2D0
100 CONTINUE
CALL SUB(L, U, MAXROW)
END

SUBROUTINE SUB(L, U, ROWS)
INTEGER*4 ROWS, I
DOUBLE PRECISION L, U, VOLUME
COMMON /BLOCK/ VOLUME
DIMENSION L(*), U(*)
DO 100 I = 1, ROWS
L(I) = L(I) / VOLUME
U(I) = U(I) / VOLUME - L(I)
WRITE(*, *) I, L(I), U(I)
READ(*, *)
100 CONTINUE
RETURN
END


The problem occurs with the line U(I) = U(I) / VOLUME - L(I).

At this point each element in the array L should be 0.002D0, and the expected result of this assignment is that each element in the array U should be 0. Running the program results in the following output:
Code:
1 2.0E-3 6.95E-20
2 2.0E-3 6.95E-20

and so on

The values output from the program have been rounded and trailing zeroes removed. The third column is for each element of the array U and should always be 0. If you examine the assembly code generated by the compiler you will find the following code has been generated:

Code:
L(I) = L(I) / VOLUME
DFLD L(I)
DFDIV VOLUME
DFSTP L(I)
U(I) = U(I) / VOLUME - L(I)
DFLD U(I)
DFDIV VOLUME
DFSUB L(I)
DFSTP U(I)


All the arithmetic within each statement is performed to 80 bit precision. However, when the data is saved to memory it is truncated to 64 bits for the double precision variable. If this were decimal and L(I) and U(I) were both 2/3 and volume was 2, you would effectively be performing 0.3333333333 - 0.333333 yielding a result of 0.3333e-6. In this case the value is 0.002 which is not exactly representable in binary. The result is truncated to the 64 bits when saved to memory. When the DFSUB is performed, the 80 bit value of 0.002 has a 64 bit representation of the same number subtracted from it. This leaves the residue in the coprocessor which is still a valid 64 bit number.

In order to obtain the expected answers the above code would need to be re-coded as follows:

Code:

L(I) = L(I) / VOLUME
U(I) = U(I) / VOLUME
U(I) = U(I) - L(I)


For FTN77, the program would also need to be coded with the floating point tracking option turned off (i.e. /NO_FLOATING_TRACKING)

Note that, in order to allow you to obtain the best possible floating point accuracy within expressions, Salford compilers are coded so that the coprocessor works in 80 bit mode.
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> KBase All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group