View previous topic :: View next topic 
Author 
Message 
JohnCampbell
Joined: 16 Feb 2006 Posts: 2239 Location: Sydney

Posted: Mon Oct 26, 2009 6:35 am Post subject: New Kinds 


Paul,
Following on from the discussion of KIND, is it an option to provide REAL*6 or INTEGER*6.
There was a time when all reals were calculated in the coprocessor, and I thought that real*4 ( and real*8 ) was just a truncated 80bit real*10. Is this the case ? If so would REAL*6 be a simple extension of managing REAL*4. There is certainly a big gap between R*4 and R*8 in precision and R*6 would provide about 11 significant digits (precision).
I'm not sure of the basis of INTEGER*8 from INTEGER*4, but INTEGER*6 could be a useful alternative ?
Just a thought !
John 

Back to top 


LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2187 Location: Yateley, Hants, UK

Posted: Mon Oct 26, 2009 3:11 pm Post subject: 


John,
I'm a real believer (no pun intended) in REAL*6 and INTEGER*6. The problem is that they aren't native to (x87) coprocessors, and all the operations would need to be coded from scratch (i.e. done in software).
When I used MS Fortran, they had 2 libraries one could link with  one where the math was done largely in software, and one where it was done largely in hardware. They didn't always give the same result! In part, this was because REAL*8 match was done in 64 bits, whereas the coprocessor operations loaded things into 80bit registers, so that the roundoff was potentially different.
Eddie 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6593 Location: Salford, UK

Posted: Mon Oct 26, 2009 8:37 pm Post subject: 


selected_integer_kind and selected_real_kind allow you to select the precision etc (within certain hardware limits) but these are mapped to those provided by the processor and coprocessor. In other words if you asked for the equivalent of *6 then you would get *8 anyway. Providing *6 via software would be slower than the *8 provided by the hardware. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2239 Location: Sydney

Posted: Tue Oct 27, 2009 3:54 am Post subject: 


Paul,
I was under the impression that real*4 and real*8 were done in the 80bit math coprocessor. Results were stored in the word address, with truncation of the accuracy.
So my assumption for real*6 would be that the calcs would be in the coprocessor, but the truncation would be different.
This is not consistent with the statement "providing real*6 via software"
I have also seen past reference to a 64bit rather than 80bit arithmetic (SSE?) instructions, which would change this assumption.
Is the coprocessor no longer used and are real*4 and real*8 calculations now done differently ?
John 

Back to top 


Sebastian
Joined: 20 Feb 2008 Posts: 177

Posted: Tue Oct 27, 2009 8:04 am Post subject: 


Quote:  So my assumption for real*6 would be that the calcs would be in the coprocessor, but the truncation would be different. 
The fpu has no support for that. It handles 32bit (single precision), 64bit (double precision) and 80bit (extended precision) operations. If you need more information just post or read through some hardware docs like http://sandpile.org/ia32/opc_fpu.htm or the intel (amd) instruction set references. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2239 Location: Sydney

Posted: Tue Oct 27, 2009 8:26 am Post subject: 


Is 80bit extended precision the same as real*10 or is real*10 software implemented ? 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6593 Location: Salford, UK

Posted: Tue Oct 27, 2009 9:10 am Post subject: 


Yes extended precision is the same as real*10. 

Back to top 


Sebastian
Joined: 20 Feb 2008 Posts: 177

Posted: Tue Oct 27, 2009 9:24 am Post subject: 


The *x usually specifies the amount of bytes required for the data type (this may be awfully wrong for nonx86/nonPC fortran implementations) so real*10 is the 10byte=80bit floating point type as Paul said. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2239 Location: Sydney

Posted: Wed Oct 28, 2009 1:24 am Post subject: 


Paul,
I am trying to understand how real*6 could be done and real*10 is done.
My question re real*10 is : Is it hardware implemented, with all calculations done in the 80bit math coprocessor, or is that an obsolete technology?
To test out this I wrote a program that repeated vector dot product on 1000 element arrays as real*8 or real*10, using dot_product intrinsic or simple function which has a loop:
Code:  REAL*10 FUNCTION VECSUM_10 (A, B, N)
!
! Performs a vector dot product VECSUM = [A] . [B]
! account is taken of the leading zero terms in the vectors
!
integer*4, intent (in) :: n
real*10, dimension(n), intent (in) :: a
real*10, dimension(n), intent (in) :: b
!
real*10 c
integer*4 i
!
c = 0
do i = 1,n
if (a(i) /= 0) exit
end do
do i = i,n
c = c + a(i)*b(i)
end do
!
vecsum_10 = c
return
!
end

Compiling without /opt The results are :
Code:  Test Type Routine Seconds Ratio
real*8 test vecsum_8 4.28 1.00
real*8 test dot_product 4.276 1.00
real*10 test vecsum_10 5.515 1.29
real*10 test dot_product 7.432 1.74
real*4 test vecsum_4 2.923 0.68

Real*10 takes 30% longer that real*8, but 74% longer using the dot_product intrinsic. Real*4 takes only 68% of real*8 computation time.
This indicates to me that real*10 is not simply taking the 80bit result from the math coprocessor while real*8 and real*4 truncate the output. Either this or the instructions to move 4, 8 or 10 bytes take a lot of time.
Any advice ?
John 

Back to top 


Sebastian
Joined: 20 Feb 2008 Posts: 177

Posted: Wed Oct 28, 2009 8:14 am Post subject: 


Quote:  This indicates to me that real*10 is not simply taking the 80bit result from the math coprocessor while real*8 and real*4 truncate the output. 
How do you come to that conclusion? There are a lot of implementation details in the fpu that make 80bit usage the nonstandard like there are no operations like "add an 80bit value from memory to an fpu register" like there is for 32bit and 64bit. 80bit values always have to be loaded into a temp fpu register first. Also keep in mind that of course reading 10 bytes from memory obviously takes longer than only reading 4 or 8 bytes, especially since 10 bytes usually are laid out to occupy 16 bytes due to better access speeds. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2239 Location: Sydney

Posted: Wed Oct 28, 2009 8:30 am Post subject: 


Sebastian wrote "How do you come to that conclusion? " I also said that "Either this or the instructions to move 4, 8 or 10 bytes take a lot of time." I just find that the ratios of 130% and 68% are big spreads for just moving bytes, as compared to floating point calculation times. Is an 80bit fpu always used for real calcualtions ?
Sebastion also wrote :
Quote:  Also keep in mind that of course reading 10 bytes from memory obviously takes longer than only reading 4 or 8 bytes, especially since 10 bytes usually are laid out to occupy 16 bytes due to better access speeds. 
Again I'm surprised how much longer it takes for reading values and when is this 16 byte claim true ?
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 6593 Location: Salford, UK

Posted: Wed Oct 28, 2009 9:20 am Post subject: 


The answer to these questions can be researched by using /explist on the command line. This will show the assembly instructions generated by FTN95. You will then need to look up these instructions in an Intel manual.
There will be little or no software intervention except perhaps in the case of INTEGER*8. The native 32, 64 and 80 bit instructions will not be truncated unless your source code stipulates this. You will also be able to look up the timing of the native instructions.
Basically FTN95 will aim to give you the maximum precision that is available in any given situation, even to the point of sometimes using 80 bits internally when a 64 bit result is being generated.
With the speed of modern processors, the speed of a native 32 bit multiply (say) as against a 64 bit native multiply is rarely an issue. 

Back to top 


LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2187 Location: Yateley, Hants, UK

Posted: Wed Oct 28, 2009 11:37 am Post subject: 


Speed may not be an issue, but storage is, and if (say) REAL*6 was good enough for (again, say) FE calculations, then one would have 25% longer arrays to do the matrix operations in  while sticking with a 32bit OS and the limitations of that. That puts off the evil moment when the solution has to use the hard disk .... which slows the process down hugely.
It's a very ong time since I knew my way round the 8087 fpu book (8087 applications and programming) and my understanding is that first MMX and later SSE provided alternate ways to do certain math operations. I got lost at that point. None of the standard methods countenance REAL*6.
Eddie 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2239 Location: Sydney

Posted: Wed Oct 28, 2009 2:09 pm Post subject: 


Thanks Eddie for providing the names of the more recent MMX and later SSE instructions.
I apologise, but I am not sufficiently familiar with assembler to understand what is happening in /explist.
Can't I get a clear answer to my question of is the real*x maths done in the coprocessor or is it the more recent instructions ?
I am surprised by the difference in gross computation time between real*4, *8 and *10. Is the only explaination the different in moving the necessary bytes.
Any clear advice would be appreciated.
John 

Back to top 


Sebastian
Joined: 20 Feb 2008 Posts: 177

Posted: Wed Oct 28, 2009 4:26 pm Post subject: 


As far as I know MMX/SSE/SSE2 do not support 80bit registers.
Quote:  I am surprised by the difference in gross computation time between real*4, *8 and *10. Is the only explaination the different in moving the necessary bytes. 
As I've already noted above there are fundamental differences in how 80bit data can be used in the fpu compared to 32bit and 64bit. And the differences between 32bit and 64bit access are data loading and the time required for the respective instruction which depends on the CPU's implementation. So you'd have to ask Intel/AMD why 64bit operations are slower than 32bit. 

Back to top 


