View previous topic :: View next topic |
Author |
Message |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Mon Oct 26, 2009 6:35 am Post subject: New Kinds |
|
|
Paul,
Following on from the discussion of KIND, is it an option to provide REAL*6 or INTEGER*6.
There was a time when all reals were calculated in the co-processor, and I thought that real*4 ( and real*8 ) was just a truncated 80-bit real*10. Is this the case ? If so would REAL*6 be a simple extension of managing REAL*4. There is certainly a big gap between R*4 and R*8 in precision and R*6 would provide about 11 significant digits (precision).
I'm not sure of the basis of INTEGER*8 from INTEGER*4, but INTEGER*6 could be a useful alternative ?
Just a thought !
John |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2390 Location: Yateley, Hants, UK
|
Posted: Mon Oct 26, 2009 3:11 pm Post subject: |
|
|
John,
I'm a real believer (no pun intended) in REAL*6 and INTEGER*6. The problem is that they aren't native to (x87) coprocessors, and all the operations would need to be coded from scratch (i.e. done in software).
When I used MS Fortran, they had 2 libraries one could link with - one where the math was done largely in software, and one where it was done largely in hardware. They didn't always give the same result! In part, this was because REAL*8 match was done in 64 bits, whereas the coprocessor operations loaded things into 80-bit registers, so that the round-off was potentially different.
Eddie |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7934 Location: Salford, UK
|
Posted: Mon Oct 26, 2009 8:37 pm Post subject: |
|
|
selected_integer_kind and selected_real_kind allow you to select the precision etc (within certain hardware limits) but these are mapped to those provided by the processor and co-processor. In other words if you asked for the equivalent of *6 then you would get *8 anyway. Providing *6 via software would be slower than the *8 provided by the hardware. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Tue Oct 27, 2009 3:54 am Post subject: |
|
|
Paul,
I was under the impression that real*4 and real*8 were done in the 80-bit math co-processor. Results were stored in the word address, with truncation of the accuracy.
So my assumption for real*6 would be that the calcs would be in the coprocessor, but the truncation would be different.
This is not consistent with the statement "providing real*6 via software"
I have also seen past reference to a 64-bit rather than 80-bit arithmetic (SSE?) instructions, which would change this assumption.
Is the co-processor no longer used and are real*4 and real*8 calculations now done differently ?
John |
|
Back to top |
|
|
Sebastian
Joined: 20 Feb 2008 Posts: 177
|
Posted: Tue Oct 27, 2009 8:04 am Post subject: |
|
|
Quote: | So my assumption for real*6 would be that the calcs would be in the coprocessor, but the truncation would be different. |
The fpu has no support for that. It handles 32bit (single precision), 64bit (double precision) and 80bit (extended precision) operations. If you need more information just post or read through some hardware docs like http://sandpile.org/ia32/opc_fpu.htm or the intel (amd) instruction set references. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Tue Oct 27, 2009 8:26 am Post subject: |
|
|
Is 80-bit extended precision the same as real*10 or is real*10 software implemented ? |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7934 Location: Salford, UK
|
Posted: Tue Oct 27, 2009 9:10 am Post subject: |
|
|
Yes extended precision is the same as real*10. |
|
Back to top |
|
|
Sebastian
Joined: 20 Feb 2008 Posts: 177
|
Posted: Tue Oct 27, 2009 9:24 am Post subject: |
|
|
The *x usually specifies the amount of bytes required for the data type (this may be awfully wrong for non-x86/non-PC fortran implementations) so real*10 is the 10byte=80bit floating point type as Paul said. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Wed Oct 28, 2009 1:24 am Post subject: |
|
|
Paul,
I am trying to understand how real*6 could be done and real*10 is done.
My question re real*10 is : Is it hardware implemented, with all calculations done in the 80-bit math co-processor, or is that an obsolete technology?
To test out this I wrote a program that repeated vector dot product on 1000 element arrays as real*8 or real*10, using dot_product intrinsic or simple function which has a loop:-
Code: | REAL*10 FUNCTION VECSUM_10 (A, B, N)
!
! Performs a vector dot product VECSUM = [A] . [B]
! account is taken of the leading zero terms in the vectors
!
integer*4, intent (in) :: n
real*10, dimension(n), intent (in) :: a
real*10, dimension(n), intent (in) :: b
!
real*10 c
integer*4 i
!
c = 0
do i = 1,n
if (a(i) /= 0) exit
end do
do i = i,n
c = c + a(i)*b(i)
end do
!
vecsum_10 = c
return
!
end
|
Compiling without /opt The results are :-
Code: | Test Type Routine Seconds Ratio
real*8 test vecsum_8 4.28 1.00
real*8 test dot_product 4.276 1.00
real*10 test vecsum_10 5.515 1.29
real*10 test dot_product 7.432 1.74
real*4 test vecsum_4 2.923 0.68
|
Real*10 takes 30% longer that real*8, but 74% longer using the dot_product intrinsic. Real*4 takes only 68% of real*8 computation time.
This indicates to me that real*10 is not simply taking the 80-bit result from the math co-processor while real*8 and real*4 truncate the output. Either this or the instructions to move 4, 8 or 10 bytes take a lot of time.
Any advice ?
John |
|
Back to top |
|
|
Sebastian
Joined: 20 Feb 2008 Posts: 177
|
Posted: Wed Oct 28, 2009 8:14 am Post subject: |
|
|
Quote: | This indicates to me that real*10 is not simply taking the 80-bit result from the math co-processor while real*8 and real*4 truncate the output. |
How do you come to that conclusion? There are a lot of implementation details in the fpu that make 80bit usage the non-standard like there are no operations like "add an 80bit value from memory to an fpu register" like there is for 32bit and 64bit. 80bit values always have to be loaded into a temp fpu register first. Also keep in mind that of course reading 10 bytes from memory obviously takes longer than only reading 4 or 8 bytes, especially since 10 bytes usually are laid out to occupy 16 bytes due to better access speeds. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Wed Oct 28, 2009 8:30 am Post subject: |
|
|
Sebastian wrote "How do you come to that conclusion? " I also said that "Either this or the instructions to move 4, 8 or 10 bytes take a lot of time." I just find that the ratios of 130% and 68% are big spreads for just moving bytes, as compared to floating point calculation times. Is an 80-bit fpu always used for real calcualtions ?
Sebastion also wrote :
Quote: | Also keep in mind that of course reading 10 bytes from memory obviously takes longer than only reading 4 or 8 bytes, especially since 10 bytes usually are laid out to occupy 16 bytes due to better access speeds. |
Again I'm surprised how much longer it takes for reading values and when is this 16 byte claim true ?
John |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7934 Location: Salford, UK
|
Posted: Wed Oct 28, 2009 9:20 am Post subject: |
|
|
The answer to these questions can be researched by using /explist on the command line. This will show the assembly instructions generated by FTN95. You will then need to look up these instructions in an Intel manual.
There will be little or no software intervention except perhaps in the case of INTEGER*8. The native 32, 64 and 80 bit instructions will not be truncated unless your source code stipulates this. You will also be able to look up the timing of the native instructions.
Basically FTN95 will aim to give you the maximum precision that is available in any given situation, even to the point of sometimes using 80 bits internally when a 64 bit result is being generated.
With the speed of modern processors, the speed of a native 32 bit multiply (say) as against a 64 bit native multiply is rarely an issue. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2390 Location: Yateley, Hants, UK
|
Posted: Wed Oct 28, 2009 11:37 am Post subject: |
|
|
Speed may not be an issue, but storage is, and if (say) REAL*6 was good enough for (again, say) FE calculations, then one would have 25% longer arrays to do the matrix operations in - while sticking with a 32-bit OS and the limitations of that. That puts off the evil moment when the solution has to use the hard disk .... which slows the process down hugely.
It's a very ong time since I knew my way round the 8087 fpu book (8087 applications and programming) and my understanding is that first MMX and later SSE provided alternate ways to do certain math operations. I got lost at that point. None of the standard methods countenance REAL*6.
Eddie |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Wed Oct 28, 2009 2:09 pm Post subject: |
|
|
Thanks Eddie for providing the names of the more recent MMX and later SSE instructions.
I apologise, but I am not sufficiently familiar with assembler to understand what is happening in /explist.
Can't I get a clear answer to my question of is the real*x maths done in the co-processor or is it the more recent instructions ?
I am surprised by the difference in gross computation time between real*4, *8 and *10. Is the only explaination the different in moving the necessary bytes.
Any clear advice would be appreciated.
John |
|
Back to top |
|
|
Sebastian
Joined: 20 Feb 2008 Posts: 177
|
Posted: Wed Oct 28, 2009 4:26 pm Post subject: |
|
|
As far as I know MMX/SSE/SSE2 do not support 80bit registers.
Quote: | I am surprised by the difference in gross computation time between real*4, *8 and *10. Is the only explaination the different in moving the necessary bytes. |
As I've already noted above there are fundamental differences in how 80bit data can be used in the fpu compared to 32bit and 64bit. And the differences between 32bit and 64bit access are data loading and the time required for the respective instruction which depends on the CPU's implementation. So you'd have to ask Intel/AMD why 64bit operations are slower than 32bit. |
|
Back to top |
|
|
|