Topic: Performance penalty from using 64-bit integers in Support

mecej4

Posts: 1912

Back to Top

10 Jan 2018 3:48 #21112

When working with a test code related to a compiler bug (for details, see this recent thread: http://forums.silverfrost.com/viewtopic.php?p=23660), I thought that I had found another compiler bug: a small change in the program, namely, changing two variables from 32-bit integers to 64-bit integers, seemed to make the resulting EXE hang. It turns out that the program had become about twenty times slower (800 times slower than with Gfortran).

Here is the test program:

  program fbug
   implicit none
   integer, parameter :: N8 = selected_int_kind(15)
   integer, parameter :: HundredMill = 100000000
   integer(N8) :: i, j             ! could be plain integers, instead
   integer(N8) ::  s

   s = 0_N8
   do i = 1, 30
      do j=1,HundredMill
         s = s + j
      end do
      write(*,*)i,s
   end do 

   write (*,*) 's =', s
   end

Here are some timing results from this program:

gftn -m32 -O2          0.047 s
gftn -m64 -O2          0.047 s
ftn95 /opt            40.89  s
ftn95 /opt /64         2.199 s

The GCC versions were 4.8 (32-bit) and 6.2 (64-bit), and I used FTN95 8.10, all on a laptop with an i5-4200U CPU and running Windows 10 64-bit.

I think that one has to be careful about using 64-bit integers with 32-bit FTN95. The use of X87 instructions for performing 8-byte integer arithmetic is probably the root cause of the slow-down.

Changing the DO loop index variables to 32-bit integers (by removing '(N8)' on Line 5) improves the timings, but there is room for much improvement.

gftn -m32 -O2          3.118 s
gftn -m64 -O2          1.256 s  
ftn95 /opt /64         2.221 s
ftn95 /opt            16.594 s

It is curious that the same change (removing '(N8)') that helped speed up the EXE compiled with FTN95 caused the EXE compiled with GFortran to slow down significantly. Please note that with '(N8)' in place, the GFortran compiler is smart enough to optimize away the inner loop, which explains the apparent high speed of the EXE that it produces.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

10 Jan 2018 4:38 #21113

I have run this code on my machine using the developers' FTN95 and I can confirm the slowness for 32 bits.

For 64 bits I get:

gftn (not optimised) 8.9 secs. ftn95 (not optimised) 9.1 secs. ftn95 (optimised) 1.8 secs.

I used SYSTEM_CLOCK for timing and noted that gftn and ftn95 use different count rates.

JohnCampbell

Posts: 2526 Sydney

Back to Top

12 Jan 2018 12:23 #21119

Paul,

I am surprised by the improvement you report for your test with FTN95 /64 /opt. I have not been able to achieve similar results.

Would FTN95's 32-bit performance be due to the 8-byte integer instructions that are either not being used or are not available in 32-bit ? In general I have been impressed by the performance of 8-byte integers, although I mainly generate 64-bit .exe.

Mecej4, I too have a i5-4200U (with 3mb cache) running Windows 10 64-bit. It's performance is very disappointing in comparison to other pcs and laptop that I have. A purchase I regret. Now considering an i7-8700K desktop, but so often, the improvements are minimal.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

12 Jan 2018 8:18 #21123

John

I don't know why 32 bit mode INTEGER*8 arithmetic is so slow.

mecej4

Posts: 1912

Back to Top

12 Jan 2018 2:33 #21127

Quoted from PaulLaidler John

I don't know why 32 bit mode INTEGER*8 arithmetic is so slow.

The slowdown highlighted in this thread is probably of little significance to real life applications. Here we have created a loop which does little but gets executed billions of times. Real applications do not do such things.

In 64-bit mode, the code that FTN95 produces for the inner loop is just six instructions long. Two of those instructions could be removed as stated in the comments following '#'.

N_6:
ADD_Q     RDI,RSI
MOV_Q     R15,RSI         # REMOVE
INC_Q     RSI
MOV_Q     R15,RSI          # REMOVE
CMP_Q     R15,100000000   # REPLACE R15 by RSI
JLE       N_6
#Storing information in registers at exit of loop
MOV_Q     S,RDI
MOV_Q     J,RSI

More importantly, the instructions make no memory references.

The corresponding 32-bit code, however, makes lots of memory references:

Label     __N6      
mov       ecx,S         
mov       eax,S[4]      
add       ecx,J         
adc       eax,J[4]      
mov       Temp@1,ecx    
mov       Temp@1[4],eax 
mov       eax,Temp@1    
mov       edi,Temp@1[4] 
mov       S[4],edi      
mov       S,eax         
mov       edi,J         
mov       ecx,J[4]      
add       edi,1_4       
adc       ecx,1_4[4]    
mov       Temp@2,edi    
mov       Temp@2[4],ecx 
mov       ecx,Temp@2    
mov       eax,Temp@2[4] 
mov       J[4],eax      
mov       J,ecx         
qfild     100000000_4   
qfild     J             
fcomip    fr0,fr1       
ffree     fr0           
jbe       __N6

There is quite a bit of copying and fetching of temporary results to/from memory. The use of X87 instructions just to test if the DO loop is done is also rather odd.

In real life, where something substantial is done inside the loop, these inefficiencies are probably have negligible effect. We just need to be careful not to use 8-byte integers for DO loop index variables unless they are necessary.