
forums.silverfrost.com Welcome to the Silverfrost forums

View previous topic :: View next topic 
Author 
Message 
JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Mon Feb 04, 2013 12:58 pm Post subject: FTN95 run time with allocatable arrays 


Paul,
I have a program that uses allocatable arrays for a gaussean linear equation solver.
Running with FTN95 Ver 6.1 or 6.3, it goes to sleep !!
I've sent an email with more details and the sample code.
Could you please review the email and let me know if you can review the generated assembler code in the inner loop.
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7369 Location: Salford, UK

Posted: Mon Feb 04, 2013 3:36 pm Post subject: 


Have you sent the email to Silverfrost? 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Mon Feb 04, 2013 10:20 pm Post subject: 


Paul,
Yes I sent the email to silverfrost. It contained the code as a standalone program.
The main routine that is failing is Code:  subroutine gaussean_reduction (sk, nszf, nband)
!
! Reduce stiffness matrix using gaussean reduction
!
integer*4 nszf, nband
real*8 sk(nband,nszf)
!
real*8 c
integer*4 n, b, i, j, k, iband
!
integer*4, allocatable, dimension(:) :: row_i
Real*8, allocatable, dimension(:) :: row_f
!
allocate ( row_i(nband) )
allocate ( row_f(nband) )
!
do n = 1,nszf ! nszf=139020
!
iband = 1
row_i(1) = 1
row_f(1) = sk(1,n)
do b = 2,min (nband,nszfn+1)
if (sk(b,n)==0) cycle
iband = iband+1
row_i(iband) = b
row_f(iband) = sk(b,n)
sk(b,n) = sk(b,n)/sk(1,n)
end do
!
! slow loops
do b = 2,iband
i = n+row_i(b)1
c = row_f(b)/row_f(1)
do k = b,iband
j = 1 + row_i(k)row_i(b)
sk(j,i) = sk(j,i)  c*row_f(k)
end do
end do
!
end do
!
end

The slow loop is 50 times slower than my 2000 version of LF95.
I can't see where it could be wrong or where a variable could be mistaken for an array operation.
John 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Wed Feb 06, 2013 1:38 am Post subject: 


Paul,
Have you received the email?
If you compile and start the run, you should at least see that it is so much slower than other compilers.
The FTN95 trace I get from the latest version I emailed to you starts like: (I have never reached the end !) It uses 50 steps over 13,000 equations. Code:  Creating executable: gaussean_t2.EXE
[SK] allocated; status = 0
Generate [SK]
Count_Rate = 10000
elapsed time (sec) = 7.290000000000E02
Reduce [SK]
at equation 1 0.00000000000 0 0
at equation 2 0.00000000000 0 0
at equation 3 9.999999999999E05 1 0
at equation 4 0.00000000000 0 0
at equation 261 5.170000000000E02 36 481
at equation 521 8.870000000000E02 71 816
at equation 781 8.840000000000E02 80 804
at equation 1041 85.1000000000 111 850889
at equation 1301 115.134100000 135 1151206
at equation 1561 114.899300000 127 1148866
at equation 1821 117.090100000 111 1170790



Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Wed Feb 06, 2013 5:53 am Post subject: 


Paul,
I have posted further results. They identify the problem is with the loop:
do k = b,iband
j = row_i(k)+ii
sk(j,i) = sk(j,i)  c*row_f(k)
end do
My empirical analysis:
If j effectively increments j = j+1 as k = k+1, everything works well, as there is fowardcalculation in the CPU.
If J does not step uniformly, but varied, then the fowardcalculation in the CPU must be reset, destroying the efficiency. This reset takes a long time!!
FTN95 needs to determine a better way to reset when this precalculation is failing. Other compilers do not exhibit this problem. I have other calculations (backsubstitution in linear equation solution) which exhibit this problem, but not as dramatically.
There is much more detail in the email. I hope you can receive it.
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7369 Location: Salford, UK

Posted: Wed Feb 06, 2013 8:38 am Post subject: 


John
I am puzzled by this. You seem to imply that the FTN95 optimisation has changed with some recent versions and I can think of no good reason for this.
If you have access to two versions of FTN95 (one good, the other bad) then I suggest that you look at the /EXPLIST assembly for each and/or put in some timing to localise the difference.
Realistically, it is very unlikely that I will have the time to identify the problem let alone provide a fix to the compiler.
However, I will track down your email and keep it to hand. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Wed Feb 06, 2013 9:05 am Post subject: 


Paul,
This is not a recent change in FTN95, the problem has always been there.
What has changed is I now have a small test example which demonstrates the problem.
The demonstration of the problem is by comparing the results between different compilers.
I have also been able to identify how the operation of the inner loop changes for the problem to occur or not, when using FTN95, with the system_clock ticks going from 60 to 18,000 per itteration of the inner 2 loops.
John 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Fri Feb 08, 2013 8:49 am Post subject: 


Paul,
I understand you have a lot of work on at the moment, with the 64 bit version of clearwin and and windows 8 compatibility.
The problem I am reporting has been around for years. What is different now is that the identification of the problem is more clearly demonstrated.
Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time. FTN95's recovery from an unexpected state is much different to the other compilers I have available. I'm not sure how this state is identified.
I might try to run the problem in sdbg and then provide more measures of performance for the condition when I know that it is failing.
John 

Back to top 


jalih
Joined: 30 Jul 2012 Posts: 196

Posted: Fri Feb 08, 2013 10:41 am Post subject: Re: 


JohnCampbell wrote:  Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time. 
Could you post assembly dump? 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Sun Feb 10, 2013 1:20 pm Post subject: 


I don't think the problem is that simple.
I tried to write a smaller example, where I used values of ROW_I, when it failed, but did not reproduce the delays.
I think the processor is seeing that ROW_I(k+1) = ROW_I(k)+1 typically and so when this does not occur after a lot of calculations, things get upset.
The latest test is basically Code:  subroutine gaussean_reduction (sk, nszf, nband)
!
! Reduce stiffness matrix using gaussean reduction
!
integer*4 nszf, nband
real*8 sk(nband,nszf)
!
real*8 c
integer*4 n, b, i, j, k, iband, ii, ik, ib
real*8 sum_zero, sum_nonzero, sum_coeff, sum_iband, sum_mband
real*8 sec_start, sec_end, t2
!
integer*8 del_tick, sum_tick, tick_1, tick_2
external del_tick, sum_tick
integer*4, allocatable, dimension(:) :: row_i
Real*8, allocatable, dimension(:) :: row_f
!
integer*4, allocatable, dimension(:) :: stat_iband
integer*8, allocatable, dimension(:) :: stat_tick
!
! initial statistics
write (*,*) ' '
write (*,*) 'Reduce [SK]'
write (*,*) ' Eqn del_sec Tick_i Tick_r'
sum_coeff = 0 ! coefficients in matrix
sum_nonzero = 0 ! nonzero coefficients in reduced matriz
sum_zero = 0 ! zero coefficients in original matriz
sum_iband = 0 ! active coeffieients in each row
sum_mband = 0 ! row length envelope
!
do n = 1,nszf ! nszf=139020
do b = 1,min (nband,nszfn+1)
sum_coeff = sum_coeff + 1
if (sk(b,n)==0) sum_zero = sum_zero+1
end do
end do
!
allocate ( row_i(nband) )
allocate ( row_f(nband) )
!
allocate ( stat_iband(nszf) )
allocate ( stat_tick(nszf) )
!
stat_tick(1) = sum_tick ()
call elapsed_time (sec_start)
t2 = sec_start
tick_1 = del_tick()
tick_1 = 0
tick_2 = 0
!
do n = 1,nszf ! nszf=139020
!
if (mod(n,nszf/20)==0 .or. n < 5) then
call elapsed_time (sec_end)
write ( *,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_endt2, tick_1, tick_2
write (14,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_endt2, tick_1, tick_2
t2 = sec_end
tick_1 = 0
tick_2 = 0
end if
iband = 1
row_i(1) = 1
row_f(1) = sk(1,n)
do b = 2,min (nband,nszfn+1)
if (sk(b,n)==0) cycle
sum_nonzero = sum_nonzero+1
iband = iband+1
row_i(iband) = b
row_f(b) = sk(b,n)
sk(b,n) = sk(b,n)/sk(1,n)
end do
sum_iband = sum_iband + dble(iband)
sum_mband = sum_mband + dble(row_i(iband))
tick_1 = tick_1 + del_tick()
!



Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Sun Feb 10, 2013 1:22 pm Post subject: 


Code:  !
if (row_i(iband) == iband) then
do b = 2,iband
ii = b1
i = n+ii
c = row_f(b)/row_f(1)
do k = b,iband
sk(kii,i) = sk(kii,i)  c*row_f(k)
end do
end do
else
do ib = 2,iband
b = row_i(ib)
ii = b1
i = n+ii
c = row_f(b)/row_f(1)
do ik = ib,iband
k = row_i(ik)
sk(kii,i) = sk(kii,i)  c*row_f(k)
end do
end do
end if
!
tick_2 = tick_2 + del_tick()
stat_tick(n) = sum_tick ()
stat_iband(n) = iband
!
end do
call elapsed_time (sec_end)
!
c = sum_nonzero/dble(nszf) + 1.0
do i = 1,2
if (i==1) ii = 1
if (i==2) ii = 14
write (ii,*) ' '
write (ii,*) 'Number of equations =', nszf
write (ii,*) 'maximum bandwidth =', nband
write (ii,*) 'average envelope =', nint (sum_mband/dble(nszf))
write (ii,*) 'average bandwidth =', nint (sum_iband/dble(nszf))
write (ii,*) 'average nonzero =', nint (c)
write (ii,*) ' % zero before =', sum_zero/sum_coeff * 100.
write (ii,*) ' % zero reduced =', (1.  sum_nonzero/sum_coeff) * 100.
write (ii,*) 'elapsed time (sec) =', sec_endsec_start
end do
!
do n = 1,nszf
write (14,*) n, stat_iband(n), stat_tick(n)
end do
end 


Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Mon Feb 11, 2013 7:49 am Post subject: 


Paul,
I have been trying to produce a minimum change version of good and bad and finally succeeded.
I have tracked down the problem to the following routine Code:  subroutine vec_add (row_b, row_f, n, c)
real*8 row_b(*), row_f(*), c
integer*4 n, k
!
do k = 1,n
row_b(k) = row_b(k) + c*row_f(k)
end do
end

This is effectively the inner loop of the example program I have provided previously when Row_i is removed.
If I replace "call vec_add" with "call fast_asm_dsaxpy" : the vector version that David wrote last year, then the x 100 slow down does not occur.
The bad version works ok for a while ( 0.0041 vs .0043 seconds per equation 545), then slows down by a factor of 100 ( 1.009 vs .0111 seconds per equation 600), then speeds up again to be much faster than the other version !! ( 0.00369 vs .0251 seconds per equation 4500), but only after about 30 minutes run time. The run time is 1995 seconds vs the fast version takes only 263 seconds. Both versions produce very similar but not identical results ( average error = 9.617458238719E16 vs 9.617314123230E16 )
I could estimate that the processor prefetch is getting it wrong for a while then finally learns how to get it right. The SK matrix is about 170 mb.
I shall send the two programs, plus the batch file to run them and the trace files.
The program solves 13,000 equations and I use calls to cpu_clock@, to report the elapse time for each equation, both the generation of row_f then calls to vec_add.
It is realy surprising, from this time log, how the run time slows down so dramatically, then speeds up again. If I change the values in the SK matrix, the speed up is delayed further.
I have done many checks to (hopefully) confirm there is no out of bounds addressing so I think this is as I describe. The next check could be to confirm n is in the range 1:1700 and the memory address of row_i, row_f, n and c are valid.
There must be some processor status not being corrected. real*8 alignment is about the only thing I can think of ??, although I am at a loss to explain this.
I will package up the files and email tonight.
John 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Tue Feb 26, 2013 6:37 am Post subject: 


Paul,
I have continued to investigate this problem. The run times are dependent on how many zeros are in the vectors row_b and row_f.
For n=1700, when the number of zeros is between 50% and 80% there is a dramatic slow down.
Run times for my test vary from 70 seconds, to 7,000 seconds, depending on % zero, ie the values in the row_f vector. The zeros are typically grouped in blocks and not randomly.
I can only guess that the problem is related to the way the CPU handles lots of zero operations. lots of delays ?
The problem is significantly mitigated (although not totally eliminated) if I use Davidb's AVX instruction:
call fast_asm_dsaxpy (sk(1,i), row_f(ib), k, c)
Any ideas ?
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7369 Location: Salford, UK

Posted: Tue Feb 26, 2013 8:54 am Post subject: 


I don't have any ideas about this at the moment, other than that it is
difficult to see what the compiler can do about this. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 2398 Location: Sydney

Posted: Tue Feb 26, 2013 11:58 am Post subject: 


Paul,
I am wondering if it could be a timing problem with the assembler instructions generated by FTN95, where the response time for a zero multiply plus resulting product (non) addition is different to what the instructions generated by FTN95 expect, thereby generating an extended delay because the expected response had already occurred before the following instructions were ready.
This could explain the delays.
There is certainly a problem, where between 20% and 50% of the vector contents are zero. < 20% or > 50% do not have a problem, but otherwise the delay is significant.
John 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
