FTN95 run time with allocatable arrays Goto page 1, 2, 3, 4, 5  Next
Author Message
JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Mon Feb 04, 2013 12:58 pm    Post subject: FTN95 run time with allocatable arrays Paul, I have a program that uses allocatable arrays for a gaussean linear equation solver. Running with FTN95 Ver 6.1 or 6.3, it goes to sleep !! I've sent an email with more details and the sample code. Could you please review the email and let me know if you can review the generated assembler code in the inner loop. John   PaulLaidler Joined: 21 Feb 2005
Posts: 7369
Location: Salford, UK Posted: Mon Feb 04, 2013 3:36 pm    Post subject: Have you sent the email to Silverfrost?    JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Mon Feb 04, 2013 10:20 pm    Post subject: Paul,
Yes I sent the email to silverfrost. It contained the code as a stand-alone program.
The main routine that is failing is
 Code: subroutine gaussean_reduction (sk, nszf, nband) ! !  Reduce stiffness matrix using gaussean reduction !       integer*4 nszf, nband       real*8    sk(nband,nszf) !       real*8    c       integer*4 n, b, i, j, k, iband !       integer*4, allocatable, dimension(:) :: row_i       Real*8,    allocatable, dimension(:) :: row_f !          allocate ( row_i(nband) )          allocate ( row_f(nband) ) !          do n = 1,nszf                   ! nszf=139020 !            iband    = 1            row_i(1) = 1            row_f(1) = sk(1,n)            do b = 2,min (nband,nszf-n+1)              if (sk(b,n)==0) cycle              iband        = iband+1              row_i(iband) = b              row_f(iband) = sk(b,n)              sk(b,n)      = sk(b,n)/sk(1,n)            end do ! !     slow loops            do b = 2,iband              i = n+row_i(b)-1              c = row_f(b)/row_f(1)              do k = b,iband                j = 1 + row_i(k)-row_i(b)                sk(j,i) = sk(j,i) - c*row_f(k)              end do            end do !          end do !       end

The slow loop is 50 times slower than my 2000 version of LF95.
I can't see where it could be wrong or where a variable could be mistaken for an array operation.

John   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Wed Feb 06, 2013 1:38 am    Post subject: Paul,
Have you received the email?
If you compile and start the run, you should at least see that it is so much slower than other compilers.
The FTN95 trace I get from the latest version I emailed to you starts like: (I have never reached the end !) It uses 50 steps over 13,000 equations.
 Code: Creating executable: gaussean_t2.EXE  [SK] allocated; status =           0    Generate [SK]  Count_Rate =                10000  elapsed time (sec)  =       7.290000000000E-02     Reduce [SK]  at equation           1        0.00000000000                          0                    0  at equation           2        0.00000000000                          0                    0  at equation           3       9.999999999999E-05                      1                    0  at equation           4        0.00000000000                          0                    0  at equation         261       5.170000000000E-02                     36                  481  at equation         521       8.870000000000E-02                     71                  816  at equation         781       8.840000000000E-02                     80                  804  at equation        1041          85.1000000000                      111               850889  at equation        1301          115.134100000                      135              1151206  at equation        1561          114.899300000                      127              1148866  at equation        1821          117.090100000                      111              1170790   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Wed Feb 06, 2013 5:53 am    Post subject: Paul, I have posted further results. They identify the problem is with the loop: do k = b,iband j = row_i(k)+ii sk(j,i) = sk(j,i) - c*row_f(k) end do My empirical analysis: If j effectively increments j = j+1 as k = k+1, everything works well, as there is foward-calculation in the CPU. If J does not step uniformly, but varied, then the foward-calculation in the CPU must be reset, destroying the efficiency. This reset takes a long time!! FTN95 needs to determine a better way to reset when this pre-calculation is failing. Other compilers do not exhibit this problem. I have other calculations (back-substitution in linear equation solution) which exhibit this problem, but not as dramatically. There is much more detail in the email. I hope you can receive it. John   PaulLaidler Joined: 21 Feb 2005
Posts: 7369
Location: Salford, UK Posted: Wed Feb 06, 2013 8:38 am    Post subject: John I am puzzled by this. You seem to imply that the FTN95 optimisation has changed with some recent versions and I can think of no good reason for this. If you have access to two versions of FTN95 (one good, the other bad) then I suggest that you look at the /EXPLIST assembly for each and/or put in some timing to localise the difference. Realistically, it is very unlikely that I will have the time to identify the problem let alone provide a fix to the compiler. However, I will track down your email and keep it to hand.    JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Wed Feb 06, 2013 9:05 am    Post subject: Paul, This is not a recent change in FTN95, the problem has always been there. What has changed is I now have a small test example which demonstrates the problem. The demonstration of the problem is by comparing the results between different compilers. I have also been able to identify how the operation of the inner loop changes for the problem to occur or not, when using FTN95, with the system_clock ticks going from 60 to 18,000 per itteration of the inner 2 loops. John   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Fri Feb 08, 2013 8:49 am    Post subject: Paul, I understand you have a lot of work on at the moment, with the 64 bit version of clearwin and and windows 8 compatibility. The problem I am reporting has been around for years. What is different now is that the identification of the problem is more clearly demonstrated. Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time. FTN95's recovery from an unexpected state is much different to the other compilers I have available. I'm not sure how this state is identified. I might try to run the problem in sdbg and then provide more measures of performance for the condition when I know that it is failing. John   jalih

Joined: 30 Jul 2012
Posts: 196 Posted: Fri Feb 08, 2013 10:41 am    Post subject: Re: JohnCampbell wrote: Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time.

Could you post assembly dump?   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Sun Feb 10, 2013 1:20 pm    Post subject: I don't think the problem is that simple.
I tried to write a smaller example, where I used values of ROW_I, when it failed, but did not reproduce the delays.
I think the processor is seeing that ROW_I(k+1) = ROW_I(k)+1 typically and so when this does not occur after a lot of calculations, things get upset.
The latest test is basically
 Code: subroutine gaussean_reduction (sk, nszf, nband) ! !  Reduce stiffness matrix using gaussean reduction !       integer*4 nszf, nband       real*8    sk(nband,nszf) !       real*8    c       integer*4 n, b, i, j, k, iband, ii, ik, ib       real*8    sum_zero, sum_nonzero, sum_coeff, sum_iband, sum_mband       real*8    sec_start, sec_end, t2 !       integer*8 del_tick, sum_tick, tick_1, tick_2       external  del_tick, sum_tick       integer*4, allocatable, dimension(:) :: row_i       Real*8,    allocatable, dimension(:) :: row_f !       integer*4, allocatable, dimension(:) :: stat_iband       integer*8, allocatable, dimension(:) :: stat_tick ! !--- initial statistics          write (*,*) ' '          write (*,*) 'Reduce [SK]'          write (*,*) '               Eqn   del_sec    Tick_i    Tick_r'          sum_coeff   = 0    ! coefficients in matrix          sum_nonzero = 0    ! non-zero coefficients in reduced matriz          sum_zero    = 0    ! zero coefficients in original matriz          sum_iband   = 0    ! active coeffieients in each row          sum_mband   = 0    ! row length envelope !          do n = 1,nszf                   ! nszf=139020            do b = 1,min (nband,nszf-n+1)              sum_coeff = sum_coeff + 1              if (sk(b,n)==0) sum_zero = sum_zero+1            end do          end do !          allocate ( row_i(nband) )          allocate ( row_f(nband) ) !          allocate ( stat_iband(nszf) )          allocate ( stat_tick(nszf) ) !          stat_tick(1) = sum_tick ()          call elapsed_time (sec_start)          t2     = sec_start          tick_1 = del_tick()          tick_1 = 0          tick_2 = 0 !          do n = 1,nszf                   ! nszf=139020 !            if (mod(n,nszf/20)==0 .or. n < 5) then               call elapsed_time (sec_end)               write ( *,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_end-t2, tick_1, tick_2               write (14,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_end-t2, tick_1, tick_2               t2     = sec_end               tick_1 = 0               tick_2 = 0            end if            iband    = 1            row_i(1) = 1            row_f(1) = sk(1,n)            do b = 2,min (nband,nszf-n+1)              if (sk(b,n)==0) cycle              sum_nonzero  = sum_nonzero+1              iband        = iband+1              row_i(iband) = b              row_f(b)     = sk(b,n)              sk(b,n)      = sk(b,n)/sk(1,n)            end do            sum_iband = sum_iband + dble(iband)            sum_mband = sum_mband + dble(row_i(iband))            tick_1 = tick_1 + del_tick() !   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Sun Feb 10, 2013 1:22 pm    Post subject: Code: !           if (row_i(iband) == iband) then             do b = 2,iband               ii = b-1               i  = n+ii               c  = row_f(b)/row_f(1)               do k = b,iband                 sk(k-ii,i) = sk(k-ii,i) - c*row_f(k)               end do             end do           else             do ib = 2,iband               b  = row_i(ib)               ii = b-1               i  = n+ii               c  = row_f(b)/row_f(1)               do ik = ib,iband                 k = row_i(ik)                 sk(k-ii,i) = sk(k-ii,i) - c*row_f(k)               end do             end do           end if !            tick_2 = tick_2 + del_tick()            stat_tick(n)  = sum_tick ()            stat_iband(n) = iband !          end do          call elapsed_time (sec_end) !          c = sum_nonzero/dble(nszf) + 1.0          do i = 1,2            if (i==1) ii = 1            if (i==2) ii = 14            write (ii,*) ' '            write (ii,*) 'Number of equations =', nszf            write (ii,*) 'maximum bandwidth   =', nband            write (ii,*) 'average envelope    =', nint (sum_mband/dble(nszf))            write (ii,*) 'average bandwidth   =', nint (sum_iband/dble(nszf))            write (ii,*) 'average non-zero    =', nint (c)            write (ii,*) ' % zero before      =', sum_zero/sum_coeff           * 100.            write (ii,*) ' % zero reduced     =', (1. - sum_nonzero/sum_coeff) * 100.            write (ii,*) 'elapsed time (sec)  =', sec_end-sec_start          end do !          do n = 1,nszf            write (14,*) n, stat_iband(n), stat_tick(n)          end do       end   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Mon Feb 11, 2013 7:49 am    Post subject: Paul,

I have been trying to produce a minimum change version of good and bad and finally succeeded.
I have tracked down the problem to the following routine
 Code: subroutine vec_add (row_b, row_f, n, c)       real*8    row_b(*), row_f(*), c       integer*4 n, k !       do k = 1,n          row_b(k) = row_b(k) + c*row_f(k)       end do       end

This is effectively the inner loop of the example program I have provided previously when Row_i is removed.
If I replace "call vec_add" with "call fast_asm_dsaxpy" : the vector version that David wrote last year, then the x 100 slow down does not occur.
The bad version works ok for a while ( 0.0041 vs .0043 seconds per equation 545), then slows down by a factor of 100 ( 1.009 vs .0111 seconds per equation 600), then speeds up again to be much faster than the other version !! ( 0.00369 vs .0251 seconds per equation 4500), but only after about 30 minutes run time. The run time is 1995 seconds vs the fast version takes only 263 seconds. Both versions produce very similar but not identical results ( average error = 9.617458238719E-16 vs 9.617314123230E-16 )

I could estimate that the processor pre-fetch is getting it wrong for a while then finally learns how to get it right. The SK matrix is about 170 mb.

I shall send the two programs, plus the batch file to run them and the trace files.
The program solves 13,000 equations and I use calls to cpu_clock@, to report the elapse time for each equation, both the generation of row_f then calls to vec_add.
It is realy surprising, from this time log, how the run time slows down so dramatically, then speeds up again. If I change the values in the SK matrix, the speed up is delayed further.
I have done many checks to (hopefully) confirm there is no out of bounds addressing so I think this is as I describe. The next check could be to confirm n is in the range 1:1700 and the memory address of row_i, row_f, n and c are valid.
There must be some processor status not being corrected. real*8 alignment is about the only thing I can think of ??, although I am at a loss to explain this.

I will package up the files and email tonight.

John   JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Tue Feb 26, 2013 6:37 am    Post subject: Paul, I have continued to investigate this problem. The run times are dependent on how many zeros are in the vectors row_b and row_f. For n=1700, when the number of zeros is between 50% and 80% there is a dramatic slow down. Run times for my test vary from 70 seconds, to 7,000 seconds, depending on % zero, ie the values in the row_f vector. The zeros are typically grouped in blocks and not randomly. I can only guess that the problem is related to the way the CPU handles lots of zero operations. lots of delays ? The problem is significantly mitigated (although not totally eliminated) if I use Davidb's AVX instruction: call fast_asm_dsaxpy (sk(1,i), row_f(ib), k, c) Any ideas ? John   PaulLaidler Joined: 21 Feb 2005
Posts: 7369
Location: Salford, UK Posted: Tue Feb 26, 2013 8:54 am    Post subject: I don't have any ideas about this at the moment, other than that it is difficult to see what the compiler can do about this.    JohnCampbell

Joined: 16 Feb 2006
Posts: 2398
Location: Sydney Posted: Tue Feb 26, 2013 11:58 am    Post subject: Paul, I am wondering if it could be a timing problem with the assembler instructions generated by FTN95, where the response time for a zero multiply plus resulting product (non) addition is different to what the instructions generated by FTN95 expect, thereby generating an extended delay because the expected response had already occurred before the following instructions were ready. This could explain the delays. There is certainly a problem, where between 20% and 50% of the vector contents are zero. < 20% or > 50% do not have a problem, but otherwise the delay is significant. John   Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First
 All times are GMT + 1 HourGoto page 1, 2, 3, 4, 5  Next Page 1 of 5