forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

FTN95 run time with allocatable arrays
Goto page 1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Feb 04, 2013 12:58 pm    Post subject: FTN95 run time with allocatable arrays Reply with quote

Paul,

I have a program that uses allocatable arrays for a gaussean linear equation solver.
Running with FTN95 Ver 6.1 or 6.3, it goes to sleep !!
I've sent an email with more details and the sample code.
Could you please review the email and let me know if you can review the generated assembler code in the inner loop.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Mon Feb 04, 2013 3:36 pm    Post subject: Reply with quote

Have you sent the email to Silverfrost?
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Feb 04, 2013 10:20 pm    Post subject: Reply with quote

Paul,
Yes I sent the email to silverfrost. It contained the code as a stand-alone program.
The main routine that is failing is
Code:
      subroutine gaussean_reduction (sk, nszf, nband)
!
!  Reduce stiffness matrix using gaussean reduction
!
      integer*4 nszf, nband
      real*8    sk(nband,nszf)
!
      real*8    c
      integer*4 n, b, i, j, k, iband
!
      integer*4, allocatable, dimension(:) :: row_i
      Real*8,    allocatable, dimension(:) :: row_f
!
         allocate ( row_i(nband) )
         allocate ( row_f(nband) )
!
         do n = 1,nszf                   ! nszf=139020
!
           iband    = 1
           row_i(1) = 1
           row_f(1) = sk(1,n)
           do b = 2,min (nband,nszf-n+1)
             if (sk(b,n)==0) cycle
             iband        = iband+1
             row_i(iband) = b
             row_f(iband) = sk(b,n)
             sk(b,n)      = sk(b,n)/sk(1,n)
           end do
!
!     slow loops
           do b = 2,iband
             i = n+row_i(b)-1
             c = row_f(b)/row_f(1)
             do k = b,iband
               j = 1 + row_i(k)-row_i(b)
               sk(j,i) = sk(j,i) - c*row_f(k)
             end do
           end do
!
         end do
!
      end

The slow loop is 50 times slower than my 2000 version of LF95.
I can't see where it could be wrong or where a variable could be mistaken for an array operation.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Feb 06, 2013 1:38 am    Post subject: Reply with quote

Paul,
Have you received the email?
If you compile and start the run, you should at least see that it is so much slower than other compilers.
The FTN95 trace I get from the latest version I emailed to you starts like: (I have never reached the end !) It uses 50 steps over 13,000 equations.
Code:
Creating executable: gaussean_t2.EXE
 [SK] allocated; status =           0
 
 Generate [SK]
 Count_Rate =                10000
 elapsed time (sec)  =       7.290000000000E-02 
 
 Reduce [SK]
 at equation           1        0.00000000000                          0                    0
 at equation           2        0.00000000000                          0                    0
 at equation           3       9.999999999999E-05                      1                    0
 at equation           4        0.00000000000                          0                    0
 at equation         261       5.170000000000E-02                     36                  481
 at equation         521       8.870000000000E-02                     71                  816
 at equation         781       8.840000000000E-02                     80                  804
 at equation        1041          85.1000000000                      111               850889
 at equation        1301          115.134100000                      135              1151206
 at equation        1561          114.899300000                      127              1148866
 at equation        1821          117.090100000                      111              1170790
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Feb 06, 2013 5:53 am    Post subject: Reply with quote

Paul,

I have posted further results. They identify the problem is with the loop:
do k = b,iband
j = row_i(k)+ii
sk(j,i) = sk(j,i) - c*row_f(k)
end do
My empirical analysis:
If j effectively increments j = j+1 as k = k+1, everything works well, as there is foward-calculation in the CPU.
If J does not step uniformly, but varied, then the foward-calculation in the CPU must be reset, destroying the efficiency. This reset takes a long time!!
FTN95 needs to determine a better way to reset when this pre-calculation is failing. Other compilers do not exhibit this problem. I have other calculations (back-substitution in linear equation solution) which exhibit this problem, but not as dramatically.
There is much more detail in the email. I hope you can receive it.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Wed Feb 06, 2013 8:38 am    Post subject: Reply with quote

John

I am puzzled by this. You seem to imply that the FTN95 optimisation has changed with some recent versions and I can think of no good reason for this.

If you have access to two versions of FTN95 (one good, the other bad) then I suggest that you look at the /EXPLIST assembly for each and/or put in some timing to localise the difference.

Realistically, it is very unlikely that I will have the time to identify the problem let alone provide a fix to the compiler.

However, I will track down your email and keep it to hand.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Feb 06, 2013 9:05 am    Post subject: Reply with quote

Paul,

This is not a recent change in FTN95, the problem has always been there.
What has changed is I now have a small test example which demonstrates the problem.
The demonstration of the problem is by comparing the results between different compilers.
I have also been able to identify how the operation of the inner loop changes for the problem to occur or not, when using FTN95, with the system_clock ticks going from 60 to 18,000 per itteration of the inner 2 loops.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Fri Feb 08, 2013 8:49 am    Post subject: Reply with quote

Paul,

I understand you have a lot of work on at the moment, with the 64 bit version of clearwin and and windows 8 compatibility.

The problem I am reporting has been around for years. What is different now is that the identification of the problem is more clearly demonstrated.
Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time. FTN95's recovery from an unexpected state is much different to the other compilers I have available. I'm not sure how this state is identified.

I might try to run the problem in sdbg and then provide more measures of performance for the condition when I know that it is failing.

John
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Fri Feb 08, 2013 10:41 am    Post subject: Re: Reply with quote

JohnCampbell wrote:
Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time.


Could you post assembly dump?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Feb 10, 2013 1:20 pm    Post subject: Reply with quote

I don't think the problem is that simple.
I tried to write a smaller example, where I used values of ROW_I, when it failed, but did not reproduce the delays.
I think the processor is seeing that ROW_I(k+1) = ROW_I(k)+1 typically and so when this does not occur after a lot of calculations, things get upset.
The latest test is basically
Code:
      subroutine gaussean_reduction (sk, nszf, nband)
!
!  Reduce stiffness matrix using gaussean reduction
!
      integer*4 nszf, nband
      real*8    sk(nband,nszf)
!
      real*8    c
      integer*4 n, b, i, j, k, iband, ii, ik, ib
      real*8    sum_zero, sum_nonzero, sum_coeff, sum_iband, sum_mband
      real*8    sec_start, sec_end, t2
!
      integer*8 del_tick, sum_tick, tick_1, tick_2
      external  del_tick, sum_tick
      integer*4, allocatable, dimension(:) :: row_i
      Real*8,    allocatable, dimension(:) :: row_f
!
      integer*4, allocatable, dimension(:) :: stat_iband
      integer*8, allocatable, dimension(:) :: stat_tick
!
!--- initial statistics
         write (*,*) ' '
         write (*,*) 'Reduce [SK]'
         write (*,*) '               Eqn   del_sec    Tick_i    Tick_r'
         sum_coeff   = 0    ! coefficients in matrix
         sum_nonzero = 0    ! non-zero coefficients in reduced matriz
         sum_zero    = 0    ! zero coefficients in original matriz
         sum_iband   = 0    ! active coeffieients in each row
         sum_mband   = 0    ! row length envelope
!
         do n = 1,nszf                   ! nszf=139020
           do b = 1,min (nband,nszf-n+1)
             sum_coeff = sum_coeff + 1
             if (sk(b,n)==0) sum_zero = sum_zero+1
           end do
         end do
!
         allocate ( row_i(nband) )
         allocate ( row_f(nband) )
!
         allocate ( stat_iband(nszf) )
         allocate ( stat_tick(nszf) )
!
         stat_tick(1) = sum_tick ()
         call elapsed_time (sec_start)
         t2     = sec_start
         tick_1 = del_tick()
         tick_1 = 0
         tick_2 = 0
!
         do n = 1,nszf                   ! nszf=139020
!
           if (mod(n,nszf/20)==0 .or. n < 5) then
              call elapsed_time (sec_end)
              write ( *,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_end-t2, tick_1, tick_2
              write (14,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_end-t2, tick_1, tick_2
              t2     = sec_end
              tick_1 = 0
              tick_2 = 0
           end if
           iband    = 1
           row_i(1) = 1
           row_f(1) = sk(1,n)
           do b = 2,min (nband,nszf-n+1)
             if (sk(b,n)==0) cycle
             sum_nonzero  = sum_nonzero+1
             iband        = iband+1
             row_i(iband) = b
             row_f(b)     = sk(b,n)
             sk(b,n)      = sk(b,n)/sk(1,n)
           end do
           sum_iband = sum_iband + dble(iband)
           sum_mband = sum_mband + dble(row_i(iband))
           tick_1 = tick_1 + del_tick()
!
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Feb 10, 2013 1:22 pm    Post subject: Reply with quote

Code:
!
          if (row_i(iband) == iband) then
            do b = 2,iband
              ii = b-1
              i  = n+ii
              c  = row_f(b)/row_f(1)
              do k = b,iband
                sk(k-ii,i) = sk(k-ii,i) - c*row_f(k)
              end do
            end do
          else
            do ib = 2,iband
              b  = row_i(ib)
              ii = b-1
              i  = n+ii
              c  = row_f(b)/row_f(1)
              do ik = ib,iband
                k = row_i(ik)
                sk(k-ii,i) = sk(k-ii,i) - c*row_f(k)
              end do
            end do
          end if
!
           tick_2 = tick_2 + del_tick()
           stat_tick(n)  = sum_tick ()
           stat_iband(n) = iband
!
         end do
         call elapsed_time (sec_end)
!
         c = sum_nonzero/dble(nszf) + 1.0
         do i = 1,2
           if (i==1) ii = 1
           if (i==2) ii = 14
           write (ii,*) ' '
           write (ii,*) 'Number of equations =', nszf
           write (ii,*) 'maximum bandwidth   =', nband
           write (ii,*) 'average envelope    =', nint (sum_mband/dble(nszf))
           write (ii,*) 'average bandwidth   =', nint (sum_iband/dble(nszf))
           write (ii,*) 'average non-zero    =', nint (c)
           write (ii,*) ' % zero before      =', sum_zero/sum_coeff           * 100.
           write (ii,*) ' % zero reduced     =', (1. - sum_nonzero/sum_coeff) * 100.
           write (ii,*) 'elapsed time (sec)  =', sec_end-sec_start
         end do
!
         do n = 1,nszf
           write (14,*) n, stat_iband(n), stat_tick(n)
         end do
      end
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Feb 11, 2013 7:49 am    Post subject: Reply with quote

Paul,

I have been trying to produce a minimum change version of good and bad and finally succeeded.
I have tracked down the problem to the following routine
Code:
      subroutine vec_add (row_b, row_f, n, c)
      real*8    row_b(*), row_f(*), c
      integer*4 n, k
!
      do k = 1,n
         row_b(k) = row_b(k) + c*row_f(k)
      end do
      end

This is effectively the inner loop of the example program I have provided previously when Row_i is removed.
If I replace "call vec_add" with "call fast_asm_dsaxpy" : the vector version that David wrote last year, then the x 100 slow down does not occur.
The bad version works ok for a while ( 0.0041 vs .0043 seconds per equation 545), then slows down by a factor of 100 ( 1.009 vs .0111 seconds per equation 600), then speeds up again to be much faster than the other version !! ( 0.00369 vs .0251 seconds per equation 4500), but only after about 30 minutes run time. The run time is 1995 seconds vs the fast version takes only 263 seconds. Both versions produce very similar but not identical results ( average error = 9.617458238719E-16 vs 9.617314123230E-16 )

I could estimate that the processor pre-fetch is getting it wrong for a while then finally learns how to get it right. The SK matrix is about 170 mb.

I shall send the two programs, plus the batch file to run them and the trace files.
The program solves 13,000 equations and I use calls to cpu_clock@, to report the elapse time for each equation, both the generation of row_f then calls to vec_add.
It is realy surprising, from this time log, how the run time slows down so dramatically, then speeds up again. If I change the values in the SK matrix, the speed up is delayed further.
I have done many checks to (hopefully) confirm there is no out of bounds addressing so I think this is as I describe. The next check could be to confirm n is in the range 1:1700 and the memory address of row_i, row_f, n and c are valid.
There must be some processor status not being corrected. real*8 alignment is about the only thing I can think of ??, although I am at a loss to explain this.

I will package up the files and email tonight.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Feb 26, 2013 6:37 am    Post subject: Reply with quote

Paul,

I have continued to investigate this problem. The run times are dependent on how many zeros are in the vectors row_b and row_f.
For n=1700, when the number of zeros is between 50% and 80% there is a dramatic slow down.
Run times for my test vary from 70 seconds, to 7,000 seconds, depending on % zero, ie the values in the row_f vector. The zeros are typically grouped in blocks and not randomly.
I can only guess that the problem is related to the way the CPU handles lots of zero operations. lots of delays ?
The problem is significantly mitigated (although not totally eliminated) if I use Davidb's AVX instruction:
call fast_asm_dsaxpy (sk(1,i), row_f(ib), k, c)

Any ideas ?

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Tue Feb 26, 2013 8:54 am    Post subject: Reply with quote

I don't have any ideas about this at the moment, other than that it is
difficult to see what the compiler can do about this.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Feb 26, 2013 11:58 am    Post subject: Reply with quote

Paul,

I am wondering if it could be a timing problem with the assembler instructions generated by FTN95, where the response time for a zero multiply plus resulting product (non) addition is different to what the instructions generated by FTN95 expect, thereby generating an extended delay because the expected response had already occurred before the following instructions were ready.
This could explain the delays.

There is certainly a problem, where between 20% and 50% of the vector contents are zero. < 20% or > 50% do not have a problem, but otherwise the delay is significant.

John
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page 1, 2, 3, 4, 5  Next
Page 1 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group