Topic: FTN95 run time with allocatable arrays in Support

JohnCampbell

Posts: 2526 Sydney

Back to Top

4 Feb 2013 11:58 #11510

Paul,

I have a program that uses allocatable arrays for a gaussean linear equation solver. Running with FTN95 Ver 6.1 or 6.3, it goes to sleep !! I've sent an email with more details and the sample code. Could you please review the email and let me know if you can review the generated assembler code in the inner loop.

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

4 Feb 2013 2:36 #11512

Have you sent the email to Silverfrost?

JohnCampbell

Posts: 2526 Sydney

Back to Top

4 Feb 2013 9:20 #11514

Paul, Yes I sent the email to silverfrost. It contained the code as a stand-alone program. The main routine that is failing is subroutine gaussean_reduction (sk, nszf, nband) ! ! Reduce stiffness matrix using gaussean reduction ! integer4 nszf, nband real8 sk(nband,nszf) ! real8 c integer4 n, b, i, j, k, iband ! integer4, allocatable, dimension(:) :: row_i Real8, allocatable, dimension(:) :: row_f ! allocate ( row_i(nband) ) allocate ( row_f(nband) ) ! do n = 1,nszf ! nszf=139020 ! iband = 1 row_i(1) = 1 row_f(1) = sk(1,n) do b = 2,min (nband,nszf-n+1) if (sk(b,n)==0) cycle iband = iband+1 row_i(iband) = b row_f(iband) = sk(b,n) sk(b,n) = sk(b,n)/sk(1,n) end do ! ! slow loops do b = 2,iband i = n+row_i(b)-1 c = row_f(b)/row_f(1) do k = b,iband j = 1 + row_i(k)-row_i(b) sk(j,i) = sk(j,i) - c*row_f(k) end do end do ! end do ! end

The slow loop is 50 times slower than my 2000 version of LF95. I can't see where it could be wrong or where a variable could be mistaken for an array operation.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

6 Feb 2013 12:38 #11521

Paul, Have you received the email? If you compile and start the run, you should at least see that it is so much slower than other compilers. The FTN95 trace I get from the latest version I emailed to you starts like: (I have never reached the end !) It uses 50 steps over 13,000 equations. Creating executable: gaussean_t2.EXE [SK] allocated; status = 0

 Generate [SK]
 Count_Rate =                10000
 elapsed time (sec)  =       7.290000000000E-02  
  
 Reduce [SK]
 at equation           1        0.00000000000                          0                    0
 at equation           2        0.00000000000                          0                    0
 at equation           3       9.999999999999E-05                      1                    0
 at equation           4        0.00000000000                          0                    0
 at equation         261       5.170000000000E-02                     36                  481
 at equation         521       8.870000000000E-02                     71                  816
 at equation         781       8.840000000000E-02                     80                  804
 at equation        1041          85.1000000000                      111               850889
 at equation        1301          115.134100000                      135              1151206
 at equation        1561          114.899300000                      127              1148866
 at equation        1821          117.090100000                      111              1170790

JohnCampbell

Posts: 2526 Sydney

Back to Top

6 Feb 2013 4:53 #11522

Paul,

I have posted further results. They identify the problem is with the loop: do k = b,iband j = row_i(k)+ii sk(j,i) = sk(j,i) - c*row_f(k) end do My empirical analysis: If j effectively increments j = j+1 as k = k+1, everything works well, as there is foward-calculation in the CPU. If J does not step uniformly, but varied, then the foward-calculation in the CPU must be reset, destroying the efficiency. This reset takes a long time!! FTN95 needs to determine a better way to reset when this pre-calculation is failing. Other compilers do not exhibit this problem. I have other calculations (back-substitution in linear equation solution) which exhibit this problem, but not as dramatically. There is much more detail in the email. I hope you can receive it.

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

6 Feb 2013 7:38 #11525

John

I am puzzled by this. You seem to imply that the FTN95 optimisation has changed with some recent versions and I can think of no good reason for this.

If you have access to two versions of FTN95 (one good, the other bad) then I suggest that you look at the /EXPLIST assembly for each and/or put in some timing to localise the difference.

Realistically, it is very unlikely that I will have the time to identify the problem let alone provide a fix to the compiler.

However, I will track down your email and keep it to hand.

JohnCampbell

Posts: 2526 Sydney

Back to Top

6 Feb 2013 8:05 #11526

Paul,

This is not a recent change in FTN95, the problem has always been there. What has changed is I now have a small test example which demonstrates the problem. The demonstration of the problem is by comparing the results between different compilers. I have also been able to identify how the operation of the inner loop changes for the problem to occur or not, when using FTN95, with the system_clock ticks going from 60 to 18,000 per itteration of the inner 2 loops.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

8 Feb 2013 7:49 #11532

Paul,

I understand you have a lot of work on at the moment, with the 64 bit version of clearwin and and windows 8 compatibility.

The problem I am reporting has been around for years. What is different now is that the identification of the problem is more clearly demonstrated. Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time. FTN95's recovery from an unexpected state is much different to the other compilers I have available. I'm not sure how this state is identified.

I might try to run the problem in sdbg and then provide more measures of performance for the condition when I know that it is failing.

John

jalih

Posts: 192

Back to Top

8 Feb 2013 9:41 #11533

Quoted from JohnCampbell Having a do loop changing from 60 system_clock cycles to 18,000 cycles is fairly dramatic. Something is causing the process to wait for a long time.

Could you post assembly dump?

JohnCampbell

Posts: 2526 Sydney

Back to Top

10 Feb 2013 12:20 #11536

I don't think the problem is that simple. I tried to write a smaller example, where I used values of ROW_I, when it failed, but did not reproduce the delays. I think the processor is seeing that ROW_I(k+1) = ROW_I(k)+1 typically and so when this does not occur after a lot of calculations, things get upset. The latest test is basically subroutine gaussean_reduction (sk, nszf, nband) ! ! Reduce stiffness matrix using gaussean reduction ! integer4 nszf, nband real8 sk(nband,nszf) ! real8 c integer4 n, b, i, j, k, iband, ii, ik, ib real8 sum_zero, sum_nonzero, sum_coeff, sum_iband, sum_mband real8 sec_start, sec_end, t2 ! integer8 del_tick, sum_tick, tick_1, tick_2 external del_tick, sum_tick integer4, allocatable, dimension(:) :: row_i Real8, allocatable, dimension(:) :: row_f ! integer4, allocatable, dimension(:) :: stat_iband integer8, allocatable, dimension(:) :: stat_tick ! !--- initial statistics write (,) ' ' write (,) 'Reduce [SK]' write (,*) ' Eqn del_sec Tick_i Tick_r' sum_coeff = 0 ! coefficients in matrix sum_nonzero = 0 ! non-zero coefficients in reduced matriz sum_zero = 0 ! zero coefficients in original matriz sum_iband = 0 ! active coeffieients in each row sum_mband = 0 ! row length envelope ! do n = 1,nszf ! nszf=139020 do b = 1,min (nband,nszf-n+1) sum_coeff = sum_coeff + 1 if (sk(b,n)==0) sum_zero = sum_zero+1 end do end do ! allocate ( row_i(nband) ) allocate ( row_f(nband) ) ! allocate ( stat_iband(nszf) ) allocate ( stat_tick(nszf) ) ! stat_tick(1) = sum_tick () call elapsed_time (sec_start) t2 = sec_start tick_1 = del_tick() tick_1 = 0 tick_2 = 0 ! do n = 1,nszf ! nszf=139020 ! if (mod(n,nszf/20)==0 .or. n < 5) then call elapsed_time (sec_end) write ( *,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_end-t2, tick_1, tick_2 write (14,fmt='(a,i7,f10.4,2i10)') ' at equation',n, sec_end-t2, tick_1, tick_2 t2 = sec_end tick_1 = 0 tick_2 = 0 end if iband = 1 row_i(1) = 1 row_f(1) = sk(1,n) do b = 2,min (nband,nszf-n+1) if (sk(b,n)==0) cycle sum_nonzero = sum_nonzero+1 iband = iband+1 row_i(iband) = b row_f(b) = sk(b,n) sk(b,n) = sk(b,n)/sk(1,n) end do sum_iband = sum_iband + dble(iband) sum_mband = sum_mband + dble(row_i(iband)) tick_1 = tick_1 + del_tick() !

JohnCampbell

Posts: 2526 Sydney

Back to Top

10 Feb 2013 12:22 #11537

!
          if (row_i(iband) == iband) then
            do b = 2,iband
              ii = b-1
              i  = n+ii
              c  = row_f(b)/row_f(1)
              do k = b,iband
                sk(k-ii,i) = sk(k-ii,i) - c*row_f(k)
              end do
            end do
          else
            do ib = 2,iband
              b  = row_i(ib)
              ii = b-1
              i  = n+ii
              c  = row_f(b)/row_f(1)
              do ik = ib,iband
                k = row_i(ik)
                sk(k-ii,i) = sk(k-ii,i) - c*row_f(k)
              end do
            end do
          end if
!
           tick_2 = tick_2 + del_tick()
           stat_tick(n)  = sum_tick ()
           stat_iband(n) = iband
!
         end do
         call elapsed_time (sec_end)
!
         c = sum_nonzero/dble(nszf) + 1.0
         do i = 1,2
           if (i==1) ii = 1
           if (i==2) ii = 14
           write (ii,*) ' '
           write (ii,*) 'Number of equations =', nszf
           write (ii,*) 'maximum bandwidth   =', nband
           write (ii,*) 'average envelope    =', nint (sum_mband/dble(nszf))
           write (ii,*) 'average bandwidth   =', nint (sum_iband/dble(nszf))
           write (ii,*) 'average non-zero    =', nint (c)
           write (ii,*) ' % zero before      =', sum_zero/sum_coeff           * 100.
           write (ii,*) ' % zero reduced     =', (1. - sum_nonzero/sum_coeff) * 100.
           write (ii,*) 'elapsed time (sec)  =', sec_end-sec_start
         end do
!
         do n = 1,nszf
           write (14,*) n, stat_iband(n), stat_tick(n)
         end do
      end

JohnCampbell

Posts: 2526 Sydney

Back to Top

11 Feb 2013 6:49 #11539

Paul,

I have been trying to produce a minimum change version of good and bad and finally succeeded. I have tracked down the problem to the following routine subroutine vec_add (row_b, row_f, n, c) real8 row_b(), row_f(), c integer4 n, k ! do k = 1,n row_b(k) = row_b(k) + c*row_f(k) end do end

This is effectively the inner loop of the example program I have provided previously when Row_i is removed. If I replace 'call vec_add' with 'call fast_asm_dsaxpy' : the vector version that David wrote last year, then the x 100 slow down does not occur. The bad version works ok for a while ( 0.0041 vs .0043 seconds per equation 545), then slows down by a factor of 100 ( 1.009 vs .0111 seconds per equation 600), then speeds up again to be much faster than the other version !! ( 0.00369 vs .0251 seconds per equation 4500), but only after about 30 minutes run time. The run time is 1995 seconds vs the fast version takes only 263 seconds. Both versions produce very similar but not identical results ( average error = 9.617458238719E-16 vs 9.617314123230E-16 )

I could estimate that the processor pre-fetch is getting it wrong for a while then finally learns how to get it right. The SK matrix is about 170 mb.

I shall send the two programs, plus the batch file to run them and the trace files. The program solves 13,000 equations and I use calls to cpu_clock@, to report the elapse time for each equation, both the generation of row_f then calls to vec_add. It is realy surprising, from this time log, how the run time slows down so dramatically, then speeds up again. If I change the values in the SK matrix, the speed up is delayed further. I have done many checks to (hopefully) confirm there is no out of bounds addressing so I think this is as I describe. The next check could be to confirm n is in the range 1:1700 and the memory address of row_i, row_f, n and c are valid. There must be some processor status not being corrected. real*8 alignment is about the only thing I can think of ??, although I am at a loss to explain this.

I will package up the files and email tonight.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

26 Feb 2013 5:37 #11618

Paul,

I have continued to investigate this problem. The run times are dependent on how many zeros are in the vectors row_b and row_f. For n=1700, when the number of zeros is between 50% and 80% there is a dramatic slow down. Run times for my test vary from 70 seconds, to 7,000 seconds, depending on % zero, ie the values in the row_f vector. The zeros are typically grouped in blocks and not randomly. I can only guess that the problem is related to the way the CPU handles lots of zero operations. lots of delays ? The problem is significantly mitigated (although not totally eliminated) if I use Davidb's AVX instruction: call fast_asm_dsaxpy (sk(1,i), row_f(ib), k, c)

Any ideas ?

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

26 Feb 2013 7:54 #11619

I don't have any ideas about this at the moment, other than that it is difficult to see what the compiler can do about this.

JohnCampbell

Posts: 2526 Sydney

Back to Top

26 Feb 2013 10:58 #11622

Paul,

I am wondering if it could be a timing problem with the assembler instructions generated by FTN95, where the response time for a zero multiply plus resulting product (non) addition is different to what the instructions generated by FTN95 expect, thereby generating an extended delay because the expected response had already occurred before the following instructions were ready. This could explain the delays.

There is certainly a problem, where between 20% and 50% of the vector contents are zero. < 20% or > 50% do not have a problem, but otherwise the delay is significant.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

9 Oct 2014 10:58 #14797

Paul,

I have been doing some work recently with this program, using gFortran with Openmp and sometimes validating with FTN95. The very poor performance is still occurring with FTN95 for the vector subtraction calculation, with long vectors (1700) and 20% to 50% zeros in the vectors and not using SSE instructions. I am sure there is an error state or timing problem being returned from the standard math calculation, possibly indicating zero values in the registers, which FTN95 is not correctly addressing. Not sure why it occurs with 20% to 50% zeros? I have now used 3 other compilers and none of them exhibit the problem, with the same code. Certainly the other compilers show better performance with SSE instructions, as does Davidb's SSE routine.

My next project is to merge Clearwin_64 with gFortran and OpenMP. The freedom from the 32bit memory limit is infectious, until you hit the physical memory wall. I've certainly forgotten the days of virtual memory on the Pr1me.

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

9 Oct 2014 1:42 #14798

Thanks for the feedback.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

9 Oct 2014 6:01 #14799

John, The FTN95 developers truly puzzled everyone why they got full stop at 32 bit limit while every sh#tty and even free compiler in the world moved or plan to move to 64. I'd be interested with your experience and detailed instructions of transferring from Silverfrost to gFortran or Ifort without losing compatibility with FTN95 on smaller arrays which fit into 32 bits (for debugging purposes and to save the way back to Silverfrost if common sense will prevail and we will eventually see their 64 bit compiler).

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

10 Oct 2014 2:50 #14801

In the DBOS DOS-extender days, FTN77 was the speed king. The rot (if one dare call it that) was the decision to support Windows, and with that, a virtual halt on using new CPU facilities as they became available.

For me, the hit is negligible: compilation is still fast, error reports are comprehensive, and Clearwin+ is a huge bonus - plus, what I do is executed almost instantaneously on modern hardware and certainly fits well within the memory limits of 32 bit Windows.

I'm not sure that the Windows environment is ideal for running the computationally intensive routines John is using, especially not late version Windows which seems to me to service a whole bunch of things from time to time instead of getting on with the task in hand. Might not John's 100 times slowdown be the result of Windows simply deciding for itself that it was time to go and tidy something up? Defragment memory, perhaps? Empty some inconsequential cache? (Possibly when it got a bit fed up with multiplying by zero?)

It seems to me that as well as deploying SSE3 support and things of that ilk, one needs to lock down Windows' behaviour too.

Eddie

mecej4

Posts: 1914

Back to Top

16 Feb 2015 10:57 (Edited: 16 Feb 2015 3:01) #15655

Please excuse me for resurrecting this months-old thread. John Campbell brought my attention to the issues discussed here in a private message here, and we exchanged several responses. He made his complete source code available, and I spent some time exploring the code with several compilers, including FTN95 7.1. Here is a summary of my findings, which may be of interest to users of FTN95.

John's code is computationally intensive, so using a compiler without support for SSE2 imay be expected to result in poor performance.
The code tests several versions of two heavily used BLAS routines (DDOT and DAXPY), which are all linked into it (Fortran 90 vector ops version, Fortran Loops version, DavidB's inline assembly version for SSE2 and an OBJ version compiled with a compiler that can generate AVX instructions). John has published his timing results in several forum threads. I found out that in some parts of the code he calls more than one version of the BLAS routines in an inconsistent way, so his timing results were corrupted by this impurity. Of course, all of us are prone to such mix-ups when we try to improve and extend code. Before running his code to obtain the results given below, I corrected this deficiency.
Here are results from one run, comparing (i) Digital/Compaq 6.6C, (ii) FTN95 7.1, (iii) Intel 15.0.2, on a laptop with an i5-4200U CPU. With each compiler, I made two runs: (a) using the best BLAS supplied with the compiler, if any : CXML for CVF, MKL for IFort, and (b) DavidB's SSE2 assembler code. Compiling was done with full optimization when available. The input parameters for the runs, which make sense only if one builds and runs John's program, but are needed to document my results, were /l:5 /m:-1 /d:2. The compiled programs did not use multithreading, but the two vendor libraries may do so internally.

Compiler| BLAS Version| Time consumed in top two CPU hoggers

Comments:

If performance is important in your code, and you wish to use FTN95 nevertheless, identify a few key bottlenecks and compile them with another compiler or write assembly code. Performance is not something that FTN95 puts much emphasis on, but it outshines other compilers when it comes to error checking and debugging. In the code dealt with here, three subprograms with 400 bytes (out of a total count of 80 kb) were replaced using DavidB's SSE2 assembly code, and gave FTN95 a respectable place in the execution time comparisons.
If you are running floating-point intensive programs on a modern CPU, do not use a compiler that generates only X87 FPU instructions unless you are willing to accept a significant to severe slowdown. You may even start suspecting that the slowdown manifests bugs in the compiler, as John did, or lay blame/suspicion on Windows, as Eddie did. Interestingly, Pathscale has the motto 'We consider poor performance to be a bug'.
We owe our collective thanks to DavidB for spending his time to write the fine SSE inline assembly versions of the two BLAS routines used. Having the assembler source made it possible for me to use his routines even with compilers that cannot generate SSE2 instructions, such as CVF (15 years old) and FTN95 (compiled code performance not stressed). Changing 0.5 percent of the code to assembler increased the execution speed tenfold in the case of CVF.