forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Severe slowdown with /64 /check for a certain program
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> 64-bit
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Thu Oct 04, 2018 3:28 pm    Post subject: Severe slowdown with /64 /check for a certain program Reply with quote

Paul, I found a peculiar behavior of SALFLIBC64.DLL (I think) when I compiled and ran a test program (Minpack LMDER example) with /64 /check (or /64 /checkmate).

The test program has no subscript errors, undefined variables or such other errors as those that FTN95 is great at catching. The use of /check or /checkmate causes the run time of the program to increase from 200 to 800 times, when I use the SALFLIBC64.DLL that came with FTN95 V8.30.0 or older. There is no such drastic slowdown when the 32-bit target is chosen or if I allow the program to find the SALFLIBC64.DLL that was released with the 8.30.279 Beta.

Here are some results, which show the elapsed CPU_TIME in the last column.

SALFLIBC64.DLL, version 20.3.16.7:
Code:
 NPROB   N    M   NFEV  NJEV  INFO  FINAL L2 NORM CPU-t (s)

   11   12   31    10     9     3   0.2173104D-04  0.141
   11   12   31    13    12     2   0.2173104D-04  0.547
   11   12   31    34    28     2   0.2173104D-04  5.141

The same program (not even a recompie and link), with

SALFLIBC64.DLL, versions 20.4.9.10, 20.6.30.12:
Code:
 NPROB   N    M   NFEV  NJEV  INFO  FINAL L2 NORM CPU-t (s)

   11   12   31    10     9     3   0.2173104D-04  0.000
   11   12   31    13    12     2   0.2173104D-04  0.016
   11   12   31    34    28     2   0.2173104D-04  0.000


If you already know what was fixed in the newer DLL that may have a bearing on this, we users may just look forward to the next release. If this is an unknown issue, on the other hand, I can provide the test code and any other details necessary.

I use /check and /checkmate often, in 32- and 64-bit built programs. Usually, the slowdown compared to /opt is by a factor of ~10. I had never seen a slowdown by a factor of 800.


Last edited by mecej4 on Fri Oct 05, 2018 3:12 am; edited 1 time in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Thu Oct 04, 2018 6:09 pm    Post subject: Reply with quote

mecej4

Thanks for the report. Please provide the test program or a link to it.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Thu Oct 04, 2018 7:30 pm    Post subject: Reply with quote

Here is the link to the source file, zipped up:

https://www.dropbox.com/s/mr5yyuzihwvwj2n/xlmp11.zip?dl=0

Compile with /64 /checkmate /link.

When it uses the SALFLIBC64.DLL that came with FTN95 8.30, initial release, or earlier, it runs for around 10 seconds (depending, of course, on the CPU used).

With newer releases of SALFLIBC64.DLL, such as those included with FTN95 8.30.169 and 8.30.279, the program takes less than 0.1 s to run, i.e., a hundred times faster.

Thanks.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Fri Oct 05, 2018 7:01 am    Post subject: Reply with quote

mecej4,

I compiled with /64 /checkmate /timing /link

The resulting .tmr timing report suggested the problem is in QRFAC.
I would suspect the delay in the older code could come from lines like:
Code:
            temp1u      = a(:m, j)
            a(:m, j)    = a(:m, kmax)
            a(:m, kmax) = temp1u
...
               allocate (d1v(m-j+1))
               d1v = a(j:m, j)*a(j:m, k)
...
               a(j:m, k) = a(j:m, k) - temp*a(j:m, j)

Array sections have long been a performance problem in FTN95.
This could be tested by replacing all array sections with a "do i" loop

If this is the change, it could be an interesting outcome to know about.

ps 1: I also included the following code, removing CPU_time
Code:
 subroutine elapse_time (sec)
   real :: sec
   integer*8 clock, rate
   call system_clock ( clock, rate )
   sec = real(clock) / real(rate)
 end subroutine elapse_time


ps 2: I changed routine QRFAC and replaced the array sections with DO i = j,m and this removed the delay in QRFAC. This suggests that array sections are the cause, WHICH COULD BE A VERY USEFUL FIND !!

Paul, Is there any documentation to suggest that array section performance has been improved in Ver 8.3 beta ?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Fri Oct 05, 2018 10:32 am    Post subject: Reply with quote

Thanks for contributing, John. Your comments regarding the time consumed in QRFAC led me to do some more tests, and the findings are interesting. Instead of the single test problem P11 that I chose earlier to demonstrate the problem, I ran the entire test problem suite from:

http://www.netlib.org/minpack/ , "lmder1.f plus dependencies"

http://www.netlib.org/minpack/ex/ , file17, "LMDER test" and file22 (data file)

Here are the timing results (Intel i5-8400-2.8 GHz, Windows 10 X64).

F77, IMPLICIT NONE added and code fixed (see below)

/64 /opt : 0.110 s
/64 /checkmate : 0.150 s

F90, array expressions, allocate/deallocate temporary arrays as needed

/64 /opt : 0.110 s
/64 /checkmate : 6.710 s

F90, temporary local arrays allocated on stack,
array sizes passed as subroutine integer arg(s)

/64 /opt : 0.110 s
/64 /checkmate : 0.170 s


Clearly, the drastic slowdown with /64 /checkmate is attributable to the use of blocks of code such as
Code:
            do k = jp1, n
               sum = zero
               allocate (d1v(m-j+1))       !<<<
               d1v = a(j:m, j)*a(j:m, k)
               do i = 1, m - j + 1
                  sum = sum + d1v(i)
               end do
               deallocate (d1v)            !<<<
               ...
            end do

This problem, as we have seen, occurs only with older versions of SALFLIBC64.DLL, and Paul can probably record how/why this happens.

For you, John, it should be interesting to note that using array operations did not entail any measurable performance penalty.

Note that the original F77 code from Netlib cannot be run with /checkmate because of a defect in FTN95: If implicit typing is being used in the code, an actual argument that has been given an EXTERNAL declaration is classified as REAL or INTEGER based on the initial letter of the name. There is no means in Fortran 77 of declaring that an external variable is typeless, i.e., that it is a subroutine and not a function. With /checkmate, a runtime error occurs, "real argument passed when subroutine was expected...". This is what made "code fixed" necessary.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Fri Oct 05, 2018 10:57 am    Post subject: Reply with quote

Are you suggesting that the problem relates to ALLOCATE as this is not my experience ?

From past versions of FTN95, I have learnt that using array section expressions line "a(j:m, k)" cause significant performance penalties. As a consequence, I never use these expressions. I concluded that FTN95 created lots of temporary copies of the array section, which caused most of the delay. ( FTN95 does not support non-contiguous vectors, which I think is a good thing. )

If this problem has been removed with the latest Ver 8.3 beta then it should be documented for the next release. It will improve FTN95 performance in a number of benchmark tests.

I am not sure of the history for some of the F90, array expressions code blocks you have used, such as:
Code:
            allocate (d1v(nsing-jp1+1))
            d1v = r(jp1:nsing, j)*wa(jp1:nsing)
            do i = 1, nsing - jp1 + 1
               sum = sum + d1v(i)
            end do
            deallocate (d1v)


These can be replaced by
Code:
            do i = jp1,nsing
               sum = sum + r(i, j)*wa(i)                  ! ~~ array section
            end do


I changed the original code you linked to remove unnecessary array sections and it works without /checkmate delays with Ver 8.20, see link

https://www.dropbox.com/s/k6tuemhg01fl8vd/xlmP11-v2.f90?dl=0

I do find /timing to be a very useful diagnostic tool. It does a per routine performance report which can quickly highlight where the problem occurs.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Fri Oct 05, 2018 12:35 pm    Post subject: Reply with quote

I did the initial conversion from F77 to F90 using a tool (the old Vast77to90). Sometimes, that tool produces compiler-ready F90 code, but often the code looks odd/inhuman. The tool attempts to convert computed GO TO statements to SELECT ..CASE.. constructs, but does a bad job of it.

ALLOCATE/DEALLOCATE may not, by itself, be bad for performance as long as it is used sparingly. The tool-generated code, however, contains these statements inside loops. Those, together with /checkmate, seem to create a bottleneck when an older SALFLIBC.DLL is used.

If those temporary arrays are allocated on the stack, with upper bounds specified generously enough, array expressions seem to work fine and the penalty is negligible with current versions of FTN95. You may consider reevaluating your opinion of array sections, or presentin a counterexample where the use of sections hurts performance.

I do not know the internal details of how FTN95 does array bounds checking, but I am inclined to think that it is easier to check bounds when viewing an array assignment than the equivalent DO loop. Therefore, /check should entail much less overhead with array assignments than with the equivalent F77 code -- in simple cases, the bounds check needs to be done once per array assignment, instead of during every iteration of the corresponding loop.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat Oct 06, 2018 12:14 pm    Post subject: Reply with quote

Just out of interest, did the code work properly (as originally written) before you changed it? If Vast77to90 does a bad job, why use it? (I once used a code rearranger called SPAG - and hated what it did, as I thought the output was unreadable).

Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sat Oct 06, 2018 2:26 pm    Post subject: Reply with quote

Actually, Vast does a pretty good job most of the time. It is mainly when one of the targets of a computed go to is itself a computed go to that Vast botches it.

Why convert? Two main reasons:

a) to make the code more compact, readable and, possibly, more efficient. Finding bugs in old programs is a lot easier after conversion, since an original F77 subprogram with, say, 50 labels becomes a F90 program with 5 labels. I am incapable of imagining the possible execution paths when a subprogram has more labels than I have fingers.

b) to produce a non-trivial F90 program that is sufficiently dense in array operations to make a good test for a Fortran 9x compiler, with correct results available for verification of the run. I had suspected bugs in FTN95, but these programs, which have been used by thousands of people and have verified output, helped me catch and describe the bugs in FTN95 that I reported.

For a collection of converted Fortran programs, see http://wp.csiro.au/alanmiller/ .
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sat Oct 06, 2018 3:01 pm    Post subject: Reply with quote

I am not sure I agree that it does a good job as the resulting code:
Code:
            do k = jp1, n
               sum = zero
               allocate (d1v(m-j+1))       !<<<
               d1v = a(j:m, j)*a(j:m, k)
               do i = 1, m - j + 1
                  sum = sum + d1v(i)
               end do
               deallocate (d1v)            !<<<
               temp = sum/a(j, j)
               ...
            end do
can be replaced by
Code:
            do k = jp1, n
               temp = dot_product ( a(j:m, j), a(j:m, k) ) / a(j, j)
               ...
            end do

This could also be replaced by F77 syntax like
Code:
            do k = jp1, n
               temp = vec_product ( a(j, j), a(j, k), m-j+1 ) / a(j, j)
               ...
            end do


Edit: mecej4, am I correct assuming the above code was generated by vast ?
Could the alternative of dot_product, (or the F77 wrapper) be a useful way of avoiding /checkmate problems, while checking the array in the call ?

Years of FTN95 performance problems with array sections has conditioned me to avoid this syntax and use F77 wrappers.
(ifort does show the opposite, with array sections performing better than f77 wrapper routines, like ddotp or daxpy)


Last edited by JohnCampbell on Sun Oct 07, 2018 4:57 am; edited 3 times in total
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat Oct 06, 2018 4:55 pm    Post subject: Reply with quote

Hi Mecej4,

I didn't ask if you preferred it a different way, I asked if it worked, i.e. did FTN95 throw up any errors, or fail to compile it, so that by 'worked' I mean did it run with the expected results - which include timings as well as numerical accuracy. (And I meant the original published code, not the demonstrator you kindly provided.)

The answer you gave was fairly obvious from the path you followed. I'm not going to even suggest that you change your modus operandi (or stylistic preference), merely to have confirmation - if that's possible - that my own fits me best.

Regards

Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sat Oct 06, 2018 5:09 pm    Post subject: Reply with quote

Eddie:

Yes, this F77 code as well as most of the Netlib F77 codes work, provided one goes through a small ritual at first: one has to select the machine constants by modifying the source code, because the unmodified code may have been written for a CDC, Convex, etc. In many cases, one also has to tell the compiler to initialise variables to zero. If the code passes subprogram names as actual arguments, one has some more work to do because FTN95 attaches an implicit type attribute to such arguments and in turn issues run time error messages regarding mismatched formal and actual arguments.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Sun Oct 07, 2018 12:40 pm    Post subject: Reply with quote

John Campbell said:
Quote:
... mecej4, am I correct assuming the above code was generated by Vast ?

Yes. Perhaps Vast was hobbled by the fact that the F77 input used the F90 intrinsic SUM as a variable. In many cases, Vast may not realise that a vector expression that serves only to be input to a function such as SUM should not be assigned to a temporary vector variable; that is why it produces code that allocates D1V, fills it with values, sums up the values, and deallocates it. As you observed, using DOT_PRODUCT makes the temporary array superfluous.

The old F77 codes also do other things that hinder conversion and optimisation. For example, the coder of:

Code:
NP1= N + 1
IF (NP1 .GE. 1) THEN
   ...
   DO j = 1, NP1

is being defensive against the DO loop being executed once even when NP1 = 0. Most compilers today do not do "one-trip DO loop"s unless you ask them. If the "..." is many statements, the compiler will probably not recognise that when the DO is entered we are guaranteed that NP1 >= 1, and generate instructions to perform this comparison.

Sometimes, subscript checking could be performed quite economically. The following code extract has two nested loops, yet it is sufficient to check that the variable A is (N, N) or larger, just once and prior to entering the outer loop:
Code:
DO J = 1, N
   ...
   DO K = J+1, N
      TMP = DOT_PRODUCT(A(J:N, J), A(J:N, K))/A(J, J)
      A(J:N, K) = A(J:N, K) - TEMP*A(J:N, J)
      ...
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Sun Oct 07, 2018 9:07 pm    Post subject: Reply with quote

mecej4 wrote:-
Quote:
a) to make the code more compact, readable and, possibly, more efficient. Finding bugs in old programs is a lot easier after conversion, since an original F77 subprogram with, say, 50 labels becomes a F90 program with 5 labels. I am incapable of imagining the possible execution paths when a subprogram has more labels than I have fingers.


you could use the well written and continuously updated program flowchart to understand the code SmileSurprisedSmileSurprised ... just like we all do ! LOL
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Sun Oct 07, 2018 9:22 pm    Post subject: Reply with quote

you also wrote:
Quote:
Yes, this F77 code as well as most of the Netlib F77 codes work, provided one goes through a small ritual at first: one has to select the machine constants by modifying the source code, because the unmodified code may have been written for a CDC, Convex, etc. In many cases, one also has to tell the compiler to initialise variables to zero. If the code passes subprogram names as actual arguments, one has some more work to do because FTN95 attaches an implicit type attribute to such arguments and in turn issues run time error messages regarding mismatched formal and actual arguments.

which made my Sunday. I think it's a great example of the so-called seamless hands-off one-click backgound compiling experience Wink LOL

Maybe Paul could introduce some new intrinsics to facilitate the task:-
AUTOZEROINIT, IFORIGCDC, ...

Oh, but isn't there a F77toF90 (or 95) conversion program included in FTN95 ? they'll be in there no doubt Wink

or run with the F77 as INCLUDES, doesn't FTN95 then compile each file as F77 or F95 code as appropriate based on their file extension ?

_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> 64-bit All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group