forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Adding /check causes EXE to run 800 times slower
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 873

PostPosted: Sat Mar 31, 2018 1:51 pm    Post subject: Adding /check causes EXE to run 800 times slower Reply with quote

I have run into a problem using /check /64 with a program comprising around 4500 lines, spread over six files. The program reads a single input file, and produces a number of unformatted files. These unformatted files are intended for input to another program, but we need not be concerned with that program now.

Without /check, the program runs to completion in 0.4 seconds (32- or 64-bit, 8.10 compiler). With /check, in 32-bit mode, the program takes 1.0 s. With /check and 64-bit, the program takes 320 seconds!

I have tested the same program with other compilers, and it runs without any noticeable problems, in about 1 second, even with the equivalent of /undef.

Since the source code is rather long (4,500 lines), I have zipped the source files, the input data file, and four batch files that I used to build the program with FTN95 8.10. The Zip file is at:

https://www.dropbox.com/s/dw9c9v390yzxo9v/swmol.7z?dl=0

Thanks.

[P.S.] Perhaps the first step to take is to obtain the timings with the 8.2 or 8.3 versions of the compiler.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5275
Location: Salford, UK

PostPosted: Sat Mar 31, 2018 3:42 pm    Post subject: Reply with quote

Thanks for the feedback.

This is a strange behaviour which I have confirmed with v8.10, v8.20 and v8.30.

The first suspect would be an anti-virus checker but my test did not show any difference when switched off.

I will make a note that this needs investigating.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5275
Location: Salford, UK

PostPosted: Sat Mar 31, 2018 3:57 pm    Post subject: Reply with quote

The problem is "fixed" when using /inhibit_check 4. This provides a temporary work-around and a starting point for us when investigating.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 873

PostPosted: Sat Mar 31, 2018 4:21 pm    Post subject: Reply with quote

Thanks for the work-around.

However, since the help file says that "/inhibit_check 4" suppresses all pointer checking, I am puzzled; the code contains only allocatable variables and no pointer variables (I realise that, behind the scenes, allocatable variables may be handled as pointer variables with some additional attributes.)

I found that the subroutine VGAMMA in swmol3.f90 is the main culprit. Compiling everything else with /64 /check and just the VGAMMA subroutine with /64 /check /inhibit_check 4, I find that doing so reduces the run time from 320 s to about 4 s. (I used a Fortran source file split utility on swmol3.f90).
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5275
Location: Salford, UK

PostPosted: Sat Mar 31, 2018 6:27 pm    Post subject: Reply with quote

Thanks.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Sun Apr 01, 2018 2:35 am    Post subject: Reply with quote

I tried to see if /timing would provide some answers.
I used/guessed the following build for /64
Code:
now                             >ftn95.tce
del *.obj                      >>ftn95.tce
del *.mod                      >>ftn95.tce
SET TIMINGOPTS=/TMO /DLM ,

ftn95 precisn.f90 /64 /timing  >> ftn95.tce
ftn95 blasint.f90 /64 /timing >> ftn95.tce
ftn95 gdata.f90 /64 /timing >> ftn95.tce
ftn95 ipack.f90 /64 /timing >> ftn95.tce
ftn95 swdata.f90 /64 /timing /check >> ftn95.tce
ftn95 swmol3.f90 /64 /timing /check >> ftn95.tce

slink64 swmol3.obj swdata.obj ipack.obj gdata.obj blasint.obj precisn.obj /stack_size 10000000 /file swmo64.exe   >> ftn95.tce

swmo64   >> ftn95.tce



This produced file swmo64.tmo, with the first few lines being:
Code:
Name,CallsTo,CallsFrom,PageFaults,CPUsec,iCallsTo,iCallsFrom,iCPUsec
CONLOX,234,2466,0,155.3824,119409,119175,195.2694
CONLOR,21,147,0,130.3098,18406,18385,156.6794
VGAMMA,3880,22674,0,68.8965,26554,22674,68.9070
TWOEL,1,135,0,31.6267,231015,231014,397.7422
DRIVE,120,1551,0,14.1037,230994,230874,366.1142
ONELH,1,3973,0,0.5578,26443,26442,3.3684


This shows the delays could be occurring in CONLOX and CONLOR.

I do find that when stack is required for FTN95, this can indicate problems.
I would try to find any large local or temporary arrays that could demand stack and convert to ALLOCATE.

I would recomend the approach of using /timing as a way of locating potential problems,

I hope this helps.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Sun Apr 01, 2018 3:14 am    Post subject: Reply with quote

Paul,

Regarding the use of /TIMING, I am puzzled by the use of "Entering-10 second timer calibration loop"

It is possible to calibrate real*8 function cpu_clock@ or INTEGER*8 FUNCTION RDTSC_VAL@() against SYSTEM_CLOCK or 'QueryPerformanceCounter' / 'QueryPerformanceFrequency' in only a few ticks of these reference timers, assuming that is what is taking place.
I also find the clock rate = 1024. * QueryPerformanceFrequency on all processors I have available, although I am not sure how long this has been the case. This approximation actually appears to be more accurate that any calibration test !

I do find /TIMING to be a very useful feature of FTN95, especially for identifying problem areas of code.

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Sun Apr 01, 2018 1:52 pm    Post subject: Reply with quote

I further investigated the performance and found some results that might help.
I compiled with /64 /check /timing on a i5-4200 so these results might be a bit different, but not much.
With these compile options and Ver 8.10, I found most time was spent in routines
CONLOR (line 2410:2476) 180 seconds
CONLOX (line2477:2541) 218 seconds
VGAMMA (line 1736:1850) 75 seconds

Strangely, the most time in CONLOX (161 seconds) was spent in lines 2522:2535
Code:
       do lqs = 1, ncqrs
          f(lqs) = f(lqs)*root(lqs)
       end do
       ib = 0
       do i = 2, nnn
          ib = ib + ncqrs
          do lqs = 1, ncqrs
             root(lqs) = -4._wp*root(lqs)*e1(lqs)
             f(lqs+ib) = f(lqs+ib)*root(lqs)
          end do
       end do
       do i = 1, 3*nbb
          a1nbb(i) = zero
       end do

Also in CONLOR at lines 2457:2470

Changing from an i5-4200 and Ver 8.10 to an i5-2300 and using Ver 8.20 reduced the total run time from 530 seconds to 330 seconds, then removing the /timing option reduced to 320 seconds. (processor change is most significant)
It is surprising that this code has so much delay.
Changing "-4._wp" had no effect.

Also, in routine readin at line 71, if array "kabs" is made allocatable, this removes 224 Mb from the stack, so no stack adjustment is then required.

To get this timing breakdown, I supplemented /timing with my own timer based on RDTSC_VAL@(). I could send the modified code if required,

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 873

PostPosted: Sun Apr 01, 2018 3:17 pm    Post subject: Reply with quote

John, thanks for the timing analysis. As it happens, I am more interested in getting fixed the unexpected huge slowdown caused by using /check. We are used to seeing the use of /check cause a doubling of the run time of a program, sometimes seeing the run time to increase by a factor of 10. But 800?

At this point, I am not interested in timing and attempting to speed up the UKRMOL program. Someone posted a question about the program crashing in the Intel Fortran for Linux forum. I used FTN95 to catch a couple of uninitialsed variables and incorrect usage of logical expressions in the program. The version that was provided is probably not the current one, since one has to apply to obtain the source code of UKRMOL, and I have not done so.

You measured the total time spent in routines, including the incremental time for the additional checking. You concluded that CONLOX and CONLOR were the routines where the most time was spent.

I find that compiling just those two subroutines with the additional option of /inhibit_check 4 did not do much to reduce the run time, but doing the same with VGAMMA essentially solved the problem. Thus, my conclusion is that the overhead of /check spikes for VGAMMA and not CONLOX and CONLOR.

However, those routines, as well as ONELH, each contain a call to VGAMMA. Thus, if you looked at inclusive timings, you would think that CONLOX, CONLOR and ONELH are where problems occur when /check is used. The problem, however, lies elsewhere.

I think that what the compiler developers may have to look at are the calls to routines such as __pcheck() that are made from VGAMMA when /check is used.

There are many ALLOCATE statements inside DO loops in VGAMMA and other routines. These are possibly the results of the UKRMOL authors having used some automatic F77 -> F90 conversion utilities. Estimating the peak allocations and making one-time allocations outside the loops should solve this problem. However, the program runs in less than a second, so there is no pressing need to do this.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Sun Apr 01, 2018 3:40 pm    Post subject: Reply with quote

mecej4,

I am also at a loss to explain the huge time lost with /check.

I have been looking at CONLOR ( it is called less times than CONLOX)
I was timing a segment, that was taking about 60 seconds and moved some code ( zero array ) to a different part.
I now have the following code taking about 76 seconds for 21 passes , but without the line a1nbb = 0 it took 17 seconds !
Code:
       nnn3 = max(2, nnn-1)
       nbb = nnb*ncqrs
       npw = ncqrs*max(2, nnn-1)
       igam = max(5, nnn)*ncqrs
       nbb3 = nbb*3
       allocate (a1npw(3*npw), a1nbb(nbb3), args(ncqrs), f(igam))
        a1npw = 0
        a1nbb = 0
!       do i = 1, 3*npw
!          a1npw(i) = zero
!       end do

I was wondering if there were Floating Point Exception errors, but took out the following code, but it uses 0.0025 seconds for all calls.
Code:
    subroutine lqs_loop ( ncqrs, root, e1, f )
      integer*4 ncqrs, lqs
      real*8 root(*), e1(*), f(*)

          do lqs = 1, ncqrs
             root(lqs) = -4.0d0*root(lqs)*e1(lqs)
             f(lqs)    = f(lqs)*root(lqs)
          end do
    end subroutine lqs_loop

I have no explanation for this: delays in getting allocated memory active ?

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Sun Apr 01, 2018 3:55 pm    Post subject: Reply with quote

I put in timings for segments of CONLOX and CONLOR using rdtsc_val@. With this I can monitor the elapsed times in these segments. It is hard to believe that the time can be taken up outside these segments that I am timing.
I am showing some time being spent in VGAMMA but much more time outside this call.
Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get.
Code:
  Id  Call Description        Calls          Ticks     Seconds
   0  Other time               4135    86522310412     30.9702
   1  <vgama> DO 1             2685    38384694245     13.7396
   2  <vgama> DO 2             2685    38802965904     13.8893
   3  <vgama> DO 3             2685    12861502266      4.6037
   4  <vgama> DO 4             2685    13906233666      4.9777
   5  <vgama> exit 5           2685    21003895422      7.5182
   6  <vgama> wexp 6           1306      316496509      0.1133
   7  <vgama> f= 7              112     2179037152      0.7800
   8  <vgama> f= 8             1194      208513310      0.0746
   9  <vgama> f= 9             1194      193226831      0.0692
  10  <vgama> exit 10          1306      156840561      0.0561
  16  <vgama> set tab 16          1        4527747      0.0016
  21  <conlor> exit 21           21    19841940782      7.1023
  22  <conlor> exit 22           21    50278686775     17.9970
  23  <conlor> set f 23          21    15060933199      5.3910
  24  <conlor> set f 24          21   167748801924     60.0448 **
  25  <conlor> dorbax 25         21       99642680      0.0357
  26  <conlor> exit 26           21       79447440      0.0284
  31  <conlox> exit 31          234    22570256060      8.0789
  32  <conlox> exit 32          234    82333831518     29.4709
  33  <conlox> f = 33           234    25752846438      9.2181
  34  <conlox> f = 34           234   180254180811     64.5210 **
  35  <conlox> call dorbax      234      262701403      0.0940
  36  <conlox> exit 36          234      361165824      0.1293
 run time =  278.9071 seconds

The two ** can be reduced substantially when removing the zero of a1nbb. There is not much else happening there
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 873

PostPosted: Sun Apr 01, 2018 7:35 pm    Post subject: Re: Reply with quote

JohnCampbell wrote:
I am showing some time being spent in VGAMMA but much more time outside this call. Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get.

Please note that you are attributing the time spent to the Fortran source in VGAMMA, etc. In contrast, I think, the time is being spent in the pointer and subscript checking code planted by the use of /check. Along the same lines, please note that allocating memory and zeroing the memory is done even when /check is not used, when it takes just a fraction of a second to do. If we had a profiler that could collect timings for the extra functions that check pointers, etc., we would be in a better position to understand what went wrong.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Mon Apr 02, 2018 12:57 am    Post subject: Reply with quote

Mecej4,

I have confirmed that there is a significant proportion of the time being taken up in initialising arrays after allocation. I think this relates to the way that /check manages allocatable arrays.
for example, in conlor, for the loop
do i = 1, 3*nbb
a1nbb(i) = zero
end do
If I replace this with "call zero_array(a1nbb,3*nbb)" and use /timing, most of the time is attributed conlor, rather than zero_array. I used this routine in a few places and all 12,447 calls to zero_array took 0.0024 seconde from the /timer report. It was compiled with /64 /timing /check.

However, if I put explicit code to time this loop, using:
call jdc_time (24, '<conlor> set f 24')
do i = 1, 3*nbb
a1nbb(i) = zero
end do
call jdc_time (42, '<conlor> zero 24')
This identifies significant time for this block of code.
When using call zero_array, /check is spending significant time preparing the array.
Basically, there are large delays when /check prepares the allocated array for use ( memory page allocation or whatever ) This looks to be the/a source of the problem.
My test showed that the code between timer_24
to timer_42 was called 21 times and took 96 seconds.

There is a problem with /check in this respect.

John

I have retested this and the times were substantially reduced with array syntax in a call. There does not appear to be significant overhead outside the do loop. Do loops are causing big delays.
timer_42 reduced to .0030 seconds when zero_array was called.
subroutine zero_array (a,n)
integer n
real*8 a(n)
a = 0
end subroutine zero_array


Last edited by JohnCampbell on Mon Apr 02, 2018 1:54 am; edited 2 times in total
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 873

PostPosted: Mon Apr 02, 2018 1:01 am    Post subject: Reply with quote

John, do you know if /timing and /check have any mutual effects?

Anyway, your findings should be useful to Silverfrost as they identify and rectify the /check problem.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1928
Location: Sydney

PostPosted: Mon Apr 02, 2018 1:06 am    Post subject: Re: Reply with quote

mecej4 wrote:
John, do you know if /timing and /check have any mutual effects?


removing /timing has little effect on the run time, reducing to 501 seconds for the 522 second test I just ran with /64 /timing /check

I can package up what I have done and provide the link

John


Last edited by JohnCampbell on Mon Apr 02, 2018 1:55 am; edited 1 time in total
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group