|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sat Mar 31, 2018 1:51 pm Post subject: Adding /check causes EXE to run 800 times slower |
|
|
I have run into a problem using /check /64 with a program comprising around 4500 lines, spread over six files. The program reads a single input file, and produces a number of unformatted files. These unformatted files are intended for input to another program, but we need not be concerned with that program now.
Without /check, the program runs to completion in 0.4 seconds (32- or 64-bit, 8.10 compiler). With /check, in 32-bit mode, the program takes 1.0 s. With /check and 64-bit, the program takes 320 seconds!
I have tested the same program with other compilers, and it runs without any noticeable problems, in about 1 second, even with the equivalent of /undef.
Since the source code is rather long (4,500 lines), I have zipped the source files, the input data file, and four batch files that I used to build the program with FTN95 8.10. The Zip file is at:
https://www.dropbox.com/s/dw9c9v390yzxo9v/swmol.7z?dl=0
Thanks.
[P.S.] Perhaps the first step to take is to obtain the timings with the 8.2 or 8.3 versions of the compiler. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7933 Location: Salford, UK
|
Posted: Sat Mar 31, 2018 3:42 pm Post subject: |
|
|
Thanks for the feedback.
This is a strange behaviour which I have confirmed with v8.10, v8.20 and v8.30.
The first suspect would be an anti-virus checker but my test did not show any difference when switched off.
I will make a note that this needs investigating. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7933 Location: Salford, UK
|
Posted: Sat Mar 31, 2018 3:57 pm Post subject: |
|
|
The problem is "fixed" when using /inhibit_check 4. This provides a temporary work-around and a starting point for us when investigating. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sat Mar 31, 2018 4:21 pm Post subject: |
|
|
Thanks for the work-around.
However, since the help file says that "/inhibit_check 4" suppresses all pointer checking, I am puzzled; the code contains only allocatable variables and no pointer variables (I realise that, behind the scenes, allocatable variables may be handled as pointer variables with some additional attributes.)
I found that the subroutine VGAMMA in swmol3.f90 is the main culprit. Compiling everything else with /64 /check and just the VGAMMA subroutine with /64 /check /inhibit_check 4, I find that doing so reduces the run time from 320 s to about 4 s. (I used a Fortran source file split utility on swmol3.f90). |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7933 Location: Salford, UK
|
Posted: Sat Mar 31, 2018 6:27 pm Post subject: |
|
|
Thanks. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Apr 01, 2018 2:35 am Post subject: |
|
|
I tried to see if /timing would provide some answers.
I used/guessed the following build for /64 Code: | now >ftn95.tce
del *.obj >>ftn95.tce
del *.mod >>ftn95.tce
SET TIMINGOPTS=/TMO /DLM ,
ftn95 precisn.f90 /64 /timing >> ftn95.tce
ftn95 blasint.f90 /64 /timing >> ftn95.tce
ftn95 gdata.f90 /64 /timing >> ftn95.tce
ftn95 ipack.f90 /64 /timing >> ftn95.tce
ftn95 swdata.f90 /64 /timing /check >> ftn95.tce
ftn95 swmol3.f90 /64 /timing /check >> ftn95.tce
slink64 swmol3.obj swdata.obj ipack.obj gdata.obj blasint.obj precisn.obj /stack_size 10000000 /file swmo64.exe >> ftn95.tce
swmo64 >> ftn95.tce
|
This produced file swmo64.tmo, with the first few lines being: Code: | Name,CallsTo,CallsFrom,PageFaults,CPUsec,iCallsTo,iCallsFrom,iCPUsec
CONLOX,234,2466,0,155.3824,119409,119175,195.2694
CONLOR,21,147,0,130.3098,18406,18385,156.6794
VGAMMA,3880,22674,0,68.8965,26554,22674,68.9070
TWOEL,1,135,0,31.6267,231015,231014,397.7422
DRIVE,120,1551,0,14.1037,230994,230874,366.1142
ONELH,1,3973,0,0.5578,26443,26442,3.3684
|
This shows the delays could be occurring in CONLOX and CONLOR.
I do find that when stack is required for FTN95, this can indicate problems.
I would try to find any large local or temporary arrays that could demand stack and convert to ALLOCATE.
I would recomend the approach of using /timing as a way of locating potential problems,
I hope this helps.
John |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Apr 01, 2018 3:14 am Post subject: |
|
|
Paul,
Regarding the use of /TIMING, I am puzzled by the use of "Entering-10 second timer calibration loop"
It is possible to calibrate real*8 function cpu_clock@ or INTEGER*8 FUNCTION RDTSC_VAL@() against SYSTEM_CLOCK or 'QueryPerformanceCounter' / 'QueryPerformanceFrequency' in only a few ticks of these reference timers, assuming that is what is taking place.
I also find the clock rate = 1024. * QueryPerformanceFrequency on all processors I have available, although I am not sure how long this has been the case. This approximation actually appears to be more accurate that any calibration test !
I do find /TIMING to be a very useful feature of FTN95, especially for identifying problem areas of code.
John |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Apr 01, 2018 1:52 pm Post subject: |
|
|
I further investigated the performance and found some results that might help.
I compiled with /64 /check /timing on a i5-4200 so these results might be a bit different, but not much.
With these compile options and Ver 8.10, I found most time was spent in routines
CONLOR (line 2410:2476) 180 seconds
CONLOX (line2477:2541) 218 seconds
VGAMMA (line 1736:1850) 75 seconds
Strangely, the most time in CONLOX (161 seconds) was spent in lines 2522:2535 Code: | do lqs = 1, ncqrs
f(lqs) = f(lqs)*root(lqs)
end do
ib = 0
do i = 2, nnn
ib = ib + ncqrs
do lqs = 1, ncqrs
root(lqs) = -4._wp*root(lqs)*e1(lqs)
f(lqs+ib) = f(lqs+ib)*root(lqs)
end do
end do
do i = 1, 3*nbb
a1nbb(i) = zero
end do
|
Also in CONLOR at lines 2457:2470
Changing from an i5-4200 and Ver 8.10 to an i5-2300 and using Ver 8.20 reduced the total run time from 530 seconds to 330 seconds, then removing the /timing option reduced to 320 seconds. (processor change is most significant)
It is surprising that this code has so much delay.
Changing "-4._wp" had no effect.
Also, in routine readin at line 71, if array "kabs" is made allocatable, this removes 224 Mb from the stack, so no stack adjustment is then required.
To get this timing breakdown, I supplemented /timing with my own timer based on RDTSC_VAL@(). I could send the modified code if required,
John |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sun Apr 01, 2018 3:17 pm Post subject: |
|
|
John, thanks for the timing analysis. As it happens, I am more interested in getting fixed the unexpected huge slowdown caused by using /check. We are used to seeing the use of /check cause a doubling of the run time of a program, sometimes seeing the run time to increase by a factor of 10. But 800?
At this point, I am not interested in timing and attempting to speed up the UKRMOL program. Someone posted a question about the program crashing in the Intel Fortran for Linux forum. I used FTN95 to catch a couple of uninitialsed variables and incorrect usage of logical expressions in the program. The version that was provided is probably not the current one, since one has to apply to obtain the source code of UKRMOL, and I have not done so.
You measured the total time spent in routines, including the incremental time for the additional checking. You concluded that CONLOX and CONLOR were the routines where the most time was spent.
I find that compiling just those two subroutines with the additional option of /inhibit_check 4 did not do much to reduce the run time, but doing the same with VGAMMA essentially solved the problem. Thus, my conclusion is that the overhead of /check spikes for VGAMMA and not CONLOX and CONLOR.
However, those routines, as well as ONELH, each contain a call to VGAMMA. Thus, if you looked at inclusive timings, you would think that CONLOX, CONLOR and ONELH are where problems occur when /check is used. The problem, however, lies elsewhere.
I think that what the compiler developers may have to look at are the calls to routines such as __pcheck() that are made from VGAMMA when /check is used.
There are many ALLOCATE statements inside DO loops in VGAMMA and other routines. These are possibly the results of the UKRMOL authors having used some automatic F77 -> F90 conversion utilities. Estimating the peak allocations and making one-time allocations outside the loops should solve this problem. However, the program runs in less than a second, so there is no pressing need to do this. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Apr 01, 2018 3:40 pm Post subject: |
|
|
mecej4,
I am also at a loss to explain the huge time lost with /check.
I have been looking at CONLOR ( it is called less times than CONLOX)
I was timing a segment, that was taking about 60 seconds and moved some code ( zero array ) to a different part.
I now have the following code taking about 76 seconds for 21 passes , but without the line a1nbb = 0 it took 17 seconds !
Code: | nnn3 = max(2, nnn-1)
nbb = nnb*ncqrs
npw = ncqrs*max(2, nnn-1)
igam = max(5, nnn)*ncqrs
nbb3 = nbb*3
allocate (a1npw(3*npw), a1nbb(nbb3), args(ncqrs), f(igam))
a1npw = 0
a1nbb = 0
! do i = 1, 3*npw
! a1npw(i) = zero
! end do
|
I was wondering if there were Floating Point Exception errors, but took out the following code, but it uses 0.0025 seconds for all calls.
Code: | subroutine lqs_loop ( ncqrs, root, e1, f )
integer*4 ncqrs, lqs
real*8 root(*), e1(*), f(*)
do lqs = 1, ncqrs
root(lqs) = -4.0d0*root(lqs)*e1(lqs)
f(lqs) = f(lqs)*root(lqs)
end do
end subroutine lqs_loop
|
I have no explanation for this: delays in getting allocated memory active ?
John |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Sun Apr 01, 2018 3:55 pm Post subject: |
|
|
I put in timings for segments of CONLOX and CONLOR using rdtsc_val@. With this I can monitor the elapsed times in these segments. It is hard to believe that the time can be taken up outside these segments that I am timing.
I am showing some time being spent in VGAMMA but much more time outside this call.
Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get. Code: | Id Call Description Calls Ticks Seconds
0 Other time 4135 86522310412 30.9702
1 <vgama> DO 1 2685 38384694245 13.7396
2 <vgama> DO 2 2685 38802965904 13.8893
3 <vgama> DO 3 2685 12861502266 4.6037
4 <vgama> DO 4 2685 13906233666 4.9777
5 <vgama> exit 5 2685 21003895422 7.5182
6 <vgama> wexp 6 1306 316496509 0.1133
7 <vgama> f= 7 112 2179037152 0.7800
8 <vgama> f= 8 1194 208513310 0.0746
9 <vgama> f= 9 1194 193226831 0.0692
10 <vgama> exit 10 1306 156840561 0.0561
16 <vgama> set tab 16 1 4527747 0.0016
21 <conlor> exit 21 21 19841940782 7.1023
22 <conlor> exit 22 21 50278686775 17.9970
23 <conlor> set f 23 21 15060933199 5.3910
24 <conlor> set f 24 21 167748801924 60.0448 **
25 <conlor> dorbax 25 21 99642680 0.0357
26 <conlor> exit 26 21 79447440 0.0284
31 <conlox> exit 31 234 22570256060 8.0789
32 <conlox> exit 32 234 82333831518 29.4709
33 <conlox> f = 33 234 25752846438 9.2181
34 <conlox> f = 34 234 180254180811 64.5210 **
35 <conlox> call dorbax 234 262701403 0.0940
36 <conlox> exit 36 234 361165824 0.1293
run time = 278.9071 seconds
|
The two ** can be reduced substantially when removing the zero of a1nbb. There is not much else happening there |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Sun Apr 01, 2018 7:35 pm Post subject: Re: |
|
|
JohnCampbell wrote: | I am showing some time being spent in VGAMMA but much more time outside this call. Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get. |
Please note that you are attributing the time spent to the Fortran source in VGAMMA, etc. In contrast, I think, the time is being spent in the pointer and subscript checking code planted by the use of /check. Along the same lines, please note that allocating memory and zeroing the memory is done even when /check is not used, when it takes just a fraction of a second to do. If we had a profiler that could collect timings for the extra functions that check pointers, etc., we would be in a better position to understand what went wrong. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Mon Apr 02, 2018 12:57 am Post subject: |
|
|
Mecej4,
I have confirmed that there is a significant proportion of the time being taken up in initialising arrays after allocation. I think this relates to the way that /check manages allocatable arrays.
for example, in conlor, for the loop
do i = 1, 3*nbb
a1nbb(i) = zero
end do
If I replace this with "call zero_array(a1nbb,3*nbb)" and use /timing, most of the time is attributed conlor, rather than zero_array. I used this routine in a few places and all 12,447 calls to zero_array took 0.0024 seconde from the /timer report. It was compiled with /64 /timing /check.
However, if I put explicit code to time this loop, using:
call jdc_time (24, '<conlor> set f 24')
do i = 1, 3*nbb
a1nbb(i) = zero
end do
call jdc_time (42, '<conlor> zero 24')
This identifies significant time for this block of code.
When using call zero_array, /check is spending significant time preparing the array.
Basically, there are large delays when /check prepares the allocated array for use ( memory page allocation or whatever ) This looks to be the/a source of the problem.
My test showed that the code between timer_24
to timer_42 was called 21 times and took 96 seconds.
There is a problem with /check in this respect.
John
I have retested this and the times were substantially reduced with array syntax in a call. There does not appear to be significant overhead outside the do loop. Do loops are causing big delays.
timer_42 reduced to .0030 seconds when zero_array was called.
subroutine zero_array (a,n)
integer n
real*8 a(n)
a = 0
end subroutine zero_array
Last edited by JohnCampbell on Mon Apr 02, 2018 1:54 am; edited 2 times in total |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1891
|
Posted: Mon Apr 02, 2018 1:01 am Post subject: |
|
|
John, do you know if /timing and /check have any mutual effects?
Anyway, your findings should be useful to Silverfrost as they identify and rectify the /check problem. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2556 Location: Sydney
|
Posted: Mon Apr 02, 2018 1:06 am Post subject: Re: |
|
|
mecej4 wrote: | John, do you know if /timing and /check have any mutual effects? |
removing /timing has little effect on the run time, reducing to 501 seconds for the 522 second test I just ran with /64 /timing /check
I can package up what I have done and provide the link
John
Last edited by JohnCampbell on Mon Apr 02, 2018 1:55 am; edited 1 time in total |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|