forums.silverfrost.com

mecej4 · Joined: 31 Oct 2006 Posts: 1891

I have run into a problem using /check /64 with a program comprising around 4500 lines, spread over six files. The program reads a single input file, and produces a number of unformatted files. These unformatted files are intended for input to another program, but we need not be concerned with that program now.

Without /check, the program runs to completion in 0.4 seconds (32- or 64-bit, 8.10 compiler). With /check, in 32-bit mode, the program takes 1.0 s. With /check and 64-bit, the program takes 320 seconds!

I have tested the same program with other compilers, and it runs without any noticeable problems, in about 1 second, even with the equivalent of /undef.

Since the source code is rather long (4,500 lines), I have zipped the source files, the input data file, and four batch files that I used to build the program with FTN95 8.10. The Zip file is at:

https://www.dropbox.com/s/dw9c9v390yzxo9v/swmol.7z?dl=0

Thanks.

[P.S.] Perhaps the first step to take is to obtain the timings with the 8.2 or 8.3 versions of the compiler.

PaulLaidler · Posted: Sat Mar 31, 2018 3:42 pm Post subject:

Thanks for the feedback.

This is a strange behaviour which I have confirmed with v8.10, v8.20 and v8.30.

The first suspect would be an anti-virus checker but my test did not show any difference when switched off.

I will make a note that this needs investigating.

PaulLaidler · Posted: Sat Mar 31, 2018 3:57 pm Post subject:

The problem is "fixed" when using /inhibit_check 4. This provides a temporary work-around and a starting point for us when investigating.

mecej4 · Joined: 31 Oct 2006 Posts: 1891

Thanks for the work-around.

However, since the help file says that "/inhibit_check 4" suppresses all pointer checking, I am puzzled; the code contains only allocatable variables and no pointer variables (I realise that, behind the scenes, allocatable variables may be handled as pointer variables with some additional attributes.)

I found that the subroutine VGAMMA in swmol3.f90 is the main culprit. Compiling everything else with /64 /check and just the VGAMMA subroutine with /64 /check /inhibit_check 4, I find that doing so reduces the run time from 320 s to about 4 s. (I used a Fortran source file split utility on swmol3.f90).

PaulLaidler · Posted: Sat Mar 31, 2018 6:27 pm Post subject:

Thanks.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

I tried to see if /timing would provide some answers.
I used/guessed the following build for /64

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

Paul,

Regarding the use of /TIMING, I am puzzled by the use of "Entering-10 second timer calibration loop"

It is possible to calibrate real*8 function cpu_clock@ or INTEGER*8 FUNCTION RDTSC_VAL@() against SYSTEM_CLOCK or 'QueryPerformanceCounter' / 'QueryPerformanceFrequency' in only a few ticks of these reference timers, assuming that is what is taking place.
I also find the clock rate = 1024. * QueryPerformanceFrequency on all processors I have available, although I am not sure how long this has been the case. This approximation actually appears to be more accurate that any calibration test !

I do find /TIMING to be a very useful feature of FTN95, especially for identifying problem areas of code.

John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

I further investigated the performance and found some results that might help.
I compiled with /64 /check /timing on a i5-4200 so these results might be a bit different, but not much.
With these compile options and Ver 8.10, I found most time was spent in routines
CONLOR (line 2410:2476) 180 seconds
CONLOX (line2477:2541) 218 seconds
VGAMMA (line 1736:1850) 75 seconds

Strangely, the most time in CONLOX (161 seconds) was spent in lines 2522:2535

mecej4 · Joined: 31 Oct 2006 Posts: 1891

John, thanks for the timing analysis. As it happens, I am more interested in getting fixed the unexpected huge slowdown caused by using /check. We are used to seeing the use of /check cause a doubling of the run time of a program, sometimes seeing the run time to increase by a factor of 10. But 800?

At this point, I am not interested in timing and attempting to speed up the UKRMOL program. Someone posted a question about the program crashing in the Intel Fortran for Linux forum. I used FTN95 to catch a couple of uninitialsed variables and incorrect usage of logical expressions in the program. The version that was provided is probably not the current one, since one has to apply to obtain the source code of UKRMOL, and I have not done so.

You measured the total time spent in routines, including the incremental time for the additional checking. You concluded that CONLOX and CONLOR were the routines where the most time was spent.

I find that compiling just those two subroutines with the additional option of /inhibit_check 4 did not do much to reduce the run time, but doing the same with VGAMMA essentially solved the problem. Thus, my conclusion is that the overhead of /check spikes for VGAMMA and not CONLOX and CONLOR.

However, those routines, as well as ONELH, each contain a call to VGAMMA. Thus, if you looked at inclusive timings, you would think that CONLOX, CONLOR and ONELH are where problems occur when /check is used. The problem, however, lies elsewhere.

I think that what the compiler developers may have to look at are the calls to routines such as __pcheck() that are made from VGAMMA when /check is used.

There are many ALLOCATE statements inside DO loops in VGAMMA and other routines. These are possibly the results of the UKRMOL authors having used some automatic F77 -> F90 conversion utilities. Estimating the peak allocations and making one-time allocations outside the loops should solve this problem. However, the program runs in less than a second, so there is no pressing need to do this.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

mecej4,

I am also at a loss to explain the huge time lost with /check.

I have been looking at CONLOR ( it is called less times than CONLOX)
I was timing a segment, that was taking about 60 seconds and moved some code ( zero array ) to a different part.
I now have the following code taking about 76 seconds for 21 passes , but without the line a1nbb = 0 it took 17 seconds !

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

I put in timings for segments of CONLOX and CONLOR using rdtsc_val@. With this I can monitor the elapsed times in these segments. It is hard to believe that the time can be taken up outside these segments that I am timing.
I am showing some time being spent in VGAMMA but much more time outside this call.
Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get.

mecej4 · Joined: 31 Oct 2006 Posts: 1891

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney

Mecej4,

I have confirmed that there is a significant proportion of the time being taken up in initialising arrays after allocation. I think this relates to the way that /check manages allocatable arrays.
for example, in conlor, for the loop
do i = 1, 3*nbb
a1nbb(i) = zero
end do
If I replace this with "call zero_array(a1nbb,3*nbb)" and use /timing, most of the time is attributed conlor, rather than zero_array. I used this routine in a few places and all 12,447 calls to zero_array took 0.0024 seconde from the /timer report. It was compiled with /64 /timing /check.

However, if I put explicit code to time this loop, using:
call jdc_time (24, '<conlor> set f 24')
do i = 1, 3*nbb
a1nbb(i) = zero
end do
call jdc_time (42, '<conlor> zero 24')
This identifies significant time for this block of code.
When using call zero_array, /check is spending significant time preparing the array.
Basically, there are large delays when /check prepares the allocated array for use ( memory page allocation or whatever ) This looks to be the/a source of the problem.
My test showed that the code between timer_24
to timer_42 was called 21 times and took 96 seconds.

There is a problem with /check in this respect.

John

I have retested this and the times were substantially reduced with array syntax in a call. There does not appear to be significant overhead outside the do loop. Do loops are causing big delays.
timer_42 reduced to .0030 seconds when zero_array was called.
subroutine zero_array (a,n)
integer n
real*8 a(n)
a = 0
end subroutine zero_array

mecej4 · Joined: 31 Oct 2006 Posts: 1891

John, do you know if /timing and /check have any mutual effects?

Anyway, your findings should be useful to Silverfrost as they identify and rectify the /check problem.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2556 Location: Sydney