Topic: Adding /check causes EXE to run 800 times slower in Support

mecej4

Posts: 1911

Back to Top

31 Mar 2018 12:51 #21701

I have run into a problem using /check /64 with a program comprising around 4500 lines, spread over six files. The program reads a single input file, and produces a number of unformatted files. These unformatted files are intended for input to another program, but we need not be concerned with that program now.

Without /check, the program runs to completion in 0.4 seconds (32- or 64-bit, 8.10 compiler). With /check, in 32-bit mode, the program takes 1.0 s. With /check and 64-bit, the program takes 320 seconds!

I have tested the same program with other compilers, and it runs without any noticeable problems, in about 1 second, even with the equivalent of /undef.

Since the source code is rather long (4,500 lines), I have zipped the source files, the input data file, and four batch files that I used to build the program with FTN95 8.10. The Zip file is at:

 https://www.dropbox.com/s/dw9c9v390yzxo9v/swmol.7z?dl=0

Thanks.

[P.S.] Perhaps the first step to take is to obtain the timings with the 8.2 or 8.3 versions of the compiler.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

31 Mar 2018 2:42 #21702

Thanks for the feedback.

This is a strange behaviour which I have confirmed with v8.10, v8.20 and v8.30.

The first suspect would be an anti-virus checker but my test did not show any difference when switched off.

I will make a note that this needs investigating.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

31 Mar 2018 2:57 #21703

The problem is 'fixed' when using /inhibit_check 4. This provides a temporary work-around and a starting point for us when investigating.

mecej4

Posts: 1911

Back to Top

31 Mar 2018 3:21 #21704

Thanks for the work-around.

However, since the help file says that '/inhibit_check 4' suppresses all pointer checking, I am puzzled; the code contains only allocatable variables and no pointer variables (I realise that, behind the scenes, allocatable variables may be handled as pointer variables with some additional attributes.)

I found that the subroutine VGAMMA in swmol3.f90 is the main culprit. Compiling everything else with /64 /check and just the VGAMMA subroutine with /64 /check /inhibit_check 4, I find that doing so reduces the run time from 320 s to about 4 s. (I used a Fortran source file split utility on swmol3.f90).

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

31 Mar 2018 5:27 #21705

Thanks.

JohnCampbell

Posts: 2526 Sydney

Back to Top

1 Apr 2018 1:35 #21706

I tried to see if /timing would provide some answers. I used/guessed the following build for /64 now >ftn95.tce del *.obj >>ftn95.tce del *.mod >>ftn95.tce SET TIMINGOPTS=/TMO /DLM ,

ftn95 precisn.f90 /64 /timing  >> ftn95.tce
ftn95 blasint.f90 /64 /timing >> ftn95.tce
ftn95 gdata.f90 /64 /timing >> ftn95.tce
ftn95 ipack.f90 /64 /timing >> ftn95.tce
ftn95 swdata.f90 /64 /timing /check >> ftn95.tce
ftn95 swmol3.f90 /64 /timing /check >> ftn95.tce

slink64 swmol3.obj swdata.obj ipack.obj gdata.obj blasint.obj precisn.obj /stack_size 10000000 /file swmo64.exe   >> ftn95.tce

swmo64   >> ftn95.tce

This produced file swmo64.tmo, with the first few lines being: Name,CallsTo,CallsFrom,PageFaults,CPUsec,iCallsTo,iCallsFrom,iCPUsec CONLOX,234,2466,0,155.3824,119409,119175,195.2694 CONLOR,21,147,0,130.3098,18406,18385,156.6794 VGAMMA,3880,22674,0,68.8965,26554,22674,68.9070 TWOEL,1,135,0,31.6267,231015,231014,397.7422 DRIVE,120,1551,0,14.1037,230994,230874,366.1142 ONELH,1,3973,0,0.5578,26443,26442,3.3684

This shows the delays could be occurring in CONLOX and CONLOR.

I do find that when stack is required for FTN95, this can indicate problems. I would try to find any large local or temporary arrays that could demand stack and convert to ALLOCATE.

I would recomend the approach of using /timing as a way of locating potential problems,

I hope this helps.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

1 Apr 2018 2:14 #21707

Paul,

Regarding the use of /TIMING, I am puzzled by the use of 'Entering-10 second timer calibration loop'

It is possible to calibrate real8 function cpu_clock@ or INTEGER8 FUNCTION RDTSC_VAL@() against SYSTEM_CLOCK or 'QueryPerformanceCounter' / 'QueryPerformanceFrequency' in only a few ticks of these reference timers, assuming that is what is taking place. I also find the clock rate = 1024. * QueryPerformanceFrequency on all processors I have available, although I am not sure how long this has been the case. This approximation actually appears to be more accurate that any calibration test !

I do find /TIMING to be a very useful feature of FTN95, especially for identifying problem areas of code.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

1 Apr 2018 12:52 #21708

I further investigated the performance and found some results that might help. I compiled with /64 /check /timing on a i5-4200 so these results might be a bit different, but not much. With these compile options and Ver 8.10, I found most time was spent in routines CONLOR (line 2410:2476) 180 seconds CONLOX (line2477:2541) 218 seconds VGAMMA (line 1736:1850) 75 seconds

Strangely, the most time in CONLOX (161 seconds) was spent in lines 2522:2535 do lqs = 1, ncqrs f(lqs) = f(lqs)root(lqs) end do ib = 0 do i = 2, nnn ib = ib + ncqrs do lqs = 1, ncqrs root(lqs) = -4._wproot(lqs)*e1(lqs) f(lqs+ib) = f(lqs+ib)root(lqs) end do end do do i = 1, 3nbb a1nbb(i) = zero end do

Also in CONLOR at lines 2457:2470

Changing from an i5-4200 and Ver 8.10 to an i5-2300 and using Ver 8.20 reduced the total run time from 530 seconds to 330 seconds, then removing the /timing option reduced to 320 seconds. (processor change is most significant) It is surprising that this code has so much delay. Changing '-4._wp' had no effect.

Also, in routine readin at line 71, if array 'kabs' is made allocatable, this removes 224 Mb from the stack, so no stack adjustment is then required.

To get this timing breakdown, I supplemented /timing with my own timer based on RDTSC_VAL@(). I could send the modified code if required,

John

mecej4

Posts: 1911

Back to Top

1 Apr 2018 2:17 #21709

John, thanks for the timing analysis. As it happens, I am more interested in getting fixed the unexpected huge slowdown caused by using /check. We are used to seeing the use of /check cause a doubling of the run time of a program, sometimes seeing the run time to increase by a factor of 10. But 800?

At this point, I am not interested in timing and attempting to speed up the UKRMOL program. Someone posted a question about the program crashing in the Intel Fortran for Linux forum. I used FTN95 to catch a couple of uninitialsed variables and incorrect usage of logical expressions in the program. The version that was provided is probably not the current one, since one has to apply to obtain the source code of UKRMOL, and I have not done so.

You measured the total time spent in routines, including the incremental time for the additional checking. You concluded that CONLOX and CONLOR were the routines where the most time was spent.

I find that compiling just those two subroutines with the additional option of /inhibit_check 4 did not do much to reduce the run time, but doing the same with VGAMMA essentially solved the problem. Thus, my conclusion is that the overhead of /check spikes for VGAMMA and not CONLOX and CONLOR.

However, those routines, as well as ONELH, each contain a call to VGAMMA. Thus, if you looked at inclusive timings, you would think that CONLOX, CONLOR and ONELH are where problems occur when /check is used. The problem, however, lies elsewhere.

I think that what the compiler developers may have to look at are the calls to routines such as __pcheck() that are made from VGAMMA when /check is used.

There are many ALLOCATE statements inside DO loops in VGAMMA and other routines. These are possibly the results of the UKRMOL authors having used some automatic F77 → F90 conversion utilities. Estimating the peak allocations and making one-time allocations outside the loops should solve this problem. However, the program runs in less than a second, so there is no pressing need to do this.

JohnCampbell

Posts: 2526 Sydney

Back to Top

1 Apr 2018 2:40 #21711

mecej4,

I am also at a loss to explain the huge time lost with /check.

I have been looking at CONLOR ( it is called less times than CONLOX) I was timing a segment, that was taking about 60 seconds and moved some code ( zero array ) to a different part. I now have the following code taking about 76 seconds for 21 passes , but without the line a1nbb = 0 it took 17 seconds !

       nnn3 = max(2, nnn-1)
       nbb = nnb*ncqrs
       npw = ncqrs*max(2, nnn-1)
       igam = max(5, nnn)*ncqrs
       nbb3 = nbb*3
       allocate (a1npw(3*npw), a1nbb(nbb3), args(ncqrs), f(igam))
        a1npw = 0
        a1nbb = 0
!       do i = 1, 3*npw
!          a1npw(i) = zero
!       end do

I was wondering if there were Floating Point Exception errors, but took out the following code, but it uses 0.0025 seconds for all calls.

    subroutine lqs_loop ( ncqrs, root, e1, f )
      integer*4 ncqrs, lqs
      real*8 root(*), e1(*), f(*)

          do lqs = 1, ncqrs
             root(lqs) = -4.0d0*root(lqs)*e1(lqs)
             f(lqs)    = f(lqs)*root(lqs)
          end do
    end subroutine lqs_loop

I have no explanation for this: delays in getting allocated memory active ?

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

1 Apr 2018 2:55 #21713

I put in timings for segments of CONLOX and CONLOR using rdtsc_val@. With this I can monitor the elapsed times in these segments. It is hard to believe that the time can be taken up outside these segments that I am timing. I am showing some time being spent in VGAMMA but much more time outside this call. Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get. Id Call Description Calls Ticks Seconds 0 Other time 4135 86522310412 30.9702 1 <vgama> DO 1 2685 38384694245 13.7396 2 <vgama> DO 2 2685 38802965904 13.8893 3 <vgama> DO 3 2685 12861502266 4.6037 4 <vgama> DO 4 2685 13906233666 4.9777 5 <vgama> exit 5 2685 21003895422 7.5182 6 <vgama> wexp 6 1306 316496509 0.1133 7 <vgama> f= 7 112 2179037152 0.7800 8 <vgama> f= 8 1194 208513310 0.0746 9 <vgama> f= 9 1194 193226831 0.0692 10 <vgama> exit 10 1306 156840561 0.0561 16 <vgama> set tab 16 1 4527747 0.0016 21 <conlor> exit 21 21 19841940782 7.1023 22 <conlor> exit 22 21 50278686775 17.9970 23 <conlor> set f 23 21 15060933199 5.3910 24 <conlor> set f 24 21 167748801924 60.0448 ** 25 <conlor> dorbax 25 21 99642680 0.0357 26 <conlor> exit 26 21 79447440 0.0284 31 <conlox> exit 31 234 22570256060 8.0789 32 <conlox> exit 32 234 82333831518 29.4709 33 <conlox> f = 33 234 25752846438 9.2181 34 <conlox> f = 34 234 180254180811 64.5210 ** 35 <conlox> call dorbax 234 262701403 0.0940 36 <conlox> exit 36 234 361165824 0.1293 run time = 278.9071 seconds

The two ** can be reduced substantially when removing the zero of a1nbb. There is not much else happening there

mecej4

Posts: 1911

Back to Top

1 Apr 2018 6:35 #21717

Quoted from JohnCampbell I am showing some time being spent in VGAMMA but much more time outside this call. Delays in allocating memory pages as arrays are being used (zeroed) is the best explanation I can get.

Please note that you are attributing the time spent to the Fortran source in VGAMMA, etc. In contrast, I think, the time is being spent in the pointer and subscript checking code planted by the use of /check. Along the same lines, please note that allocating memory and zeroing the memory is done even when /check is not used, when it takes just a fraction of a second to do. If we had a profiler that could collect timings for the extra functions that check pointers, etc., we would be in a better position to understand what went wrong.

JohnCampbell

Posts: 2526 Sydney

Back to Top

1 Apr 2018 11:57 (Edited: 2 Apr 2018 12:54) #21718

Mecej4,

I have confirmed that there is a significant proportion of the time being taken up in initialising arrays after allocation. I think this relates to the way that /check manages allocatable arrays. for example, in conlor, for the loop do i = 1, 3nbb a1nbb(i) = zero end do If I replace this with 'call zero_array(a1nbb,3nbb)' and use /timing, most of the time is attributed conlor, rather than zero_array. I used this routine in a few places and all 12,447 calls to zero_array took 0.0024 seconde from the /timer report. It was compiled with /64 /timing /check.

However, if I put explicit code to time this loop, using: call jdc_time (24, '<conlor> set f 24') do i = 1, 3*nbb a1nbb(i) = zero end do call jdc_time (42, '<conlor> zero 24') This identifies significant time for this block of code. When using call zero_array, /check is spending significant time preparing the array. Basically, there are large delays when /check prepares the allocated array for use ( memory page allocation or whatever ) This looks to be the/a source of the problem. My test showed that the code between timer_24 to timer_42 was called 21 times and took 96 seconds.

There is a problem with /check in this respect.

John

I have retested this and the times were substantially reduced with array syntax in a call. There does not appear to be significant overhead outside the do loop. Do loops are causing big delays. timer_42 reduced to .0030 seconds when zero_array was called. subroutine zero_array (a,n) integer n real*8 a(n) a = 0 end subroutine zero_array

mecej4

Posts: 1911

Back to Top

2 Apr 2018 12:01 #21719

John, do you know if /timing and /check have any mutual effects?

Anyway, your findings should be useful to Silverfrost as they identify and rectify the /check problem.

JohnCampbell

Posts: 2526 Sydney

Back to Top

2 Apr 2018 12:06 (Edited: 2 Apr 2018 12:55) #21720

Quoted from mecej4 John, do you know if /timing and /check have any mutual effects?

removing /timing has little effect on the run time, reducing to 501 seconds for the 522 second test I just ran with /64 /timing /check

I can package up what I have done and provide the link

John

mecej4

Posts: 1911

Back to Top

2 Apr 2018 12:24 #21721

Thanks; please provide the link.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

2 Apr 2018 2:57 #21724

This has been fixed (after a fashion) for the next release of clearwin64.dll (not the one released today via this forum).

The failure was in __pcheck which was doing far too much validating.

__pcheck checks that ALLOCATEd store is not accessed after it has been DEALLOCATEd.

In the longer term we will need a better strategy for this particular check.

mecej4

Posts: 1911

Back to Top

2 Apr 2018 3:48 #21725

Thanks for the information, Paul.

In the short term, with 8.10 and earlier, would it hurt to put /inhibit_check 4 in FTN95.cfg, if I am sure that my codes have no variables that have the pointer attribute?

Many of the allocatable variables in the test program are local subprogram variables, so these should get automatically deallocated at subprogram exit. For those variables, the checking (for incorrect use after they are DEALLOCATEd) should be stopped at subprogram exit, because those variables do not exist thereafter.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

2 Apr 2018 8:19 #21726

Just to be clear, it's not just the POINTER attribute but also the ALLOCATABLE attribute that is involved - so anything that is ALLOCATEd.

If /config does the trick then all well and good.

We may be close to a decent fix which will be in clearwin64.dll and can be release vey soon. If you are not up to date with v8.30 then using the latest DLLs may work - it would just be a case of testing after taking a careful backup of your existing DLLs.

JohnCampbell

Posts: 2526 Sydney

Back to Top

3 Apr 2018 9:57 #21727

The following link is the test directory that I have been using to test the program.

https://www.dropbox.com/s/m504de68txw8br0/swmo_test.zip?dl=0

There are a 'few' tests in the directory, but the latest test to look at is:

swmol3.f90 is my latest modified version, which includes calls to my spot_timer_64.f90 code which uses rdtsc_val@ to monitor ticks of the code. swmol3_orig.f90 is te original code cmp_changes.tce lists the differences between the two.

I did 3 other changes to the code: changed KABS array as allocatable. It is a 224 Mb local array changed fort.2 to fort.12 to remove write (,) problem changed 'real (kind=wp), dimension (nsiz+1, nndim), save :: tab' to be in COMMON

For testing I have used /timing and also included my routine spot_time, which allows timing of selected code inside a routine. (change the code for calls)

The most glaring slow down is at lines 2601:2605 call spot_time (24, '<conlor> set f 24') do i = 1, 3*nbb a1nbb(i) = zero end do call spot_time (43, '<conlor> zero f 24') I have calculated that each iteration of the do loop can take up to 150,000 processor cycles !!!

I hope this helps

The latest .bat file to do the test is 'build_64.bat'

John