|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
mecej4
Joined: 31 Oct 2006 Posts: 1895
|
Posted: Wed Apr 14, 2021 4:17 pm Post subject: Perplexing bug in program (or compiler?) |
|
|
I have been working on a groundwater flow program with about 12,000 lines of Fortran code. FTN95 (V 8.71) has been quite helpful in finding and fixing bugs related to subscript bounds, uninitialised variables, etc. However, in one run, I encountered strange program behaviour.
When I compiled the program with /checkmate and built a 32-bit EXE, the program ran for a few seconds and then aborted with INTEGER OVERFLOW on a line that contains just a subroutine call:
Code: | call abmult(Ap(1,l),Bp(:,l)) |
When I compiled the same program with /64 /checkmate to build a 64-bit EXE, the behaviour becomes a bit stranger. When I ran the EXE from the command line, the small Windows hourglass icon came on. After up to ten seconds, the program quit with no messages of any kind. When I ran the same EXE from SDBG64, execution stopped after about one second, reporting:
Code: | Error: Access Violation writing address 0x000000000AC40000 |
The line of code on which this happens is just a line containing an argument variable declaration, and reads:
Code: | real(kind=KDP) , dimension(*) , intent(out) :: X |
[Added on 16 April] Looking at the assembler code in the area where the access violation occurs reveals that an attempt is made to store the marker byte value Z'80' (= 'undefined') into Z'653BFFFF3588' bytes of memory starting at address Z'0AC40000' . That byte count is too large to be represented in 32 bits, which may explain the integer overflow encountered by the 32-bit EXE.
I have created a reproducer by removing about 50 percent of the code, but the reproducer is still a bit large -- about 6,000 lines. The source code, a required input data file, and batch files for building EXEs from the sources are contained in a Zip file which may be downloaded from Dropbox:
https://www.dropbox.com/s/az2yedspwjk5unk/asizbug.7z?dl=0
Last edited by mecej4 on Thu Apr 15, 2021 3:11 pm; edited 2 times in total |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Thu Apr 15, 2021 3:21 am Post subject: |
|
|
would abmult need an interface ? |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1895
|
Posted: Thu Apr 15, 2021 3:38 am Post subject: |
|
|
Thanks for taking a look. Good point.
When an array section is used as an actual argument, an interface may be required. However, since ABMULT is a contained subroutine of the caller, which is the host routine, the latter has an interface already available.
The issue persists if the call is changed to
Code: | CALL abmult(ap(1,L),bp(1,L)) |
|
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7938 Location: Salford, UK
|
Posted: Thu Apr 15, 2021 2:56 pm Post subject: |
|
|
This is my impression...
The argument ap(1,L) of abmult is a scalar whereas an array is expected.
If I change to ap(1,L:L), it is an array but only with one element.
Within abmult, x(irn) = s is putting values into multiple elements of the actual argument ap. So the destination is not valid. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1895
|
Posted: Thu Apr 15, 2021 4:18 pm Post subject: |
|
|
Paul, your remarks relate to the F77 conventions for passing contiguous array sections and their variance from the F90+ conventions for the same. Indeed, the F2003 standard comments at length on these differences in section C.9.5.
However, every current Fortran compiler that we are likely to use today, including FTN95, handles such F77 style calls perfectly well. Here is a very short example that illustrates that capability. FTN95 has no problem with this code, nor does the NAG compiler. Similarly, if /check or /debug is used instead of /checkmate on the larger test code that I posted to Dropbox, there is no problem.
Code: | program aseccopy
implicit none
integer a(10), b(5)
integer i,n
!
n = 5
do i=1, n
a(i) = i
b(i) = i+5
end do
! F77 style call: argument association by location/address
call acopy(a(n+1),b(1),n) ! copy b into upper half of array A
print '(1x,I2,2x,I4)',(i,a(i),i=1,10)
stop
contains
subroutine acopy(a,b,n)
integer, intent(in) :: n
integer, dimension(n), intent(in) :: b
integer, dimension(n), intent(in out) :: a
integer i
do i=1,n
a(i) = b(i)
end do
return
end subroutine
end program |
In the CALL statement, the argument a(n+1) has the same meaning as a(n+1:). |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7938 Location: Salford, UK
|
Posted: Thu Apr 15, 2021 5:03 pm Post subject: |
|
|
mecej4
My impression was that the first argument is a scalar or an array with one element. Whereas the routine is setting multiple elements for the corresponding dummy argument.
If that is correct then the code is at fault regardless of the compiler or the Fortran standard. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Fri Apr 16, 2021 11:34 am Post subject: |
|
|
For 32-bit version, I ran in SDBG.
I changed
routine init.f90 to report that ap was allocated and its dimensions; which appeared to be correct Code: | if ( allocated (ap) ) then
write (*,*) 'in iter : AP is allocated as size ', size(ap,1),' x ',size(ap,2)
else
write (*,*) 'in iter : AP is NOT allocated'
end if
CALL gcgris(ap, bbp, ra, rr, sss, xx, ww, zz, sumfil) |
routine agcgris.f90 to report ap dimensions; which reported the second dimension as incorrect.
Code: | write (*,*) 'in gcgris : AP is allocated as size ', size(ap,1),' x *'
write (*,*) ' nrn =',nrn,' for AP(nrn,*'
write (*,*) ' nbn =',nbn,' for BP(nbn,*'
CALL abmult(ap(1,L),bp(:,L)) ! <<== Integer overflow with 32-bit /checkmate |
the program then crashed and SDBG reported invalid dimensions for arrays AP and BP
I think FTN95 is having a problem with "0:*" in:
REAL(KIND=kdp), DIMENSION(nrn,0:*) :: ap
In Routine abmult, what would FTN95 do with "DIMENSION(*), INTENT(OUT)" in X declaration ?
Looks like problems to me !!
Using INTENT can be problematic in this case, especially when going to other compilers.
Also, I think "va(jc,irn)*y(jcol-nrn)" could be "va(jc,irn)*y(jcol+1)" as L = 0 prior to call. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1895
|
Posted: Fri Apr 16, 2021 1:14 pm Post subject: |
|
|
John,
The caller is passing the first columns of the two-dimensional arrays AP and BP as the actual arguments to match the one-dimensional arrays X and Y in ABMULT. Yes, L = 0, but the arrays AP and BP have been declared with matching 0-s in their second dimension.
With /checkmate, I expect FTN95 to set INTENT(OUT) arguments to 'UNDEFINED' . To do so correctly, it needs the actual bounds of those arguments, which are normally not available with assumed size arguments, but we expect /checkmate to pass any/all extra information needed for checking and initialising to 'UNDEFINED'.
The original code is the HST3D Version 2 from the USGS, see https://wwwbrr.cr.usgs.gov/projects/GW_Solute/hst/index.shtml and http://priede.bf.lu.lv/ftp/pub/TIS/datu_analiize/WaterFlow/HST3D/ .
Thanks for your comments. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Fri Apr 16, 2021 1:30 pm Post subject: Re: |
|
|
mecej4 wrote: | The caller is passing the first columns of the two-dimensional arrays AP and BP as the actual arguments to match the one-dimensional arrays X and Y in ABMULT. Yes, L = 0, but the arrays AP and BP have been declared with matching 0-s in their second dimension. |
if (jcol > 0) ... *y(jcol-nrn) just looks wrong, as nrn = nbn = 6479 (although I don't know what jcol values mean)
-nrn looks like a probable column shift, ie expects Y in call is second column of bp.
Also why "CALL abmult(ap(1,L),bp(:,L))" rather than "CALL abmult(ap(1,L),bp(1,L))", as first dimension is declared as nrn/nbn.
There is no good reason for this if using F77 wrapper approach ? |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1895
|
Posted: Fri Apr 16, 2021 3:20 pm Post subject: |
|
|
John, the full story is that the original code passes assumed shape arrays to BLAS-like routines such as ABMULT, but for calls with assumed shape arrays, there is a major performance penalty with FTN95 (but not with Gfortran, Intel). Here are run times (FTN95 8.71 /opt /64, Ifort 21 /O2, CPU: i7-10710-1.1 GHz)
Code: | Silverfrost Intel
Test Case A-Shape A-Size
---------- --------- -------
Elder_H 113.499 2.654 1.328
Elder_S 111.826 2.476 1.125
Henry 1.186 0.254 0.078
Huyakorn failed 16.136 8.219
Hydrocoin 771.261 28.473 16.109 |
[Sorry, I don't want to experiment with the number of spaces to add to get the columns to come out aligned].
I have been gradually replacing these calls with equivalent assumed-size arguments. At the same time, the original code has some bugs, unfortunately, and I am using FTN95 to make the conversions and catch errors in the process.
You are seeing a mangled reproducer, so you will see lots of inconsistencies and objectionable style in it; you have to avoid allowing yourself to be distracted by such things.
You are welcome to try the HST3D original code. I have the assumed size version of Version 2.2.13, which I developed a couple of years ago, and I should be happy to provide that to you. I am in the process of making similar modifications to Version 2.2.16. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Sat Apr 17, 2021 4:00 am Post subject: |
|
|
I have further looked at the code you provided.
In iter.f90 from line 55 I have included: Code: | if ( allocated (ap) ) then
write (*,*) 'in iter : AP is allocated as size ', size(ap,1),' x ',size(ap,2)
write (*,*) 'Loc(ap) =',loc(ap)
write (*,*) 'Loc(bp) =',loc(bbp)
else
write (*,*) 'in iter : AP is NOT allocated'
end if
CALL gcgris(ap, bbp, ra, rr, sss, xx, ww, zz, sumfil)
END SUBROUTINE iter
|
In Agcgris.f90 from line 38 I have included: Code: | write (*,*) 'in gcgris : AP is allocated as size ', size(ap,1),' x *'
write (*,*) ' nrn =',nrn,' for AP(nrn,* : Loc(ap) =',loc(ap)
write (*,*) ' nbn =',nbn,' for BP(nbn,* : Loc(bp) =',loc(bp)
write (*,*) 'using CALL abmult (ap,bp)"
!zz CALL abmult (ap(1,L),bp(:,L)) ! <<== Integer overflow with 32-bit /checkmate
!zz CALL abmult (ap(1,L),bp(1,L)) ! <<== Integer overflow with 32-bit /checkmate
CALL abmult (ap,bp) ! <<== Integer overflow with 32-bit /checkmate
stop
CONTAINS
SUBROUTINE abmult(x,y)
IMPLICIT NONE
REAL(KIND=kdp), DIMENSION(*), INTENT(OUT) :: x ! (nrn <<== Access Violation writing address 0x000000000AD60000
REAL(KIND=kdp), DIMENSION(*), INTENT(IN) :: y ! (nbn
!
INTEGER :: irn, jc, jcol
REAL(KIND=kdp) :: s
integer :: start_x, start_y, nbad, ngood, nskip
!
start_x = loc(x)
start_y = loc(y)
write (*,*) 'Loc(ap/x) =',start_x
write (*,*) 'Loc(bp/y) =',start_y
nbad = 0
ngood = 0
nskip = 0
!
DO irn=1,nrn
s = 0.0_kdp
DO jc=1,6
jcol = ci(jc,irn)
IF (jcol > 0) s = s + va(jc,irn)*y(jcol-nrn)
IF (jcol > 0) then
if ( loc(y(jcol-nrn)) < start_y) then
write (*,*) irn,jc,' bad y usage : jcol=',jcol
nbad = nbad + 1
else
ngood = ngood + 1
end if
else
nskip = nskip+1
end if
END DO
x(irn) = s
END DO
write (*,*) 'ngood =',ngood
write (*,*) 'nbad =',nbad
write (*,*) 'nskip =',nskip
END SUBROUTINE abmult
END SUBROUTINE gcgris |
the first 2 calls to abmult failed wityh integer overflow, but using the 3rd call option produced the following output. Code: | ...
Compiling Z:\Temp\mecej4\asizbug\ldind.f90
NO ERRORS [<LDIND> FTN95 v8.64.0]
Z:\Temp\mecej4\asizbug>slink *.obj /out:asiz32
Creating executable: asiz32.exe
in iter : AP is allocated as size 6479 x 6
Loc(ap) = 280774288
Loc(bp) = 281085296
in gcgris : AP is allocated as size 6479 x *
nrn = 6479 for AP(nrn,* : Loc(ap) = 280774288
nbn = 6479 for BP(nbn,* : Loc(bp) = 281085296
using CALL abmult (ap,bp)
Loc(bp/y) = 281085296
ngood = 36805
nbad = 0
nskip = 2069
|
It is interesting that in the other 2 call tests, prior to the call to abmult, the address and size(.1) values are correct, but after starting call, sdbg info is corrupted. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Sat Apr 17, 2021 4:13 am Post subject: |
|
|
I am using FTN95 Ver 8.64 32-bit and SDBG Ver 8.62
Using CALL abmult (ap(1,L),bp(:,L));
when it crashes, in the Vars:GCGRIS window, there are 2 listed variables BP,
a variable BP and
an array "BP = REAL*8 (280774289,0:214748646)"
These have different memory addresses.
This looks confusing to me.
-- new test --
I returned to using CALL abmult (ap,bp);
The program terminates normally, but SDBG is still listing a variable BP and an array BP ??? FTN95/SDBG error ?
Arrays AP and BP have incorrect (random) sizes.
(I should put a breakpoint prior to the call)
-- new test --
Now in SDBG with breakpoint, checked prior to call abmult
AP has wrong size (57735933,0:2147483646)
BP has wrong size, (4604097,0:2147483646), plus a variable BP exists.
write statement reports size(ap,1) correctly as 6479
Their first dimension should be correct ?? FTN95/SDBG error ?
I would suggest this is possible cause of integer overflow.
Their second dimension is reported as 0:2147483646, a possible length for *, so could be ok.
In abmult
X and Y have same memory size of module MCS2 variables AP and BBP.
X = Y = REAL*8 (38874) { this is 6479*6 }
(my comment about "-nrn" appears invalid, as CI might be adjusted for this offset. ie my ngood = 36805 vs nbad = 0)
-- new test --
Further test: I changed the declarations for AP and BP to
REAL(KIND=kdp), DIMENSION(nrn,*) :: ap
REAL(KIND=kdp), DIMENSION(nbn,*) :: bp
SDBG still reports their first dimension incorrectly
Other arrays ra and rr are reported with correct dimensions by SDBG.
array h is also correct. (automatic array where nsdr is module variable)
REAL(KIND=kdp), DIMENSION(lrcgd1,*), INTENT(INOUT) :: ra
REAL(KIND=kdp), DIMENSION(*), INTENT(IN OUT) :: rr
REAL(KIND=kdp), DIMENSION(0:nsdr-2,0:nsdr-2) :: h |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1895
|
Posted: Sat Apr 17, 2021 7:00 am Post subject: |
|
|
Encouraged by your comments, John, I succeeded in creating a short reproducer, which I hope Paul will consider.
Code: | MODULE mcs
IMPLICIT NONE
INTEGER, PARAMETER :: LRCGD1 = 19
INTEGER :: NRN
REAL, DIMENSION(:,:), ALLOCATABLE, SAVE :: app
END Module
Program HST3d
USE mcs
IMPLICIT NONE
NRN = 7
allocate(app(NRN,0:5))
CALL gcgris(app)
print *,app(3,0)
end Program
SUBROUTINE gcgris(ap)
USE mcs
IMPLICIT NONE
REAL, DIMENSION(NRN,0:*), INTENT(IN OUT) :: ap
CALL abmult(ap(1:NRN,0))
end subroutine gcgris
SUBROUTINE abmult(x)
USE mcs
IMPLICIT NONE
REAL, DIMENSION(*), INTENT(OUT) :: x
INTEGER :: irn
!
DO irn=1,nrn
x(irn) = 0.1
END DO
END SUBROUTINE abmult |
Compile with /full_debug /full_undef, and run inside SDBG (running outside the debugger may cause the command window to hang up). The program will stop on line 31 with "Array subscript(s) out-of-bounds". The variables pane shows x to have a size of 1, which is incorrect. If you click on the GCGRIS line in the call stack window, you will now see the size of array AP as (7,0:2147483646) (which is preposterous -- in fact, the upper bound in the second extent is seen as one less than the largest 32-bit signed integer, whereas the declared upper bound is just 5. However, as John C. noted, this large value may be just a signature for '*' in the subroutine).
Turning now to 64-bits, we run into the same problem as in the original test program of this thread. Compile with /64 /full_debug /full_undef, and run inside sdbg64. The program stops with "Access violation writing address 0x0000000002575000". The variables pane shows x as having a size of 1000794972.
I suspect that one cause of this bug is how this program specifies a lower bound of 0 rather than the conventional 1.
Last edited by mecej4 on Sat Apr 17, 2021 1:32 pm; edited 2 times in total |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7938 Location: Salford, UK
|
Posted: Sat Apr 17, 2021 7:40 am Post subject: |
|
|
mecej4
Thank you for this. I have made a note to investigate. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2560 Location: Sydney
|
Posted: Sun Apr 18, 2021 4:16 am Post subject: |
|
|
mecej4,
I had a look at your short reproducer.
For my testing with FTN95 Ver 8.64, on entry to gcgris, the array ap has valid dimensions, so I am not sure if this identifies the previous problem (that I am seeing).
I have produced a single file reproducer from your larger test.
https://www.dropbox.com/s/jvwul6p11ol83gi/Agcgris_test.f90?dl=0
https://www.dropbox.com/s/gdw1ndc10g5iccw/Agcgris_test3.f90?dl=0
For my testing of this reproducer, on entry to gcgris, the arrays ap, bp have incorrect dimensions in SDBG, but array ra has valid dimensions.
I think the invalid dimensions for ap and bp is a problem to be addressed.
I compiler as :
ftn95 agcgris_test /checkmate /link
sdbg agcgris_test
I set breakpoints at:
line 210 : at call to gcgris : shows arrays ap, bbp and ra defined as expedted
line 115 : entry to gcgris : shows arrays ap, bp with invalid dimensions
line 145 : entry to abmult : shows X,Y with correct dimensions
F6 to start test
Paul,
I see an error at line 115 : entry to gcgris that AP and BP have invalid dimensions in SDBG, reported as (1,7), but first dimension should = nbn = 100.
I identified this as a problem in the bigger program.
( note write (*,*) ... size(ap,1) reports correct value of first dimension, different from SDBG ?? )
In my example:
lines 96:101 show alternative definitions of AP and BP, although there was no change to outcome.
lines 130:132 show alternative calls to abmult. The first 2 resulted in a crash in the larger program, while the 3rd ran to completion.
I have only tested the 3rd option with this reproducer.
I thought this identified the problem for FTN95/SDBG that needs checking.
edit:
the second link above for Agcgris_test3.f90 reproduces the original crash. This uses the 1st declarations (as in original post program) and then 3rd then 2nd call.
3rd call works, but 2nd call that requires array dimensions fails.
I hope this easily demonstrates the problem.
ps.
I do find the use of "SAVE" in a module to be annoying.
I do not know of any F95+ compiler that requires explicit use of SAVE in a module. SAVE does nothing as modules do not go out of scope. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|