forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Perplexing bug in program (or compiler?)
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1580

PostPosted: Wed Apr 14, 2021 4:17 pm    Post subject: Perplexing bug in program (or compiler?) Reply with quote

I have been working on a groundwater flow program with about 12,000 lines of Fortran code. FTN95 (V 8.71) has been quite helpful in finding and fixing bugs related to subscript bounds, uninitialised variables, etc. However, in one run, I encountered strange program behaviour.

When I compiled the program with /checkmate and built a 32-bit EXE, the program ran for a few seconds and then aborted with INTEGER OVERFLOW on a line that contains just a subroutine call:

Code:
call abmult(Ap(1,l),Bp(:,l))


When I compiled the same program with /64 /checkmate to build a 64-bit EXE, the behaviour becomes a bit stranger. When I ran the EXE from the command line, the small Windows hourglass icon came on. After up to ten seconds, the program quit with no messages of any kind. When I ran the same EXE from SDBG64, execution stopped after about one second, reporting:

Code:
Error: Access Violation writing address 0x000000000AC40000


The line of code on which this happens is just a line containing an argument variable declaration, and reads:

Code:
real(kind=KDP) , dimension(*) , intent(out)  ::  X


[Added on 16 April] Looking at the assembler code in the area where the access violation occurs reveals that an attempt is made to store the marker byte value Z'80' (= 'undefined') into Z'653BFFFF3588' bytes of memory starting at address Z'0AC40000' . That byte count is too large to be represented in 32 bits, which may explain the integer overflow encountered by the 32-bit EXE.

I have created a reproducer by removing about 50 percent of the code, but the reproducer is still a bit large -- about 6,000 lines. The source code, a required input data file, and batch files for building EXEs from the sources are contained in a Zip file which may be downloaded from Dropbox:

https://www.dropbox.com/s/az2yedspwjk5unk/asizbug.7z?dl=0


Last edited by mecej4 on Thu Apr 15, 2021 3:11 pm; edited 2 times in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2351
Location: Sydney

PostPosted: Thu Apr 15, 2021 3:21 am    Post subject: Reply with quote

would abmult need an interface ?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1580

PostPosted: Thu Apr 15, 2021 3:38 am    Post subject: Reply with quote

Thanks for taking a look. Good point.

When an array section is used as an actual argument, an interface may be required. However, since ABMULT is a contained subroutine of the caller, which is the host routine, the latter has an interface already available.

The issue persists if the call is changed to

Code:
 CALL abmult(ap(1,L),bp(1,L))
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7142
Location: Salford, UK

PostPosted: Thu Apr 15, 2021 2:56 pm    Post subject: Reply with quote

This is my impression...

The argument ap(1,L) of abmult is a scalar whereas an array is expected.

If I change to ap(1,L:L), it is an array but only with one element.

Within abmult, x(irn) = s is putting values into multiple elements of the actual argument ap. So the destination is not valid.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1580

PostPosted: Thu Apr 15, 2021 4:18 pm    Post subject: Reply with quote

Paul, your remarks relate to the F77 conventions for passing contiguous array sections and their variance from the F90+ conventions for the same. Indeed, the F2003 standard comments at length on these differences in section C.9.5.

However, every current Fortran compiler that we are likely to use today, including FTN95, handles such F77 style calls perfectly well. Here is a very short example that illustrates that capability. FTN95 has no problem with this code, nor does the NAG compiler. Similarly, if /check or /debug is used instead of /checkmate on the larger test code that I posted to Dropbox, there is no problem.

Code:
program aseccopy
implicit none
integer a(10), b(5)
integer i,n
!
n = 5
do i=1, n
   a(i) = i
   b(i) = i+5
end do

! F77 style call: argument association by location/address
call acopy(a(n+1),b(1),n)  ! copy b into upper half of array A

print '(1x,I2,2x,I4)',(i,a(i),i=1,10)
stop

contains

subroutine acopy(a,b,n)
integer, intent(in) :: n
integer, dimension(n), intent(in) :: b
integer, dimension(n), intent(in out) :: a
integer i
do i=1,n
   a(i) = b(i)
end do
return
end subroutine
end program


In the CALL statement, the argument a(n+1) has the same meaning as a(n+1:).
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7142
Location: Salford, UK

PostPosted: Thu Apr 15, 2021 5:03 pm    Post subject: Reply with quote

mecej4

My impression was that the first argument is a scalar or an array with one element. Whereas the routine is setting multiple elements for the corresponding dummy argument.

If that is correct then the code is at fault regardless of the compiler or the Fortran standard.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2351
Location: Sydney

PostPosted: Fri Apr 16, 2021 11:34 am    Post subject: Reply with quote

For 32-bit version, I ran in SDBG.
I changed
routine init.f90 to report that ap was allocated and its dimensions; which appeared to be correct
Code:
  if ( allocated (ap) ) then
  write (*,*) 'in iter : AP is allocated as size ', size(ap,1),' x ',size(ap,2)
  else
  write (*,*) 'in iter : AP is NOT allocated'
  end if 

  CALL gcgris(ap, bbp, ra, rr, sss, xx, ww, zz, sumfil)


routine agcgris.f90 to report ap dimensions; which reported the second dimension as incorrect.
Code:
  write (*,*) 'in gcgris : AP is allocated as size ', size(ap,1),' x *'
  write (*,*) ' nrn =',nrn,' for AP(nrn,*'
  write (*,*) ' nbn =',nbn,' for BP(nbn,*'
  CALL abmult(ap(1,L),bp(:,L))   ! <<== Integer overflow with 32-bit /checkmate

the program then crashed and SDBG reported invalid dimensions for arrays AP and BP
I think FTN95 is having a problem with "0:*" in:
REAL(KIND=kdp), DIMENSION(nrn,0:*) :: ap

In Routine abmult, what would FTN95 do with "DIMENSION(*), INTENT(OUT)" in X declaration ?
Looks like problems to me !!
Using INTENT can be problematic in this case, especially when going to other compilers.
Also, I think "va(jc,irn)*y(jcol-nrn)" could be "va(jc,irn)*y(jcol+1)" as L = 0 prior to call.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1580

PostPosted: Fri Apr 16, 2021 1:14 pm    Post subject: Reply with quote

John,

The caller is passing the first columns of the two-dimensional arrays AP and BP as the actual arguments to match the one-dimensional arrays X and Y in ABMULT. Yes, L = 0, but the arrays AP and BP have been declared with matching 0-s in their second dimension.

With /checkmate, I expect FTN95 to set INTENT(OUT) arguments to 'UNDEFINED' . To do so correctly, it needs the actual bounds of those arguments, which are normally not available with assumed size arguments, but we expect /checkmate to pass any/all extra information needed for checking and initialising to 'UNDEFINED'.

The original code is the HST3D Version 2 from the USGS, see https://wwwbrr.cr.usgs.gov/projects/GW_Solute/hst/index.shtml and http://priede.bf.lu.lv/ftp/pub/TIS/datu_analiize/WaterFlow/HST3D/ .

Thanks for your comments.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2351
Location: Sydney

PostPosted: Fri Apr 16, 2021 1:30 pm    Post subject: Re: Reply with quote

mecej4 wrote:
The caller is passing the first columns of the two-dimensional arrays AP and BP as the actual arguments to match the one-dimensional arrays X and Y in ABMULT. Yes, L = 0, but the arrays AP and BP have been declared with matching 0-s in their second dimension.

if (jcol > 0) ... *y(jcol-nrn) just looks wrong, as nrn = nbn = 6479 (although I don't know what jcol values mean)
-nrn looks like a probable column shift, ie expects Y in call is second column of bp.

Also why "CALL abmult(ap(1,L),bp(:,L))" rather than "CALL abmult(ap(1,L),bp(1,L))", as first dimension is declared as nrn/nbn.
There is no good reason for this if using F77 wrapper approach ?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1580

PostPosted: Fri Apr 16, 2021 3:20 pm    Post subject: Reply with quote

John, the full story is that the original code passes assumed shape arrays to BLAS-like routines such as ABMULT, but for calls with assumed shape arrays, there is a major performance penalty with FTN95 (but not with Gfortran, Intel). Here are run times (FTN95 8.71 /opt /64, Ifort 21 /O2, CPU: i7-10710-1.1 GHz)

Code:
                                Silverfrost                              Intel
     Test Case     A-Shape      A-Size
     ----------      ---------      -------
      Elder_H        113.499     2.654                          1.328
      Elder_S        111.826     2.476                          1.125
      Henry              1.186     0.254                          0.078
      Huyakorn        failed     16.136                          8.219
      Hydrocoin     771.261   28.473                         16.109


[Sorry, I don't want to experiment with the number of spaces to add to get the columns to come out aligned].

I have been gradually replacing these calls with equivalent assumed-size arguments. At the same time, the original code has some bugs, unfortunately, and I am using FTN95 to make the conversions and catch errors in the process.

You are seeing a mangled reproducer, so you will see lots of inconsistencies and objectionable style in it; you have to avoid allowing yourself to be distracted by such things.

You are welcome to try the HST3D original code. I have the assumed size version of Version 2.2.13, which I developed a couple of years ago, and I should be happy to provide that to you. I am in the process of making similar modifications to Version 2.2.16.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2351
Location: Sydney

PostPosted: Sat Apr 17, 2021 4:00 am    Post subject: Reply with quote

I have further looked at the code you provided.
In iter.f90 from line 55 I have included:
Code:
  if ( allocated (ap) ) then
    write (*,*) 'in iter : AP is allocated as size ', size(ap,1),' x ',size(ap,2)
    write (*,*) 'Loc(ap) =',loc(ap)
    write (*,*) 'Loc(bp) =',loc(bbp)
  else
    write (*,*) 'in iter : AP is NOT allocated'
  end if 

  CALL gcgris(ap, bbp, ra, rr, sss, xx, ww, zz, sumfil)
END SUBROUTINE iter

In Agcgris.f90 from line 38 I have included:
Code:
  write (*,*) 'in gcgris : AP is allocated as size ', size(ap,1),' x *'
  write (*,*) ' nrn =',nrn,' for AP(nrn,*  : Loc(ap) =',loc(ap)
  write (*,*) ' nbn =',nbn,' for BP(nbn,*  : Loc(bp) =',loc(bp)
  write (*,*) 'using CALL abmult (ap,bp)"
!zz  CALL abmult (ap(1,L),bp(:,L))   ! <<== Integer overflow with 32-bit /checkmate
!zz  CALL abmult (ap(1,L),bp(1,L))   ! <<== Integer overflow with 32-bit /checkmate
  CALL abmult (ap,bp)   ! <<== Integer overflow with 32-bit /checkmate
  stop
CONTAINS

SUBROUTINE abmult(x,y)
  IMPLICIT NONE
  REAL(KIND=kdp), DIMENSION(*), INTENT(OUT) :: x   ! (nrn <<== Access Violation writing address 0x000000000AD60000
  REAL(KIND=kdp), DIMENSION(*), INTENT(IN)  :: y   ! (nbn
  !
  INTEGER :: irn, jc, jcol
  REAL(KIND=kdp) :: s
  integer :: start_x, start_y, nbad, ngood, nskip
!
  start_x = loc(x)
  start_y = loc(y)
  write (*,*) 'Loc(ap/x) =',start_x
  write (*,*) 'Loc(bp/y) =',start_y
  nbad = 0
  ngood = 0
  nskip = 0
!
  DO irn=1,nrn
     s = 0.0_kdp
     DO jc=1,6
        jcol = ci(jc,irn)
        IF (jcol > 0) s = s + va(jc,irn)*y(jcol-nrn)
        IF (jcol > 0) then
          if ( loc(y(jcol-nrn)) < start_y) then
            write (*,*) irn,jc,' bad y usage : jcol=',jcol
            nbad = nbad + 1
          else
            ngood = ngood + 1
          end if
        else
          nskip = nskip+1
        end if
     END DO
     x(irn) = s
  END DO
  write (*,*) 'ngood =',ngood
  write (*,*) 'nbad  =',nbad
  write (*,*) 'nskip =',nskip
END SUBROUTINE abmult

END SUBROUTINE gcgris

the first 2 calls to abmult failed wityh integer overflow, but using the 3rd call option produced the following output.
Code:
...
Compiling Z:\Temp\mecej4\asizbug\ldind.f90
    NO ERRORS  [<LDIND> FTN95 v8.64.0]

Z:\Temp\mecej4\asizbug>slink *.obj /out:asiz32
Creating executable: asiz32.exe
 in iter : AP is allocated as size         6479 x            6
 Loc(ap) =   280774288
 Loc(bp) =   281085296
 in gcgris : AP is allocated as size         6479 x *
  nrn =        6479 for AP(nrn,*  : Loc(ap) =   280774288
  nbn =        6479 for BP(nbn,*  : Loc(bp) =   281085296
 using CALL abmult (ap,bp)
 Loc(bp/y) =   281085296
 ngood =       36805
 nbad  =           0
 nskip =        2069

It is interesting that in the other 2 call tests, prior to the call to abmult, the address and size(.1) values are correct, but after starting call, sdbg info is corrupted.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2351
Location: Sydney

PostPosted: Sat Apr 17, 2021 4:13 am    Post subject: Reply with quote

I am using FTN95 Ver 8.64 32-bit and SDBG Ver 8.62

Using CALL abmult (ap(1,L),bp(:,L));
when it crashes, in the Vars:GCGRIS window, there are 2 listed variables BP,
a variable BP and
an array "BP = REAL*8 (280774289,0:214748646)"
These have different memory addresses.
This looks confusing to me.

-- new test --
I returned to using CALL abmult (ap,bp);
The program terminates normally, but SDBG is still listing a variable BP and an array BP ??? FTN95/SDBG error ?
Arrays AP and BP have incorrect (random) sizes.
(I should put a breakpoint prior to the call)

-- new test --
Now in SDBG with breakpoint, checked prior to call abmult
AP has wrong size (57735933,0:2147483646)
BP has wrong size, (4604097,0:2147483646), plus a variable BP exists.
write statement reports size(ap,1) correctly as 6479

Their first dimension should be correct ?? FTN95/SDBG error ?
I would suggest this is possible cause of integer overflow.
Their second dimension is reported as 0:2147483646, a possible length for *, so could be ok.

In abmult
X and Y have same memory size of module MCS2 variables AP and BBP.
X = Y = REAL*8 (38874) { this is 6479*6 }

(my comment about "-nrn" appears invalid, as CI might be adjusted for this offset. ie my ngood = 36805 vs nbad = 0)

-- new test --
Further test: I changed the declarations for AP and BP to
REAL(KIND=kdp), DIMENSION(nrn,*) :: ap
REAL(KIND=kdp), DIMENSION(nbn,*) :: bp

SDBG still reports their first dimension incorrectly

Other arrays ra and rr are reported with correct dimensions by SDBG.
array h is also correct. (automatic array where nsdr is module variable)
REAL(KIND=kdp), DIMENSION(lrcgd1,*), INTENT(INOUT) :: ra
REAL(KIND=kdp), DIMENSION(*), INTENT(IN OUT) :: rr
REAL(KIND=kdp), DIMENSION(0:nsdr-2,0:nsdr-2) :: h
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1580

PostPosted: Sat Apr 17, 2021 7:00 am    Post subject: Reply with quote

Encouraged by your comments, John, I succeeded in creating a short reproducer, which I hope Paul will consider.

Code:
MODULE mcs
  IMPLICIT NONE
  INTEGER, PARAMETER :: LRCGD1 = 19
  INTEGER :: NRN
  REAL, DIMENSION(:,:), ALLOCATABLE, SAVE :: app
END Module

Program HST3d
USE mcs
IMPLICIT NONE
NRN = 7
allocate(app(NRN,0:5))
CALL gcgris(app)
print *,app(3,0)
end Program

SUBROUTINE gcgris(ap)
  USE mcs
  IMPLICIT NONE
  REAL, DIMENSION(NRN,0:*), INTENT(IN OUT) :: ap
  CALL abmult(ap(1:NRN,0))
end subroutine gcgris

SUBROUTINE abmult(x)
  USE mcs
  IMPLICIT NONE
  REAL, DIMENSION(*), INTENT(OUT) :: x
  INTEGER :: irn
!
  DO irn=1,nrn
     x(irn) = 0.1
  END DO
END SUBROUTINE abmult


Compile with /full_debug /full_undef, and run inside SDBG (running outside the debugger may cause the command window to hang up). The program will stop on line 31 with "Array subscript(s) out-of-bounds". The variables pane shows x to have a size of 1, which is incorrect. If you click on the GCGRIS line in the call stack window, you will now see the size of array AP as (7,0:2147483646) (which is preposterous -- in fact, the upper bound in the second extent is seen as one less than the largest 32-bit signed integer, whereas the declared upper bound is just 5. However, as John C. noted, this large value may be just a signature for '*' in the subroutine).

Turning now to 64-bits, we run into the same problem as in the original test program of this thread. Compile with /64 /full_debug /full_undef, and run inside sdbg64. The program stops with "Access violation writing address 0x0000000002575000". The variables pane shows x as having a size of 1000794972.

I suspect that one cause of this bug is how this program specifies a lower bound of 0 rather than the conventional 1.


Last edited by mecej4 on Sat Apr 17, 2021 1:32 pm; edited 2 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7142
Location: Salford, UK

PostPosted: Sat Apr 17, 2021 7:40 am    Post subject: Reply with quote

mecej4

Thank you for this. I have made a note to investigate.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2351
Location: Sydney

PostPosted: Sun Apr 18, 2021 4:16 am    Post subject: Reply with quote

mecej4,

I had a look at your short reproducer.
For my testing with FTN95 Ver 8.64, on entry to gcgris, the array ap has valid dimensions, so I am not sure if this identifies the previous problem (that I am seeing).

I have produced a single file reproducer from your larger test.

https://www.dropbox.com/s/jvwul6p11ol83gi/Agcgris_test.f90?dl=0

https://www.dropbox.com/s/gdw1ndc10g5iccw/Agcgris_test3.f90?dl=0

For my testing of this reproducer, on entry to gcgris, the arrays ap, bp have incorrect dimensions in SDBG, but array ra has valid dimensions.
I think the invalid dimensions for ap and bp is a problem to be addressed.

I compiler as :
ftn95 agcgris_test /checkmate /link
sdbg agcgris_test

I set breakpoints at:
line 210 : at call to gcgris : shows arrays ap, bbp and ra defined as expedted
line 115 : entry to gcgris : shows arrays ap, bp with invalid dimensions
line 145 : entry to abmult : shows X,Y with correct dimensions
F6 to start test

Paul,
I see an error at line 115 : entry to gcgris that AP and BP have invalid dimensions in SDBG, reported as (1,7), but first dimension should = nbn = 100.
I identified this as a problem in the bigger program.
( note write (*,*) ... size(ap,1) reports correct value of first dimension, different from SDBG ?? )
In my example:
lines 96:101 show alternative definitions of AP and BP, although there was no change to outcome.
lines 130:132 show alternative calls to abmult. The first 2 resulted in a crash in the larger program, while the 3rd ran to completion.
I have only tested the 3rd option with this reproducer.

I thought this identified the problem for FTN95/SDBG that needs checking.

edit:
the second link above for Agcgris_test3.f90 reproduces the original crash. This uses the 1st declarations (as in original post program) and then 3rd then 2nd call.
3rd call works, but 2nd call that requires array dimensions fails.

I hope this easily demonstrates the problem.

ps.
I do find the use of "SAVE" in a module to be annoying.
I do not know of any F95+ compiler that requires explicit use of SAVE in a module. SAVE does nothing as modules do not go out of scope.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group