|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
mecej4
Joined: 31 Oct 2006 Posts: 1897
|
Posted: Sat Nov 05, 2022 2:30 pm Post subject: Puzzling and Spurious Aborts with Integer Overflow |
|
|
I now have a working demonstration of what I believe is a bug in the bug-checking code that is introduced into an EXE produced by FTN95 with options such as /check, /undef, etc.
SYMPTOMS
One specific symptom is that when the EXE is run it aborts with totally unexpected and inexplicable contexts. Adding a redundant IMPLICIT NONE, adding an otherwise unused variable in the declarations section, including a subprogram that never gets called, etc., changing the compiler options or input data may cause the overflow abort to disappear, only to make the user see the bug resurface when some slight change is made to the program source.
These integer overflows are exhibited even for sources which are error free and run without any problems using other compilers that provide for catching integer overflow (NAG, the old CVF). It is this property -- error stops resulting from running error-free programs -- that differentiates this kind of error from the more common errors such as undefined variables, argument mismatches, array overruns, etc.
The overflow is not related to integer variables in the test program, but apparently originates in subscript calculations and bounds checking, for which the compiler inserts immediate data bytes in the instruction stream or places those bytes in the stack.
BUG IS ELUSIVE
In the past, I tried to post a bug report for this problem, but the test programs were too big, and had bugs that I was trying to hunt down and fix using FTN95. When a test program with suspected bugs aborts with integer overflow, what is the basis for apportioning blame between the compiler and the test program?
REPRODUCER
I have prepared a 640 line program that should demonstrate the problem.
1. Please download the single source file (in a zip file) from
https://www.dropbox.com/s/qiqggjcac01z58j/psovfl.zip?dl=0
2. Compile and link with /check or /64 /check and run the program. It will abort (at least in my experience, using the 8.92 compiler) with integer overflow on line 553:
Code: | ddn = dvnorm (n, yh(1, L), ewt) / tesco (1, nq) |
(Note that the 64 bit debugger reports 552, being off by one line). In the variable pane, you can see yh listed as "yh(216, *invalid*)".
3. Repeat step 2, this time adding /imp as a compiler action. Note that no variables are flagged by the compiler as not being covered by type declarations. This EXE runs to normal completion!
The EXEs differ by one byte in two places. In effect, that one byte difference causes havoc.
4. Here is my interpretation of what goes wrong at the machine level. At the point where the integer overflow occurs, these are the instructions:
Code: | 40C28A mov r15, [rsp+0x2f0]
inc r15
jno ...
mov eax, 0x02
int 0x9
imul r15, rbx #RBX has D8 (= 216)
jno ... |
The value loaded into R15, should be the first subscript extent of array YH:
Code: | MOV_Q R15,(YH:size:1) |
If you step to the instruction following this instruction (I noted EIP = 40C28A for the 64 bit program), you find that the value loaded into R15 from the stack is 0x8080808080808080 when the EXE had been compiled with /64 /check, instead of the correct value, namely, 0x0000000000000017, for the extent of the second subscript of YH, minus 1. When this huge value (already so large that it is now a negative number in twos-complement notation) is multiplied by RBX, which contains 0x00000000000000D8 = 216, the extent of the first subscript of YH(:,:), integer overflow results. When /imp is used in addition to /check, the correct value gets loaded and the whole problem goes away.
The same sort of things happen with 32-bit EXEs, with the culminating instruction:
Code: | 405196 IMUL EDI,EDX |
APOLOGY: This is rather long and technical, and is not meant for general reading. The reported details
Last edited by mecej4 on Sat Nov 05, 2022 7:08 pm; edited 1 time in total |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8019 Location: Salford, UK
|
Posted: Sat Nov 05, 2022 5:54 pm Post subject: |
|
|
mecej4
Thank you for the feedback. I will make a note that this should be investigated. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8019 Location: Salford, UK
|
Posted: Sat Nov 05, 2022 7:01 pm Post subject: |
|
|
mecej4
I have had an initial look at this issue. Enough to think that it might be very difficult to fix.
There are lots of arrays that have assumed size and a function is passed as an argument.
Basically the compiler is generating tempories that are not being set. In other words the code is just too complex for some aspect of the checking mechanism. (The next stage might be to switch off parts of the checking to see when the fault goes away.)
The following kind of code is worrying...
rwork (lewt:lewt+n-1) = 1.0d0 / &
& (rtol*Abs(rwork(lyh:lyh+n-1))+atol)
Expansion into a do loop would be simpler and more efficient when parallisation is not available. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1897
|
Posted: Sat Nov 05, 2022 10:23 pm Post subject: |
|
|
Paul, thanks for your analysis.
The original F77 source code with which I first noticed the issue is available, but that is about 30,000 lines of code, with lots of "spaghetti" style labelled statements connected by lots of GOTO statements. My reproducer originated in the Demo5 test problem of the LLNL Odepack package ( https://computing.llnl.gov/projects/odepack/software ).
The Vast77to90 code polisher that I used for generating the reproducer can be asked to refrain from outputting vector expressions such as those that you pointed out. I shall try to provide a cleaned up F77 version of the test code. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1897
|
Posted: Sun Nov 06, 2022 1:29 am Post subject: |
|
|
Paul, here is a link to a Fortran 77 fixed form version of the same test program:
https://www.dropbox.com/s/ytsa5dfu9j4wb2v/psovfl77.zip?dl=0
This version does not use any Fortran 90 features. In fact, this version can be built and run to completion without error using the following compilers that allow checking for integer overflow:
Code: | CVF6.6C df /check:overflow
NAG 7.1 nagfor -C=intovf
nagfor -C=intovf -C=undefined -gline
FTN77 4.03 ftn77 /check
FTN95 7.2 ftn95 /check
ftn95 /undef
|
This Fortran 77 code, compiled with the current 8.92 compiler with /check, exhibits the same integer overflow issues that I described earlier in this thread for the Fortran 95 version. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2587 Location: Sydney
|
Posted: Tue Nov 08, 2022 10:50 am Post subject: |
|
|
mecej4,
Perhaps you could consider the following replacements for vector syntax in the example code Paul posted ?
Code: | do i = 0,n-1
rwork (lewt+i) = 1.0d0 / ( rtol*Abs(rwork(lyh+i)) + atol )
end do
Forall ( i = 0:n-1 ) rwork (lewt+i) = 1.0d0 / ( rtol*Abs(rwork(lyh+i)) + atol )
|
Could it be somewhere where "forall" could be useful ? |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1897
|
Posted: Tue Nov 08, 2022 5:31 pm Post subject: |
|
|
John, I think that Forall is "obsolescent" as of Fortran 2018. Apart from that, such changes (array syntax to Forall, Do Concurrent, etc.) could be viable when one is writing new code, but in the context of what I am working on (modernising Odepack), I am constrained to using whatever the code polisher puts out.
The original code from LLNL is Fortran 77, about 30,000 lines, with a high Spaghetti Index in portions of it. When FTN95 turns up a bug in that code, there is not much that I can do to fix the bug in the original code, given the thousands of statement labels and GoTo statements (many are computed GoTos). So, to make progress feasible, I use a code reformatter, and use FTN95 with /check or /undef to find and fix bugs.
In summary, I know that I have 1) original code with bugs, 2) reformatted code that is easier to work with but which may contain new bugs introduced in the reformatting process, and I want a bug-checking compiler that can help me find and fix the bugs.
If the compiler only generated a few false messages, I could live with that. However, when a faulty subscript calculation causes integer overflow and causes the program to abort, I am stuck. It would be nice if there were an /Inhibit_Check option for bypassing only those integer overflows that arise from subscript calculations. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8019 Location: Salford, UK
|
Posted: Wed Nov 16, 2022 2:11 pm Post subject: |
|
|
mecej4
This turns out to be essentially the same issue as http://forums.silverfrost.com/viewtopic.php?t=4534&start=0.
Bounds checking is being applied to a 1D array that is passed as an argument but is treated as a 2D (star-sized) array in the subprogram.
The bounds checking mechanism can't cope with this complexity.
The existing bypass using /inhibit_check 20 is not sufficient for this particular program but it has been extended so that the same "fix" will work for this program in the future. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8019 Location: Salford, UK
|
Posted: Wed Nov 16, 2022 5:33 pm Post subject: |
|
|
Rather than extend inhibit option 20, a new inhibit option 21 has been added.
This means that inhibit 20 has not changed and a new inhibit option 21 has been added for the current context.
Inhibit 21 means "fix" this problem and inhibit 20 as well. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1897
|
Posted: Thu Nov 17, 2022 12:10 pm Post subject: Re: |
|
|
PaulLaidler wrote: | ...
Bounds checking is being applied to a 1D array that is passed as an argument but is treated as a 2D (star-sized) array in the subprogram.
The bounds checking mechanism can't cope with this complexity.
|
Thanks for looking into this issue, Paul.
Your new option /inhibit_check 21 should be quite helpful with many of the numerical algorithms codes, mostly in Fortran 77, that were published in Netlib, TOMS, etc. during the 1980s, 1990s and some even later.
Most of these algorithms needed to use arrays whose size is not known at compile time. Since allocatable arrays were not available in Fortran 77, it was common practice to declare a large 1-D array in the main program, and hand out segments of that array to subroutines as adjustable work-space. As you noted, 1-D segments were used as actual arguments even when the corresponding dummy arguments were 2-D, 3-D, etc. With such usage, it is only reasonable to check if all accesses to the array are within the bounds of the actual argument. Quite often, it is impossible to detect where one work array segment ends and where the next segment begins. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|