Topic: Puzzling and Spurious Aborts with Integer Overflow in Support

mecej4

Posts: 1911

Back to Top

5 Nov 2022 1:30 (Edited: 5 Nov 2022 6:08) #29581

I now have a working demonstration of what I believe is a bug in the bug-checking code that is introduced into an EXE produced by FTN95 with options such as /check, /undef, etc.

SYMPTOMS

One specific symptom is that when the EXE is run it aborts with totally unexpected and inexplicable contexts. Adding a redundant IMPLICIT NONE, adding an otherwise unused variable in the declarations section, including a subprogram that never gets called, etc., changing the compiler options or input data may cause the overflow abort to disappear, only to make the user see the bug resurface when some slight change is made to the program source.

These integer overflows are exhibited even for sources which are error free and run without any problems using other compilers that provide for catching integer overflow (NAG, the old CVF). It is this property -- error stops resulting from running error-free programs -- that differentiates this kind of error from the more common errors such as undefined variables, argument mismatches, array overruns, etc.

The overflow is not related to integer variables in the test program, but apparently originates in subscript calculations and bounds checking, for which the compiler inserts immediate data bytes in the instruction stream or places those bytes in the stack.

BUG IS ELUSIVE

In the past, I tried to post a bug report for this problem, but the test programs were too big, and had bugs that I was trying to hunt down and fix using FTN95. When a test program with suspected bugs aborts with integer overflow, what is the basis for apportioning blame between the compiler and the test program?

REPRODUCER

I have prepared a 640 line program that should demonstrate the problem.

Please download the single source file (in a zip file) from

https://www.dropbox.com/s/qiqggjcac01z58j/psovfl.zip?dl=0
Compile and link with /check or /64 /check and run the program. It will abort (at least in my experience, using the 8.92 compiler) with integer overflow on line 553:
```
         ddn = dvnorm (n, yh(1, L), ewt) / tesco (1, nq) 
```

(Note that the 64 bit debugger reports 552, being off by one line). In the variable pane, you can see yh listed as 'yh(216, invalid)'.

Repeat step 2, this time adding /imp as a compiler action. Note that no variables are flagged by the compiler as not being covered by type declarations. This EXE runs to normal completion!

The EXEs differ by one byte in two places. In effect, that one byte difference causes havoc.

Here is my interpretation of what goes wrong at the machine level. At the point where the integer overflow occurs, these are the instructions:

40C28A mov r15, [rsp+0x2f0] inc r15 jno ... mov eax, 0x02 int 0x9 imul r15, rbx #RBX has D8 (= 216) jno ...

The value loaded into R15, should be the first subscript extent of array YH:

      MOV_Q     R15,(YH:size:1)

If you step to the instruction following this instruction (I noted EIP = 40C28A for the 64 bit program), you find that the value loaded into R15 from the stack is 0x8080808080808080 when the EXE had been compiled with /64 /check, instead of the correct value, namely, 0x0000000000000017, for the extent of the second subscript of YH, minus 1. When this huge value (already so large that it is now a negative number in twos-complement notation) is multiplied by RBX, which contains 0x00000000000000D8 = 216, the extent of the first subscript of YH(:,:), integer overflow results. When /imp is used in addition to /check, the correct value gets loaded and the whole problem goes away.

The same sort of things happen with 32-bit EXEs, with the culminating instruction:

405196 IMUL EDI,EDX

APOLOGY: This is rather long and technical, and is not meant for general reading. The reported details should help in fixing the problem. Some of my guesswork regarding the compiler internal workings may certainly be wrong.

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

5 Nov 2022 4:54 #29584

mecej4

Thank you for the feedback. I will make a note that this should be investigated.

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

5 Nov 2022 6:01 #29586

mecej4

I have had an initial look at this issue. Enough to think that it might be very difficult to fix.

There are lots of arrays that have assumed size and a function is passed as an argument.

Basically the compiler is generating tempories that are not being set. In other words the code is just too complex for some aspect of the checking mechanism. (The next stage might be to switch off parts of the checking to see when the fault goes away.)

The following kind of code is worrying...

  rwork (lewt:lewt+n-1) = 1.0d0 / &amp;
 &amp; (rtol*Abs(rwork(lyh:lyh+n-1))+atol)

Expansion into a do loop would be simpler and more efficient when parallisation is not available.

mecej4

Posts: 1911

Back to Top

5 Nov 2022 9:23 #29587

Paul, thanks for your analysis.

The original F77 source code with which I first noticed the issue is available, but that is about 30,000 lines of code, with lots of 'spaghetti' style labelled statements connected by lots of GOTO statements. My reproducer originated in the Demo5 test problem of the LLNL Odepack package ( https://computing.llnl.gov/projects/odepack/software ).

The Vast77to90 code polisher that I used for generating the reproducer can be asked to refrain from outputting vector expressions such as those that you pointed out. I shall try to provide a cleaned up F77 version of the test code.

mecej4

Posts: 1911

Back to Top

6 Nov 2022 12:29 #29589

Paul, here is a link to a Fortran 77 fixed form version of the same test program:

https://www.dropbox.com/s/ytsa5dfu9j4wb2v/psovfl77.zip?dl=0

This version does not use any Fortran 90 features. In fact, this version can be built and run to completion without error using the following compilers that allow checking for integer overflow:

CVF6.6C             df /check:overflow
NAG 7.1             nagfor -C=intovf
                    nagfor -C=intovf -C=undefined -gline
FTN77 4.03          ftn77 /check
FTN95 7.2           ftn95 /check
                    ftn95 /undef

This Fortran 77 code, compiled with the current 8.92 compiler with /check, exhibits the same integer overflow issues that I described earlier in this thread for the Fortran 95 version.

JohnCampbell

Posts: 2526 Sydney

Back to Top

8 Nov 2022 9:50 #29598

mecej4,

Perhaps you could consider the following replacements for vector syntax in the example code Paul posted ?

do i = 0,n-1
  rwork (lewt+i) = 1.0d0 / ( rtol*Abs(rwork(lyh+i)) + atol )
end do

Forall ( i = 0:n-1 ) rwork (lewt+i) = 1.0d0 / ( rtol*Abs(rwork(lyh+i)) + atol )

Could it be somewhere where 'forall' could be useful ?

mecej4

Posts: 1911

Back to Top

8 Nov 2022 4:31 #29605

John, I think that Forall is 'obsolescent' as of Fortran 2018. Apart from that, such changes (array syntax to Forall, Do Concurrent, etc.) could be viable when one is writing new code, but in the context of what I am working on (modernising Odepack), I am constrained to using whatever the code polisher puts out.

The original code from LLNL is Fortran 77, about 30,000 lines, with a high Spaghetti Index in portions of it. When FTN95 turns up a bug in that code, there is not much that I can do to fix the bug in the original code, given the thousands of statement labels and GoTo statements (many are computed GoTos). So, to make progress feasible, I use a code reformatter, and use FTN95 with /check or /undef to find and fix bugs.

In summary, I know that I have 1) original code with bugs, 2) reformatted code that is easier to work with but which may contain new bugs introduced in the reformatting process, and I want a bug-checking compiler that can help me find and fix the bugs.

If the compiler only generated a few false messages, I could live with that. However, when a faulty subscript calculation causes integer overflow and causes the program to abort, I am stuck. It would be nice if there were an /Inhibit_Check option for bypassing only those integer overflows that arise from subscript calculations.

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

16 Nov 2022 1:11 #29631

mecej4

This turns out to be essentially the same issue as https://forums.silverfrost.com/Forum/Topic/4052&start=0.

Bounds checking is being applied to a 1D array that is passed as an argument but is treated as a 2D (star-sized) array in the subprogram.

The bounds checking mechanism can't cope with this complexity.

The existing bypass using /inhibit_check 20 is not sufficient for this particular program but it has been extended so that the same 'fix' will work for this program in the future.

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

16 Nov 2022 4:33 #29633

Rather than extend inhibit option 20, a new inhibit option 21 has been added.

This means that inhibit 20 has not changed and a new inhibit option 21 has been added for the current context.

Inhibit 21 means 'fix' this problem and inhibit 20 as well.

mecej4

Posts: 1911

Back to Top

17 Nov 2022 11:10 #29635

Quoted from PaulLaidler ... Bounds checking is being applied to a 1D array that is passed as an argument but is treated as a 2D (star-sized) array in the subprogram.

The bounds checking mechanism can't cope with this complexity.

Thanks for looking into this issue, Paul.

Your new option /inhibit_check 21 should be quite helpful with many of the numerical algorithms codes, mostly in Fortran 77, that were published in Netlib, TOMS, etc. during the 1980s, 1990s and some even later.

Most of these algorithms needed to use arrays whose size is not known at compile time. Since allocatable arrays were not available in Fortran 77, it was common practice to declare a large 1-D array in the main program, and hand out segments of that array to subroutines as adjustable work-space. As you noted, 1-D segments were used as actual arguments even when the corresponding dummy arguments were 2-D, 3-D, etc. With such usage, it is only reasonable to check if all accesses to the array are within the bounds of the actual argument. Quite often, it is impossible to detect where one work array segment ends and where the next segment begins.