forums.silverfrost.com

mecej4 · Joined: 31 Oct 2006 Posts: 1928

There is an elusive code generation bug in FTN95 that occurs when one compiles a program in which DO loops are nested 3-deep or more. I have caught the bug only in 32-bit compilations. The bug surfaces more often with /opt, but I have also seen it occur with certain source codes when I compiled with /check and then not occur with /opt. The bug is data dependent, and that is another reason why it is so elusive. I have seen the bug with FTN95 7.2, 8.3 and 8.30 beta 279.

The shortest code that I have to exhibit the bug is about 325 lines, and is available at https://www.dropbox.com/s/1yuzdtla5bl4a5b/cdfbug.7z?dl=0 . Unzip the source, data and batch files. Set the environment variable OPT=/p6 /opt, and use the batch file bld.bat. Run the resulting EXE.

I have tested the test code using Gfortran, Intel and NAG compilers. They gave identical results without any crashes.

I have done extensive tracking at the assembly level, and here is what I see happening. Register EBX contains the address of the dummy argument V of subroutine SLNPRO, whose base address is copied from [EBP+8] for quick access throughout the subroutine. EBX is also used to hold the DO index JJ of the inner loop that starts at Line 123 of SLNPRO.F90. During the next iteration of the outer loop LL1, this value of EBX is used as the address of V, the first dummy argument of Subroutine SLNPRO. The error can go unnoticed unless the value of EBX is such as to cause a memory access violation, floating point error, etc. Note the value of EBX after the crash. I see 0000000D, which certainly cannot be the base address of V. There are several places where EBX is refreshed by reading [EBP+8], and places where EBX is re-saved to [EBP+8] (which I think is unnecessary). In the instance that I just described, this save/restore operation was not present. In effect, the last used value of JJ is used as the base address of V and this can cause havoc.

This is a complicated bug, and investigating it is made difficult by the absence of facilities in SDBG to do low level debugging. I have found it possible to see how the bug occurs by looking for a long time at the /exp listing of SLNPRO after I compiled with /p6 /opt. If you wish to read those arguments, I shall be happy to provide them.

Thanks.

[P.S., added 9/29/2018: A short reproducer posted on Sat Sep 29, 2018 (see later in this thread) displays the register usage bug with many versions of FTN95, 32 as well as 64-bit, along with a number of different options.]

PaulLaidler · Posted: Sat Sep 22, 2018 6:48 am Post subject:

I will make a note that this needs investigating.

At one time you could use F11 in SDBG to get the assembly level code but I am not sure that this still works.

mecej4 · Joined: 31 Oct 2006 Posts: 1928

Thanks for having this bug investigated.

I do use F11 as far as it can take me, but assembly level capabilities are very restricted. Breakpoints cannot be set by address, step-in and step-over do not work. For example, with the cursor on a CALL instruction to a library routine, one has to press the Step key once for each instruction in the library routine -- whose count may be unknown.

By the way, FTN95 /exp produces 32-bit listings with the assembly and source instructions interspersed, where as 64-bit listings are less useful with the source segregated from the assembly listing. Even in 32-bit listings I found that the frequent use of pseudo-variables such as "Address of QROW" for, say, [EBP+12] makes low level work more difficult.

I did use the Visual Studio debugger as well as EDB (originally from Edinburgh Portable Compiler) in this project.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2627 Location: Sydney

mecej4,

I was able to reproduce your error using V8.20.0

I also did some minor changes to see where the program was going, eg ssetv.f90, and the bug disappeared. The problem looks to be elusive, as you describe.

mecej4 · Joined: 31 Oct 2006 Posts: 1928

John,

Thanks for putting in the effort to test for the presence of the bug in V8.2 and reporting your findings.

Bugs of this type are quite unpleasant and harmful, since a small change, such as removing a diagnostic WRITE statement from working code, may make the bug surface. Even then, if the user is not suspicious and on the lookout, wrong results may be taken as correct, so we should be happy when the program crashes.

Who knows, if FTN95-64 uses the same register allocation algorithm as FTN95-32, the same problem could occur -- now, with 16 instead of 8 registers, only with huge programs such as CFD and FEA programs.

mecej4 · Joined: 31 Oct 2006 Posts: 1928

Paul, I managed to put together a short reproducer that contains the essentials of the code generation bug. The code is legal Fortran, and is error free. The bug may be seen by compiling and running with /opt /p6. The program will abort with the message "The instruction at address xxxxxxxx attempted to read from location 00000029". At this point, EBX = 1, ESI = 6, and the instruction is DFLD [ebx - 0x8 + esi*8].

PaulLaidler · Posted: Sun Sep 23, 2018 4:08 pm Post subject:

Thanks.

LitusSaxonicum · Posted: Sun Sep 23, 2018 6:19 pm Post subject:

I abandoned /opt many years ago, thinking that I wasn't getting the right answer - in a case when I knew what the answer was, and got it without /opt - believing it was the result of code re-arrangement. Perhaps fortunately, all my stuff executes very quickly without on modern computers, and also, the'need for speed' is not essential with a Windows program in which the pace is governed by human reactions and thought processes.However, it wasn't crashing.

Is it the /P6, the /opt, or the combination that is at fault?

Eddie

mecej4 · Joined: 31 Oct 2006 Posts: 1928

PaulLaidler · Posted: Thu Sep 27, 2018 12:47 pm Post subject:

I have tested both the original code and the cut down version a number of times but unfortunately I don't get the crash.

I am guessing that a fix would involve switching off Fortran loop optimisation upon encountering a use of CYCLE. This would be a non-trivial task and I am not sure that it is worth putting on the list of things to do given the general move to 64 bit compilation.

If 32 bit optimisation makes a significant difference to the run time for this code, I can only suggest (at least for now) that you try switching off Fortran loop optimisation which I think would involve using "/inhibit_opt 43".

mecej4 · Joined: 31 Oct 2006 Posts: 1928

That is strange, Paul. With the short code above, I get a crash with FTN95 versions 6.35, 7.1, 7.2, 8.1, 8.2, 8.3 and 8.3.279.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2627 Location: Sydney

Using FTN95/Win32 Ver. 8.20.0 I tried the short code example posted on Sunday night and:
it crashed with set opt=/p6 /opt
but worked with set opt=/p6 /opt /inhibit_opt 43

my bat file is

mecej4 · Joined: 31 Oct 2006 Posts: 1928

Thanks for the quick response, John.

My expectations for optimisation are a bit nuanced. If the source code is correct and standard-conforming, choosing /opt should be a trade-off of compilation speed for faster execution. Integer and character results should be identical whether or not /opt was used. Floating point calculations can yield slightly different results.

If the source code is not quite standard-conforming, /opt becomes an adventurous option; code that works fine without /opt may fail now and then, but should work most of the time. Once in a while, we may even find that using /opt can cause slower runs.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2627 Location: Sydney

I have learnt to be very selective where I use /opt, typically with code that is only a few lines long. Using /opt with large code (files) can produce unexpected results, that become too difficult to debug.
It would be a good outcome if this thread identified a problem that could eliminate a bug that occurs more generally.

PaulLaidler · Posted: Fri Sep 28, 2018 7:25 am Post subject:

In general, with the progress towards 64 bit applications, it is not prudent for us to spend much time on 32 bit optimiser bugs. If they are easy to fix then all well and good. Otherwise we need to devote our resources to matters of greater impact and demand.

FTN95 is well known for its rapid compilation and good error reporting etc. When it comes to fast run time code it may be that other compilers can sometimes (or maybe often) do better. Going forward, this may not be the case for 64 bit applications where we have the potential to develop a fast LAPACK type library.

So for 32 bit applications the general rule is, (a) develop your application using CHECKMATE (b) test and get verifiable results in release mode (without optimisation) (c) if run time speed is important, try /opt but make sure you get the same test results as without it. Only use /opt after testing and when there is a clear improvement in run time speeds.

Our aim is to make 64 bit optimisation more robust than proved to be the case with 32 bit optimisation. But in the end, for faster number crunching you may need to invest in a more expensive compiler and only use FTN95 during development. Alternatively you could invest in a faster processor, or more RAM etc.