forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Code Generation Bug
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Fri Sep 21, 2018 9:52 pm    Post subject: Code Generation Bug Reply with quote

There is an elusive code generation bug in FTN95 that occurs when one compiles a program in which DO loops are nested 3-deep or more. I have caught the bug only in 32-bit compilations. The bug surfaces more often with /opt, but I have also seen it occur with certain source codes when I compiled with /check and then not occur with /opt. The bug is data dependent, and that is another reason why it is so elusive. I have seen the bug with FTN95 7.2, 8.3 and 8.30 beta 279.

The shortest code that I have to exhibit the bug is about 325 lines, and is available at https://www.dropbox.com/s/1yuzdtla5bl4a5b/cdfbug.7z?dl=0 . Unzip the source, data and batch files. Set the environment variable OPT=/p6 /opt, and use the batch file bld.bat. Run the resulting EXE.

I have tested the test code using Gfortran, Intel and NAG compilers. They gave identical results without any crashes.

I have done extensive tracking at the assembly level, and here is what I see happening. Register EBX contains the address of the dummy argument V of subroutine SLNPRO, whose base address is copied from [EBP+8] for quick access throughout the subroutine. EBX is also used to hold the DO index JJ of the inner loop that starts at Line 123 of SLNPRO.F90. During the next iteration of the outer loop LL1, this value of EBX is used as the address of V, the first dummy argument of Subroutine SLNPRO. The error can go unnoticed unless the value of EBX is such as to cause a memory access violation, floating point error, etc. Note the value of EBX after the crash. I see 0000000D, which certainly cannot be the base address of V. There are several places where EBX is refreshed by reading [EBP+8], and places where EBX is re-saved to [EBP+8] (which I think is unnecessary). In the instance that I just described, this save/restore operation was not present. In effect, the last used value of JJ is used as the base address of V and this can cause havoc.

This is a complicated bug, and investigating it is made difficult by the absence of facilities in SDBG to do low level debugging. I have found it possible to see how the bug occurs by looking for a long time at the /exp listing of SLNPRO after I compiled with /p6 /opt. If you wish to read those arguments, I shall be happy to provide them.

Thanks.

[P.S., added 9/29/2018: A short reproducer posted on Sat Sep 29, 2018 (see later in this thread) displays the register usage bug with many versions of FTN95, 32 as well as 64-bit, along with a number of different options.]


Last edited by mecej4 on Sun Sep 30, 2018 11:55 pm; edited 2 times in total
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5605
Location: Salford, UK

PostPosted: Sat Sep 22, 2018 6:48 am    Post subject: Reply with quote

I will make a note that this needs investigating.

At one time you could use F11 in SDBG to get the assembly level code but I am not sure that this still works.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Sat Sep 22, 2018 9:22 am    Post subject: Reply with quote

Thanks for having this bug investigated.

I do use F11 as far as it can take me, but assembly level capabilities are very restricted. Breakpoints cannot be set by address, step-in and step-over do not work. For example, with the cursor on a CALL instruction to a library routine, one has to press the Step key once for each instruction in the library routine -- whose count may be unknown.

By the way, FTN95 /exp produces 32-bit listings with the assembly and source instructions interspersed, where as 64-bit listings are less useful with the source segregated from the assembly listing. Even in 32-bit listings I found that the frequent use of pseudo-variables such as "Address of QROW" for, say, [EBP+12] makes low level work more difficult.

I did use the Visual Studio debugger as well as EDB (originally from Edinburgh Portable Compiler) in this project.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2007
Location: Sydney

PostPosted: Sat Sep 22, 2018 1:41 pm    Post subject: Reply with quote

mecej4,

I was able to reproduce your error using V8.20.0

I also did some minor changes to see where the program was going, eg ssetv.f90, and the bug disappeared. The problem looks to be elusive, as you describe.
Code:
subroutine ssetv(nump, numq, numgr, ftbwt, pwr, iopt, epsq, indlw, altbl, &
                 indf, wtble, mode, v, m)
   implicit none
   integer, parameter :: double = kind(0d0)
   integer ,      intent(in)    :: nump, numq, numgr, iopt, mode
   integer ,      intent(out)   :: m
   real(double) , intent(in)    :: epsq
   integer ,      intent(in)    :: indf(101), indlw(101)
   real(double) , intent(in)    :: ftbwt(101), altbl(101), wtble(101), pwr(101,15)
   real(double) , intent(inout) :: v(635,17)
!
   integer      :: ione, nmpp1, nmppq, n, np1, i, mm1, nnn
   real(double) :: eps
!
   write (*,*) 'entering ssetv'
   ione = mod(iopt,10)
...
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Sat Sep 22, 2018 2:05 pm    Post subject: Reply with quote

John,

Thanks for putting in the effort to test for the presence of the bug in V8.2 and reporting your findings.

Bugs of this type are quite unpleasant and harmful, since a small change, such as removing a diagnostic WRITE statement from working code, may make the bug surface. Even then, if the user is not suspicious and on the lookout, wrong results may be taken as correct, so we should be happy when the program crashes.

Who knows, if FTN95-64 uses the same register allocation algorithm as FTN95-32, the same problem could occur -- now, with 16 instead of 8 registers, only with huge programs such as CFD and FEA programs.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Sun Sep 23, 2018 3:39 pm    Post subject: Short reproducer for code generation bug Reply with quote

Paul, I managed to put together a short reproducer that contains the essentials of the code generation bug. The code is legal Fortran, and is error free. The bug may be seen by compiling and running with /opt /p6. The program will abort with the message "The instruction at address xxxxxxxx attempted to read from location 00000029". At this point, EBX = 1, ESI = 6, and the instruction is DFLD [ebx - 0x8 + esi*8].
Code:
program slnpro
   implicit none
   integer, parameter :: double = kind(0.0d0)
   integer :: m=4, n=1
   real(double)  :: v(4,2)
   data v/ 6.7307d-1, 1.4748d-1, 3.4416d-1, 7.3645d-1, &
      6.1452d-1, 5.5309d-3, 5.4214d-1, 4.3507d-1/
!
   call sub(v,m,n)
end program slnpro

subroutine sub(v,m,n)
   implicit none
   integer, parameter :: double = kind(0.0d0)
   integer, intent(in) :: m,n
   real(double) :: v(m,n+1)
   integer :: mxrkn=2,jj,lrknt,j=2,id=3,np1,ii
   real(double) :: rea=1d2, rowq, amin, rea2=-1d-4,rtcol,y=-2d0

   np1=n+1
   amin=0d0

Outer: do ii = n, m
   if (ii == id) cycle
   rowq = v(ii,j)/v(id,j)            ! crash here during second iteration
                                     ! EBX should contain the base address of V,
                                     ! but actually contains the last value of JJ
                                     ! from the previous execution of the inner loop
   rtcol = v(ii,np1) - v(id,np1)*rowq
   if (rtcol + rea2 >= 0.D0) cycle
   lrknt = 3
   do jj = 1, n
      if (jj == j) cycle
      if (y+v(ii,jj) > rea) cycle
      lrknt = lrknt + 1
      if (lrknt > mxrkn) cycle  Outer ! this jump is taken, with EBX containing JJ
   end do
   ! when the DO JJ loop terminates normally, EBX is restored to base address of V
   if (lrknt >= mxrkn) then
      if (lrknt > mxrkn) cycle
      if (v(id,j) >= amin) cycle
   endif
   mxrkn = lrknt
   amin = v(id,j)
end do Outer
   return
end subroutine sub
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5605
Location: Salford, UK

PostPosted: Sun Sep 23, 2018 4:08 pm    Post subject: Reply with quote

Thanks.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 1879
Location: Yateley, Hants, UK

PostPosted: Sun Sep 23, 2018 6:19 pm    Post subject: Reply with quote

I abandoned /opt many years ago, thinking that I wasn't getting the right answer - in a case when I knew what the answer was, and got it without /opt - believing it was the result of code re-arrangement. Perhaps fortunately, all my stuff executes very quickly without on modern computers, and also, the'need for speed' is not essential with a Windows program in which the pace is governed by human reactions and thought processes.However, it wasn't crashing.

Is it the /P6, the /opt, or the combination that is at fault?

Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Sun Sep 23, 2018 10:51 pm    Post subject: Re: Reply with quote

LitusSaxonicum wrote:
Is it the /P6, the /opt, or the combination that is at fault?

In this instance, it is the presence of /opt that leads to bad code. I specified the /P6 to have the compiler generate the x87 compare-and-set-flags instructions.

If you still have some old codes that gave correct results without /opt and bad results with /opt, I should appreciate your making them available.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5605
Location: Salford, UK

PostPosted: Thu Sep 27, 2018 12:47 pm    Post subject: Reply with quote

I have tested both the original code and the cut down version a number of times but unfortunately I don't get the crash.

I am guessing that a fix would involve switching off Fortran loop optimisation upon encountering a use of CYCLE. This would be a non-trivial task and I am not sure that it is worth putting on the list of things to do given the general move to 64 bit compilation.

If 32 bit optimisation makes a significant difference to the run time for this code, I can only suggest (at least for now) that you try switching off Fortran loop optimisation which I think would involve using "/inhibit_opt 43".
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Fri Sep 28, 2018 12:20 am    Post subject: Reply with quote

That is strange, Paul. With the short code above, I get a crash with FTN95 versions 6.35, 7.1, 7.2, 8.1, 8.2, 8.3 and 8.3.279.
Quote:
s:\lang\ftn95>ftn95 /p6 /opt slnpro.f90 /link
[FTN95/Win32 Ver. 8.30.279 Copyright (c) Silverfrost Ltd 1993-2018]
NO ERRORS [<SLNPRO> FTN95 v8.30.279]
NO ERRORS [<SUB> FTN95 v8.30.279]
Creating executable: slnpro.EXE

s:\lang\ftn95>slnpro

The crash report:
Quote:
Runtime error from program:s:\lang\ftn95\slnpro.exe
Access Violation
The instruction at address 0040110f attempted to read from location 00000029

00401050 SUB [+00bf]
00401000 main [+003f]

eax=00000002 ebx=00000001 ecx=00000003
edx=00000004 esi=00000006 edi=00404000
ebp=0360fc98 esp=0360fc00 IOPL=2
ds=002b es=002b fs=0053
gs=002b cs=0023 ss=002b
flgs=00010297 [CA EP NZ NS DN NV]

0040110f dfld [ebx-0x8+esi*8]
00401113 dfdiv [ebp-0x68]
00401116 dfstp [ebp-0x28]

The offsets in the traceback, and the contents of EBP and ESP (which are related to those offsets) vary with the compiler version, but the values of the other registers as well as the reported instructions are identical.

John Campbell, if you read this, please try the short example code with any version of FTN95.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2007
Location: Sydney

PostPosted: Fri Sep 28, 2018 2:15 am    Post subject: Reply with quote

Using FTN95/Win32 Ver. 8.20.0 I tried the short code example posted on Sunday night and:
it crashed with set opt=/p6 /opt
but worked with set opt=/p6 /opt /inhibit_opt 43

my bat file is
Code:
set opt=/p6 /opt /inhibit_opt 43

del %1.obj
del %1.exe

ftn95 %1.f90 %opt%
slink %1.obj /out:%1.exe

%1


Paul, given how difficult it is to identify errors associated with these register mis-settings, it would be good to have a better understanding of what is causing this and how extensive it may be for using /opt.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1031

PostPosted: Fri Sep 28, 2018 2:40 am    Post subject: Reply with quote

Thanks for the quick response, John.

My expectations for optimisation are a bit nuanced. If the source code is correct and standard-conforming, choosing /opt should be a trade-off of compilation speed for faster execution. Integer and character results should be identical whether or not /opt was used. Floating point calculations can yield slightly different results.

If the source code is not quite standard-conforming, /opt becomes an adventurous option; code that works fine without /opt may fail now and then, but should work most of the time. Once in a while, we may even find that using /opt can cause slower runs.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2007
Location: Sydney

PostPosted: Fri Sep 28, 2018 3:38 am    Post subject: Reply with quote

I have learnt to be very selective where I use /opt, typically with code that is only a few lines long. Using /opt with large code (files) can produce unexpected results, that become too difficult to debug.
It would be a good outcome if this thread identified a problem that could eliminate a bug that occurs more generally.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 5605
Location: Salford, UK

PostPosted: Fri Sep 28, 2018 7:25 am    Post subject: Reply with quote

In general, with the progress towards 64 bit applications, it is not prudent for us to spend much time on 32 bit optimiser bugs. If they are easy to fix then all well and good. Otherwise we need to devote our resources to matters of greater impact and demand.

FTN95 is well known for its rapid compilation and good error reporting etc. When it comes to fast run time code it may be that other compilers can sometimes (or maybe often) do better. Going forward, this may not be the case for 64 bit applications where we have the potential to develop a fast LAPACK type library.

So for 32 bit applications the general rule is, (a) develop your application using CHECKMATE (b) test and get verifiable results in release mode (without optimisation) (c) if run time speed is important, try /opt but make sure you get the same test results as without it. Only use /opt after testing and when there is a clear improvement in run time speeds.

Our aim is to make 64 bit optimisation more robust than proved to be the case with 32 bit optimisation. But in the end, for faster number crunching you may need to invest in a more expensive compiler and only use FTN95 during development. Alternatively you could invest in a faster processor, or more RAM etc.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group