Topic: Code generation bug with /64 in 64-bit

mecej4

Posts: 1911

Back to Top

26 Aug 2019 4:27 (Edited: 26 Aug 2019 11:33) #24227

A mysterious bug appeared in a large production Fortran code (12 K lines of code). When compiled and run with /check or /debug with /64, the program ran fine. When compiled with just /64 and with or without /debug, it ran fine again. When compiled with /opt /64, the program crashed with an access violation.

Fortunately, after the crash the address of the crash was present in the pop up. Rerunning the FTN95-compiled EXE under the Visual Studio debugger revealed that improper CMP instructions had been generated.

Generally, these are the circumstances when the bug occurs: an expression involving 4-byte integers is (1) evaluated, (2) compared to zero, and (3) the result of the comparison is used as a condition for whether or not to evaluate another expression and assign the result to a variable. However, I am not sure at this point that this bug occurs only if /opt is used. We would have to examine how the compiler treats other large codes to settle that question.

For example, for the source line

if (m1 - incj2 .gt. 0 .and. m1-incj2 .le. nxyz) then

the generated instructions were

  000000000040182F: 4D 8B E9               mov         r13,r9    ; R9 contains M1 already
  0000000000401832: 44 2B AD CC 50 00  00  sub         r13d,dword ptr [rbp+00000000000050CCh]     ; subtract INCJ2
  0000000000401839: 49 81 FD 00 00 00 000  cmp         r13,0 ; should have been r13d     ; is M1 - INCJ2 > 0 ?

At this point, R13 contained the value 0000 0000 FFFF FFFC, which is a positive 64-bit signed integer. However, R13D, which should have been used in the CMP instruction (or R13D sign-extended to R13 with MOVSX), contains FFFF FFFC, which is a negative 32-bit signed integer. This negative integer is then used as an index into an array, and is likely to result in an access violation and abort. Or, worse, the program may output incorrect results that may not be obviously wrong.

I was able to create a reproducer (see below). The results from compiling and running with /64 /opt:

           i1  m1-incj2     xxn(i1)

  1         2        -1  4.0000E+00
  2         3         4  6.0000E+00
  3         2         2  4.0000E+00
  4         3         7 -4.3000E+01

The '-1' in the first line of results is wrong. In fact, that whole line should be absent, as running with /64 (i.e., without /opt) shows:

           i1  m1-incj2     xxn(i1)

  1         3         4  6.0000E+00
  2         2         2  4.0000E+00
  3         3         7 -4.3000E+01

The bug does not occur if, instead of

if (m1 - incj2 .gt. 0 .and. m1-incj2 .le. nxyz) then

we have

if (m1 .gt. incj2 .and. m1-incj2 .le. nxyz) then

Because of the page size limit of this forum, I have posted the source code of the program in the next post in this thread

mecej4

Posts: 1911

Back to Top

26 Aug 2019 4:29 (Edited: 26 Aug 2019 8:33) #24228

Here is the source code of the reproducer:

      program tst
      implicit none
      real xx(7),xxn(7),va(7,7)
      xx = 71.
      xxn = 7.
      va  = 49.
      call sor2l(xx,xxn,va)
      end program

      subroutine sor2l(xx, xxn, va)
      implicit none
      integer :: l2x, i1, i2, ix2m = 7, m1, m2
      integer :: nx1 = 3, nx2 = 4, incj1 = 5, incj2 = 20
      integer :: ii1 = 2, ii3 = 7, nx3 = 8, i3
      real :: xxn(7), va(7, 7), xx(7)
      integer :: nx = 11, nxy = 3, nxyz = 7
      integer :: ilog(5),i1log(5)
      real :: xlog(5)

      xx(1:nxyz-1) = 0.
      xx(nxyz) = 1.
      l2x = 0
      print *
      print *,'          i1  m1-incj2     xxn(i1)'
      print *
      do i3 = ii3, nx3
         do i2 = 1, nx2 - 1, 2
            m1 = ((i3 - 1)*nxy + (i2 - 1)*nx + 1) - incj1
            m2 = m1 + incj2
            do i1 = ii1, nx1
               m1      = m1 + incj1
               m2      = m2 + incj1
               xxn(i1) = 2.0*i1
!
! The next IF (condition) THEN statement causes an incorrect CMP instruction
! to be generated with /64 /opt. All the conditional expressions are
! 4-byte integers, so only the lower half of a 64-bit x64 register
! should be used in CMP instructions.
!
! Inserting PRINT statements to report the values will not work.
! Store into memory instead of printing, to let optimiser be uninhibited.
! Print logged values after all loops are done
!
               if (m1 - incj2 .gt. 0 .and. m1-incj2 .le. nxyz) then
                  xxn(i1) = xxn(i1) - va(ix2m, m1-incj2)*xx(m1-incj2)
                  l2x=l2x+1
                  i1log(l2x) = i1
                  ilog(l2x) = m1-incj2
                  xlog(l2x) = xxn(i1)
               end if
            end do
         end do
      end do
      print '(1x,i2,2i10,ES12.4)', (i1,i1log(i1),ilog(i1),xlog(i1), i1=1,l2x)
      end subroutine sor2l

PaulLaidler

Posts: 7974 Salford, UK

Back to Top

26 Aug 2019 7:00 #24230

mecej4

Many thanks for the feedback and extensive analysis.

The current developers' version runs this code successfully but I will make a note that this needs to be check out.

PaulLaidler

Posts: 7974 Salford, UK

Back to Top

4 Nov 2019 3:47 #24628

This bug exists in the current release and has now been fixed for the next release of FTN95.