Topic: Bad 32-bit code generated for simple expression in Support

mecej4

Posts: 1912

Back to Top

9 Aug 2019 6:01 (Edited: 15 Oct 2019 2:28) #24178

For the following program, FTN95 V 8.51 generates bad X86 machine code when the options /opt /p6 are used.

      program terfdif
      print *,erfdif(0.6, 0.4)
      end program

      function erfdif(x1,x2)
      erfdif = erf(x1) - erf(x2)
      end

The address of the actual argument X2 is loaded into EBX when the function ERFDIF is entered. Similarly, the address of X1 is loaded into ECX. Then, DERF is called with argument X2. As a result of this call, given the register saving conventions of X86 code, ECX is overwritten. However, the generated code assumes that ECX still contains the address of X1, and passes an invalid address when trying to evaluate DERF(X1). When I ran the program, at this point ECX contained 00000003, and the program crashed with an access violation. The /EXP listing of the function follows.

      00000000(49/1/49)          push      ebp
      00000001(50/1/49)          mov       ebp,esp
      00000003(51/1/49)          push      ebx
      00000004(52/1/49)          push      esi
      00000005(53/1/49)          push      edi
      00000006(54/1/49)          push      eax
      00000007(55/1/49)          sub       esp,=16           ; Adjusted later if temporaries allocated
      0000000d(56/1/49)          mov       ebx,address of X2
                                                             ; Register ebx contains X2
      00000010(58/1/49)          mov       ecx,address of X1
                                                             ; Register ecx contains X1
   0003         end program                                                                      AT 13
   0004                                                                                          AT 13
   0005         function erfdif(x1,x2)                                                           AT 13
   0006         erfdif = erf(x1) - erf(x2)                                                       AT 13
      00000013(60/4/44)          fld       [ebx]
      00000015(61/4/44)          dfstp     Temp@11
      00000018(62/4/44)          lea       esi,Temp@11
      0000001b(63/4/44)          push      esi
      0000001c(64/4/44)          call      DERF@     ; function call, ECX is destroyed
      00000021(65/4/44)          add       esp,=4
      00000024(66/4/44)          fstp      Temp@10
      00000027(67/4/36)          fld       [ecx]     ; BUG HERE, ECX contains rubbish
      00000029(68/4/36)          dfstp     Temp@13
      0000002c(69/4/36)          lea       eax,Temp@13
      0000002f(70/4/36)          push      eax
      00000030(71/4/36)          call      DERF@
      00000035(72/4/36)          add       esp,=4
      00000038(73/4/36)          fstp      Temp@12
      0000003b(74/3/48)          fld       Temp@12
      0000003e(75/3/48)          fsub      Temp@10
      00000041(77/1/49)       Label     __N2
      00000041(78/1/49)          lea       esp,[ebp-12]
      00000044(79/1/49)          pop       edi
      00000045(80/1/49)          pop       esi
      00000046(81/1/49)          pop       ebx
      00000047(82/1/49)          pop       ebp
      00000048(83/1/49)          ret

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

10 Aug 2019 5:14 #24179

Thank you for the bug report. Does the program fail or give incorrect results? It runs OK for me.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

10 Aug 2019 5:26 #24180

This is a bit strange. My explist is different but at the same time I don't recall that any changes have been made in this respect.

Perhaps we should wait till the next release to see if it is still a problem at your end.

mecej4

Posts: 1912

Back to Top

10 Aug 2019 5:27 #24181

It crashed with an access violation, since it tried to read memory at absolute address 00000003. The crash is at ERFDIF + 00000027.

I compiled with /opt /p6. Without /opt, the bug does not occur.

I posted only the portion of the assembly code listing encompassing the subroutine ERFDIF. The leading part of the /EXP listing is given below. I am curious to see the listing that you generated.

Silverfrost FTN95/Win32 Ver 8.51.0  erfd.F90  Sat Aug 10 04:12:09 2019

Compiler used    [c:\lang\FTN95V851\ftn95.exe]
Salflibc path    [c:\lang\FTN95V851\salflibc.dll]
Salflibc version [21.5.28.8]
Compiler options in effect:
    EXPLIST;IGNORE;OPTIMISE;P6;

   0001         program terfdif                                                                  AT 0
 ; Start of SUBROUTINE TERFDIF

      00000000(3/1/1)            push      ebp
      00000001(4/1/1)            mov       ebp,esp
      00000003(5/1/1)            push      ebx
      00000004(6/1/1)            push      esi
      00000005(7/1/1)            push      edi
      00000006(8/1/1)            push      eax
      00000007(9/1/1)            lea       ecx,2
      0000000d(10/1/1)           push      ecx
      0000000e(11/1/1)           lea       esi,[ebp+8]       ; Get command line arguments
      00000011(12/1/1)           push      esi
      00000012(13/1/1)           call      __FTN95INIT1_
      00000017(14/1/1)           add       esp,=8
      0000001a(15/1/1)           sub       esp,=16           ; Adjusted later if temporaries allocated
   0002         print *,erfdif(0.6, 0.4)                                                         AT 20
      00000020(16/4/5)           lea       eax,2
      00000026(17/3/2)           push      eax
      00000027(18/4/3)           lea       ecx,-32234_2
      0000002d(19/3/2)           push      ecx
      0000002e(20/3/2)           push      eax
      0000002f(21/3/2)           mov       Temp@1,eax
      00000032(22/3/2)           mov       Temp@2,ecx
      00000035(23/3/2)           call      WSF1@@
      0000003a(24/3/2)           add       esp,=12
      0000003d(25/4/22)          lea       edi,1
      00000043(26/3/16)          push      edi
      00000044(27/5/12)          push      ebx               ; For eight-byte alignment
      00000045(28/6/11)          lea       eax,0.4
      0000004b(29/5/12)          push      eax
      0000004c(30/6/10)          lea       ecx,0.6
      00000052(31/5/12)          push      ecx
      00000053(32/5/12)          mov       Temp@4,eax
      00000056(33/5/12)          mov       Temp@5,ecx
      00000059(34/5/12)          mov       Temp@3,edi
      0000005c(35/5/12)          call      ERFDIF
      00000061(36/5/12)          add       esp,=12
      00000064(37/4/20)          fstp      Temp@6
      00000067(39/4/20)          lea       edi,Temp@6
      0000006a(40/3/16)          push      edi
      0000006b(41/3/16)          mov       Temp@7,edi
      0000006e(42/3/16)          call      R4@WSF
      00000073(43/3/16)          add       esp,=8
      00000076(44/2/24)          call      WSF2@
      0000007b(45/1/1)        Label     __N2
      0000007b(46/1/1)           call      EXIT1@
      00000080(47/1/1)           nop       
                                                             ; Start of FUNCTION ERFDIF

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

10 Aug 2019 12:49 #24182

This is what I get with the current developers' version...

      00000000(49/1/26)          push      ebp
      00000001(50/1/26)          mov       ebp,esp
      00000003(51/1/26)          push      ebx
      00000004(52/1/26)          push      esi
      00000005(53/1/26)          push      edi
      00000006(54/1/26)          push      eax
      00000007(55/1/26)          sub       esp,=16           ; Adjusted later if temporaries allocated
      0000000d(56/4/44)          mov       eax,address of X2
      00000010(57/4/44)          fld       [eax]
      00000012(58/4/44)          dfstp     Temp@9
      00000015(59/4/44)          lea       edi,Temp@9
      00000018(60/4/44)          push      edi
      00000019(61/4/44)          call      DERF@
      0000001e(62/4/44)          add       esp,=4
      00000021(64/4/36)          mov       eax,address of X1
      00000024(63/4/44)          fstp      Temp@8
      00000027(65/4/36)          fld       [eax]
      00000029(66/4/36)          dfstp     Temp@11
      0000002c(67/4/36)          lea       edi,Temp@11
      0000002f(68/4/36)          push      edi
      00000030(69/4/36)          call      DERF@
      00000035(70/4/36)          add       esp,=4
      00000038(71/4/36)          fstp      Temp@10
      0000003b(72/3/48)          fld       Temp@10
      0000003e(73/3/48)          fsub      Temp@8
      00000041(75/1/26)       Label     __N2
      00000041(76/1/26)          lea       esp,[ebp-12]
      00000044(77/1/26)          pop       edi
      00000045(78/1/26)          pop       esi
      00000046(79/1/26)          pop       ebx
      00000047(80/1/26)          pop       ebp
      00000048(81/1/26)          ret

mecej4

Posts: 1912

Back to Top

10 Aug 2019 1:31 #24183

Thanks; that listing does not exhibit the bug with the unsaved register being used across the function call.

I'll wait for the next version of the compiler.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

10 Aug 2019 3:54 #24184

In the meantime, if you are desperate for the results of your program (even though you said that you'd wait), pre-calculating the two erf function results and then doing the subtraction in another statement does work. (As, I suspect, you knew already).

It did make me think about manipulating the error function, but a quick read of the Wikipedia page reminded me that I had better things to do with my time, like mowing the lawn!

Eddie

mecej4

Posts: 1912

Back to Top

10 Aug 2019 4:37 #24185

Eddie, I am not desperate at all, I am ready for Brexit or no Brexit. I have other compilers to use for such situations.

In the actual code where I noticed the problem, the Polyhedron AerMod benchmark, the error function is evaluated thousands of times, with arguments that are known only at run time, and covering the range 0 to very large values.

Had the code simply run, and produced incorrect results, I would not have noticed anything. However, the code ( over 50,000 lines) actually crashed, and investigation led me to the tiny reproducer that I reported.

Erf, Erfc and Erfd_scaled are standard intrinsic functions in F2008.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

10 Aug 2019 5:02 #24186

Well if they are intrinsic in F2008, you are jolly lucky to find them in FTN95, then! (And anyway, is it a bit hopeful to expect the error function not to yield errors?)

Incidentally, what did people who wanted the erf do originally? Would it be the same if you used a user-written erf, or an erf function from a third-party library? Is it the same with two intrinsic functions of any sort, or just erf?

As an answer to my own question, AERMOD seems to use one of the series functions that one finds on the Wikipedia page, and not only that, the tactic used looks like precalculation of the results.

More seriously, do you genuinely get much benefit from /opt or /p6 anyway? (genuine enquiry there). I got put off /opt when it caused crashes, but that was years ago.

And as for other compilers, they may be brilliant at all sorts of things, but only FTN95 has Clearwin+ ...

Eddie

PS. There's much more chance of Brexit happening than of there being no bugs in any software.

mecej4

Posts: 1912

Back to Top

10 Aug 2019 5:43 (Edited: 11 Aug 2019 12:19) #24187

The standard reference for transcendental functions is Abramowitz and Stegun, see http://people.math.sfu.ca/~cbm/aands/abramowitz_and_stegun.pdf ; see Chap. 7 for ERF. Netlib is the source for Fortran code (often decades old, though) for such functions.

FTN95's /opt gives some improvement in speed, but not as much as with Gfortran or Intel. In the assembler listings given above you can see many redundant loads and stores. However, as long as an option is provided and is likely to be used by a number of users, use of the option ought not to produce errors.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

11 Aug 2019 8:42 #24188

However, as long as an option is provided and is likely to be used by a number of users, use of the option ought not to produce errors.

The above is a point that I have made on several occasions, whether or not the bug affects me personally. However, there is a caveat. A bug won't affect FTN95's usability if it is clearly documented, the user reads and understands the documentation, and there is a workaround. For example, if you know that:

      Y = ERF(X1) - ERF(X2)

creates errors under /OPT /P6, and instead you have to write

      A = ERF(X1)
      B = ERF(X2)
      Y = A - B

Then maybe the program will run marginally more slowly, but the time taken for the programmer / user to get results will be a lot shorter. I know which is more important to me. (It also means that you don't need the wtf function).

Also, having tried the same thing using sin instead of erf, it's obvious that what is wrong lies in the way ftn95 handles erf, not just any old function - which is useful to know, especially if (like me) you have very little use for erf.

So, for the avoidance of doubt, while the best solution is, of course, for everything to run as it should, a close second is to know what doesn't work well, so that it can be avoided.

Eddie

PS. Thanks for the helpful reference.

PPS. No doubt there are people out there who think that transcendental functions are where you all sit round wearing white kaftans, smoke herbal mixture to a background of Sergeant Pepper, and chant ... seriously, what is 'spiritual' about erf? Odd language, English.

mecej4

Posts: 1912

Back to Top

12 Aug 2019 1:33 #24191

Sorry, the trick (assigning values of sub-expressions to new variables and then summing the variables) that you suggested is a risky solution. It works sometimes, fooling you into thinking that it is a reliable solution.

You try it out on a toy program, and it succeeds.
You try it on a slightly different toy program, and it fails.
You try it in a big program, where it changes the results. If you had skipped step 2, and you were not able to judge whether the results were correct, you would be tempted to accept erroneous results as correct.

Here is a counterexample.

      PROGRAM TFRG
      CALL FRGAUSS(0.5, 1E-1, 0.7, 0.6, FRACT)
      PRINT *,FRACT
      END PROGRAM

      SUBROUTINE FRGAUSS(HCNTR,SIGMA,H1,H2,FRACT)
      S = 1.4142*AMAX1(SIGMA,1E-5)
!
      X1 = (H1-HCNTR)/S
      X2 = (H2-HCNTR)/S
!
      Z1=ERF(X1)
      Z2=ERF(X2)
      FRACT = 0.5*ABS(Z1-Z2)
!
      CONTINUE
      END

With /opt, the output is 0.00000. The correct output, without /opt, is 0.135904 (both with FTN95 8.51).

Small changes to the syntax of the program, without changing the semantics, may make the bug go into hiding.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

12 Aug 2019 10:16 #24193

Mecej4,

As you have shown that you like precise language, I refer you back to my previous post. To have a workaround, you need to know (a) that one is required, and (b) what does actually work. That, I’m afraid, is the job of documentation in the absence of a bug fix.

Regarding ‘toy programs’ you should note the very large number of complainants that report that their problems occur in ‘large’ programs, yet Paul inevitably responds with a request for a manageably small reproducer. I think that your example should send shivers down Paul’s spine when he realises the significance of your new example, that is that fixing the problem in such a small reproducer may not be the whole answer, as there is some deeper malaise. Therein, for me at least, is a most valuable point of your post.

The central point of my post was not that such a procedure was a solution, but if you knew etc.

As for whether or not you accept the results of a program as correct, then perhaps one should always be sceptical. With appropriate experience, one can detect nonsensical results, even if one cannot determine how and why they have been produced, or how large the error is. Unfortunately, scepticism is equated to 'denial of science' in some fields nowadays.

Just out of interest, I tried your new code using the user-supplied ERFX function from AERMOD, and found it worked. I then encapsulated the ERF function inside a user-supplied function, as follows:

      PROGRAM TFRG 
      CALL FRGAUSS (0.5, 1E-1, 0.7, 0.6, FRACT) 
      PRINT *,FRACT 
      END PROGRAM 

      SUBROUTINE FRGAUSS (HCNTR, SIGMA, H1, H2, FRACT) 
      S = 1.4142*AMAX1(SIGMA,1E-5) 
! 
      X1 = (H1-HCNTR)/S 
      X2 = (H2-HCNTR)/S 
! 
      Z1 = ERFX(X1) 
      Z2 = ERFX(X2) 
      FRACT = 0.5*ABS(Z1-Z2) 
! 
      END

      FUNCTION ERFX (ARG)
      ERFX = ERF(ARG)
      END

Once again the program runs to the answer you supplied. Does this mean, Paul, that something that FTN95 has remembered should be forgotten, or vice versa? Does the issue affect the Bessel functions too, implemented at the same time?

Eddie

mecej4

Posts: 1912

Back to Top

12 Aug 2019 1:16 #24194

Eddie, you make some excellent points in this post.

Optimizer bugs are elusive, hard to preserve while cutting away chunks of source code (in order to prepare a reproducer that is small enough to avoid the bug report being put into a 'to do on a day when there is nothing else to do' list) and -- worst of all -- fixing the compiler to make it work properly on the reproducer does not guarantee that the fix will also work on the original application code.

You may find the event described in the following report interesting:

 < http://www.envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/ >

The authors used a software formal verification tool to discover a flaw in the standard sort routine in the Java runtime, and proposed a fix. The Java developers acted upon the report, but implemented their own ad-hoc fix.

Currently sold compilers get updated at least once a year. Workarounds in users' source code, on the other hand, may stay in place far longer than the duration in which they had a purpose to serve. In fact, the Polyhedron Aermod source code -- in exactly the ERFDIF function that we have been discussing -- contains comments portraying some lines of code as workarounds for the 'flakey Lahey compiler'. There were many versions of the Lahey compiler that came after that workaround was added, and those versions did not need the workaround. Yet, the code changes have existed for three decades.

mecej4

Posts: 1912

Back to Top

13 Aug 2019 3:04 #24195

Lack of robustness is not just the compiler's fault. It can be caused by, for example:

Assuming that local variables are saved in some subroutines
Aliased actual arguments
Calling a subroutine inside a DO loop with the DO index variable as an argument that gets changed in the subroutine
Improper usage of mixed precision expressions ...

Any combination of all the preceding causes.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

13 Aug 2019 9:42 #24196

@John,

I think it's the /opt not the /p6 wot dun it; but reading up on what the P6 does, doesn't give me a lot of confidence. Core!

@Mecej4,

Just so we are on the same page, 1, seems to me to be a cause of the lack of robustness as defined by John, but 3. surely brings the code down every time, as I would expect 4. to also do. Improper integer expressions alternate 'coming right' with 'going wrong', and that should be obvious.

As for No. 2, why does that cause problems? Just asking.

Eddie

mecej4

Posts: 1912

Back to Top

13 Aug 2019 2:05 #24197

Quoted from LitusSaxonicum

As for No. 2, why does that cause problems? Just asking.

Here is an example. Before you compile and run it, think out the answer, and compare that with what the program actually gives. Note that the subroutine seemingly does not touch the second argument at all.

program copy
  dimension j(10)
  j(1:10) = 1
  call subr(j, j(1:9:2))
  print '(10I3)', j
end

subroutine subr(j, k)
   dimension j(10), k(5)
   j(1:10) = 2
end

In real code with aliasing bugs, the offending aliased variables may have been declared and defined in various places up long call chains, may have different names (because of Equivalence) bug may overlap in memory in some way.

This code is non-conforming, but detection of that fact can be very hard.

LitusSaxonicum

Posts: 2284 Yateley, Hants, UK

Back to Top

13 Aug 2019 2:56 #24198

Hi Mecej4,

Thanks for the explanation and the example.

Good, I'm safe from that one. I simply haven't got a clue what 'j(1:9:2)' is or does*, as I'm no enthusiast for some code constructs.

Indeed, having the same variable twice in an argument list seems to me to be an incantation to summon up Dannian Devilry. Just call me old-fashioned.

Perhaps the abliity to spell and the adoption of certain restrictive rules on 'clever' constructs may be keeping some of us safe in ways we didn't realise.

But by the time I'd converted the example to Fortran that I can read and understand at a glance, the problem had gone away (and before you ask, fixed format, in capitals, removing parameters that don't do anything and constructs I never use, WRITE instead of print, a numbered FORMAT statement - and thank heavens you used implicit type so I was OK there!)

Eddie

*I worked it out for myself, after pondering briefly whether it meant 2,4,6 ... or 1,3,5... The fact that you can do certain things doesn't mean that you should, much less must.

PaulLaidler

Posts: 7975 Salford, UK

Back to Top

12 Oct 2019 8:35 #24508

I have discovered why this failure was not showing up on my machine. It is a bug and I have made a note that it needs to be fixed.

mecej4

Posts: 1912

Back to Top

12 Oct 2019 9:34 #24509

It cannot be a case of using a different SALFLIBC.DLL, so I am curious to learn about the explanation.