Topic: FTN95 run time with allocatable arrays in Support

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

16 Feb 2015 11:12 #15656

It will be interesting to see if there are any improvements in these FTN95 run times with 64 bit FTN95. If not then will may have to revisit this issue.

mecej4

Posts: 1914

Back to Top

16 Feb 2015 11:23 #15657

Quoted from PaulLaidler It will be interesting to see if there are any improvements in these FTN95 run times with 64 bit FTN95. If not then will may have to revisit this issue.

I look forward to revisiting this thread with results from FTN95-X64 when it becomes available. Since MMX/X87 instructions are disallowed in Windows-64 kernel mode and discouraged in user mode, I anticipate that FTN95-X64 will only issue SSE2 or newer FPU code, and that users such as John Campbell will see great performance improvements with FTN95-X64.

JohnCampbell

Posts: 2526 Sydney

Back to Top

16 Feb 2015 11:45 #15658

I should point out that the problem I am presenting shows dramatically varying times for FTN95 performance, depending on the number of zero values in the matrix, while the number of operations is not varied due to zero values. The range of run times can vary between 30 seconds and 6,000 seconds. It is this latter performance, of delays up to 6,000 seconds that I have been trying to understand. Delays of this magnitude can not be explained by different efficiency assembler code generated by different compilers.

The 2 functions I call are basically: dot_product: acum = X . Y and vector subtraction: Y = Y - const * X.

The vector subtraction calculation typically exhibits worse delays than dot_product. It is interesting that Davidb's FTN95 code, based on SSE instructions does not exhibit these delays, nor does the original code compiled with other compilers (Intel ifort and Lahey F95)

It would be good if an explanation of this delay could be found.

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

16 Feb 2015 12:02 #15659

Mecej4,

It is interesting to see your ifort tests: IFort| MKL| 21 IFort| DavidB SSE2| 18

Did IFort| MKL use AVX instructions, as I have found it difficult to identify improved AVX performance when using large vectors.

John

mecej4

Posts: 1914

Back to Top

16 Feb 2015 12:22 #15660

Yes, I used /QxAVX to compile using IFort. Here, for example, is a section of assembly code from AVX_LIB.f90:

        vxorpd    xmm1, xmm1, xmm1                              ;29.6
        vxorps    xmm2, xmm2, xmm2                              ;29.6
        vmovsd    xmm0, xmm1, xmm0                              ;29.6
        mov       edi, DWORD PTR [8+ebx]                        ;29.6
        vinsertf128 ymm1, ymm0, xmm2, 1                         ;29.6
        vxorpd    ymm0, ymm0, ymm0                              ;29.6
        vmovapd   ymm3, ymm0                                    ;29.6
        vmovapd   ymm2, ymm0                                    ;29.6

These are clearly AVX instructions -- note the ones with three operands, and the usage of the ymm-n registers. In my limited experience, there has been no significant and consistent improvement arising from merely using the /QxAVX flag. Now, if DavidB could modify his SSE2 inline assembly code to AVX inline... But, then, the problem remains that the FTN95 inline assembler would not be able to encode those instructions! We have to wait and see.

In order to analyze further the issue that you bring up, regarding the effect of doing arithmetic on many zero floating point values, we would need a test code that does nothing more than fill large arrays and do a couple of DAXPYs and a couple of DDOTs. Then, we'd be looking at a single branch of one tree in a large forest, and an instruction level profiler such as Intel's VTune could be applied. Can you build such a test code?

                    ;29.6
        vinsertf128 ymm1, ymm0, xmm2, 1                         ;29.6
        vxorpd    ymm0, ymm0, ymm0                              ;29.6
        vmovapd   ymm3, ymm0                                    ;29.6
        vmovapd   ymm2, ymm0                                    ;29.6
These are clearly AVX instructions -- note the ones with three operands, and the usage of the ymm-n registers. In my limited experience, there has been no significant and consistent improvement arising from merely using the /QxAVX flag. Now, if DavidB could modify his SSE2 inline assembly code to AVX inline... But, then, the problem remains that the FTN95 inline assembler would not be able to encode those instructions! We have to wait and see.

In order to analyze further the issue that you bring up, regarding the effect of doing arithmetic on many zero floating point values, we would need a test code that does nothing more than fill large arrays and do a couple of DAXPYs and a couple of DDOTs. Then, we'd be looking at a single branch of one tree in a large forest, and an instruction level profiler such as Intel's VTune could be applied. Can you build such a test code? [quote:859cf8451c]The vector subtraction calculation typically exhibits worse delays than dot_product. It is interesting that Davidb's FTN95 code, based on SSE instructions does not exhibit these delays, nor does the original code compiled with other compilers (Intel ifort and Lahey F95)

It would be good if an explanation of this delay could be found. My current surmise/explanation is that all the cases which exhibit little delay involved SSE2 code. IFort has an extensive list of processor selection options, with SSE2 being the default, and Lahey F95 has the /KPENTIUM4,SSE2 option, which you may have used either directly or through a configuration file. To reinforce this, look again at the CVF results with and without SSE2 BLAS. CVF was the speed king in its day, but without SSE2 it produces code that is slow by today's standards and expectations. This behaviour of CVF is similar to that of FTN95 -- slow with X87 code, fast with SSE2 BLAS. A similar conclusion to Lahey LF95 7.1, Pentium+X87 versus P4+SSE2.

JohnCampbell

Posts: 2526 Sydney

Back to Top

16 Feb 2015 9:27 #15661

Mecej4,

Your indicating that /QxAVX was used for that case is consistent with my findings that AVX instructions do not appear to work for large vectors, unless they are stored in the cache (not sure if it has to be L1, L2 or L3, or even what these are!) Cache and alignment are AVX black magic ! which should be better managed in the processor.

I would like to clarify the delay problem I am addressing.

I am not concerned with the case when FTN95 has performance of say 30-60 seconds (that can be addressed by the SSE routines) but I am concerned by the cases where the run times are 1,000 to 6,000 seconds. What is causing this delay ? This looks like very long wait states, say up to 100 idle cycles per calculation cycle. How can this occur? If it is happening so often for the 6,000 second case, best identified by the number of zeros in the supplied vectors, then it could be happening less frequently for the 60 or 296 second case, and also for a lot of other DDOT or DAXPY style DO loops in general code compiled by FTN95.

I have been trying to highlight this problem and get it identified. The problem about the size of this 'branch', is that it is not easy to isolate what is the cause. I have been identifying performance problems in FTN95 with a DAXPY style routine since 1995 and this is the best branch I have found.

John

mecej4

Posts: 1914

Back to Top

17 Feb 2015 12:47 #15662

John, I have found some clues that may help you. On my laptop, I build and run the program from the command line. I have found that if, preceding a run of the program with /l:n with n not equal to 5, i.e., a run in which only X87 FPU instructions are used, setting the environment variable SALFENVAR=MASK_UNDERFLOW prior to the run greatly speeded up the running of your program built with FTN95. If you can confirm the same occurrence on your hardware, there are a couple of things that you can try. Note that, if there are lots of underflow traps taken in the parts of the code other than the VEC_XXX routines, setting the environment variable may help even when the SSE2 VEC_XXX routines are used.

Examine your code to find potential trouble spots where underflow occurs and consider adding defensive coding to avoid underflow. If you need more help, you can cause underflow traps to abort your program, locate the trouble spot from the traceback report, and fix the code. Use the /UNDERFLOW compiler option for this.
Either set the environment variable in the Windows control panel, if you think that is appropriate, or make a service call to MASK_UNDERFLOW@ (see the FTN95 documentation) to ignore underflows. With some recent versions of SALFLIBC.DLL, setting SALFENVAR in this way causes the program to crash with X87 stack overflow in some WRITE statements. A workaround is to use another version of the DLL, or to change all your INTEGER8 variables to INTEGER. It is probably not a good idea to use INTEGER8 unless you are also targeting 64-bit.
With SSE2 FPU code, there are more options (this does not - yet - apply to FTN95, but to other compilers such as Gfortran). Look at the -ftz (flush to zero) and -daz (denormals are zero) options.

CAUTION: It is not always safe to cause FPU underflows to be ignored. Check results from runs with underflow ignored to make sure that the accuracy is acceptable.

JohnCampbell

Posts: 2526 Sydney

Back to Top

17 Feb 2015 11:21 #15664

Mecej4,

Thanks for the advice. It does change the run performance. I tried setting the environment variable SALFENVAR=MASK_UNDERFLOW with FTN95 Ver 7.1 and got a number of floating point stack fault errors, writing real8 values, initially using write (,). I replaced , with a format statement number, but eventually this failed also. Prior to this failure the run times reduced substantially. There is definitely some sort of stack corruption.

I thought that floating point underflow should change to zero, rather than setting an error state. Is there a setting in FTN95 for floating point underflow to become zero ?

I also made the program more standard compliant, by changing the program to remove all mixed mode subroutine arguments. I then compiled the main routines with /check and the math libraries to use /p6 /opt. I did not get any /check error reports, just the 'floating point stack fault in IO_convert_long_double_to_ascii when writing out a real*8 variable.

John

mecej4

Posts: 1914

Back to Top

17 Feb 2015 12:24 (Edited: 18 Feb 2015 12:42) #15665

Fortran 200x requires underflow to be detected and reported by a compliant compiler. Most vendors give you ways of ignoring underflow, but these days you have to take some action (compiler switch, calling a service routine) to make that happen. Given your history with this program, I don't think that you would mind that little bit of work, to avoid the billions of calls to the underflow ISR (Interrupt Service Routine) that were slowing the program down.

I am pretty sure that either you have some array overruns in your program or there is a bug in the current versions of SALFLIBC.DLL. I have three versions of that DLL: 15.3.15.0, which came with SIMDEM, 16.4.27.11, which came with FTN95 7.1, and 17.1.28.8, which is the latest version that Paul released recently. Whichever version is used, a crash with Floating Point Stack Fault occurs, if the environment has SALFVAR set to MASK_UNDERFLOW.

It appears difficult to present a small reproducer to Paul to request a fix for this problem in the DLL. If you can give me your newer source files (the ones without conflicting type of formal and actual arguments), I can see to creating a reproducer.

Here are two other errors that I found; they affect only program output reporting but, just as the presence of conflicting types does, they make checked runs difficult. Both are in the file with the main program. In subroutine DO_TESTS, mflop_xx needs to be initialized since, if use_method(1) is not true, subroutine Gaussean_Reduction_Old does not get called, so mflop_xx would remain undefined. Likewise, in subroutine Check_D2, err_max needs to be initialized.

Also, please examine the nested DO loops subroutine Calculate_Profile. The allocated size of eqn_band() is only NSZF, but in the loop the index could potentially go up to NSZF+NBAND-1. What do you think?

(b.t.w., 'Gaussian' is the common spelling; in Gauss's native German, the word is 'gaußsche', as in gaußsche normalverteilung).

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

17 Feb 2015 1:01 #15667

For a description of how FTN95 handles underflows see ftn95.chm under Win32 platform->Exception handling->Underflows.

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Feb 2015 7:21 #15668

Mecej4,

Thanks very much for identifying this as a possible source of the delay.

Paul, would the type of behaviour (delays of ~ 100 cycles) be consistent with handling an underflow error ? It looks to be very long ? I am surprised at the time it appears to take to handle these interrupts. I am not sure what part of this delay is due to the processor response/settings and what is under the control of FTN95.

If there is an Interrupt Service Routine for underflows, it might be of use to be able to: count the number of occurrences, set the result to zero, then continue This would be an interim measure to identify this as the primary cause of the delay.

Also, is the 'floating point stack fault in IO_convert_long_double_to_ascii' error identifiable ? It would be good if a patch was available.

At present I am reviewing the matrices being generated for the different zero density settings and see how many extremely small values are being generated in this test program. I shall post these results shortly. I will also need to compare this to the number generated in a real finite element matrix. It may be that the matrices I am generating in this test suite are not suitable for the test. ( poorly conditioned, as opposed to the 'symmetric positive definite well conditioned matrices' that are assumed in FE analysis.

The performance of my DAXPY style routine has long been a problem with FTN95. It would be a great outcome if the underflow ISR response was found to be a significant cause of the delays and a faster response method was identified, although I am skipping a few steps in this suggestion.

John

ps: As an aside, I recently listened to a report on the radio about the problem of long term data retention, as storage and software technologies become obsolete. While this is apparent for the change in tape/disk/usb/... storage, it is a bigger problem for retaining communications. When trying to look back on the reports I have made on this delay problem over the last 20 years, I realised I have lost most of my emails I have sent. Not only are the files not available, but the email software packages that stored these emails and provided the interface are also gone. There remain next to no print-outs of past correspondence to review. Quite a shock when you think about what is now and will be available, especially when the present smart information systems become obsolete. What goes into the clouds will get blown away.

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

18 Feb 2015 8:15 #15669

John

I don't have anything useful to add to this discussion at the moment other than to note that there is a subroutine UNDERFLOW_COUNT@ that gives the count from start up. See the help file for details. There is currently no routine to reset the count to zero but it would not be difficult to provide this. The count is a 32 bit integer so having the facility to reset to zero may not be particularly useful.

mecej4

Posts: 1914

Back to Top

18 Feb 2015 9:55 #15670

Here are some numbers to quantify the huge cost of handling underflows in John's program. In both runs, I called UNDERFLOW_COUNT@() and printed out the returned count in the status displays make by the program (lines that begin 'p at equation' and 'r at equation'. I also called UNDERFLOW_COUNT@ and printed the final count at program termination, just to cover the possibility that the 32-bit count of underflow exceptions had overflowed.

            C A S E              UFL cnt (cumul)     Run duration
profsal /l:5 /m:-1 /d:2 /e:1500    76650       1.7 s
profsal /l:4 /m:-1 /d:2 /e:1500 39480441       306   s

Taking the difference, I worked out a cost per underflow exception of 7.8 us (microseconds). For the i5-4200U CPU on the laptop on which this test was run, that works out to 12,000 CPU cycles per interrupt (assuming that the cost of calling UNDERFLOW_COUNT@ a few hundred times can be neglected, and that one DAXPY 'op' can cause (actually, at most) one underflow). This figure for cycles per interrupt is far larger than John's estimates (given earlier in this thread and in private messages as a couple of hundred cycles) and far from reasonable for the work involved in doing two context switches and any bookkeeping in the ISR -- unreasonable not only from the CPU point of view, but also a huge problem for the end user, who may think about each element of a DAXPY operation (y := a x + y) as involving three memory accesses through cache, one FP multiplication and one FP addition. Whereas an operation may consume some tens of cycles, an underflow consumes nearly 12,000; in other words, most of the program time is spent servicing interrupts.

Why is the number of underflows so huge? John has already decided to probe into this, problem-specific, question. Here is a trimmed extract from the program output that may give some insight. Note that even when the operation count (John could perhaps describe precisely what that means) increased by a modest fifty percent, the underflow count increased about 170-fold.

 Eqn band envl      ops       ufls (cumul) 
 -----------------------------------
   1    1    1         0           0
   2    6  768      1552           0
   3    8  768      1946           0
   4   10  768      2342           0
  30   62  768    206440           0
  60  122  768    618440           0
  90  182  768   1077440           0
 120  242  768   1590440           0
 150  302  768   2157440           0
 180  312  768   2545740         182
 210  312  768   2552700         392
 240  312  768   2552700         602
 270  312  768   2552700         812
 300  312  768   2552700        1022
 330  312  768   2552700        1232
 360  312  768   2552700        1442
 390  618  768   3623161      247465
 420  618  768   7140390     1301035
 450  618  768   7139692     2354305
 480  618  768   7139692     3407251

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

18 Feb 2015 11:51 #15671

Mecej4

And what is the reason for large 3x (48s vs 17s) speed advantage of other compilers versus FTN95 even with SSE ? Can you run high resolution timers with accuracy of 1 processor cycle in different compilers (or use other ways) and find where the speed loss happens ?

mecej4

Posts: 1914

Back to Top

18 Feb 2015 12:21 (Edited: 20 Feb 2015 6:34) #15672

Dan, the 'with SSE' really means 'with SSE in the BLAS routines and X87 everywhere else'. There are FPU calculations done in the rest of the program and, with FTN95, they are done using x87 code using the ST0-ST7 registers. Comparisons of reals can be quite slow, because after a comparison the FPU flags have to be pushed to the stack or to other memory, loaded into the CPU flags, and then used to execute JCC instructions (JE, JL, JG, etc.). Formatted output of reals with WRITE statements also involves slow encoding of real numbers into decimal strings+exponents. The intrinsic functions SIN, COS, etc., and in fact any part of the SALFLIBC.DLL library that works with reals uses X87 code.

With this background information, it seems to me that running timing results on X87 code in 2015 is not something most of us would want to do. Windows 7 and 8 will not even run on a PC with a 486 or a Pentium. In some ways, this situation reminds me of the predicament of the first generation Itanium. It had X86 emulation in firmware, but that turned out to be so slow that people called the chip 'Itanic'. Later, out of Israel came a pure software emulator that outperformed the firmware emulation.

What we do with X87 code (almost emulation) on a modern chip with SSE2 and beyond is similar to buying a Porsche or a Mazda Miyata and hitching a heavy trailer behind it.

As I (and others) have said, as of now, it is good practice to use FTN95 for program development, given its fast compilation, good nonstandard library routine collection and excellent checking and debugging facilities. Once the program is debugged, switch to one of the optimizing compilers, including the free GFortran, which is quite competitive with the more expensive compilers.

When FTN95-64 bit is released, the situation is likely to see a dramatic improvement.

mecej4

Posts: 1914

Back to Top

18 Feb 2015 1:13 #15674

John, I have posted a short test program with a description of the problem at http://forums.silverfrost.com/viewtopic.php?p=17303#17303. The test program does a DAXMY operation 50000 times, and displays the severe penalty that results from having underflow interrupts enabled (which is so by default). I prepared the data file for the test program.

What is remarkable to me is that the figure of CPU cycles/underflow interrupt (which I badly miscalculated earlier to be about 200) remains the same in the single set of DAXMY data of the test code as in the hours-long run of your program.

Also remarkable is what I observed from the output of your program witn /l:4 /m:-1 /d:4. Here is the relevant section of the output:

r at equation      1    1    1   0.00000             0             0             0           0
r at equation      2   10 1700   0.00030             3             0          6830           0
r at equation      3   14 1700   0.00040             3             1          9400           0
r at equation      4   18 1700   0.00030             2             1         11974           0
r at equation    260  622 1700   0.26680            64          2604      80576362        1908
r at equation    520 1236 1700 132.07620           150       1320612     174196844    17076442
r at equation    780 1239 1700 357.12730           183       3571090     273732718    63335469
r at equation   1040 1319 1700 358.88130           146       3588667     281352012   109568425
r at equation   1300 1416 1700 410.90720           225       4108847     299587327   153649017
r at equation   1560 1557 1700 273.48410           218       2734623     330343476   188004386
r at equation   1820 1700 1700  96.77590           234        967525     363067445   200396655
r at equation   2080 1700 1700   1.42500            50         14200     376361853   200422820
r at equation   2340 1700 1700   1.19730            31         11942     376362740   200422820

What is remarkable is that only a few of the equations generated most of the underflows. At the beginning, there is no underflow. Between equations 520 and 1560, the number of underflows shoots up from the roof, and then trickles down to nothing after equation 2080.

Earlier, I had thought that the underflows, most often occurring in VEC_SUB, were caused by many elements of a x and y being of the same sign and nearly equal magnitudes. However, this portion of a printout from the program tells us otherwise, that many elements of x as well as the scalar multiplier a are so small that forming the product a x itself causes underflow, before subtracting y. If such a thing happens in your real FEA program solver routine, you can just change the sign of the elements in y and return. A similar short cut could be taken in VEC_ADD.

  a =      -1.00506E-300

    i            x(i)                 y(i)
------------------------------------------
    1      -1.0049-298     9.9980E+01
    2       1.0050-300     9.9020E-01
    3      -1.0051-302     9.8039E-05
    4       1.0052-304    -9.8049E-07
    5      -1.0153-306     9.8059E-09
    6       0.0000E+00    -9.8069E-11
    7       0.0000E+00     9.8078E-13

mecej4

Posts: 1914

Back to Top

19 Feb 2015 2:14 (Edited: 22 Feb 2015 3:55) #15688

In this thread we have seen how having lots of underflow exceptions can cause degradation of performance. However, there is a more general reason not to use X87 code on CPUs that have a more recent FPU that is capable of SSE2 and later instructions, and floating point performance is important. In the article, http://www.realworldtech.com/physx87/1/, David Kanter writes

This article delves into the recent history of real-time game physics libraries (specifically PhysX), and analyzes the performance characteristics of PhysX. In particular, through our experiments we found that PhysX uses an exceptionally high degree of x87 code and no SSE, which is a known recipe for poor performance on any modern CPU.

There is a discussion of the shortcomings of the X87 family w.r.t. exception handling and FPU stack overflow by an expert on the subject: 'How Intel 80X87 Stack Over/Underflow Should Have Been Handled' http://www.cims.nyu.edu/~dbindel/class/cs279/stack87.pdf .

JohnCampbell

Posts: 2526 Sydney

Back to Top

21 Feb 2015 12:48 #15699

I have updated my test program by making it more standard compliant and removing most 'integer array(*)' usage. This allows most of the program to be compiled with /check, except for the low level vector operations which are compiled with /P6 /OPT.

I have not been able to identify any coding cause of the run time error: 'floating point stack fault in IO_convert_long_double_to_ascii'

It looks as if it is associated with the handling of FPE after MASK_UNDERFLOW@ has been called or SALFENVAR=MASK_UNDERFLOW has been set.

It would be good if a patch was available.

John

PaulLaidler

Posts: 7977 Salford, UK

Back to Top

21 Feb 2015 4:34 #15700

John

Is it possible to provide me with a working program and source code that illustrates this problem?

As far as I know this routine is only called within a standard IO call. There are functions invalid_float and invalid_double that might be useful.

c_external INVALID_DOUBLE@ 'invalid_double'(VAL):logical c_external INVALID_FLOAT@ 'invalid_float'(VAL):logical

mecej4

Posts: 1914

Back to Top

21 Feb 2015 11:07 #15701

John, here is a temporary work-around for the FPU stack overflow bug. At the very beginning of the program, call mask_underflow@(), In file colsol.f90, just before calling RedCol_Stats(), call unmask_underflow@(). With these calls added, there is no longer a need for setting the environment variable SALFENVAR=MASK_UNDERFLOW.

Note that this work-around is quite fragile. If you modify your program, it may happen that the bug will go active again.