forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Mecej4,

Thanks very much for identifying this as a possible source of the delay.

Paul, would the type of behaviour (delays of ~ 100 cycles) be consistent with handling an underflow error ? It looks to be very long ?
I am surprised at the time it appears to take to handle these interrupts. I am not sure what part of this delay is due to the processor response/settings and what is under the control of FTN95.

If there is an Interrupt Service Routine for underflows, it might be of use to be able to:
count the number of occurrences,
set the result to zero,
then continue
This would be an interim measure to identify this as the primary cause of the delay.

Also, is the "floating point stack fault in IO_convert_long_double_to_ascii" error identifiable ? It would be good if a patch was available.

At present I am reviewing the matrices being generated for the different zero density settings and see how many extremely small values are being generated in this test program. I shall post these results shortly.
I will also need to compare this to the number generated in a real finite element matrix. It may be that the matrices I am generating in this test suite are not suitable for the test. ( poorly conditioned, as opposed to the "symmetric positive definite well conditioned matrices" that are assumed in FE analysis.

The performance of my DAXPY style routine has long been a problem with FTN95. It would be a great outcome if the underflow ISR response was found to be a significant cause of the delays and a faster response method was identified, although I am skipping a few steps in this suggestion.

John

ps: As an aside, I recently listened to a report on the radio about the problem of long term data retention, as storage and software technologies become obsolete. While this is apparent for the change in tape/disk/usb/... storage, it is a bigger problem for retaining communications.
When trying to look back on the reports I have made on this delay problem over the last 20 years, I realised I have lost most of my emails I have sent. Not only are the files not available, but the email software packages that stored these emails and provided the interface are also gone. There remain next to no print-outs of past correspondence to review.
Quite a shock when you think about what is now and will be available, especially when the present smart information systems become obsolete. What goes into the clouds will get blown away.

PaulLaidler · Posted: Wed Feb 18, 2015 9:15 am Post subject:

John

I don't have anything useful to add to this discussion at the moment other than to note that there is a subroutine UNDERFLOW_COUNT@ that gives the count from start up. See the help file for details. There is currently no routine to reset the count to zero but it would not be difficult to provide this. The count is a 32 bit integer so having the facility to reset to zero may not be particularly useful.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Here are some numbers to quantify the huge cost of handling underflows in John's program. In both runs, I called UNDERFLOW_COUNT@() and printed out the returned count in the status displays make by the program (lines that begin "p at equation" and "r at equation". I also called UNDERFLOW_COUNT@ and printed the final count at program termination, just to cover the possibility that the 32-bit count of underflow exceptions had overflowed.

DanRRight · Posted: Wed Feb 18, 2015 12:51 pm Post subject:

Mecej4

And what is the reason for large 3x (48s vs 17s) speed advantage of other compilers versus FTN95 even with SSE ? Can you run high resolution timers with accuracy of 1 processor cycle in different compilers (or use other ways) and find where the speed loss happens ?

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Dan, the "with SSE" really means "with SSE in the BLAS routines and X87 everywhere else". There are FPU calculations done in the rest of the program and, with FTN95, they are done using x87 code using the ST0-ST7 registers. Comparisons of reals can be quite slow, because after a comparison the FPU flags have to be pushed to the stack or to other memory, loaded into the CPU flags, and then used to execute JCC instructions (JE, JL, JG, etc.). Formatted output of reals with WRITE statements also involves slow encoding of real numbers into decimal strings+exponents. The intrinsic functions SIN, COS, etc., and in fact any part of the SALFLIBC.DLL library that works with reals uses X87 code.

With this background information, it seems to me that running timing results on X87 code in 2015 is not something most of us would want to do. Windows 7 and 8 will not even run on a PC with a 486 or a Pentium. In some ways, this situation reminds me of the predicament of the first generation Itanium. It had X86 emulation in firmware, but that turned out to be so slow that people called the chip 'Itanic'. Later, out of Israel came a pure software emulator that outperformed the firmware emulation.

What we do with X87 code (almost emulation) on a modern chip with SSE2 and beyond is similar to buying a Porsche or a Mazda Miyata and hitching a heavy trailer behind it.

As I (and others) have said, as of now, it is good practice to use FTN95 for program development, given its fast compilation, good nonstandard library routine collection and excellent checking and debugging facilities. Once the program is debugged, switch to one of the optimizing compilers, including the free GFortran, which is quite competitive with the more expensive compilers.

When FTN95-64 bit is released, the situation is likely to see a dramatic improvement.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

John, I have posted a short test program with a description of the problem at http://forums.silverfrost.com/viewtopic.php?p=17303#17303. The test program does a DAXMY operation 50000 times, and displays the severe penalty that results from having underflow interrupts enabled (which is so by default). I prepared the data file for the test program.

What is remarkable to me is that the figure of CPU cycles/underflow interrupt (which I badly miscalculated earlier to be about 200) remains the same in the single set of DAXMY data of the test code as in the hours-long run of your program.

Also remarkable is what I observed from the output of your program witn /l:4 /m:-1 /d:4. Here is the relevant section of the output:

mecej4 · Joined: 31 Oct 2006 Posts: 1886

In this thread we have seen how having lots of underflow exceptions can cause degradation of performance. However, there is a more general reason not to use X87 code on CPUs that have a more recent FPU that is capable of SSE2 and later instructions, and floating point performance is important. In the article, http://www.realworldtech.com/physx87/1/, David Kanter writes

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

I have updated my test program by making it more standard compliant and removing most "integer array(*)" usage.
This allows most of the program to be compiled with /check, except for the low level vector operations which are compiled with /P6 /OPT.

I have not been able to identify any coding cause of the run time error:
"floating point stack fault in IO_convert_long_double_to_ascii"

It looks as if it is associated with the handling of FPE after
MASK_UNDERFLOW@ has been called or
SALFENVAR=MASK_UNDERFLOW has been set.

It would be good if a patch was available.

John

PaulLaidler · Posted: Sat Feb 21, 2015 5:34 pm Post subject:

John

Is it possible to provide me with a working program and source code that illustrates this problem?

As far as I know this routine is only called within a standard IO call.
There are functions invalid_float and invalid_double that might be useful.

c_external INVALID_DOUBLE@ 'invalid_double'(VAL):logical
c_external INVALID_FLOAT@ 'invalid_float'(VAL):logical

mecej4 · Joined: 31 Oct 2006 Posts: 1886

John, here is a temporary work-around for the FPU stack overflow bug. At the very beginning of the program, call mask_underflow@(), In file colsol.f90, just before calling RedCol_Stats(), call unmask_underflow@(). With these calls added, there is no longer a need for setting the environment variable SALFENVAR=MASK_UNDERFLOW.

Note that this work-around is quite fragile. If you modify your program, it may happen that the bug will go active again.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul and Mecej4,

I am developing an updated version of the program, which demonstrates the FPE error, associated with calling problem of call mask_underflow@(). This program, can compile and run with /check.

If I call mask_underflow@() at the start of the program, then I get problems at the end of the testing, when outputting a valid real*8 number, after outputting an integer*8 number. There were previous writing of real*8 values that did not generate an error, and all numbers being reported are valid numbers ( in the range 0.01 to 1000).
I am not getting the error during the test, associated with RedCol_Stats(), but at the reporting stage at the end of the main do ieq loop.

Alternatively, If I first call mask_underflow@() at the start of the main loop, then call unmask_underflow@ at the end of the main loop, before the write statements, then there is no error generated. Ftn95 documentation recommends that call mask_underflow@() be the first executable statement?

I also tried a test in the inner loop:
if ( abs(Col(Jeq+I0)) < 1.0d-90 ) Col(Jeq+I0) = 0
This removes most small numbers being generated in colsol and removes FP Exceptions, but only for 1 of the solution methods.
It may be that a well conditioned finite element matrix will not generate FPE's. This is disappointing, as I was hoping a source of this cronic delay problem may have been found.
I shall send the link in a pm, together with documentation of how to generate the error.

John

DanRRight · Posted: Sun Feb 22, 2015 3:39 pm Post subject:

Keep looking at that issue. That way undeflow corruption will be finally addrssed. It caused a lot of lost time in the past in one of my subroutines which was doing a lot of exp(-a) with a exceeding log(1e-37) ~ 80. It essectially killed Jalih's (and i think Paul's latest too)parallel method for me since it is very sensitive to the underflow by crashing immediately. . May be even denormal numbers cause the problem. We discussed that last year here and even had a demo reproducer

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Paul,

The following link provides a cut-down example of the FPE failure.

https://www.dropbox.com/s/066ghzblgcmca9s/fpe_example.zip?dl=0

To demonstrate the problem, unzip this link and run do_tests.bat
The final failure with SALFENVAR=MASK_UNDERFLOW shows the Floating point stack fault occurring.

I am using FTN95 Ver 7.10.0

Only the program is run with set SALFENVAR=MASK_UNDERFLOW

If you change line 23 of prof.f90 to : eqn_option = 2000
you will then see the FPE delays becoming more significant. The last column report is the incremental count of FPE occurring.

Thanks to Mecej4 for his assistance in identifying this error.

John

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Please note that the cut down example contains only about 240 lines and does not need any command line arguments to be supplied, whereas the original version had close to 3000 lines, and had (i) provisions for many alternative code paths and (ii) extensive instrumentation to time the program.

With the shortened example code and the batch file that he provides, John has made it easy to run and exhibit the two problems with the compiler: X87 stack overflow in a WRITE statement, and excessive time consumed in processing underflows. It is possible to work around only one of these problems.

PaulLaidler · Posted: Tue Feb 24, 2015 2:06 pm Post subject:

The immediate problem occurs when WRITE is used for an INTEGER*8 value.
This has not been fixed yet but the temporary work-around is to avoid this situation.

To get this code working I keep op_count and last_count as INTEGER*8 but assigned these to an INTEGER*4 value before a calling WRITE using the INTEGER*4.

If I understand it correctly, in this context the non-zero underflow count is caused by the calls to WRITE when using an INTEGER*8 value.

I have logged this for further investigation.