|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Wed Feb 18, 2015 8:21 am Post subject: |
|
|
Mecej4,
Thanks very much for identifying this as a possible source of the delay.
Paul, would the type of behaviour (delays of ~ 100 cycles) be consistent with handling an underflow error ? It looks to be very long ?
I am surprised at the time it appears to take to handle these interrupts. I am not sure what part of this delay is due to the processor response/settings and what is under the control of FTN95.
If there is an Interrupt Service Routine for underflows, it might be of use to be able to:
count the number of occurrences,
set the result to zero,
then continue
This would be an interim measure to identify this as the primary cause of the delay.
Also, is the "floating point stack fault in IO_convert_long_double_to_ascii" error identifiable ? It would be good if a patch was available.
At present I am reviewing the matrices being generated for the different zero density settings and see how many extremely small values are being generated in this test program. I shall post these results shortly.
I will also need to compare this to the number generated in a real finite element matrix. It may be that the matrices I am generating in this test suite are not suitable for the test. ( poorly conditioned, as opposed to the "symmetric positive definite well conditioned matrices" that are assumed in FE analysis.
The performance of my DAXPY style routine has long been a problem with FTN95. It would be a great outcome if the underflow ISR response was found to be a significant cause of the delays and a faster response method was identified, although I am skipping a few steps in this suggestion.
John
ps: As an aside, I recently listened to a report on the radio about the problem of long term data retention, as storage and software technologies become obsolete. While this is apparent for the change in tape/disk/usb/... storage, it is a bigger problem for retaining communications.
When trying to look back on the reports I have made on this delay problem over the last 20 years, I realised I have lost most of my emails I have sent. Not only are the files not available, but the email software packages that stored these emails and provided the interface are also gone. There remain next to no print-outs of past correspondence to review.
Quite a shock when you think about what is now and will be available, especially when the present smart information systems become obsolete. What goes into the clouds will get blown away. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Wed Feb 18, 2015 9:15 am Post subject: |
|
|
John
I don't have anything useful to add to this discussion at the moment other than to note that there is a subroutine UNDERFLOW_COUNT@ that gives the count from start up. See the help file for details. There is currently no routine to reset the count to zero but it would not be difficult to provide this. The count is a 32 bit integer so having the facility to reset to zero may not be particularly useful. |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Wed Feb 18, 2015 10:55 am Post subject: |
|
|
Here are some numbers to quantify the huge cost of handling underflows in John's program. In both runs, I called UNDERFLOW_COUNT@() and printed out the returned count in the status displays make by the program (lines that begin "p at equation" and "r at equation". I also called UNDERFLOW_COUNT@ and printed the final count at program termination, just to cover the possibility that the 32-bit count of underflow exceptions had overflowed.
Code: |
C A S E UFL cnt (cumul) Run duration
profsal /l:5 /m:-1 /d:2 /e:1500 76650 1.7 s
profsal /l:4 /m:-1 /d:2 /e:1500 39480441 306 s
|
Taking the difference, I worked out a cost per underflow exception of 7.8 us (microseconds). For the i5-4200U CPU on the laptop on which this test was run, that works out to 12,000 CPU cycles per interrupt (assuming that the cost of calling UNDERFLOW_COUNT@ a few hundred times can be neglected, and that one DAXPY 'op' can cause (actually, at most) one underflow). This figure for cycles per interrupt is far larger than John's estimates (given earlier in this thread and in private messages as a couple of hundred cycles) and far from reasonable for the work involved in doing two context switches and any bookkeeping in the ISR -- unreasonable not only from the CPU point of view, but also a huge problem for the end user, who may think about each element of a DAXPY operation (y := a x + y) as involving three memory accesses through cache, one FP multiplication and one FP addition. Whereas an operation may consume some tens of cycles, an underflow consumes nearly 12,000; in other words, most of the program time is spent servicing interrupts.
Why is the number of underflows so huge? John has already decided to probe into this, problem-specific, question. Here is a trimmed extract from the program output that may give some insight. Note that even when the operation count (John could perhaps describe precisely what that means) increased by a modest fifty percent, the underflow count increased about 170-fold.
Code: |
Eqn band envl ops ufls (cumul)
-----------------------------------
1 1 1 0 0
2 6 768 1552 0
3 8 768 1946 0
4 10 768 2342 0
30 62 768 206440 0
60 122 768 618440 0
90 182 768 1077440 0
120 242 768 1590440 0
150 302 768 2157440 0
180 312 768 2545740 182
210 312 768 2552700 392
240 312 768 2552700 602
270 312 768 2552700 812
300 312 768 2552700 1022
330 312 768 2552700 1232
360 312 768 2552700 1442
390 618 768 3623161 247465
420 618 768 7140390 1301035
450 618 768 7139692 2354305
480 618 768 7139692 3407251
|
|
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Wed Feb 18, 2015 12:51 pm Post subject: |
|
|
Mecej4
And what is the reason for large 3x (48s vs 17s) speed advantage of other compilers versus FTN95 even with SSE ? Can you run high resolution timers with accuracy of 1 processor cycle in different compilers (or use other ways) and find where the speed loss happens ? |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Wed Feb 18, 2015 1:21 pm Post subject: |
|
|
Dan, the "with SSE" really means "with SSE in the BLAS routines and X87 everywhere else". There are FPU calculations done in the rest of the program and, with FTN95, they are done using x87 code using the ST0-ST7 registers. Comparisons of reals can be quite slow, because after a comparison the FPU flags have to be pushed to the stack or to other memory, loaded into the CPU flags, and then used to execute JCC instructions (JE, JL, JG, etc.). Formatted output of reals with WRITE statements also involves slow encoding of real numbers into decimal strings+exponents. The intrinsic functions SIN, COS, etc., and in fact any part of the SALFLIBC.DLL library that works with reals uses X87 code.
With this background information, it seems to me that running timing results on X87 code in 2015 is not something most of us would want to do. Windows 7 and 8 will not even run on a PC with a 486 or a Pentium. In some ways, this situation reminds me of the predicament of the first generation Itanium. It had X86 emulation in firmware, but that turned out to be so slow that people called the chip 'Itanic'. Later, out of Israel came a pure software emulator that outperformed the firmware emulation.
What we do with X87 code (almost emulation) on a modern chip with SSE2 and beyond is similar to buying a Porsche or a Mazda Miyata and hitching a heavy trailer behind it.
As I (and others) have said, as of now, it is good practice to use FTN95 for program development, given its fast compilation, good nonstandard library routine collection and excellent checking and debugging facilities. Once the program is debugged, switch to one of the optimizing compilers, including the free GFortran, which is quite competitive with the more expensive compilers.
When FTN95-64 bit is released, the situation is likely to see a dramatic improvement.
Last edited by mecej4 on Fri Feb 20, 2015 7:34 pm; edited 1 time in total |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Wed Feb 18, 2015 2:13 pm Post subject: |
|
|
John, I have posted a short test program with a description of the problem at http://forums.silverfrost.com/viewtopic.php?p=17303#17303. The test program does a DAXMY operation 50000 times, and displays the severe penalty that results from having underflow interrupts enabled (which is so by default). I prepared the data file for the test program.
What is remarkable to me is that the figure of CPU cycles/underflow interrupt (which I badly miscalculated earlier to be about 200) remains the same in the single set of DAXMY data of the test code as in the hours-long run of your program.
Also remarkable is what I observed from the output of your program witn /l:4 /m:-1 /d:4. Here is the relevant section of the output:
Code: |
r at equation 1 1 1 0.00000 0 0 0 0
r at equation 2 10 1700 0.00030 3 0 6830 0
r at equation 3 14 1700 0.00040 3 1 9400 0
r at equation 4 18 1700 0.00030 2 1 11974 0
r at equation 260 622 1700 0.26680 64 2604 80576362 1908
r at equation 520 1236 1700 132.07620 150 1320612 174196844 17076442
r at equation 780 1239 1700 357.12730 183 3571090 273732718 63335469
r at equation 1040 1319 1700 358.88130 146 3588667 281352012 109568425
r at equation 1300 1416 1700 410.90720 225 4108847 299587327 153649017
r at equation 1560 1557 1700 273.48410 218 2734623 330343476 188004386
r at equation 1820 1700 1700 96.77590 234 967525 363067445 200396655
r at equation 2080 1700 1700 1.42500 50 14200 376361853 200422820
r at equation 2340 1700 1700 1.19730 31 11942 376362740 200422820
|
What is remarkable is that only a few of the equations generated most of the underflows. At the beginning, there is no underflow. Between equations 520 and 1560, the number of underflows shoots up from the roof, and then trickles down to nothing after equation 2080.
Earlier, I had thought that the underflows, most often occurring in VEC_SUB, were caused by many elements of a x and y being of the same sign and nearly equal magnitudes. However, this portion of a printout from the program tells us otherwise, that many elements of x as well as the scalar multiplier a are so small that forming the product a x itself causes underflow, before subtracting y. If such a thing happens in your real FEA program solver routine, you can just change the sign of the elements in y and return. A similar short cut could be taken in VEC_ADD.
Code: |
a = -1.00506E-300
i x(i) y(i)
------------------------------------------
1 -1.0049-298 9.9980E+01
2 1.0050-300 9.9020E-01
3 -1.0051-302 9.8039E-05
4 1.0052-304 -9.8049E-07
5 -1.0153-306 9.8059E-09
6 0.0000E+00 -9.8069E-11
7 0.0000E+00 9.8078E-13
|
|
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Thu Feb 19, 2015 3:14 pm Post subject: |
|
|
In this thread we have seen how having lots of underflow exceptions can cause degradation of performance. However, there is a more general reason not to use X87 code on CPUs that have a more recent FPU that is capable of SSE2 and later instructions, and floating point performance is important. In the article, http://www.realworldtech.com/physx87/1/, David Kanter writes Quote: | This article delves into the recent history of real-time game physics libraries (specifically PhysX), and analyzes the performance characteristics of PhysX. In particular, through our experiments we found that PhysX uses an exceptionally high degree of x87 code and no SSE, which is a known recipe for poor performance on any modern CPU. |
There is a discussion of the shortcomings of the X87 family w.r.t. exception handling and FPU stack overflow by an expert on the subject: "How Intel 80X87 Stack Over/Underflow Should Have Been Handled" http://www.cims.nyu.edu/~dbindel/class/cs279/stack87.pdf .
Last edited by mecej4 on Sun Feb 22, 2015 4:55 pm; edited 1 time in total |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Sat Feb 21, 2015 1:48 pm Post subject: |
|
|
I have updated my test program by making it more standard compliant and removing most "integer array(*)" usage.
This allows most of the program to be compiled with /check, except for the low level vector operations which are compiled with /P6 /OPT.
I have not been able to identify any coding cause of the run time error:
"floating point stack fault in IO_convert_long_double_to_ascii"
It looks as if it is associated with the handling of FPE after
MASK_UNDERFLOW@ has been called or
SALFENVAR=MASK_UNDERFLOW has been set.
It would be good if a patch was available.
John |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Sat Feb 21, 2015 5:34 pm Post subject: |
|
|
John
Is it possible to provide me with a working program and source code that illustrates this problem?
As far as I know this routine is only called within a standard IO call.
There are functions invalid_float and invalid_double that might be useful.
c_external INVALID_DOUBLE@ 'invalid_double'(VAL):logical
c_external INVALID_FLOAT@ 'invalid_float'(VAL):logical |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Sun Feb 22, 2015 12:07 am Post subject: |
|
|
John, here is a temporary work-around for the FPU stack overflow bug. At the very beginning of the program, call mask_underflow@(), In file colsol.f90, just before calling RedCol_Stats(), call unmask_underflow@(). With these calls added, there is no longer a need for setting the environment variable SALFENVAR=MASK_UNDERFLOW.
Note that this work-around is quite fragile. If you modify your program, it may happen that the bug will go active again. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Sun Feb 22, 2015 12:05 pm Post subject: |
|
|
Paul and Mecej4,
I am developing an updated version of the program, which demonstrates the FPE error, associated with calling problem of call mask_underflow@(). This program, can compile and run with /check.
If I call mask_underflow@() at the start of the program, then I get problems at the end of the testing, when outputting a valid real*8 number, after outputting an integer*8 number. There were previous writing of real*8 values that did not generate an error, and all numbers being reported are valid numbers ( in the range 0.01 to 1000).
I am not getting the error during the test, associated with RedCol_Stats(), but at the reporting stage at the end of the main do ieq loop.
Alternatively, If I first call mask_underflow@() at the start of the main loop, then call unmask_underflow@ at the end of the main loop, before the write statements, then there is no error generated. Ftn95 documentation recommends that call mask_underflow@() be the first executable statement?
I also tried a test in the inner loop:
if ( abs(Col(Jeq+I0)) < 1.0d-90 ) Col(Jeq+I0) = 0
This removes most small numbers being generated in colsol and removes FP Exceptions, but only for 1 of the solution methods.
It may be that a well conditioned finite element matrix will not generate FPE's. This is disappointing, as I was hoping a source of this cronic delay problem may have been found.
I shall send the link in a pm, together with documentation of how to generate the error.
John |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2816 Location: South Pole, Antarctica
|
Posted: Sun Feb 22, 2015 3:39 pm Post subject: |
|
|
Keep looking at that issue. That way undeflow corruption will be finally addrssed. It caused a lot of lost time in the past in one of my subroutines which was doing a lot of exp(-a) with a exceeding log(1e-37) ~ 80. It essectially killed Jalih's (and i think Paul's latest too)parallel method for me since it is very sensitive to the underflow by crashing immediately. . May be even denormal numbers cause the problem. We discussed that last year here and even had a demo reproducer |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Tue Feb 24, 2015 1:16 am Post subject: |
|
|
Paul,
The following link provides a cut-down example of the FPE failure.
https://www.dropbox.com/s/066ghzblgcmca9s/fpe_example.zip?dl=0
To demonstrate the problem, unzip this link and run do_tests.bat
The final failure with SALFENVAR=MASK_UNDERFLOW shows the Floating point stack fault occurring.
I am using FTN95 Ver 7.10.0
Only the program is run with set SALFENVAR=MASK_UNDERFLOW
If you change line 23 of prof.f90 to : eqn_option = 2000
you will then see the FPE delays becoming more significant. The last column report is the incremental count of FPE occurring.
Thanks to Mecej4 for his assistance in identifying this error.
John |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Tue Feb 24, 2015 11:04 am Post subject: |
|
|
Please note that the cut down example contains only about 240 lines and does not need any command line arguments to be supplied, whereas the original version had close to 3000 lines, and had (i) provisions for many alternative code paths and (ii) extensive instrumentation to time the program.
With the shortened example code and the batch file that he provides, John has made it easy to run and exhibit the two problems with the compiler: X87 stack overflow in a WRITE statement, and excessive time consumed in processing underflows. It is possible to work around only one of these problems. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7925 Location: Salford, UK
|
Posted: Tue Feb 24, 2015 2:06 pm Post subject: |
|
|
The immediate problem occurs when WRITE is used for an INTEGER*8 value.
This has not been fixed yet but the temporary work-around is to avoid this situation.
To get this code working I keep op_count and last_count as INTEGER*8 but assigned these to an INTEGER*4 value before a calling WRITE using the INTEGER*4.
If I understand it correctly, in this context the non-zero underflow count is caused by the calls to WRITE when using an INTEGER*8 value.
I have logged this for further investigation. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|