forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

FTN95 run time with allocatable arrays
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Feb 18, 2015 8:21 am    Post subject: Reply with quote

Mecej4,

Thanks very much for identifying this as a possible source of the delay.

Paul, would the type of behaviour (delays of ~ 100 cycles) be consistent with handling an underflow error ? It looks to be very long ?
I am surprised at the time it appears to take to handle these interrupts. I am not sure what part of this delay is due to the processor response/settings and what is under the control of FTN95.

If there is an Interrupt Service Routine for underflows, it might be of use to be able to:
count the number of occurrences,
set the result to zero,
then continue
This would be an interim measure to identify this as the primary cause of the delay.

Also, is the "floating point stack fault in IO_convert_long_double_to_ascii" error identifiable ? It would be good if a patch was available.

At present I am reviewing the matrices being generated for the different zero density settings and see how many extremely small values are being generated in this test program. I shall post these results shortly.
I will also need to compare this to the number generated in a real finite element matrix. It may be that the matrices I am generating in this test suite are not suitable for the test. ( poorly conditioned, as opposed to the "symmetric positive definite well conditioned matrices" that are assumed in FE analysis.

The performance of my DAXPY style routine has long been a problem with FTN95. It would be a great outcome if the underflow ISR response was found to be a significant cause of the delays and a faster response method was identified, although I am skipping a few steps in this suggestion.

John

ps: As an aside, I recently listened to a report on the radio about the problem of long term data retention, as storage and software technologies become obsolete. While this is apparent for the change in tape/disk/usb/... storage, it is a bigger problem for retaining communications.
When trying to look back on the reports I have made on this delay problem over the last 20 years, I realised I have lost most of my emails I have sent. Not only are the files not available, but the email software packages that stored these emails and provided the interface are also gone. There remain next to no print-outs of past correspondence to review.
Quite a shock when you think about what is now and will be available, especially when the present smart information systems become obsolete. What goes into the clouds will get blown away.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Wed Feb 18, 2015 9:15 am    Post subject: Reply with quote

John

I don't have anything useful to add to this discussion at the moment other than to note that there is a subroutine UNDERFLOW_COUNT@ that gives the count from start up. See the help file for details. There is currently no routine to reset the count to zero but it would not be difficult to provide this. The count is a 32 bit integer so having the facility to reset to zero may not be particularly useful.
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Wed Feb 18, 2015 10:55 am    Post subject: Reply with quote

Here are some numbers to quantify the huge cost of handling underflows in John's program. In both runs, I called UNDERFLOW_COUNT@() and printed out the returned count in the status displays make by the program (lines that begin "p at equation" and "r at equation". I also called UNDERFLOW_COUNT@ and printed the final count at program termination, just to cover the possibility that the 32-bit count of underflow exceptions had overflowed.
Code:

            C A S E              UFL cnt (cumul)     Run duration
profsal /l:5 /m:-1 /d:2 /e:1500    76650       1.7 s
profsal /l:4 /m:-1 /d:2 /e:1500 39480441       306   s


Taking the difference, I worked out a cost per underflow exception of 7.8 us (microseconds). For the i5-4200U CPU on the laptop on which this test was run, that works out to 12,000 CPU cycles per interrupt (assuming that the cost of calling UNDERFLOW_COUNT@ a few hundred times can be neglected, and that one DAXPY 'op' can cause (actually, at most) one underflow). This figure for cycles per interrupt is far larger than John's estimates (given earlier in this thread and in private messages as a couple of hundred cycles) and far from reasonable for the work involved in doing two context switches and any bookkeeping in the ISR -- unreasonable not only from the CPU point of view, but also a huge problem for the end user, who may think about each element of a DAXPY operation (y := a x + y) as involving three memory accesses through cache, one FP multiplication and one FP addition. Whereas an operation may consume some tens of cycles, an underflow consumes nearly 12,000; in other words, most of the program time is spent servicing interrupts.

Why is the number of underflows so huge? John has already decided to probe into this, problem-specific, question. Here is a trimmed extract from the program output that may give some insight. Note that even when the operation count (John could perhaps describe precisely what that means) increased by a modest fifty percent, the underflow count increased about 170-fold.

Code:

 Eqn band envl      ops       ufls (cumul)
 -----------------------------------
   1    1    1         0           0
   2    6  768      1552           0
   3    8  768      1946           0
   4   10  768      2342           0
  30   62  768    206440           0
  60  122  768    618440           0
  90  182  768   1077440           0
 120  242  768   1590440           0
 150  302  768   2157440           0
 180  312  768   2545740         182
 210  312  768   2552700         392
 240  312  768   2552700         602
 270  312  768   2552700         812
 300  312  768   2552700        1022
 330  312  768   2552700        1232
 360  312  768   2552700        1442
 390  618  768   3623161      247465
 420  618  768   7140390     1301035
 450  618  768   7139692     2354305
 480  618  768   7139692     3407251
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2816
Location: South Pole, Antarctica

PostPosted: Wed Feb 18, 2015 12:51 pm    Post subject: Reply with quote

Mecej4

And what is the reason for large 3x (48s vs 17s) speed advantage of other compilers versus FTN95 even with SSE ? Can you run high resolution timers with accuracy of 1 processor cycle in different compilers (or use other ways) and find where the speed loss happens ?
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Wed Feb 18, 2015 1:21 pm    Post subject: Reply with quote

Dan, the "with SSE" really means "with SSE in the BLAS routines and X87 everywhere else". There are FPU calculations done in the rest of the program and, with FTN95, they are done using x87 code using the ST0-ST7 registers. Comparisons of reals can be quite slow, because after a comparison the FPU flags have to be pushed to the stack or to other memory, loaded into the CPU flags, and then used to execute JCC instructions (JE, JL, JG, etc.). Formatted output of reals with WRITE statements also involves slow encoding of real numbers into decimal strings+exponents. The intrinsic functions SIN, COS, etc., and in fact any part of the SALFLIBC.DLL library that works with reals uses X87 code.

With this background information, it seems to me that running timing results on X87 code in 2015 is not something most of us would want to do. Windows 7 and 8 will not even run on a PC with a 486 or a Pentium. In some ways, this situation reminds me of the predicament of the first generation Itanium. It had X86 emulation in firmware, but that turned out to be so slow that people called the chip 'Itanic'. Later, out of Israel came a pure software emulator that outperformed the firmware emulation.

What we do with X87 code (almost emulation) on a modern chip with SSE2 and beyond is similar to buying a Porsche or a Mazda Miyata and hitching a heavy trailer behind it.

As I (and others) have said, as of now, it is good practice to use FTN95 for program development, given its fast compilation, good nonstandard library routine collection and excellent checking and debugging facilities. Once the program is debugged, switch to one of the optimizing compilers, including the free GFortran, which is quite competitive with the more expensive compilers.

When FTN95-64 bit is released, the situation is likely to see a dramatic improvement.


Last edited by mecej4 on Fri Feb 20, 2015 7:34 pm; edited 1 time in total
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Wed Feb 18, 2015 2:13 pm    Post subject: Reply with quote

John, I have posted a short test program with a description of the problem at http://forums.silverfrost.com/viewtopic.php?p=17303#17303. The test program does a DAXMY operation 50000 times, and displays the severe penalty that results from having underflow interrupts enabled (which is so by default). I prepared the data file for the test program.

What is remarkable to me is that the figure of CPU cycles/underflow interrupt (which I badly miscalculated earlier to be about 200) remains the same in the single set of DAXMY data of the test code as in the hours-long run of your program.

Also remarkable is what I observed from the output of your program witn /l:4 /m:-1 /d:4. Here is the relevant section of the output:
Code:

r at equation      1    1    1   0.00000             0             0             0           0
r at equation      2   10 1700   0.00030             3             0          6830           0
r at equation      3   14 1700   0.00040             3             1          9400           0
r at equation      4   18 1700   0.00030             2             1         11974           0
r at equation    260  622 1700   0.26680            64          2604      80576362        1908
r at equation    520 1236 1700 132.07620           150       1320612     174196844    17076442
r at equation    780 1239 1700 357.12730           183       3571090     273732718    63335469
r at equation   1040 1319 1700 358.88130           146       3588667     281352012   109568425
r at equation   1300 1416 1700 410.90720           225       4108847     299587327   153649017
r at equation   1560 1557 1700 273.48410           218       2734623     330343476   188004386
r at equation   1820 1700 1700  96.77590           234        967525     363067445   200396655
r at equation   2080 1700 1700   1.42500            50         14200     376361853   200422820
r at equation   2340 1700 1700   1.19730            31         11942     376362740   200422820

What is remarkable is that only a few of the equations generated most of the underflows. At the beginning, there is no underflow. Between equations 520 and 1560, the number of underflows shoots up from the roof, and then trickles down to nothing after equation 2080.

Earlier, I had thought that the underflows, most often occurring in VEC_SUB, were caused by many elements of a x and y being of the same sign and nearly equal magnitudes. However, this portion of a printout from the program tells us otherwise, that many elements of x as well as the scalar multiplier a are so small that forming the product a x itself causes underflow, before subtracting y. If such a thing happens in your real FEA program solver routine, you can just change the sign of the elements in y and return. A similar short cut could be taken in VEC_ADD.

Code:

  a =      -1.00506E-300

    i            x(i)                 y(i)
------------------------------------------
    1      -1.0049-298     9.9980E+01
    2       1.0050-300     9.9020E-01
    3      -1.0051-302     9.8039E-05
    4       1.0052-304    -9.8049E-07
    5      -1.0153-306     9.8059E-09
    6       0.0000E+00    -9.8069E-11
    7       0.0000E+00     9.8078E-13
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Thu Feb 19, 2015 3:14 pm    Post subject: Reply with quote

In this thread we have seen how having lots of underflow exceptions can cause degradation of performance. However, there is a more general reason not to use X87 code on CPUs that have a more recent FPU that is capable of SSE2 and later instructions, and floating point performance is important. In the article, http://www.realworldtech.com/physx87/1/, David Kanter writes
Quote:
This article delves into the recent history of real-time game physics libraries (specifically PhysX), and analyzes the performance characteristics of PhysX. In particular, through our experiments we found that PhysX uses an exceptionally high degree of x87 code and no SSE, which is a known recipe for poor performance on any modern CPU.


There is a discussion of the shortcomings of the X87 family w.r.t. exception handling and FPU stack overflow by an expert on the subject: "How Intel 80X87 Stack Over/Underflow Should Have Been Handled" http://www.cims.nyu.edu/~dbindel/class/cs279/stack87.pdf .


Last edited by mecej4 on Sun Feb 22, 2015 4:55 pm; edited 1 time in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sat Feb 21, 2015 1:48 pm    Post subject: Reply with quote

I have updated my test program by making it more standard compliant and removing most "integer array(*)" usage.
This allows most of the program to be compiled with /check, except for the low level vector operations which are compiled with /P6 /OPT.

I have not been able to identify any coding cause of the run time error:
"floating point stack fault in IO_convert_long_double_to_ascii"

It looks as if it is associated with the handling of FPE after
MASK_UNDERFLOW@ has been called or
SALFENVAR=MASK_UNDERFLOW has been set.

It would be good if a patch was available.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Sat Feb 21, 2015 5:34 pm    Post subject: Reply with quote

John

Is it possible to provide me with a working program and source code that illustrates this problem?

As far as I know this routine is only called within a standard IO call.
There are functions invalid_float and invalid_double that might be useful.

c_external INVALID_DOUBLE@ 'invalid_double'(VAL):logical
c_external INVALID_FLOAT@ 'invalid_float'(VAL):logical
Back to top
View user's profile Send private message AIM Address
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Sun Feb 22, 2015 12:07 am    Post subject: Reply with quote

John, here is a temporary work-around for the FPU stack overflow bug. At the very beginning of the program, call mask_underflow@(), In file colsol.f90, just before calling RedCol_Stats(), call unmask_underflow@(). With these calls added, there is no longer a need for setting the environment variable SALFENVAR=MASK_UNDERFLOW.

Note that this work-around is quite fragile. If you modify your program, it may happen that the bug will go active again.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Feb 22, 2015 12:05 pm    Post subject: Reply with quote

Paul and Mecej4,

I am developing an updated version of the program, which demonstrates the FPE error, associated with calling problem of call mask_underflow@(). This program, can compile and run with /check.

If I call mask_underflow@() at the start of the program, then I get problems at the end of the testing, when outputting a valid real*8 number, after outputting an integer*8 number. There were previous writing of real*8 values that did not generate an error, and all numbers being reported are valid numbers ( in the range 0.01 to 1000).
I am not getting the error during the test, associated with RedCol_Stats(), but at the reporting stage at the end of the main do ieq loop.

Alternatively, If I first call mask_underflow@() at the start of the main loop, then call unmask_underflow@ at the end of the main loop, before the write statements, then there is no error generated. Ftn95 documentation recommends that call mask_underflow@() be the first executable statement?

I also tried a test in the inner loop:
if ( abs(Col(Jeq+I0)) < 1.0d-90 ) Col(Jeq+I0) = 0
This removes most small numbers being generated in colsol and removes FP Exceptions, but only for 1 of the solution methods.
It may be that a well conditioned finite element matrix will not generate FPE's. This is disappointing, as I was hoping a source of this cronic delay problem may have been found.
I shall send the link in a pm, together with documentation of how to generate the error.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2816
Location: South Pole, Antarctica

PostPosted: Sun Feb 22, 2015 3:39 pm    Post subject: Reply with quote

Keep looking at that issue. That way undeflow corruption will be finally addrssed. It caused a lot of lost time in the past in one of my subroutines which was doing a lot of exp(-a) with a exceeding log(1e-37) ~ 80. It essectially killed Jalih's (and i think Paul's latest too)parallel method for me since it is very sensitive to the underflow by crashing immediately. . May be even denormal numbers cause the problem. We discussed that last year here and even had a demo reproducer
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Feb 24, 2015 1:16 am    Post subject: Reply with quote

Paul,

The following link provides a cut-down example of the FPE failure.

https://www.dropbox.com/s/066ghzblgcmca9s/fpe_example.zip?dl=0

To demonstrate the problem, unzip this link and run do_tests.bat
The final failure with SALFENVAR=MASK_UNDERFLOW shows the Floating point stack fault occurring.

I am using FTN95 Ver 7.10.0

Only the program is run with set SALFENVAR=MASK_UNDERFLOW

If you change line 23 of prof.f90 to : eqn_option = 2000
you will then see the FPE delays becoming more significant. The last column report is the incremental count of FPE occurring.

Thanks to Mecej4 for his assistance in identifying this error.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1886

PostPosted: Tue Feb 24, 2015 11:04 am    Post subject: Reply with quote

Please note that the cut down example contains only about 240 lines and does not need any command line arguments to be supplied, whereas the original version had close to 3000 lines, and had (i) provisions for many alternative code paths and (ii) extensive instrumentation to time the program.

With the shortened example code and the batch file that he provides, John has made it easy to run and exhibit the two problems with the compiler: X87 stack overflow in a WRITE statement, and excessive time consumed in processing underflows. It is possible to work around only one of these problems.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7925
Location: Salford, UK

PostPosted: Tue Feb 24, 2015 2:06 pm    Post subject: Reply with quote

The immediate problem occurs when WRITE is used for an INTEGER*8 value.
This has not been fixed yet but the temporary work-around is to avoid this situation.

To get this code working I keep op_count and last_count as INTEGER*8 but assigned these to an INTEGER*4 value before a calling WRITE using the INTEGER*4.

If I understand it correctly, in this context the non-zero underflow count is caused by the calls to WRITE when using an INTEGER*8 value.

I have logged this for further investigation.
Back to top
View user's profile Send private message AIM Address
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> Support All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5  Next
Page 3 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group