forums.silverfrost.com
Welcome to the Silverfrost forums

Author Message
eegeedee

Joined: 09 Nov 2023
Posts: 7

Posted: Sat Nov 11, 2023 6:06 pm    Post subject: Implausible effect regarding CPU time

I am presently comparing several correlations with regard to accuracy and CPU time. For the determination of the CPU time I use the Plato option “Plant timing information”. During this comparison I have got the effect that LOG(y1) in one FUNCTION requires more CPU time than LOG(y1) in another FUNCTION although in both cases y1 is an already calculated variable.

The effect in detail:
Below are the FUNCTIONs ONE and TWO that differ only in the lines "z=..." and "x1=..." in ONE and "x2=..." in TWO. At the end of both FUNCTIONs LOG(y1) is called. The CPU times are 78,6 nsec for ONE and 71,0 nsec for TWO, i.e. ONE needs about 11% more CPU time than TWO, although the numerical effort is almost the same (12 basic algebraic operations in ONE versus 13 in TWO).
 Code: FUNCTION one(R,K) ..... ..... z   = CKiZ*K+AR*Cz X1  = ((Ap2*z+Ap1)*z+Ap0)/((Bp2*z+Bp1)*z+Bp0) Y1  = AR*X1+AK one = LOG(Y1) END FUNCTION one FUNCTION two(R,K) ..... ..... X2  = B0+C0*k+(R*N2+N1)*R/(((k*Dk+D2)*R+D1)*R+D0) Y1  = AR*X2+AK two = LOG(Y1) END FUNCTION two

In order to further analyse this effect I have written the FUNCTIONs ONE1 and TWO1 below, which are identical to ONE and TWO besides the LOG functions at the end, that have been removed.
 Code: FUNCTION one1(R,K) ..... ..... z   = CKiZ*K+AR*Cz X1  = ((Ap2*z+Ap1)*z+Ap0)/((Bp2*z+Bp1)*z+Bp0) Y1  = AR*X1+AK one1= Y1 END FUNCTION one1 FUNCTION two1(R,K) ..... ..... X2  = B0+C0*k+(R*N2+N1)*R/(((k*Dk+D2)*R+D1)*R+D0) Y1  = AR*X2+AK two1= Y1 END FUNCTION two1

The CPU times are 26,8 nsec for ONE1 and 26,5 nsec for TWO1, i.e. ONE1 needs about 1% more CPU time than TWO1. This confirms my estimation that the numerical effort for the differences in ONE and TWO is about the same. This also means to me, with view to the FUNCTIONs ONE and TWO, that up to the lines "y1=..." both FUNCTIONs need the same CPU time and that LOG(y1) in ONE requires more CPU time than LOG(y1) in TWO.
What is a possible explanation for this effect? How can I get rid of it?
mecej4

Joined: 31 Oct 2006
Posts: 1839

 Posted: Sun Nov 12, 2023 5:37 am    Post subject: Your reasoning ignores the fact that loading and storing operands from memory to registers and back can often consume more processor time than arithmetic operations on operands that are already in registers, and the time required also depends on the usage of the CPU and other caches. Instead of just counting arithmetic operations in the Fortran source lines, you may generate machine code listings and count the mahine instructions used to compute the argument to log() in the two cases.Last edited by mecej4 on Sun Nov 12, 2023 2:56 pm; edited 1 time in total
PaulLaidler
Site Admin

Joined: 21 Feb 2005
Posts: 7766
Location: Salford, UK

 Posted: Sun Nov 12, 2023 8:37 am    Post subject: eegeedee Timings are generally not very consistent. They depend on the effect of other processes running on your machine. Run the tests a few times and take the average. I have a virus checker running on my machine that is more active when I switch on. Timings appear to improve after a while. For 64 bits, optimised code generally reduces the number of assembly instructions by about 50%. You can see the details by using /explist on the FTN95 command line. Again for 64 bits, you will find that the next release of FTN95 (v9.0) will be significantly faster when computing basic maths intrinsics like log.
eegeedee

Joined: 09 Nov 2023
Posts: 7

 Posted: Sun Nov 12, 2023 12:21 pm    Post subject: Re: mecej4 due to my skills I am not familiar with loading and storing operands, registers and caches. I am also unable to generate machine code, and if I could I would not be able to understand it. I simply write FORTRAN programs as a hobby, compare the accuracies of several correlations with a FORTRAN program and use the Plant timing analysis to compare the CPU times. And I am critically questioning the results I get. Thanks for your hints, but this is not the direction I am able to go.
mecej4

Joined: 31 Oct 2006
Posts: 1839

 Posted: Sun Nov 12, 2023 2:03 pm    Post subject: Your approach (measuring run time and trying to correlate it with a count of arithmetic operations) makes some sense for an interpreted language that has only one interpreter on your computer. It has no validity for a compiled language for which you may have several compilers, each of which may have many options that affect the speed of the program that it generates. As an example of a more realistic benchmark, look at the Polyhedron Induct program, https://polyhedron.com/?page_id=175 . If your measure had validity, all the timings in each line of the table should have been the same, because all the entries on a line are for the same Fortran source file, and therefore have the same operation count (using your criterion of "operation"). How would you try to explain why the slowest run took fourteen times as much as the fastest?
eegeedee

Joined: 09 Nov 2023
Posts: 7

 Posted: Sun Nov 12, 2023 2:12 pm    Post subject: Re: Paul, for my timing analysis I repeated the tests several times. For each run I (1) sum up the CPU times of all FUNCTIONs and (2) calculate for each FUNCTION the percentage of the CPU time. To get reliable results for my "production runs" I disconnect my PC from the internet, restart Windows 11, and wait several minutes so that all services are started completely. Then I start my timing measurement with Plato for all my FUNCTIONs and keep my fingers off the PC until the test has finish. With this procedure the sum of the CPU times for several runs (1) differ by less than +/- 1.3%. And the percentages of the FUNCTIONs (2) deviate by less than +/- 0.7%. This small noise means that the results are very well reproducable. Therefore, I judge these timing results as very reliable. There is no anti virus program on my PC, only Windows defender is on. I run Plato in Win32 environment because I have programed the reference FUNCTION for the accuracy in Extended Precision and the X64 environment doesn't accept EP. But EP is not mandatory to me, there is no problem to change to DP. I will run my tests with x64 when (?) v9.0 is available. Nevertheless, I have made a some experience during the last months with improvement of programs of this kind on FORTRAN code level and the effect on CPU time. My fealing tells me that the effect I have described is implausible with view from the FORTRAN code.
eegeedee

Joined: 09 Nov 2023
Posts: 7

 Posted: Sun Nov 12, 2023 5:40 pm    Post subject: Re: mecej4, comparing absolute CPU times for the comparison of correlations makes little sense because these times depend on compiler, computer hardware and OS. Therefore I take one of these correlations as the reference for the CPU time and set the other times into relation to it. This kind of comparison is more reliable, but still it depends compiler, computer hardware and OS. When testing the correlations with Plato, FTN95 and Plant timing analysis I got a good feeling about the effect of the source code onto the CPU time. And the changes I made to improve or optimize the code were from my perspective always in accordance with the changes of the CPU time, besides the issue I have posted. In this case I see an implausible effect. What are possible explanations? 1) this effect is caused by my program. Very unlikely, nobody told me up to now. 2) my view onto this issue / my interpretation of the data is incorrect. This is possible, but as I have posted to Paul the timing data are very reliable. And the difference of 10% is far off the noise of the measured times. 3) this effect is a compiler issue. Your link to polyhedron and there to Dr. Appleyard confirms that this is potentially possible. But a compiler issue is not under my control. What are my options? Changes to the source code will not improve the effect, because I have already tried a lot of changes. Change of the compiler is not an option to me, because the overhead is to high. So I can only wait for the next FTN95 version 9.0 and repeat my tests. mecej4, thanks for your support.
PaulLaidler
Site Admin

Joined: 21 Feb 2005
Posts: 7766
Location: Salford, UK

 Posted: Sun Nov 12, 2023 9:51 pm    Post subject: For 32 bits, log(x) is computed directly via an assembly instruction to the coprocessor. That's the limit of my understanding but I am guessing that the coprocessor will perform some kind of series calculation and that the response time might depend on the value supplied. If I wanted to time how long it takes, I would compute log(x) for a fixed value of x many (millions) of times. Then try a different value. But in the end 11% is not really that significant.
JohnCampbell

Joined: 16 Feb 2006
Posts: 2502
Location: Sydney

 Posted: Mon Nov 13, 2023 3:05 am    Post subject: @eegeedee, I can not see your methodology for finding "CPU time" Intrinsic CPU_Time is only updated 64 times per second. It gives very poor accuracy for the fine timing you are attempting. 0.0156 second accuracy poses problems for estimating the performance you are identifying. Intrinsic SYSTEM_CLOCK depends on the type of arguments, so integer*8 arguments should be used. It is based on QueryPerformance_tick (), which is based on RDTSC(), however each transition uses a reduced clock rate (so reduced precision) to report elapsed time. We really need integer*8 functions RDTSC_tick@ and RDTSC_rate@ available for this testing, although using these functions doesn't eliminate all the problems when estimating function performance.Last edited by JohnCampbell on Tue Nov 14, 2023 5:49 am; edited 1 time in total
eegeedee

Joined: 09 Nov 2023
Posts: 7

 Posted: Mon Nov 13, 2023 9:16 pm    Post subject: Re: @JohnCampbell I simply use the Plato option "Plant Timing Information", in Plato Help its written that RDTSC is used. And I am convinced that these experts can better do the timing analysis than I could do with intrinsic functions. For my purposes the quality of the results I get with the Plant Timing Information is sufficient. The times I get are reproducable to with +/-1.3%. See my reply to Paul. My concern is rather the implausible effect I have posted than the method to evaluate the CPU time, because this effect is absolutely (in 100% of my tests) reproducable and with 10% to 11% far beyond the reproducability. Thanks for your reply.
eegeedee

Joined: 09 Nov 2023
Posts: 7

 Posted: Mon Nov 13, 2023 10:59 pm    Post subject: Re: @PaulLaidler In my post I was a little bit short with my test desription, may be to short. For the accuracy evaluation my test grid consists of 1 million points of (R,K) couples with different values. For the CPU time evaluation I run this test 20 times to get in the tmr report times in the range of one second for the fastest FUNCTION. The slowest FUNCTION is then in the range of 6 secs. In the tmr report also the times for 1 call of a FUNCTION is given, which is simply the total time devided by the number of calls. These are the times I have given in my post in nsec. In order to have no side effect by the data I also ran the test with a fixed data set and checked the values of all variable inside the FUNCTIONs. They are in the range of +10 to +1E-4 / -10 to -E-4. For the full test grid the argument Y1 of the LOG function is in the range of 3 to 8. From my point of view nothing for an exception handling. The timing result (+11%) for the fixed data set is the same as for the average over 1 million data sets. You are right, 11% is not really significant. But the relation between the two FUNCTIONs is against my experience, that I have made with the timing analysis for about 40 correlations. Thanks again for your feedback.
PaulLaidler
Site Admin

Joined: 21 Feb 2005
Posts: 7766
Location: Salford, UK

 Posted: Tue Nov 14, 2023 9:54 am    Post subject: I did a quick test of the processing time for a call to log(x) and it does not appear to vary significantly with the value of x. The primary way to understand how the time is being consumed is to examine and compare the assembly instructions for the two cases. For 32 bits, the calculations are carried out via a stack associated with the x87 coprocessor. Depending on the complexity of the expression and the limitations of the stack, it is possible that some parts of the calculation must be stored as temporaries in memory. If the number of temporaries required differs in the two cases then you can expect to get different timings. Similarly for 64 bits, the CPU has 16 registers for floating point values but there is still the possibility that parts of the calculation might need to be swapped out to temporary memory on the way. The only way to see if temporaries are being used (and how many) is to look at the assembly instructions. For FTN95 you get this via /explist on the command line.
eegeedee

Joined: 09 Nov 2023
Posts: 7

Posted: Tue Nov 14, 2023 5:32 pm    Post subject: Re:

@PaulLaidler

I have run my test on an old PC with Windows 10 and an core i3 processor with 2 kernels and got the same result. Below the results of two runs.
 Code: Called      Calls  Page Flts    P/F%    Cpu sec    Avg Cpu    CPU_TIME                 1 40,000,000         0   0.000     1.7641     1.764 s  ONE    10,000,000          0         0   0.000     1.4461     0.145 u  TWO    10,000,000          0         0   0.000     1.2706     0.127 u  TWO1   10,000,000          0         0   0.000     0.4180     0.042 u  ONE1   10,000,000          0         0   0.000     0.4120     0.041 u            Called      Calls  Page Flts    P/F%    Cpu sec    Avg Cpu CPU_TIME                 1 40,000,000         0   0.000     1.3609     1.361 s ONE    10,000,000          0         0   0.000     1.2913     0.129 u TWO    10,000,000          0         0   0.000     1.1326     0.113 u TWO1   10,000,000          0         0   0.000     0.3350     0.033 u ONE1   10,000,000          0         0   0.000     0.3194     0.032 u

[/code]

ONE requires about 14% more CPU time than TWO. And TWO1 is a little bit slower than ONE1.
Thanks for your further investigation.
 Quote: The only way to see if temporaries are being used (and how many) is to look at the assembly instructions. For FTN95 you get this via /explist on the command line.

This possibility is not the one I will take, and I also do not see the need that somebody else does it for me.
From my point of view this book may be closed, because obviously there is no easy explanation or correction for this effect.
V9.0 is announced and I repeat my test with x64.

The only way to see if temporaries are being used (and how many) is to look at the assembly instructions. For FTN95 you get this via /explist on the command line.[/quote]
 Display posts from previous: All Posts1 Day7 Days2 Weeks1 Month3 Months6 Months1 Year Oldest FirstNewest First
 All times are GMT + 1 Hour Page 1 of 1

 Jump to: Select a forum Admin----------------Announcements FTN95----------------GeneralKBaseSupportSuggestionsClearWin+Plato64-bit FTN77----------------Support
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group