forums.silverfrost.com

eegeedee · Joined: 09 Nov 2023 Posts: 7

I am presently comparing several correlations with regard to accuracy and CPU time. For the determination of the CPU time I use the Plato option �Plant timing information�. During this comparison I have got the effect that LOG(y1) in one FUNCTION requires more CPU time than LOG(y1) in another FUNCTION although in both cases y1 is an already calculated variable.

The effect in detail:
Below are the FUNCTIONs ONE and TWO that differ only in the lines "z=..." and "x1=..." in ONE and "x2=..." in TWO. At the end of both FUNCTIONs LOG(y1) is called. The CPU times are 78,6 nsec for ONE and 71,0 nsec for TWO, i.e. ONE needs about 11% more CPU time than TWO, although the numerical effort is almost the same (12 basic algebraic operations in ONE versus 13 in TWO).

mecej4 · Joined: 31 Oct 2006 Posts: 1899

Your reasoning ignores the fact that loading and storing operands from memory to registers and back can often consume more processor time than arithmetic operations on operands that are already in registers, and the time required also depends on the usage of the CPU and other caches.

Instead of just counting arithmetic operations in the Fortran source lines, you may generate machine code listings and count the mahine instructions used to compute the argument to log() in the two cases.

PaulLaidler · Posted: Sun Nov 12, 2023 8:37 am Post subject:

eegeedee

Timings are generally not very consistent. They depend on the effect of other processes running on your machine. Run the tests a few times and take the average.

I have a virus checker running on my machine that is more active when I switch on. Timings appear to improve after a while.

For 64 bits, optimised code generally reduces the number of assembly instructions by about 50%. You can see the details by using /explist on the FTN95 command line.

Again for 64 bits, you will find that the next release of FTN95 (v9.0) will be significantly faster when computing basic maths intrinsics like log.

eegeedee · Joined: 09 Nov 2023 Posts: 7

mecej4

due to my skills I am not familiar with loading and storing operands, registers and caches. I am also unable to generate machine code, and if I could I would not be able to understand it.

I simply write FORTRAN programs as a hobby, compare the accuracies of several correlations with a FORTRAN program and use the Plant timing analysis to compare the CPU times. And I am critically questioning the results I get.

Thanks for your hints, but this is not the direction I am able to go.

mecej4 · Joined: 31 Oct 2006 Posts: 1899

Your approach (measuring run time and trying to correlate it with a count of arithmetic operations) makes some sense for an interpreted language that has only one interpreter on your computer. It has no validity for a compiled language for which you may have several compilers, each of which may have many options that affect the speed of the program that it generates.

As an example of a more realistic benchmark, look at the Polyhedron Induct program, https://polyhedron.com/?page_id=175 . If your measure had validity, all the timings in each line of the table should have been the same, because all the entries on a line are for the same Fortran source file, and therefore have the same operation count (using your criterion of "operation"). How would you try to explain why the slowest run took fourteen times as much as the fastest?

eegeedee · Joined: 09 Nov 2023 Posts: 7

Paul,
for my timing analysis I repeated the tests several times. For each run I (1) sum up the CPU times of all FUNCTIONs and (2) calculate for each FUNCTION the percentage of the CPU time. To get reliable results for my "production runs" I disconnect my PC from the internet, restart Windows 11, and wait several minutes so that all services are started completely. Then I start my timing measurement with Plato for all my FUNCTIONs and keep my fingers off the PC until the test has finish.
With this procedure the sum of the CPU times for several runs (1) differ by less than +/- 1.3%. And the percentages of the FUNCTIONs (2) deviate by less than +/- 0.7%. This small noise means that the results are very well reproducable. Therefore, I judge these timing results as very reliable.

There is no anti virus program on my PC, only Windows defender is on.

I run Plato in Win32 environment because I have programed the reference FUNCTION for the accuracy in Extended Precision and the X64 environment doesn't accept EP. But EP is not mandatory to me, there is no problem to change to DP. I will run my tests with x64 when (?) v9.0 is available.

Nevertheless, I have made a some experience during the last months with improvement of programs of this kind on FORTRAN code level and the effect on CPU time. My fealing tells me that the effect I have described is implausible with view from the FORTRAN code.

eegeedee · Joined: 09 Nov 2023 Posts: 7

mecej4,
comparing absolute CPU times for the comparison of correlations makes little sense because these times depend on compiler, computer hardware and OS. Therefore I take one of these correlations as the reference for the CPU time and set the other times into relation to it. This kind of comparison is more reliable, but still it depends compiler, computer hardware and OS.

When testing the correlations with Plato, FTN95 and Plant timing analysis I got a good feeling about the effect of the source code onto the CPU time. And the changes I made to improve or optimize the code were from my perspective always in accordance with the changes of the CPU time, besides the issue I have posted. In this case I see an implausible effect.
What are possible explanations?
1) this effect is caused by my program. Very unlikely, nobody told me up to now.
2) my view onto this issue / my interpretation of the data is incorrect. This is possible, but as I have posted to Paul the timing data are very reliable. And the difference of 10% is far off the noise of the measured times.
3) this effect is a compiler issue. Your link to polyhedron and there to Dr. Appleyard confirms that this is potentially possible. But a compiler issue is not under my control.
What are my options? Changes to the source code will not improve the effect, because I have already tried a lot of changes. Change of the compiler is not an option to me, because the overhead is to high. So I can only wait for the next FTN95 version 9.0 and repeat my tests.

mecej4, thanks for your support.

PaulLaidler · Posted: Sun Nov 12, 2023 9:51 pm Post subject:

For 32 bits, log(x) is computed directly via an assembly instruction to the coprocessor. That's the limit of my understanding but I am guessing that the coprocessor will perform some kind of series calculation and that the response time might depend on the value supplied.

If I wanted to time how long it takes, I would compute log(x) for a fixed value of x many (millions) of times. Then try a different value.

But in the end 11% is not really that significant.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

@eegeedee,

I can not see your methodology for finding "CPU time"

Intrinsic CPU_Time is only updated 64 times per second. It gives very poor accuracy for the fine timing you are attempting. 0.0156 second accuracy poses problems for estimating the performance you are identifying.

Intrinsic SYSTEM_CLOCK depends on the type of arguments, so integer*8 arguments should be used. It is based on QueryPerformance_tick (), which is based on RDTSC(), however each transition uses a reduced clock rate (so reduced precision) to report elapsed time.

We really need integer*8 functions RDTSC_tick@ and RDTSC_rate@ available for this testing, although using these functions doesn't eliminate all the problems when estimating function performance.

eegeedee · Joined: 09 Nov 2023 Posts: 7

@JohnCampbell

I simply use the Plato option "Plant Timing Information", in Plato Help its written that RDTSC is used. And I am convinced that these experts can better do the timing analysis than I could do with intrinsic functions.

For my purposes the quality of the results I get with the Plant Timing Information is sufficient. The times I get are reproducable to with +/-1.3%. See my reply to Paul.

My concern is rather the implausible effect I have posted than the method to evaluate the CPU time, because this effect is absolutely (in 100% of my tests) reproducable and with 10% to 11% far beyond the reproducability.

Thanks for your reply.

eegeedee · Joined: 09 Nov 2023 Posts: 7

@PaulLaidler
In my post I was a little bit short with my test desription, may be to short. For the accuracy evaluation my test grid consists of 1 million points of (R,K) couples with different values. For the CPU time evaluation I run this test 20 times to get in the tmr report times in the range of one second for the fastest FUNCTION. The slowest FUNCTION is then in the range of 6 secs. In the tmr report also the times for 1 call of a FUNCTION is given, which is simply the total time devided by the number of calls. These are the times I have given in my post in nsec.

In order to have no side effect by the data I also ran the test with a fixed data set and checked the values of all variable inside the FUNCTIONs. They are in the range of +10 to +1E-4 / -10 to -E-4. For the full test grid the argument Y1 of the LOG function is in the range of 3 to 8. From my point of view nothing for an exception handling. The timing result (+11%) for the fixed data set is the same as for the average over 1 million data sets.

You are right, 11% is not really significant. But the relation between the two FUNCTIONs is against my experience, that I have made with the timing analysis for about 40 correlations.

Thanks again for your feedback.

PaulLaidler · Posted: Tue Nov 14, 2023 9:54 am Post subject:

I did a quick test of the processing time for a call to log(x) and it does not appear to vary significantly with the value of x.

The primary way to understand how the time is being consumed is to examine and compare the assembly instructions for the two cases.

For 32 bits, the calculations are carried out via a stack associated with the x87 coprocessor. Depending on the complexity of the expression and the limitations of the stack, it is possible that some parts of the calculation must be stored as temporaries in memory. If the number of temporaries required differs in the two cases then you can expect to get different timings.

Similarly for 64 bits, the CPU has 16 registers for floating point values but there is still the possibility that parts of the calculation might need to be swapped out to temporary memory on the way.

The only way to see if temporaries are being used (and how many) is to look at the assembly instructions. For FTN95 you get this via /explist on the command line.

eegeedee · Joined: 09 Nov 2023 Posts: 7

@PaulLaidler

I have run my test on an old PC with Windows 10 and an core i3 processor with 2 kernels and got the same result. Below the results of two runs.