Silverfrost Forums

Welcome to our forums

Fortran modernisation workshop

16 Oct 2016 11:47 #18132

mecej4,

I find the /timing option to be a good approach. It reports the elapsed time associated with each routine compiled with the /timing option. From the source code, I generate 2 lists of files, based on if they are utility routines or if they are code I want to find their delays. I then include these files as a list of include statements, and compile the first as /debug and the second with /timing. This encourages you to break up large subroutines into smaller bits and get timings for the bits, which can be a good thing when isolating code to improve. Basically I only compile the source code with /timing that I want to review or does not take too long to run (eg exclude functions that are called millions of times). There is a timing call overhead on entry and exit to each routine (based on cpu_clock@/RDTSC_VAL@)

The following is a batch file I used for a large simulation I have

now                             >ftn95.tce
del *.obj                      >>ftn95.tce
del *.mod                      >>ftn95.tce
SET TIMINGOPTS=/TMO /DLM ,
ftn95 sim_ver1_tim     /timing >>ftn95.tce
ftn95 sutil            /debug  >>ftn95.tce
ftn95 util             /debug  >>ftn95.tce
slink  main_tim.txt            >>ftn95.tce
type ftn95.tce
dir aaa_tim.exe
rem run aaa_tim.exe
aaa_tim IH_2009_AB_g40_C205.txt  >sim_tim.tce
 

sim_ver1_tim.f95 is an INCLUDE 'xxx.f95' main_tim.txt is the link list, which lists the .obj files plus some libraries.

lo sim_ver1_tim.obj
lo sutil.obj
lo util.obj
le \clearwin\saplib.mem\saplib.lib
map aaa_tim.map
file aaa_tim.exe

The timing output is 2 files; .tmo and .tmr, one is aaa_tim.tmo, which is a .csv file of accumulated elapsed times. It is easy to review in Excel. You can see where all the time is being taken and may identify where the code has problems. I find it provides a lot of information at the routine level, which is more helpful than the /profile approach.

I would recommend this approach as worth testing. (I have not yet used this with /64.)

FTN95 is not good with array sections and long strides in array addressing. It can benefit from including SSE vector routines where available.

John

16 Oct 2016 12:31 #18133

Thanks, John. Indeed, /timing provides a nicely formatted report with a lot of useful information. Unfortunately, if I compile the same program (HST3D) with /timing, it enters a timer calibration loop and then aborts with the following message before doing any real calculation.

Access Violation.
The instruction at address 004bc598 attempted to read from location ace0930c

I suppose I could try /timing without /opt.

16 Oct 2016 5:34 #18134

FTN95 is not good with array sections and long strides in array addressing.

John, that was a perfect diagnosis.

After running with /timing (without /opt), I found that 98 percent of the time was consumed in an 'envelope storage' solver for positive definite matrices. The solver consists of three subroutines, and the main solver passes pointers to array sections to the subsidiary subroutines. I examined the code and found that all the array sections had unit stride, which means that it would suffice to pass just the first element of each section as the subroutine argument.

These changes reduced the run time from 178 s to 4.6 s for the Elder_solute problem. The new runs showed that FTN95 32-bit produced code that was consistently comparable in speed to that produced by Gfortran.

Perhaps, there is scope for improvement in the code that FTN95 generates for passing array sections. I doubt that I would have believed the drastic slowdown if I had not experienced it myself. Had the sections been non-unit-stride sections, the conversion would have been more difficult, so help from the compiler would be valuable.

16 Oct 2016 8:52 #18135

It is often the case, that by using a F77 style wrapper, this can dramatically improve the run time performance for these types of calls in FTN95. I find the opposite with ifort, as often they will increase the run time.

Years of using FTN95, have biased my programming style to simple F77 style calls, which work well. The KISS principal certainly got thrown out with F03/08. Perhaps Eddie will agree ?

I think the 'bug' in FTN95 is not recognising when array sections are contiguous and that temporary copies are not required, although I'm not sure of cases that break this rule.

Ironic post, given the title of this thread.

John

17 Oct 2016 9:53 #18142

As you challenged me, John, I will repl,y although apart from my previous posting I was keeping my head down on this. And before I start, I freely confess that I am a programming dinosaur. My needs were entirely met by Fortran 77 complemented by a graphics package and a few routines to access DOS functions, and for quite a while that graphics package was assembled from a handful of commands to program a plotter and a simple system for putting graphics on a VGA screen. Clearwin+ satisfies my needs for an extension to Fortran, and as 77 is a subset of 95, that’s fine by me. I am not surprised that the genetics of FTN95 mean that the Fortran 77 way of doing things works better than Fortran 9x style, nor that IFORT as the other way around. As far as optimisation is concerned, this is again a function of programming style. I think that common subexpression removal (for example) it is best done by a programmer, although where the common subexpression is a simple variable, this is probably best done by the compiler when it manages registers. What I understand from your explanation to be the mechanism for passing array subsections seems to be clumsy in the extreme requiring big chunks of stack, and no wonder it’s slow and inefficient. I’d program that with three parts:

(array_name, lower_limit, limit_higher)

As this seems simple, but if

array_name (lower_limit..limit_higher)

is your preference, then why it isn’t implemented the same way just causes puzzlement in my mind. I particularly wanted to talk about 80 bit, round off and efficiency. It’s about 30 years since I understood 8086/7 assembler, but I do remember playing around with an idea that came out of Richard Startz’s book on programming the 8087. Take the very common requirement to do something like this:

      C=0.0D0
      DO 100 I=1,NUMBER
      C = C + A(I)*B(I)
 100  CONTINUE

One could not avoid incrementing I, nor fetching A(I) and B(I) and multiplying them together, but one could avoid storing the result back in RAM which was not only slow but also truncated the result from 80 bit in the 8087 registers to 64 bits (assuming the temporary copy was REAL*8). It only took sensible management of the 8087 stack to hold C, and then not only was there only a tiny overhead instead of a big one. In the days of the 8086/7 even a quite short loop took appreciable time to execute, and my now distant recollection is that doing it the Startz way was at least 10 times faster than the way Microsoft Fortran did it. Microsoft Fortran also at one stage had two libraries, one in which 8087 was assumed present, and one where the functions were done in software. It didn’t take a very complicated calculation for the two to produce different answers, and this is all due to round off. I’ve no doubt that things are different with on-chip cache RAM and modern processor architectures, but I was left with a very cautious attitude to round-off, and a belief that there were productivity gains available if compiler writers were prepared to take them. We then went into a period of incredibly rapid development in raw processor speed so that if one wanted to do things faster it was a function of buying an updated PC, and the gains from that outstripped what one could get playing with the software. It was not always the case in the past that this was true, and at one time, it was possible to find oneself using the same computer for 8 to 10 years, and without an optimising compiler. In those days hand-optimisation using simple rules always gave significant run-time improvements and one just got used to programming in that way. I also discovered that straightforward programming with lots of white space made source codes easy to understand many years after they were written. Eddie

17 Oct 2016 10:57 #18143

Eddie, if you wish to compile snippets of code (such as your dot-product code) and see the assembly output, there are sites such as www.godbolt.org that enable you do so in a browser window, without having to install compilers, etc. Since godbolt.org only has C/C++ support, I tried

double ddot(double *a,double *b,int *n){
double s=0.0;
for(int i=0; i<*n; i++)s+=*a++ * *b++;
return s;
}

with gcc -O2 and obtained this X64-SSE2 assembly listing, which is notably short (comments added by me):

 mov    edx,DWORD PTR [rdx]           # vector length 
 test   edx,edx 
 jle    L1 
 pxor   xmm0,xmm0                     # s = 0 
 xor    eax,eax                       # i = 0 
 nop    DWORD PTR [rax+0x0]           # pad for alignment? 
 L0: 
 movsd  xmm1,QWORD PTR [rdi+rax*8]    # load a(i) 
 mulsd  xmm1,QWORD PTR [rsi+rax*8]    # multiply by b(i) 
 add    rax,0x1                       # increment i 
 cmp    edx,eax                       # test if done 
 addsd  xmm0,xmm1                     # update s 
 jg     L0 
 repz ret 
 L1: 
 pxor   xmm0,xmm0 
 ret

The body of the loop contains only four instructions, including memory fetches for a(i) and b(i), multiply-and-accumulate, plus two more instructions to increment and test the index i. The result is kept and returned in xmm0.

This is not yet optimal code, since it is not 'vectorized'.

17 Oct 2016 1:35 #18146

Hi Mecej4, Thanks for the useful link. It does seem to me that compilers could always be improved, but so too can programmers’ stylistic efforts. I’m not sure that computer speeds are going up relatively as fast as they did a few years ago, but I remember my first PC costing about four month’s income but the fastest one I could buy retail today costs me less than a day’s income, and if it wasn’t for the fact that I program in a relatively straightforward if old-fashioned style I certainly wouldn’t waste time in hand optimisation today. Whereas for you and perhaps John Campbell every speed gain is worth it, I’m not sure that’s always the case for everybody and not normally for me these days. If a response to user interaction is as far as I can tell instantaneous, then halving the time taken is rather meaningless. There are also other ways to get the job done, so for example in a structural analysis program solving multiple load cases it is probably cheaper to run each load case on a separate computer rather than labour for months to make it faster for running on a single computer. Round-off and all the issues of finite precision arithmetic continue to perplex many folk (me included, generally speaking), but using the SSEx vectorised arithmetic instead of x87 will give different results for many algorithms, of that I’m sure. Eddie

19 Oct 2016 1:47 #18172

But noticed how Mecej4 improved performance of FTN95 on one of examples making it even 2-3 times faster then Intel VF and GFortran ? That means that there exist yet a lot of potential for developers to make this compiler fly at superspeeds.

21 Oct 2016 11:33 #18199

But will add -- please make the debugger first and port Simpleplot %pl to 64bit Clearwin.

22 Oct 2016 7:04 #18201

The current release (v8.05) of the compiler comes with a beta release of our 64 bit debugger SDBG64. Compile with /debug and run 'SDBG.exe prog.exe'.

A beta version of a native %pl is now available for testing by using the following link to download new DLLs. Please use with caution and make sure that the existing DLLs are backed up before installing. A text file in the download provides notes on how to use the native %pl.

https://www.dropbox.com/s/2p4n4bjt8bfo7tv/newDlls10.zip?dl=0

Here is an illustrative sample program:

      WINAPP
      INCLUDE <clearwin.ins>
      C_EXTERNAL WINOP@ '__winop'(INSTRING) !Remove this line for a new clearwin.ins
      INTEGER i,x
      INTEGER,PARAMETER::n=1000
      DOUBLE PRECISION p1,p2,p3,y(n)
      INTEGER,EXTERNAL::cb
      !read*,i
      p1=1.5d0
      p2=150.0d0
      p3=15d0
      x=0
      DO i=1,n
        y(i)=p1*sin(x/p3)*exp(-x/p2)
        x=x+1
      ENDDO
      i=winio@('%ww[no_border]%ca[Damped wave]%pv&')
      i=winio@('%fn[Tahoma]&')
      i=winio@('%ts&', 1.1d0)
      i=winio@('%tc&',rgb@(0,0,80))
      i=winio@('%it&')
      i=winio@('%`bg&',rgb@(230,255,225))
      call winop@('%pl[native]')
      call winop@('%pl[width=2]')
      call winop@('%pl[title='Sample plot']')
      call winop@('%pl[x_axis=Time(Milliseconds)]')
      call winop@('%pl[y_axis=Amplitude@(-4.0)]')
      call winop@('%pl[style=2]')     ! curve joins points
      call winop@('%pl[smoothing=4]') ! anti-aliasing
      i=winio@('%^pl[colour=red]',500,400,n,0.0d0,1.0d0,y,cb)
      END

      INTEGER FUNCTION cb()
      INCLUDE <clearwin.ins>
      call draw_characters@('Legend:..', 300, 100, 0)
      call draw_line_between@(300,120,360,120,rgb@(0,0,255))
      cb = 0
      END
24 Oct 2016 12:12 #18210

Paul,

the new %pl format seems quite interesting.I have tested it and works fine, and I liked it, although I wonder how to use it to replace the old %dw. In my case, I am still using Simpleplot via:

width=0.8clearwin_info@('SCREEN_WIDTH') height=0.8clearwin_info@('SCREEN_DEPTH') ans=winio@('%bg[grey]%ww[maximise]%^dw[user_resize]&',width,height,bitmap) ! Pass bitmapDC to ClearWin

so I can reserve a region for plots and create menus and buttons around it. With %pl I feel a little lost, becauseseems it asks always data for plotting, so if I use it I get a 'second' window once the plots are calculated. Or am I forgetting anything?. As far as I remember, in the past it was possible to define something like 'i=winio@('%pl[user_drawn]&',400,300)', and use a call-back to plot something or leave the corresponding place empty, but now this option is no more there.

I have some other comments about the new %pl format, but I prefer to have an answer to this first question before advancing....

Agustin

24 Oct 2016 6:54 #18211

%dw was the original graphics control that was developed into %gr. So %gr replaces %dw.

The native %pl shares a 'drawing surface' with %gr which means that you can use %gr routines to draw directly to a native graph.

The %pl identifier is described in the original documentation for %pl. The native %pl replicates all of the original %pl syntax apart from for the user_drawn option. For further details see the notes that are included in the download.

24 Oct 2016 5:36 #18224

I'm sorry, but I'm lost: %pl requires always a pair of x/y values, so I cannot create a %pl window without plotting something....I find no way to create, like %gr, an empty window where I show initially the place for the coming plots (like I did with %gr and %dw, i.e., something like just: winio@('%pl[options]', width, height). On the other hand, if I initially plot something with a first call to %pl, when I have a second call to %pl (because I have a new plot), %pl opens in a second window.

Agustin

25 Oct 2016 12:26 #18226

Uff ...i already thought %pl will never be revived from the oblivion ! Great start and changes are in the right direction. Fonts look nice, and in right places with tic marks. It is also possible to plot symbols instead of lines and the whole ideology is more aligned with the whole Clearwin+.

Please do not forget my easy to implement suggestions on how to make it to produce top notch quality plots so that in one of our next Nature or Science paper i will make a caption 'Plotted using Silverfrost Clearwin+'

One obvious bug is in this demo: when you scale the plot size to zero with mouse it crashes with FP overflow error.

I also tried for first time 64bit debugger on this exactly example but it did not go...

25 Oct 2016 5:50 #18227

aebolzan

%pl is only for plotting 2D graphs otherwise use %gr.

25 Oct 2016 1:10 #18230

I know that, what I am saying is that you cannot have a blank %pl window when a program starts, as it can be done with %dw and %gr, therefore it cannot be used when you have a program that first makes some calculations and then plots the resulting data, within the same and single window. If you use %pl during the run of the program, %pl opens in a second window. I do not know if I am clear on this point, am I?.....the new facility of %pl is quite interesting, but limited in this respect, unless I am missing something....

Agustin

25 Oct 2016 4:11 #18231

OK. Could you do that with the old %pl?

Perhaps, if ClearWin+ was to process the number of points n as a reference then the user could set it to zero initially and trigger a redraw with a non-zero value later.

At the moment the native %pl fails if n is zero but it might a relatively simple feature to add.

25 Oct 2016 6:17 #18232

Paul,

I have to admit that it was also not possible with the old non-native %pl, but I thought that the new implementation had overcome such limitation. I think that it would be very useful to have the option of no-plot by setting n=0 as you mention.

By the way: I do not know how difficult could be to implement also at least three types of styles for curves (full, dot, dash) and different types of symbols (square, circle, triangle, both as empty and full symbols). Such styles would make quite versatile the use of %pl for 2D plots.....

Sorry if I am asking too much.....

Agustin

25 Oct 2016 8:15 #18233

Agustin

I will look at the n=0 feature. The other things should be simple to implement but the best I can do for now is to put them on the wish list.

25 Oct 2016 10:41 #18234

Thanks Paul, that would be a good start, for the rest...well.....we can wait and in the meantime we can use colours to plot different sets of data-curves....

Agustin

Please login to reply.