forums.silverfrost.com

DanRRight · Posted: Wed Mar 08, 2017 3:56 am Post subject:

The new 64bit 8.10 is fast and sometimes is much faster with /optimize option but optimization not always works sometimes crashing the code.

The old compiler was not completely fixed for all such kind of errors for years since I suspect it was difficult to demonstrate the cause for this on some reasonably small code for developers to work on error.

I'd urge users to try /opt and if you can minimize the source to write smaller demonstration program report it to Silverfrost.

DanRRight · Posted: Sat Mar 11, 2017 10:58 am Post subject:

Couple years back Davidb wrote assembler utilities Vec_Add_SSE, Vec_Sum_SSE ... to use SSE. As usually they were just embedded into Fortran text and recognized. They looked like this

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

FTN95 /64 provides new routines for this.
see ...\ftn95\doc\noteson64bitftn95.txt

DanRRight · Posted: Sat Mar 11, 2017 11:28 am Post subject:

Cool! Thanks, John. From first glance i do not see they offer exactly the same functionality for Vec_Add_SSE and Vec_Sum_SSE in this routine below you wrote but will look closer

mecej4 · Joined: 31 Oct 2006 Posts: 1886

One should be careful when using linear equation solving subroutines that do not implement pivoting, at least partial pivoting.

Adding pivoting, however, need not imply the use of FPU or SSE instructions, since block copies can be performed using memcpy() and friends, which use only integer instructions.

DanRRight · Posted: Sat Mar 11, 2017 9:49 pm Post subject:

Besides that without pivoting the algorithm becomes super simple so far I never seen any problems after killing pivoting specifically if you move to real*8 where rounding errors decrease tremendously while speed is the same. There was no zeroes on major diagonal in my physical model and the numbers there were naturally the largest or not too small. I would not risk doing that calculating Mars landing though. Smile

Mecej4, are you familiar with good parallel methods for block matrices (squares of different sizes on its major diagonal)? This is the only reason I use LAIPE.LIB library which has to be now recompiled by its author for 64 bits for Intel Fortran which should be partially compatible in the LIB form or fully compatible as DLL. It is generally good library and exists for 32bits IVF and 32 and 64 gFortran but 64 bit one was never tried with FTN95 unless JohnCampbell already done that. It should go together with gFortran for free.

/* By the way John promised to come to my North Pole and "collect" from me some small prize I forgot how much $30--50--100 I offered few years back if showing the proof that his own methods are faster then LAIPE but I never seen the real comparison even for the simple dense or skyline matrix and even for 32 bits. Any news, John? Smile

Comparisons of different compilers can be seen on website called equation dot com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

The use of partial pivoting is made more difficult when sparse storage methods are used, such as banded, or skyline storage.
SSE_BlockSolver is a variable band solver, used for well conditioned sets of equations.
It appears to use Gaussian Elimination, with variable length rows, as DAXPY is used for forward reduction.
I have not seen examples of pivoting used with banded or skyline solvers, but I presume some "partial" pivoting could be applied.
Typically with these sets of equations, if the diagonal is very small, an artificial restraint is applied to the equation.

Dan,

To answer your question: I have found my Laipe comparison results, run on my i7-4790K, i5-2300 and i7-6790HQ. All are 4-core processors.
I've been trying to source new pc's (i7-7700K or i7-6850k) with faster memory and/or more cores, to see if cache, cores or memory speed is significant, but don't have the budget.
The laipe test is to compute [C]=[A][B], where matrices [A], [B] and [C] are 4-byte real matrix. Matrix [A] is of order (15,000-by-11,000), and matrix [B] is of order (11,000-by-12,000), and matrix [C] is of order (15,000-by-12,000).
My tests use 8-byte reals, which doubles the memory requirement. (more cache conflicts)
My matrix multiplier includes a cache size blocking strategy to minimise cache-memory conflicts.
Large matrix multiplication is one of the easiest calculations for applying OpenMP.
One of the interesting outcomes from my tests is I don't get good efficiency as more threads are introduced, due mainly to problems with hyper-threading of 5-8 threads onto 4 cores, but it is elapsed time, rather than efficiency that is important. (i7-4790K result is clear/worst example of hyper-thread failure I have found)

DanRRight · Posted: Sun Mar 12, 2017 4:16 am Post subject:

Again, John, you are feeding Shakespeare country forum with words, words, words. This comparison is even not apples to oranges but apples to description of oranges. Take real LAIPE library and your test and do the elementary:
1) SAME SOURCE SOFTWARE on
2) SAME HARDWARE.

Over decades i have seen many strange claims and strange test results because of typos, different assumptions, wrong initial conditions, etc.
Everything must be done in the so called clean experiment when there is no other explanations. In our case that means that all has to be done side by side in order to get clean results

Lately on the net kids compare everything to everything, CPUs, GPUs, cellphones, car fuel efficiency etcetcetc, and no single novice would do the comparison like in your post. No one ever compares, say, different cellphones on even different VERSION of the same software! You are comparing unknown test with unknown test done on the different processors and claim that your method is faster !!! Smile

And finally what cache miss are you talking about ? You cache is around 10MB while the memory size is 12000*15000*8 = more then 1 GB ! The 12000*15000 multiplications itself takes less then a second out of 1000s your test takes. This is memory bandwidth bound problem. Bad "test", bad solution method, processor is doing nothing, just waiting for the SDRAM. Smile

. For this primitive test cache size used is exactly zero because there are no intermediate calculation which are further used, actually besides just one multiplication per new pair of array elements there is nothing else done Smile

. The only it is good for is to show scalability of the method with number of cores exactly as author of LAIPE doing. I do not see matrix multiplication in my LAIPE library by the way, this is probably some addon. Take skyline, block or just dense solver for example and prove in straight side by side comparison that your method is faster, John. Prize is good quality Stoli, whiskey or $50.

Additionally you or anyone succeed to adopt 64 bit LAIPE to 64 bit FTN95 and this will increase code speed with block matrix versus current 32 bit LAIPE on 32 bit FTN95 I will double the prize. Same offer for any other parallel method for block matrix adopted to 64 bit FTN95 if it is faster then current 32 bit LAIPE. Worth the fun!

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

Referring to Eddie's lead-in comment above,I also saw that Paul commented on another post about the discussion here.
Just to justify the relevance of the discussion to FTN95 ..... look what I dropped upon ...

''Poor Dan is in a droop' is a palindrome !!!

Very apt Dan for those de-bugging problem posts LOL

I'm sure someone can come up with another ftn95 related one more apt which would make the above example palin (as the alaskan sister) comparison. Cool

-

DanRRight · Posted: Sun Mar 12, 2017 11:12 pm Post subject:

Of course it is relevant to FTN95 specifically future FTN XX which should be parallel: have you noticed AMD made 8 core/16 threads processor last week beating Intel in price and performance? For $300+ . All should run in the shops and start thinking "parallel".

So this all above is much more then fancy like in here: "General. General discussions on FTN95, Fortran, Third Party tools...basically anything that takes your fancy!"

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

Adapting from Shakespeare : "methinks you doth protest too much"

I am not going to run the Laipe approach. It just doesn't make sense, for the performance times they are quoting.

Not sure of some of your comments, but for context:
# operations count for calculation is 3,688 gflops
# my matrix multiply basically uses DAXPY and partitions the matrices to focus on smaller packets.
# memory usage is 1.8 gb so there is lots of memory to cache transfers which is the significant bottleneck, especially when lots of threads are operating. This is why a cache blocking strategy is so important.

To explain the testing I have done:

I re-did my test using real*4 arrays and got interesting results. ( I can send you the test program if you wish)

In my (very old) i5-2300, which is a 4-core and 4 thread, ie no hyper-threading, but using SSE instructions.
The intrinsic MATMUL takes 660 seconds
The single thread cache strategy takes 517 seconds
The 4-thread cache strategy takes 145 seconds, which is equivalent to 25.4 gflops

Compare this to the quoted Intel Xeon L7555 performance of 5,678 seconds for a single thread and 204 seconds for 32 threads. How can this be so slow.

In my (now old) i7-4790K, which is a 4-core and 8 thread using 1600 MHz memory and AVX instructions.
The intrinsic MATMUL takes 507 seconds
The single thread cache strategy takes 286 seconds
The 8-thread cache strategy takes 66.5 seconds, which is equivalent to 55.5 gflops

Compare this to the quoted AMD Opteron 6168 performance of 3,494 seconds for a single thread and 88.6 seconds for 48 threads.

Perhaps these multi-core processors are not suited to this type of calculation. I would expect the Xeon to support SSE/AVX instructions ?
They appear to be incredibly slow, neither as fast as a basic 4th gen Intel 4-core processor. Strange result!

The Laipe single thread times start with such slow performance, while they may demonstrate good efficiency of the threads, don't demonstrate overcoming some of the important problems associated with multi-thread, such as a memory to cache bottleneck and having data in cache to enable AVX instructions.

I should point out that in these Matrix Multiply tests, the cache strategy works very well and so AVX performance on the i7 is working very well. Most other multi-thread calculations I have do not perform this well. A larger cache and faster memory should make this better, but I am yet to test this.

John

DanRRight · Posted: Mon Mar 13, 2017 1:05 am Post subject:

Oh my...more words...I have to take some palen'aya Stoli... Smile

I notice recently that i can not explain elementary things to anyone. These gflops are different gflops. They were obtained not in controlled environment on similar setup and hardware. And they are not gflops too because nothing FP is there, mostly memory transfers.

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

my comment about relevance was about palindromes not parallel processing ì n all the other important stuff !
Paul made a comment on the 2nd 'Native %pl' post, and then I found your palidìdrome which I quoted Dan !

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

Rather than more words, here is the test:

To give some FTN95 relevance to my tests, I converted the test program to FTN95 and ran the test program with 1 thread using FTN95 /64. ( the conversion was mainly limiting the multi thread options, changing the non-standard timer routines and including AXPY4@ routines for vector instructions )
These tests use real*4 arrays.

The results are not good, especially for MATMUL !

There are 4 different matrix multiplication approaches being tested in the linked program:
FTN95 /64 using MATMUL achieves 0.2 gflops on my i7-4790K
FTN95 /64 using array syntax in the inner loop achieves 0.6 gflops
FTN95 /64 using AXPY4@ in the inner loop achieves 6 gflops
FTN95 /64 using cacheing and AXPY4@ achieves 11 gflops

MATMUL performance with /64 is very poor.
Any performance below 1 gflops is not good, which shows the penalty for not using SSE/AVX calculations where they are available.

The following links provide the test program and the batch files I have used. ( you may want to stop at test:3 !)
https://www.dropbox.com/s/j1avyv18kvfko4p/laipe4_sf.f90?dl=0
https://www.dropbox.com/s/3w32uns3fihh9rf/do_sf.bat?dl=0
https://www.dropbox.com/s/e1kyhuv598tckjf/run_laipe_sf.bat?dl=0

do_sf.bat is used to perform the tests.

MATMUL is called at line 226
array syntax is stream_matmul_dp : lines 288:303
AXPY4@ in the inner loop is laipe_matmul_dp : lines 305:323
cached + AXPY4@ laipe_matmul_cache : lines 325:356

I tried FTN95 /64 /opt, but this made little change to MATMUL or array syntax performance.

I would recommend the use of laipe_matmul_dp for "small" arrays while the extension for cacheing is not a large overhead.

The code includes !$OMP OpenMP syntax when it is available and is an example of it's use for matrix multiplication. Matrix multiplication is one of the easiest applications of OpenMP, with little overhead. FTN95 ignores this syntax.

John

DanRRight · Posted: Tue Mar 14, 2017 2:50 am Post subject:

Tried to download and see the content of files from your Dropbox on my phone because am away from my computer but the phone complains that it can not open the files. Let me ask you now just not to lose the whole day due to differences in time with Australia before you or me go to sleep -- in what form did you get LAIPE ? As a source file or LIB or DLL?

If as a source file then the performance is not expected to be good obviously at least till FTN95 will be fully optimized and parallelized. The only way I used LAIPE so far was to link FTN95 OBJ files with LIB compiled on the fastest compiler. Author has bunch of different LIBs but fastest approximately 8 years ago with my current laipe.lib library was IVF lib. Difference may reach few times between libraries made with different compilers and even between 32 and 64 bits libraries of the same compiler, see the benchmarks on his site.

The question is if gFortran can make 64 bit or at least 32 bit DLL (or may be you or its author can generate 64 bit DLL on Intel Fortran, the author promised but still didn't do that) then it will be compatible with FTN95 or any other compiler and this is how it should be used.