forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

FTN 95 8.10 Personal Edition
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Wed Mar 08, 2017 3:56 am    Post subject: Reply with quote

The new 64bit 8.10 is fast and sometimes is much faster with /optimize option but optimization not always works sometimes crashing the code.

The old compiler was not completely fixed for all such kind of errors for years since I suspect it was difficult to demonstrate the cause for this on some reasonably small code for developers to work on error.

I'd urge users to try /opt and if you can minimize the source to write smaller demonstration program report it to Silverfrost.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Sat Mar 11, 2017 10:58 am    Post subject: Reply with quote

Couple years back Davidb wrote assembler utilities Vec_Add_SSE, Vec_Sum_SSE ... to use SSE. As usually they were just embedded into Fortran text and recognized. They looked like this
Code:

! Assembly code is between code, edoc lines
    code
       movupd xmm7%, v            ; move v array to xmm7
       mov eax%, =x               ; address of x
       mov ecx%, =y               ; address of y
.................


Now 64bit compiler does not recognize them

Code:

[6942) movsd [ecx%], xmm0%        ; form y(1) = y(1) + a*x(1)
*** Error 29: Syntax Error
6966) movupd [ecx%], xmm0%       ; move xmm0 into next 2 doubles in y
*** Error 29: Syntax Error
*** Error 343: Unrecognised assembler mnemonic - MOVAPD
6976) movapd xmm1%, [eax%+16]    ; move next 2 doubles in x into xmm1
6999) movsd [ecx%], xmm0%
    10 ERRORS  [<VEC_ADD_SSE> FTN95 v8.10.0]


Any ideas how to resolve this issue ?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1739
Location: Sydney

PostPosted: Sat Mar 11, 2017 11:06 am    Post subject: Reply with quote

Dan,

FTN95 /64 provides new routines for this.
see ...\ftn95\doc\noteson64bitftn95.txt
Code:
SSE and AVX support
-------------------------------------------------------------------------------
FTN95 /64 creates machine code that makes some use of the SSE and AVX instruction
sets (see https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions). Users can
also provide direct SSE/AVX support via CODE/EDOC statements in their code (see
below for further details).

Four "BLAS" type library routines (DOT_PRODUCT8@,DOT_PRODUCT4@,AXPY8@ and AXPY4@)
are also provided and these make direct use of the SSE/AVX instruction sets.
In addition, the library function USE_AVX@ can be called in order to instruct these
routines to use AVX rather than SSE when the CPU and operating system make this
possible.

REAL*8 FUNCTION DOT_PRODUCT8@(x,y,n)
REAL*8 x(n),y(n)
INTEGER*8 n

REAL*4 FUNCTION DOT_PRODUCT4@(x,y,n)
REAL*4 x(n),y(n)
INTEGER*8 n

SUBROUTINE AXPY8@(y,x,n,a)
REAL*8 x(n),y(n),a
INTEGER*8 n
(Y = Y + A*X)

SUBROUTINE AXPY4@(y,x,n,a)
REAL*4 x(n),y(n),a
INTEGER*8 n
(Y = Y + A*X)

INTEGER FUNCTION USE_AVX@(level)
INTEGER level
(Set level = 0 for SSE. Set level = 1 for AVX. The function returns the level that
will be used by the current CPU/OS.
The default level is 1 which means that AVX will be used when available otherwise
SSE. If USE_AVX@(1) is called before an ALLOCATE statement then the resultant
addresses will be 32 byte aligned. The USE_AVX@ level must be the same at a
corresponding DEALLOCATE.)

For example:

INTEGER(4),PARAMETER::n=100
REAL(2) DOT_PRODUCT8@,prod,x(n),y(n)
INTEGER USE_AVX@,level
! x = ...; y = ...
level = USE_AVX@(0)
prod = DOT_PRODUCT8@(x,y,n)
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Sat Mar 11, 2017 11:28 am    Post subject: Reply with quote

Cool! Thanks, John. From first glance i do not see they offer exactly the same functionality for Vec_Add_SSE and Vec_Sum_SSE in this routine below you wrote but will look closer
Code:

    subroutine SSE_BlockSolver
    use clrwin
    use MajorDeclarations
    real*8    FFFF, SUM1, Vec_Sum_SSE
    external  Vec_Sum_SSE
    integer*4 k,i, next_k

    next_k = 100
    Progress = 0
    DO  k=1, nEquat-1

 !........ Progress
      if (k == next_k) then
         Progress = k/(nEquat-1.)
         call temporary_yield@
         call window_update@(Progress)   
         next_k = k+100
      endif
 !....... End Progress

      do I=k+1,IJmax(k)
         FFFF = -AT(k,i)/AT(k,k)
         AT(k,i) = 0.
 !          do  j=k+1,IJmax(k)
 !            AT(j,i) = AT(j,i) - FFFF * AT(j,k)
 !          enddo
         call Vec_Add_SSE ( AT(k+1,i), AT(k+1,k), FFFF, IJmax(k)-k)
         B(i) = B(i) + FFFF * B(k)
      end do
    END DO

 !   X(nEquat) = B(nEquat)/AT(nEquat,nEquat)
 ! 100   SUM1=0.
 !      do j=i+1,IJmax(I)
 !        SUM1 = SUM1 + AT(j,i) * X(j)
 !      enddo
    do i = nEquat, 1, -1
       SUM1  = Vec_Sum_SSE ( AT(i+1,i), X(i+1) , IJmax(I)-i )
       X(i) = (B(i)-SUM1)/AT(i,i)
     end do
 !      i=i-1
 !      IF(i.gt.0) GOTO 100

       if(kLookAtSolution.eq.1) write(*,'( 1pe14.7)') (X(i),i=1,5)
 
 ! 10000   continue
      end subroutine
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 673

PostPosted: Sat Mar 11, 2017 2:44 pm    Post subject: Reply with quote

One should be careful when using linear equation solving subroutines that do not implement pivoting, at least partial pivoting.

Adding pivoting, however, need not imply the use of FPU or SSE instructions, since block copies can be performed using memcpy() and friends, which use only integer instructions.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Sat Mar 11, 2017 9:49 pm    Post subject: Reply with quote

Besides that without pivoting the algorithm becomes super simple so far I never seen any problems after killing pivoting specifically if you move to real*8 where rounding errors decrease tremendously while speed is the same. There was no zeroes on major diagonal in my physical model and the numbers there were naturally the largest or not too small. I would not risk doing that calculating Mars landing though. Smile

Mecej4, are you familiar with good parallel methods for block matrices (squares of different sizes on its major diagonal)? This is the only reason I use LAIPE.LIB library which has to be now recompiled by its author for 64 bits for Intel Fortran which should be partially compatible in the LIB form or fully compatible as DLL. It is generally good library and exists for 32bits IVF and 32 and 64 gFortran but 64 bit one was never tried with FTN95 unless JohnCampbell already done that. It should go together with gFortran for free.

/* By the way John promised to come to my North Pole and "collect" from me some small prize I forgot how much $30--50--100 I offered few years back if showing the proof that his own methods are faster then LAIPE but I never seen the real comparison even for the simple dense or skyline matrix and even for 32 bits. Any news, John? Smile

Comparisons of different compilers can be seen on website called equation dot com
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1739
Location: Sydney

PostPosted: Sun Mar 12, 2017 2:58 am    Post subject: Reply with quote

The use of partial pivoting is made more difficult when sparse storage methods are used, such as banded, or skyline storage.
SSE_BlockSolver is a variable band solver, used for well conditioned sets of equations.
It appears to use Gaussian Elimination, with variable length rows, as DAXPY is used for forward reduction.
I have not seen examples of pivoting used with banded or skyline solvers, but I presume some "partial" pivoting could be applied.
Typically with these sets of equations, if the diagonal is very small, an artificial restraint is applied to the equation.

Dan,

To answer your question: I have found my Laipe comparison results, run on my i7-4790K, i5-2300 and i7-6790HQ. All are 4-core processors.
I've been trying to source new pc's (i7-7700K or i7-6850k) with faster memory and/or more cores, to see if cache, cores or memory speed is significant, but don't have the budget.
The laipe test is to compute [C]=[A][B], where matrices [A], [B] and [C] are 4-byte real matrix. Matrix [A] is of order (15,000-by-11,000), and matrix [B] is of order (11,000-by-12,000), and matrix [C] is of order (15,000-by-12,000).
My tests use 8-byte reals, which doubles the memory requirement. (more cache conflicts)
My matrix multiplier includes a cache size blocking strategy to minimise cache-memory conflicts.
Large matrix multiplication is one of the easiest calculations for applying OpenMP.
One of the interesting outcomes from my tests is I don't get good efficiency as more threads are introduced, due mainly to problems with hyper-threading of 5-8 threads onto 4 cores, but it is elapsed time, rather than efficiency that is important. (i7-4790K result is clear/worst example of hyper-thread failure I have found)
Code:
No of       i5      i7      i7    Intel     AMD
Threads    2300    4790K  6700HQ   Xeon  Opteron
cache        4.5       6     4.5  L7555    6168
       1  1108.6   579.8   656.5  5678.2  3493.6
       2   577.9   295.8   373.6  2839.3  1730.2
       3   404.5   201.0   296.6  1896.5  1151.6
       4   318.0   154.9   240.7  1420.4   865.9
       5           196.0   246.8  1136.6   691.4
       6           179.9   232.0   955.1   580.7
       7           190.2   234.4   820.9   498.0
       8           193.6   241.2   745.7   434.8
      32                           204.4   119.6
      48                                    88.6

The processors I have used are your basic intel i series processor, which is the basic cheap processor available in most stores.
I don't know a lot about the Laipe multi-core processors that have been used for the laipe results, but for single thread they are amazingly slow. One is a many/multi-core Xeon so should not be this slow ?
To quote great efficiency of multi thread calculations, with such poor elapsed time performance is hardly relevant.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Sun Mar 12, 2017 4:16 am    Post subject: Reply with quote

Again, John, you are feeding Shakespeare country forum with words, words, words. This comparison is even not apples to oranges but apples to description of oranges. Take real LAIPE library and your test and do the elementary:
1) SAME SOURCE SOFTWARE on
2) SAME HARDWARE.

Over decades i have seen many strange claims and strange test results because of typos, different assumptions, wrong initial conditions, etc.
Everything must be done in the so called clean experiment when there is no other explanations. In our case that means that all has to be done side by side in order to get clean results

Lately on the net kids compare everything to everything, CPUs, GPUs, cellphones, car fuel efficiency etcetcetc, and no single novice would do the comparison like in your post. No one ever compares, say, different cellphones on even different VERSION of the same software! You are comparing unknown test with unknown test done on the different processors and claim that your method is faster !!! Smile

And finally what cache miss are you talking about ? You cache is around 10MB while the memory size is 12000*15000*8 = more then 1 GB ! The 12000*15000 multiplications itself takes less then a second out of 1000s your test takes. This is memory bandwidth bound problem. Bad "test", bad solution method, processor is doing nothing, just waiting for the SDRAM. Smile. For this primitive test cache size used is exactly zero because there are no intermediate calculation which are further used, actually besides just one multiplication per new pair of array elements there is nothing else done Smile. The only it is good for is to show scalability of the method with number of cores exactly as author of LAIPE doing. I do not see matrix multiplication in my LAIPE library by the way, this is probably some addon. Take skyline, block or just dense solver for example and prove in straight side by side comparison that your method is faster, John. Prize is good quality Stoli, whiskey or $50.

Additionally you or anyone succeed to adopt 64 bit LAIPE to 64 bit FTN95 and this will increase code speed with block matrix versus current 32 bit LAIPE on 32 bit FTN95 I will double the prize. Same offer for any other parallel method for block matrix adopted to 64 bit FTN95 if it is faster then current 32 bit LAIPE. Worth the fun!


Last edited by DanRRight on Sun Mar 12, 2017 10:35 pm; edited 1 time in total
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 503

PostPosted: Sun Mar 12, 2017 10:32 pm    Post subject: Reply with quote

Referring to Eddie's lead-in comment above,I also saw that Paul commented on another post about the discussion here.
Just to justify the relevance of the discussion to FTN95 ..... look what I dropped upon ...

''Poor Dan is in a droop' is a palindrome !!!

Very apt Dan for those de-bugging problem posts LOL

I'm sure someone can come up with another ftn95 related one more apt which would make the above example palin (as the alaskan sister) comparison. Cool-
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Sun Mar 12, 2017 11:12 pm    Post subject: Reply with quote

Of course it is relevant to FTN95 specifically future FTN XX which should be parallel: have you noticed AMD made 8 core/16 threads processor last week beating Intel in price and performance? For $300+ . All should run in the shops and start thinking "parallel".

So this all above is much more then fancy like in here: "General. General discussions on FTN95, Fortran, Third Party tools...basically anything that takes your fancy!"
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1739
Location: Sydney

PostPosted: Mon Mar 13, 2017 12:13 am    Post subject: Reply with quote

Dan,

Adapting from Shakespeare : "methinks you doth protest too much"

I am not going to run the Laipe approach. It just doesn't make sense, for the performance times they are quoting.

Not sure of some of your comments, but for context:
# operations count for calculation is 3,688 gflops
# my matrix multiply basically uses DAXPY and partitions the matrices to focus on smaller packets.
# memory usage is 1.8 gb so there is lots of memory to cache transfers which is the significant bottleneck, especially when lots of threads are operating. This is why a cache blocking strategy is so important.

To explain the testing I have done:

I re-did my test using real*4 arrays and got interesting results. ( I can send you the test program if you wish)

In my (very old) i5-2300, which is a 4-core and 4 thread, ie no hyper-threading, but using SSE instructions.
The intrinsic MATMUL takes 660 seconds
The single thread cache strategy takes 517 seconds
The 4-thread cache strategy takes 145 seconds, which is equivalent to 25.4 gflops

Compare this to the quoted Intel Xeon L7555 performance of 5,678 seconds for a single thread and 204 seconds for 32 threads. How can this be so slow.

In my (now old) i7-4790K, which is a 4-core and 8 thread using 1600 MHz memory and AVX instructions.
The intrinsic MATMUL takes 507 seconds
The single thread cache strategy takes 286 seconds
The 8-thread cache strategy takes 66.5 seconds, which is equivalent to 55.5 gflops

Compare this to the quoted AMD Opteron 6168 performance of 3,494 seconds for a single thread and 88.6 seconds for 48 threads.

Perhaps these multi-core processors are not suited to this type of calculation. I would expect the Xeon to support SSE/AVX instructions ?
They appear to be incredibly slow, neither as fast as a basic 4th gen Intel 4-core processor. Strange result!

The Laipe single thread times start with such slow performance, while they may demonstrate good efficiency of the threads, don't demonstrate overcoming some of the important problems associated with multi-thread, such as a memory to cache bottleneck and having data in cache to enable AVX instructions.

I should point out that in these Matrix Multiply tests, the cache strategy works very well and so AVX performance on the i7 is working very well. Most other multi-thread calculations I have do not perform this well. A larger cache and faster memory should make this better, but I am yet to test this.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Mon Mar 13, 2017 1:05 am    Post subject: Reply with quote

Oh my...more words...I have to take some palen'aya Stoli... Smile I notice recently that i can not explain elementary things to anyone. These gflops are different gflops. They were obtained not in controlled environment on similar setup and hardware. And they are not gflops too because nothing FP is there, mostly memory transfers.
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 503

PostPosted: Mon Mar 13, 2017 4:47 am    Post subject: Reply with quote

my comment about relevance was about palindromes not parallel processing ́ n all the other important stuff !
Paul made a comment on the 2nd 'Native %pl' post, and then I found your palid́drome which I quoted Dan !
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1739
Location: Sydney

PostPosted: Tue Mar 14, 2017 1:02 am    Post subject: Reply with quote

Dan,

Rather than more words, here is the test:

To give some FTN95 relevance to my tests, I converted the test program to FTN95 and ran the test program with 1 thread using FTN95 /64. ( the conversion was mainly limiting the multi thread options, changing the non-standard timer routines and including AXPY4@ routines for vector instructions )
These tests use real*4 arrays.

The results are not good, especially for MATMUL !

There are 4 different matrix multiplication approaches being tested in the linked program:
FTN95 /64 using MATMUL achieves 0.2 gflops on my i7-4790K
FTN95 /64 using array syntax in the inner loop achieves 0.6 gflops
FTN95 /64 using AXPY4@ in the inner loop achieves 6 gflops
FTN95 /64 using cacheing and AXPY4@ achieves 11 gflops

MATMUL performance with /64 is very poor.
Any performance below 1 gflops is not good, which shows the penalty for not using SSE/AVX calculations where they are available.

The following links provide the test program and the batch files I have used. ( you may want to stop at test:3 !)
https://www.dropbox.com/s/j1avyv18kvfko4p/laipe4_sf.f90?dl=0
https://www.dropbox.com/s/3w32uns3fihh9rf/do_sf.bat?dl=0
https://www.dropbox.com/s/e1kyhuv598tckjf/run_laipe_sf.bat?dl=0

do_sf.bat is used to perform the tests.

MATMUL is called at line 226
array syntax is stream_matmul_dp : lines 288:303
AXPY4@ in the inner loop is laipe_matmul_dp : lines 305:323
cached + AXPY4@ laipe_matmul_cache : lines 325:356

I tried FTN95 /64 /opt, but this made little change to MATMUL or array syntax performance.

I would recommend the use of laipe_matmul_dp for "small" arrays while the extension for cacheing is not a large overhead.

The code includes !$OMP OpenMP syntax when it is available and is an example of it's use for matrix multiplication. Matrix multiplication is one of the easiest applications of OpenMP, with little overhead. FTN95 ignores this syntax.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1484
Location: South Pole, Antarctica

PostPosted: Tue Mar 14, 2017 2:50 am    Post subject: Reply with quote

Tried to download and see the content of files from your Dropbox on my phone because am away from my computer but the phone complains that it can not open the files. Let me ask you now just not to lose the whole day due to differences in time with Australia before you or me go to sleep -- in what form did you get LAIPE ? As a source file or LIB or DLL?

If as a source file then the performance is not expected to be good obviously at least till FTN95 will be fully optimized and parallelized. The only way I used LAIPE so far was to link FTN95 OBJ files with LIB compiled on the fastest compiler. Author has bunch of different LIBs but fastest approximately 8 years ago with my current laipe.lib library was IVF lib. Difference may reach few times between libraries made with different compilers and even between 32 and 64 bits libraries of the same compiler, see the benchmarks on his site.

The question is if gFortran can make 64 bit or at least 32 bit DLL (or may be you or its author can generate 64 bit DLL on Intel Fortran, the author promised but still didn't do that) then it will be compatible with FTN95 or any other compiler and this is how it should be used.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group