
forums.silverfrost.com Welcome to the Silverfrost forums

View previous topic :: View next topic 
Author 
Message 
JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Mon Jun 04, 2012 2:38 am Post subject: 


This has been a rewarding colaborative effort as I think we are getting some useful results, thanks to David's coding efforts.
Thanks to Dan's prompting, I have now written 4 different test programs to test Dot_Product performance and the alignment impact. I would be pleased to forward these to anyone who wants them, if they provide me with email addresses, via messages.
The simplest is the MATMUL test, which I have now run using both Salford Ver 6.10 and Intel Ver 11.1 on my old Xeon processor. The performance times for 5/6 options I have tested are Code:  Test Salford Intel
MATMUL Intrinsic 14.12 0.90
Simple 3 DO loops 10.09 10.06
2 DO + Dot_Product 10.09 10.06
Column transpose + Dot_Product 1.42 0.85
Column transpose + Vec_Sum 1.43 1.46
Column transpose + Asm_Sum 0.93 
The Intel test was compiled with /o2, which provided the SSE vector instructions.
Core i5 gave similar performance, without selecting AXV instructions.
This demonstrates that Intel's MATMUL combines the temporary transpose column and an optimised Dot_Product to achieve their performance.
For Salford, by using a temporary column and David's SSE2 assembler code, the run times are comparible to Intel's performance.
This demonstrates (in a simple example) that the performance times being achieved by other fortran compilers can be replicated with Salford by selective use of SSE2 instructions in key inner loops. From my limited experience, I have found that SSE and now AVX instructions are a relatively easy path to performance improvement, certainly easier than parallel coding approaches.
I know for my Finite Element package, there are probably only a few key loops that are critical to performance. David has now produced examples for the 2 key loops I have.
The Profile_Test example I posted on 13th May gave 6 similar DO loop coding approaches, which produced significantly different performance times with FTN95. If /opt could address some of these inefficiencies and potentially identify some SSE2 instructions for the iner loop, then FTN95 could achieve relative benchmark performance that this FTN77 achieved.
A useful result for others might be to publish and document a few simpler loop approaches, so that you might be able to utilise SSE2 instructions using FTN95.
Thanks for all the help,
John 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Fri Oct 11, 2013 2:24 am Post subject: 


Building on the example of Davidb, would it be possible to add some support for SSE2, SSE4 or AVX instructions in FTN95 via a math library ?
By placing this in a library, this could limit the changes required to FTN95 to making the instructions available in a CODE / EDOC block.
I have identified two main vector calculations in past posts which I would like supported and it would be good to have these available as they could significantly improve run times when used.
Code:  REAL*8 FUNCTION Dot_Product_8 ( A, B, n )
Real*8 A(*), B(*)
Integer*4 n
acum = 0
do i = 1,n
acum = acum + A(i)*B(I)
end do
Dot_Product_8 = acum
END FUNCTION Dot_Product_8
REAL*8 FUNCTION Dot_Product_SSE2
REAL*8 FUNCTION Dot_Product_SSE4
REAL*8 FUNCTION Dot_Product_AVX
SUBROUTINE Vec_Add_8 ( A, B, Const, n )
Real*8 A(*), B(*), Const
Integer*4 n
do i = 1,n
A(i) = A(i) + Const * B(i)
end do
END SUBROUTINE Vec_Add_8
SUBROUTINE Vec_Add_SSE2
SUBROUTINE Vec_Add_SSE4
SUBROUTINE Vec_Add_AVX

(I have kept the argument list as F77 style to maintain simplicity and not include the complexity of array slices or F90 interface syntax)
Is it possible to investigate this enhancement and see if it could be possible ?
Even supplying as a source code, as David has posted would provide an indication of supporting vector instructions in FTN95.
Vector instructions can be a simpler path to significant performance improvement using FTN95, without the complexity of OpenMP multithread.
I'm sure there are a lot of other vector or matrix calculations that could be supported, but a demonstration of the above two could be a significant start. Certainly a restricted set of MATMUL ( A x B ) or preferably ( A' x B ) as A' is a simpler calculation, based on Dot_Product.
This approach could also provide an opportunity for performance improvement while using FTN95. I would certainly like to understand the opportunities SSE4 and AVX instructions could provide for improving FTN95. Issues associated with alignment could also be investigated, although I understand newer processors have also addresed this issue with AVX. It would be good to be able to test these options !
John
I note there is even newer FMA3 instructions to support the calculation "d = a + b x c" which would suit the Vec_Add calculation. Can these be accessed via the CODE/EDOC blocks ?
Last edited by JohnCampbell on Fri Oct 11, 2013 4:43 am; edited 1 time in total 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Fri Oct 11, 2013 4:39 am Post subject: 


Paul,
I also note there is a compiler option /SSE2 listed from FTN95 /?.
This is not listed in FTN95.CHM.
Is this a new or obsolete option ?
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 5418 Location: Salford, UK

Posted: Fri Oct 11, 2013 8:03 am Post subject: 


It is old and does not appear to do anything so I have now removed it from the list. 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Wed Oct 16, 2013 12:40 am Post subject: 


Paul,
I sent an email of an example for some of the real*8 math library, which uses the SSE routines that Davidb provided. I am not sure if they are SSE2 or SSE4 instructions.
I would like to see this library developed and either provided with FTN95 or in the coding examples.
This would provide us users of FTN95 with some access to faster computing, by way of vector instructions for a few basic types of array calculations, using CODE EDOC structures.
If it is possible to document these routines, this could provide a vehicle for enlarging the set.
My understanding of the change required to FTN95 is that more of the SSE4, AVX and possibly FMA3 instructions need to be supported by CODE/EDOC. Is this the case ?
I am at present using the 2 routines that David provided for
[a] . [b] (dot_product) and
[a] = [a] + const * [b]
There are two main areas of computing performance change that is leaving FTN95 behind, being multiple processor or larger vector registers.
My experience from testing both vector instructions and OpenMP is that vector instructions can be a much easier way of achieving 200% to 300% run time improvement, by targeting key inner loops of computation.
I'd be interested in your comments and those from others who might find this of use.
John 

Back to top 


PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 5418 Location: Salford, UK

Posted: Wed Oct 16, 2013 7:55 am Post subject: 


John
Thanks for the enquiry and information.
I have logged it for investigation.
Paul 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1843 Location: South Pole, Antarctica

Posted: Wed Oct 16, 2013 9:30 pm Post subject: 


John,
Post ***simplest possible*** benchmarks here demonstrating the effect on multiple common tasks. Dot product or matrix inversion are OK but great would be to have at least few applications like LUfactorization, Gauss or band matrix solvers for AX=B equations. In a bigger picture, after SSE support added and multithreading now working (buy 16core latest Intel server admittedly very expensive chips and you may see tremendous speedups) this compiler will miss mostly only 64bit support 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Thu Oct 17, 2013 8:52 am Post subject: 


Dan,
I have written a sample program and will try to save it to dropbox.com, so that it can be downloaded.
I have run a number of tests of matrix multiplication, being:
1) using FTN95 MATMUL for C(l,n) = A(l,m) x B(m,n)
2) using vector addition using Vec_Add_SSE written by Davidb
3) using dot_product using Vec_Sum_SSE, by providing A'
The test results are in seconds
Code: 
Test 1 Test 2 Test 3 Description
20.38 2.08 1.77 uniform values of l,m,n = 1600 1000 1400
2.09 1.10 1.12 test with small m l,m,n = 4500 100 3500
3.82 1.89 1.35 test with large l l,m,n = 12000 100 1500
12.06 1.21 1.19 test with large m l,m,n = 550 10000 250
0.62 0.22 0.72 test with large n l,m,n = 2500 10 12000

These results are with /opt compilation and show how SSE instructions can provide a significant improvement.
I'll set up the drop box from home tonight and attach the link here.
John
https://dl.dropboxusercontent.com/s/qvftgj1ezddqaqw/veclib.f90?token_hash=AAHcRRJYzeUZNOTECeJ8nVeaTOPHSULo_TAb8x3R2_1D5w&dl=1
This link appears to contain the test program I used to generate the above results.
The program consists of:
running 5 tests of matrix multiplication, multiplication, using A transpose and then using the MATMUL intrinsic.
The optimised tests are based on the 2 routines that Davidb provided.
There is also use of a timer based on CPU_clock@, which is RDTSC timer that cycles at the processor clock rate ( 2.66 ghz )
Give it a run and let me know if it works in a similar way for you.
It would be good if this could be updated to include AVX routines and could be shared with all FTN95 users who want faster real*8 vector computations. 

Back to top 


davidb
Joined: 17 Jul 2009 Posts: 553 Location: UK

Posted: Thu Oct 17, 2013 6:37 pm Post subject: 


It would be good if the alignment issue in FTN95 could be fixed. Then we could at least write some user contributed routines.
In the meantime, my preference has been just to use FTN95 for development, testing and debugging, and compile with gfortran (or ifort) when I want efficient release code. This is easier now that Clearwin+ is supported on other compilers (though I don't use it much personally).
This way I can take advantage of OpenMP, SSE, AVX etc. _________________ Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1843 Location: South Pole, Antarctica

Posted: Tue Oct 22, 2013 2:35 am Post subject: 


John, davidb,
The speedups I see tell me not too much because it is not clear to me right now if SSE will be actually applied exactly to the bottlenecks of my tasks. We need more tests on general purpose cases. But it is good to ask the authors of my parallel algebra library if they use SSE and if not then ask them to try it on multiple matrix equation solvers they have in there.
Meantime want an exercise of implementing SSE on A*X = B system of linear equations solver? Here is simplest Gauss elimination program modified to solve usual square dense matrix system of equation as well as block matrix ones. The text explains it all. Play with it first (it is written intentionally a bit verbose for others if they not yet started using Clearwin+) and then try to modify the short 18 lines Gauss solver adding SSE case to the last subroutine I left empty. Run the comparison. That way we will see the difference at least on some most commonly used general purpose programs
Code: 
!...compilation : FTN95 Bench.f95 /opt /link
module MajorDeclarations
use clrwin
!.... matrix related
real*8, dimension(:,:), allocatable :: A
real*8, dimension(:), allocatable :: B, X
integer nEquat
integer*4, dimension(:),allocatable :: IJmax
!... GUI
real*8 ElapsedTime, Progress
integer keySSEon
CONTAINS
!......................................
integer function cbRun ()
real*4 tStart, tFinish
real*8 random, SEED ! , DATE_TIME_SEED@
integer li(100)
CALL PERMIT_ANOTHER_callback@()
!...Allocating arrays
allocate ( A(nEquat,nEquat), stat = li(50))
allocate ( B(nEquat), stat = li(51))
allocate ( X(nEquat), stat = li(52))
allocate ( IJmax(nEquat), stat = li(53))
!...Randomly fill the arrays
! call DATE_TIME_SEED@
SEED = 1.234
call SET_SEED@(SEED)
do i=1,nEquat
B(i) = random()
do j=1,nEquat
A(i,j) = random()
enddo
enddo
IJmax(:) = nEquat
ElapsedTime = 0
!...Do the test
CALL CPU_TIME(tStart)
if(keySSEon.eq.0) then
call GAUSS_Square_Block
else
call SSE_BlockSolver
endif
CALL CPU_TIME(tFinish)
!...
ElapsedTime = tFinish  tStart
write(*,'(a,i1, i10,a,f10.3)') 'SSE=',keySSEon, nEquat, ':', ElapsedTime
! call window_update@(ElapsedTime)
!...Deallocating
deallocate ( A )
deallocate ( B )
deallocate ( X )
deallocate ( IJmax )
!...zeroising progress bar
Progress = 0
call window_update@(Progress)
cbRun=1
end function
end module MajorDeclarations
!=================================================================
Program Bench
winapp
use clrwin
use MajorDeclarations
nEquat = 1000
i=winio@('%ww%ca[A*X=B Matrix Benchmark]%1tl&', 21)
i=winio@('Number of equations%ta%il%dd%6rd%`il%ff&',1,10000,100,nEquat)
i=winio@('SSE Method %ta%rb[SSE]%ff %ff&',keySSEon)
i=winio@('%cn%bc%^bt[Start]%ff %ff&', rgb@(237,177,11), cbRun)
i=winio@('%ac[Enter]&',cbRun)
i=winio@('%ac[esc]&','exit')
i=winio@('%cn%30br%ff&', Progress, RGB@(255,0,0))
! i=winio@('Elapsed time %ta%6rf%ff&', ElapsedTime)
i=winio@('%30.15cw', 0)
end

Last edited by DanRRight on Wed Oct 23, 2013 9:26 am; edited 17 times in total 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1843 Location: South Pole, Antarctica

Posted: Tue Oct 22, 2013 2:37 am Post subject: 


Code:  !=============================================================
! Dan R Right 2013
SUBROUTINE GAUSS_Square_Block
use clrwin
use MajorDeclarations
real*8 FFFF, SUM
!.............................................................
! Here IJmax is vector of lengths of rows or columns
! For square dense matrix IJmax(:) = nEquat
! >
! 11 12 13 14
! 21 22 23 24
! 31 32 33 34
! 41 42 43 44
!
! Here IJmax(:) = 4
!
! For block matrix
! 11 12 13
! 21 22 23
! 31 32 33 34 35 36
! 43 44 45 46
! 53 54 55 56
! 63 64 65 66 67 68 69
! 76 77 78 79
! 86 87 88 89
! 96 97 98 99
!
! for clarity the IJmax are plotted below together with the matrix
!
! >
! 11 12 13
! >
! 21 22 23
! >
! 31 32 33 34 35 36
! >
! 43 44 45 46
! >
! 53 54 55 56
! >
! 63 64 65 66 67 68 69
! >
! 76 77 78 79
! >
! 86 87 88 89
! >
! 96 97 98 99
!
!
! You can see that IJmax(1:2)=3, IJmax(3:5)=6, IJmax(6:9)=9
! Since blocks are of squared shape their vertical lengths are the same as horizontal ones.
! But we will be dealing here only with the first square dense matrix case.
! ..........................................................
DO k=1, nEquat1
!........ Progress
Progress = k/(nEquat1.)
if(k/30*30.eq.k) then
call temporary_yield@
call window_update@(Progress)
endif
!....... End Progress
do I=k+1,IJmax(k)
FFFF = A(i,k)/A(k,k)
A(i,k)=0.
do j=k+1,IJmax(k)
A(i,j) = A(i,j)  FFFF * A(k,j)
enddo
B(i)=B(i)FFFF * B(k)
enddo
ENDDO
X(nEquat)=B(nEquat)/A(nEquat,nEquat)
i=nEquat1
100 SUM=0.
do j=i+1,IJmax(I)
SUM = SUM + A(i,j) * X(j)
enddo
X(i)=(B(i)SUM)/A(i,i)
i=i1
IF(i.gt.0) GOTO 100
! Print*,' X=',(X(i),i=1,nEquat)
10000 continue
end subroutine
!============================================================
subroutine SSE_BlockSolver
print*,'SSE needs your contribution'
end subroutine 
Last edited by DanRRight on Wed Oct 23, 2013 11:47 pm; edited 2 times in total 

Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Wed Oct 23, 2013 5:59 am Post subject: 


Dan,
I have not tested the code, but basically the changes I made to your code were:
* change A to A transpose to allow sequential memory access, and
* Include the vector routines in the inner loops.
This should show the benefit of the new instructions.
Note that the 2 routines in the library are required for this solver.
John
ps: I will test the update and copy to dropbox link shortly
The change is: Code:  use clrwin
use MajorDeclarations
real*8 FFFF, SUM, Vec_Sum_SSE
external Vec_Sum_SSE
integer*4 k,i, next_k
next_k = 30
DO k=1, nEquat1
!........ Progress
if (k == next_k) then
Progress = k/(nEquat1.)
call temporary_yield@
call window_update@(Progress)
next_k = k+30
endif
!....... End Progress
do I=k+1,IJmax(k)
FFFF = AT(k,i)/AT(k,k)
AT(k,i) = 0.
! do j=k+1,IJmax(k)
! AT(j,i) = AT(j,i)  FFFF * AT(j,k)
! enddo
call Vec_Add_SSE ( AT(k+1,i), AT(k+1,k), FFFF, IJmax(k)k)
B(i) = B(i) + FFFF * B(k)
end do
END DO
! X(nEquat) = B(nEquat)/AT(nEquat,nEquat)
! 100 SUM=0.
! do j=i+1,IJmax(I)
! SUM = SUM + AT(j,i) * X(j)
! enddo
do i = nEquat, 1, 1
SUM = Vec_Sum_SSE ( AT(i+1,i), X(i+1) , IJmax(I)i )
X(i) = (B(i)SUM)/AT(i,i)
end do
! i=i1
! IF(i.gt.0) GOTO 100
! Print*,' X=',(X(i),i=1,nEquat)
! 10000 continue
end SUBROUTINE GAUSS_Square_Block



Back to top 


JohnCampbell
Joined: 16 Feb 2006 Posts: 1959 Location: Sydney

Posted: Wed Oct 23, 2013 6:50 am Post subject: 


Dan,
I tested on my i5 notebook.
For 1,000 equations : gauss 4.103 seconds, SSE 0.312 seconds
For 1,500 equations : gauss 17.113 seconds, SSE 1.076 seconds
For 2,000 equations : gauss 44.819 seconds, SSE 2.574 seconds
For 3,000 equations : gauss 175.938 seconds, SSE 8.564 seconds
I would suggest that if you applied A' storage to your Gauss routine, then it would run faster, but I would still expect a saving of 2 to 4x with SSE.
I'll post the code tonight. You could add a 3rd option with A' storage.
Also you could test the solution ?
This demonstrates that there is the potential for significant performance improvement for FTN95 if these SSE ( I think SSE4 ) instructions are made available. AVX could provide a further improvement ( unless the SSE4 instructions are implemented in the AVX register on the i5 ?? )
John 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1843 Location: South Pole, Antarctica

Posted: Wed Oct 23, 2013 9:14 am Post subject: 


Wow! 

Back to top 


DanRRight
Joined: 10 Mar 2008 Posts: 1843 Location: South Pole, Antarctica

Posted: Wed Oct 23, 2013 11:59 am Post subject: 


The solution looks fine, you've done good job and incredibly fast, John.
Instead of transposing block matrix case I included parallel library case. Please do the transpose case if you'll have time, my brain does not work now, 3am. Anyway, SSE is still very good, specifically for smaller matrices where parallelization takes too much overhead. Got the following timings
Code: 
matrix size > 1000 2000 3000

Dense_Block 2.22 30.4 127.
SSE 0.12 1.81 6.70
LAIPE 0.09 0.75 2.44

Parallel method though killed the timer, I do not know why. Here is the code, it will need two parallel libraries to work for all three cases.
Code: 
module MajorDeclarations
use clrwin
!.... matrix related
real*8, dimension(:,:), allocatable :: A
real*8, dimension(:,:), allocatable :: AT
real*8, dimension(:), allocatable :: B, Bsave, X
integer nEquat
integer*4, dimension(:),allocatable :: IJmax
!... GUI
real*8 ElapsedTime, Progress
integer keyDenseBlockOn, keySSEon, keyLAIPEon
CONTAINS
!......................................
integer function cbRun ()
real*4 tStart, tFinish
real*8 random, SEED ! , DATE_TIME_SEED@
integer li(100)
CALL PERMIT_ANOTHER_callback@()
allocate ( A(nEquat,nEquat), stat = li(50))
allocate ( AT(nEquat,nEquat),stat = li(55))
allocate ( B(nEquat), stat = li(51))
allocate ( Bsave(nEquat), stat = li(56))
allocate ( X(nEquat), stat = li(52))
allocate ( IJmax(nEquat), stat = li(53))
! call DATE_TIME_SEED@
SEED = 1.234
call SET_SEED@(SEED)
do i=1,nEquat
B(i) = random()
Bsave(i) = B(i)
do j=1,nEquat
A(i,j) = random()
AT(j,i) = A(i,j)
enddo
enddo
IJmax(:) = nEquat
ElapsedTime = 0
!.... Dense_Block
if(keyDenseBlockOn.eq.1) then
CALL CPU_TIME(tStart)
call GAUSS_Square_Block
CALL CPU_TIME(tFinish)
ElapsedTime = tFinish  tStart
write(*,'(a,i2, i10,a,f10.3)') 'DBlock=',keyDenseBlockOn, nEquat, ':', ElapsedTime
endif
!.... SSE
if(keySSEon.eq.1) then
CALL CPU_TIME(tStart)
call SSE_BlockSolver
CALL CPU_TIME(tFinish)
ElapsedTime = tFinish  tStart
write(*,'(a,i2, i10,a,f10.3)') 'SSE =',keySSEon, nEquat, ':', ElapsedTime
endif
CALL CPU_TIME(tFinish)
!.... parallel
if(keyLAIPEon.eq.1) then
Call GetCPUs(NumberOfCPUs)
PRINT*,' NumberOfCPUs=',NumberOfCPUs
do nproc=1,NumberOfCPUs
do i=1,nEquat
do j=1,nEquat
A(j,i) = AT(i,j)
enddo
enddo
B(:) = Bsave(:)
Call SetTasks(nproc)
! CALL CPU_TIME(tStart)
Call GetElapsedTime(tStart)
call LAIPE_parallel_solver
Call GetElapsedTime(tFinish)
! CALL CPU_TIME(tFinish)
ElapsedTime = tFinish  tStart
write(*,'(a,i2, i10,a,f10.3)') 'N CPUs=',nproc, nEquat, ':', ElapsedTime
enddo
endif
deallocate ( A )
deallocate ( AT)
deallocate ( B )
deallocate ( Bsave )
deallocate ( X )
deallocate ( IJmax )
Progress = 0
call window_update@(Progress)
cbRun=1
end function
end module MajorDeclarations 
Last edited by DanRRight on Wed Oct 23, 2013 12:27 pm; edited 5 times in total 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © 2001, 2005 phpBB Group
