forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran 2003/2008
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Oct 23, 2013 12:01 pm    Post subject: Reply with quote

Code:

!=================================================================
   Program Bench
   winapp
   use clrwin   
   use MajorDeclarations

   nEquat = 1000
   keyDenseBlockOn=1; keySSEon=0; keyLAIPEon=0
   i=winio@('%ww%ca[A*X=B Matrix Benchmark]%1tl&', 21)
   i=winio@('Number of equations%ta%il%dd%6rd%`il%ff&',1,10000,100,nEquat)

   i=winio@('Dense Block Method1 %ta%`rb[Dense_Block]%ff&',keyDenseBlockOn)
   i=winio@('SSE Method %ta%`rb[SSE] %ff&',keySSEon)
   i=winio@('LAIPE parallel Method %ta%`rb[LAIPE]%ff %ff&',keyLAIPEon)
   i=winio@('%3`ga&', keyDenseBlockOn, keySSEon, keyLAIPEon)
   i=winio@('%cn%bc%^bt[Start]%ff %ff&', rgb@(237,177,11), cbRun)
   i=winio@('%ac[Enter]&',cbRun)   
   i=winio@('%ac[esc]&','exit')   
   i=winio@('%cn%30br%ff&', Progress, RGB@(255,0,0))
!  i=winio@('Elapsed time %ta%6rf%ff&', ElapsedTime)
   i=winio@('%30.15cw', 0)

   end   

!=============================================================
!       Dan R Right 2013

   SUBROUTINE GAUSS_Square_Block
        use clrwin
        use MajorDeclarations
        real*8 FFFF, SUM
!.............................................................
!  IJmax is length of row or column
!  For square dense matrix  IJmax(:) = nEquat
!     |--------->
!     11 12 13 14
!     21 22 23 24
!     31 32 33 34
!     41 42 43 44
!
!     Here IJmax(:) = 4
!
!  For block matrix
!     11 12 13
!     21 22 23
!     31 32 33 34 35 36
!           43 44 45 46
!           53 54 55 56
!           63 64 65 66 67 68 69
!                    76 77 78 79
!                    86 87 88 89
!                    96 97 98 99
!     |------>
!     11 12 13
!     21 22 23
!     |--------------->
!     31 32 33 34 35 36
!           43 44 45 46
!           53 54 55 56
!     |------------------------>
!           63 64 65 66 67 68 69
!                    76 77 78 79
!                    86 87 88 89
!                    96 97 98 99

!
!    IJmax(1:2)=3,   IJmax(3:5)=6,    IJmax(6:9)=9
!
! ..........................................................

       next_k = 30
       Progress = 0

   DO  k=1, nEquat-1

!........ Progress
      if (k == next_k) then
         Progress = k/(nEquat-1.)
         call temporary_yield@
         call window_update@(Progress)   
         next_k = k+30
      endif
!....... End Progress

     do  I=k+1,IJmax(k)
     FFFF = A(i,k)/A(k,k)
     A(i,k)=0.
       do  j=k+1,IJmax(k)
         A(i,j) = A(i,j) - FFFF * A(k,j)
       enddo
     B(i)=B(i)-FFFF * B(k)
     enddo
   ENDDO

   X(nEquat)=B(nEquat)/A(nEquat,nEquat)
   i=nEquat-1

100   SUM=0.
      do j=i+1,IJmax(I)
        SUM = SUM + A(i,j) * X(j)
      enddo

      X(i)=(B(i)-SUM)/A(i,i)
      i=i-1
      IF(i.gt.0) GOTO 100

!      write(*,'( 1pe14.7)') (X(i),i=1,5)

10000   continue
   end subroutine
!============================================================
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Oct 23, 2013 12:03 pm    Post subject: Reply with quote

Code:

!============================================================
    subroutine SSE_BlockSolver
    use clrwin
    use MajorDeclarations
    real*8    FFFF, SUM, Vec_Sum_SSE
    external  Vec_Sum_SSE
    integer*4 k,i, next_k

    next_k = 30
    Progress = 0
    DO  k=1, nEquat-1

 !........ Progress
      if (k == next_k) then
         Progress = k/(nEquat-1.)
         call temporary_yield@
         call window_update@(Progress)   
         next_k = k+30
      endif
 !....... End Progress

      do I=k+1,IJmax(k)
         FFFF = -AT(k,i)/AT(k,k)
         AT(k,i) = 0.
 !          do  j=k+1,IJmax(k)
 !            AT(j,i) = AT(j,i) - FFFF * AT(j,k)
 !          enddo
         call Vec_Add_SSE ( AT(k+1,i), AT(k+1,k), FFFF, IJmax(k)-k)
         B(i) = B(i) + FFFF * B(k)
      end do
    END DO

 !   X(nEquat) = B(nEquat)/AT(nEquat,nEquat)
 ! 100   SUM=0.
 !      do j=i+1,IJmax(I)
 !        SUM = SUM + AT(j,i) * X(j)
 !      enddo
    do i = nEquat, 1, -1
       SUM  = Vec_Sum_SSE ( AT(i+1,i), X(i+1) , IJmax(I)-i )
       X(i) = (B(i)-SUM)/AT(i,i)
     end do
 !      i=i-1
 !      IF(i.gt.0) GOTO 100

 !       write(*,'( 1pe14.7)') (X(i),i=1,5)
 
 ! 10000   continue
      end subroutine

!============================================================
    subroutine  LAIPE_parallel_solver
    use clrwin
    use MajorDeclarations
    LOGICAL*4 NoGood

      CALL Decompose_DAG_8 (A,nEquat,NoGood)
      IF(NoGood) STOP 'LAIPE: Cannot be decomposed'
      CALL Substitute_DAG_8 (A,nEquat,B)

    end subroutine


!============================================================
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Oct 23, 2013 12:13 pm    Post subject: Reply with quote

Dan,

Most of the savings comes from the transposed solution. However the SSE library is still faster than this improved solution. It provides about a 33% improvement.
I have updated your program to do 3 tests. I have not included your latest parallel solution. ( no check on ALLOCATE !)

https://www.dropbox.com/s/tz8qb88amvs2uae/gauss.f90

It is good to see from this that optimising code can still have an effect.
It would be good to see an AVX solution in the suite I have attached.
My testing has shown an improved performance (better than 33%) from the SSE instructions.

Please try the test and let me know what you find.
John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Oct 23, 2013 1:13 pm    Post subject: Reply with quote

Thanks John, i ran your transposed case (Tr). Here are final stats
Code:

matrix size --> 1000    2000    3000
--------------------------------------
Dense/Block     2.22    30.4    127.
Dense/Block Tr. 0.20    2.06    7.36
SSE             0.12    1.81    6.70
LAIPE           0.09    0.75    2.44
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Oct 23, 2013 1:27 pm    Post subject: Reply with quote

Dan,

Thanks for the results. ( I have changed the version in Dropbox to improve the progress metre.)

The results for SSE improvement from Dense Block Tr. are surprisingly small, in comparison to past tests. I'm not sure why.
It could be :
a problem with alignment in the SSE instructions, or
conflict with the graphics environment, or
SSE only gives this much improvement

Anyway, it does show a benefit from SSE instructions ( 10% in your tests look disappointing) What processor and Hz are you using ?

John
Back to top
View user's profile Send private message
jalih



Joined: 30 Jul 2012
Posts: 196

PostPosted: Wed Oct 23, 2013 3:01 pm    Post subject: Reply with quote

Does anyone here have system that supports AVX-instructions?

I could probably rewrite SSE2 assembler routines to use AVX instructions. It should probably double the performance. My computer is too old to support that stuff.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Oct 23, 2013 10:13 pm    Post subject: Reply with quote

The computer I used is I7 4770k Haswell 4.5GHz, OC & watercooled, with 16GB RAM, graphics is NVIDIA EVGA 770. Should support AVX2 256bit probably. As always, will be glad to test, jalih

Last edited by DanRRight on Thu Oct 24, 2013 7:37 am; edited 2 times in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Oct 23, 2013 11:22 pm    Post subject: Reply with quote

Dan,

Could you try the veclib.f90 code and see if it replicates the gains for SSE instructions.
My tests show reduced SSE improvements with your gauss.f90 test, in comparison to veclib, but much better than you report.

The folowing results are from the latest gauss.f90, showing some variability in run times, but no significant run penalty from the progress metre.

Code:
Gauss.f90 with run metre
 1000 Equations : 0.015 Gb memory
Test=SSE       1000:     0.374  2.86E-06
Test=Tran      1000:     0.608  1.02E-06
Test=Gaus      1000:     4.930  5.16E-06
 2000 Equations : 0.060 Gb memory
Test=SSE       2000:     3.370  5.16E-07
Test=Tran      2000:     4.696  8.79E-07
 3000 Equations : 0.134 Gb memory
Test=SSE       3000:    10.514  3.07E-06
Test=Tran      3000:    14.820  1.43E-06
 4000 Equations : 0.239 Gb memory
Test=SSE       4000:    25.803  1.48E-06
Test=Tran      4000:    34.008  1.17E-05

Gauss.f90 without run metre
 1000 Equations : 0.015 Gb memory
Test=SSE       1000:     0.359  2.86E-06
Test=Tran      1000:     0.515  1.02E-06
Test=Gaus      1000:     4.696  5.16E-06
 2000 Equations : 0.060 Gb memory
Test=SSE       2000:     3.151  5.16E-07   ???
Test=Tran      2000:     5.101  8.79E-07   ???
 3000 Equations : 0.134 Gb memory
Test=SSE       3000:    11.466  3.07E-06   ???
Test=Tran      3000:    14.680  1.43E-06
 4000 Equations : 0.239 Gb memory
Test=SSE       4000:    25.881  1.48E-06
Test=Tran      4000:    33.681  1.17E-05
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu Oct 24, 2013 12:11 am    Post subject: Reply with quote

John,
Here are the results together with my earlier today post (just for comparison convenience)

Code:

John Campbell gauss.f90, 2013/10/23 4:49am US Pacific Time, with progress bar, i7 4770k 4.5GHz computer
matrix size --> 1000    2000    3000    4000
--------------------------------------------
Dense/Block     2.16    30.3    127.7   297.
Dense/Block Tr. 0.25    2.22    7.73    18.3
SSE             0.11    1.77    6.72    16.4


i7 4770k 4.5GHz computer, with progress bar
matrix size --> 1000    2000    3000    4000
--------------------------------------------
Dense/Block     2.22    30.4    127.    297.
Dense/Block Tr. 0.20    2.06    7.36    17.5
SSE             0.12    1.81    6.70    16.2

LAIPE           0.09    0.75    2.44    5.90


UPDATE: I ran the test again and the results were a bit slower by approximately 1-3%. Removing progress bar may produce surprisingly slower results but again within the same fluctuation bars. That's typical. Is alignment the reason?
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu Oct 24, 2013 11:46 pm    Post subject: Reply with quote

John,
How about testing smaller matrix sizes in 30-1000 territory when they fit into L3/L2 caches? This is huge application area when such matrices are called million times per run. Here hardware parallelization with SSE/AVX vs software based parallelization may specifically shine. For testing that sizes a better timer is needed. You seems have the experience with high resolution timers, can you comment on native FTN95 one vs your own?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Oct 27, 2013 1:04 am    Post subject: Reply with quote

Dan,

I record elapsed time, rather than CPU_TIME and use SYSTEM_CLOCK (fortran intrinsic) or cpu_clock@ (Salford,which is basically RDTSC).
Veclib.f90 uses QueryPerformanceCounter, which is what SYSTEM_CLOCK is based on.
CPU_TIME is very poor, as it is accurate to only 1/64 second. So are date_and_time, dclock@ GetLocalTime, GetTickCount and GetProcessTimes.

I have modified gauss.f90 to also do the matrix multiplication test of veclib.f90. These calculations have much better improvement with SSE instructions, showing a 2x factor.

I think the gauss solver gives less improvement because of alignment issues that David allowed for in the code. More recent implementations of AVX instructions have improved manageing alignment issues, but there is a lot I don't understand about this area. If AVX instructions are available in FTN95, they should give a 4x factor for real*8 calculations.

Changing the subscript order in the original GAUSS_Square_Block solver identifies the issue of efficient cache usage, which like alignment is a very common problem that is difficult to overcome. These can be fixed by restructuring the data structure, but what is good for one calculation msy not work for other uses of the data set.

I have now updated gauss.for on dropbox which shows these new matrix multiplication tests.

To date we can demonstrate the benefits of SSE instructions for FTN95 but we need access to AVX instructions in CODE/EDOC structures to progress this project.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Mon Oct 28, 2013 12:54 pm    Post subject: Reply with quote

Is this correct that you are counting processors ticks and base time on that? But the problem in this case is that processor clock changes all the time.

SSE part shows up to 80% speedup on smaller matrices up to 1000 in size. Suspect larger matrix timings are more and more RAM bandwidth bound which is typically 1.6GHz. Would be nice if someone confirmed that by taking faster memory (say 2.5 and even 3GHz one) by overclocking it
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Oct 28, 2013 11:22 pm    Post subject: Reply with quote

Dan,

Thanks for your comments. I have never experienced problems of "processor clock changes all the time"
What I do know is that the higher resolution timers of SYSTEM_CLOCK (QueryPerformanceCounter) and CPU_CLOCK@ (RDTSC) both give far better timing reports than any of the other 1/64 second timers.

Also elapsed time is probably a more relevant time than CPU time.

Checking out hardware alternatives can be difficult if you don't have access to a lot of new PC's and where I work has put a go-slow on new hardware. I'd like to be able to test the new i7, but I'm limited to i5 at the moment.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Tue Oct 29, 2013 7:04 am    Post subject: Reply with quote

John,

It's actually a big fat problem to have Really Good Timer

The real*10 CPU_CLOCK@() counts processor ticks but they are changing according to the power envelope of the processor and the clock frequency is wildly varying as you can see for example with CPU-Z. Of course when we do processor intensive tasks (like ours) CPU is on its maximum. But it takes tme for it to jump to higher clocks rates. Of course our tasks were mostly single-core, but affinity is not locking tasks now to one specific processor core even for single-threaded tasks and in multiprocessing LAIPE case when all 2, 4 or 8 cores are employed, the clock for individual core, L1 cache ratios etc of latest power-saving processors are also varied due to thermal restrictions by power management. So when we benchmark multiprocessed tasks then what is tick here and what is clock time is totally messed up since all cores dance their own tune.

When I took counting ticks route and after finding amount of ticks divided them on 4.5GHz (which is what CPU-Z is reporting, the processor is overclocked) I got that I must use different processor clock by 0.777777 which is exactly 3.5GHz or the clock of non-overclocked processor! Seems the system has no clue that the processor is running faster. Due to that this way of time measurement is not suitable because every kid now overclocks everything from CPU, GPU, RAM to cellphone processors. In their age at the times of first personal computers I overclocked even floppy disk drives to run 3 times faster - that caused huge laugh of everyone who have seen that harddrive-like performance on steroids

Even more surprising was that this ratio stays almost the same no matter how short task is which either tells that power management is superfast or tick is not actually tick but some artificial value scaled to the clock frequency of non-overclocked processor.

So like you I also moved to elapsed time right now starting using again HIGH_RES_CLOCK@(.TRUE.) / HIGH_RES_CLOCK@(.FALSE.) for ON/OFF because it stays the same no matter what but always worry because I remember that in the times of FTN77 this routine caused problems with the entire OS changing system timer to high gear and stayed at it until reboot.

Or I use own clock of LAIPE library because multitasking messed this 1/64 second low resolution timer (and nobody knows what probably else which I am investigating but am very restricted in time. Will look at other methods including ones you use in a month or so)
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Oct 29, 2013 7:57 am    Post subject: Reply with quote

Dan,

I have not experienced the problem of over-clocking that you describe, or the different cycle rates of the different CPU's.

The file veclib.f90 (in Dropbox) uses CPU_CLOCK@ ( which is RDTSC) and then calibrates it with the WINAPI routine QueryPerformanceCounter.
Could you run this on your 4.5ghz pc and let me know what speed rating you get (the first report when the program starts).
If QueryPerformanceCounter doesn't know you are overclocking, then I suspect it would return 3 ghz.

I have another timing program, which uses about 8 different timers that are available. It would be interesting to see how each of these are calibrated.

Does Date_and_Time give the correct time on your 4.5 ghz pc ?

John
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 8 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group