|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2593 Location: Sydney
|
Posted: Thu Mar 14, 2024 1:42 pm Post subject: Re: |
|
|
DanRRight wrote: | I am sure you've heard that no one already optimizes codes by hand anymore, compilers do that better than average programmer. |
I don't see any compiler fixing your mistakes !
Try this changed code and see if you get more accurate elapsed time ?
Code: | integer, parameter :: i=1000, j=1000, m=1000, n=1
Real, allocatable :: DensitySpecies(:,:,:,:)
integer*8 :: idim, nnn, num, ni, two=2
integer :: k, nn, ierr
real :: t1, t2, GByte
integer*8 :: na, nb
logical :: dan = .false.
call delta_sec ('Start tests')
k = 1
do nn = 0,3
nnn = two**nn
ni = nnn*i
idim = ni * j * m * n
gbyte = 4.*idim/1.e9
print*,'=====================', nn, nnn
write(*,'(A, f0.3, A, 5i7)') 'Size ',gbyte,' GB, Size i,j,m,n= ', ni,j,m,n
call cpu_time (t1)
call delta_sec ('start loop')
allocate ( DensitySpecies(ni,j,m,n), stat=ierr )
if(ierr.ne.0) print*, '====ierr=', ierr
call cpu_time (t2)
call delta_sec ('Allocation time')
print*,' Allocation time= ', t2-t1
DensitySpecies = 123
call delta_sec ('Initialisation time')
call cpu_time (t1)
if ( dan ) then
DensitySpecies(:,:,:,1) = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
else
na = ni
nb = j*m
call wrapper_add ( na, nb, DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k) )
end if
call cpu_time (t2)
call delta_sec ('Calculation')
print*,' END section :::, time= ', t2-t1
deallocate (DensitySpecies)
call delta_sec ('Deallocation')
end do
END
subroutine delta_sec ( desc )
character desc*(*)
integer*8 :: clock, rate, last_clock = 0
real*8 :: sec
call system_clock ( clock, rate )
sec = dble(clock-last_clock) / dble(rate)
write (*,fmt='( f10.4,2x,a )') sec, desc
last_clock = clock
end subroutine delta_sec
subroutine wrapper_add ( na, nb, accum, add )
integer*8 :: na, nb
real :: accum(na,nb), add(na,nb )
integer*8 :: k
write (*,fmt='(a,i0,a,i0,a)') 'Add arrays( ',na,', ',nb,' )'
do k = 1,nb
accum(:,k) = accum(:,k) + add(:,k)
end do
end subroutine wrapper_add
|
Unfortunately no compiler I used striped out the bad code.
I will have to correct any errors by hand !
It appeared to run successfully in Plato with FTN95 and Gfortran up to 32 GBytes |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2593 Location: Sydney
|
Posted: Thu Mar 14, 2024 2:11 pm Post subject: |
|
|
You could try this alternative code, selecting apy4@.
I ran this with FTN95 Release x64 on my Ryzen with 64 GBytes of physical memory. The test used 59 GBytes and ran faster than Gfortran.
Code: | integer, parameter :: i=1000, j=900, m=1000, n=1
Real, allocatable :: DensitySpecies(:,:,:,:)
integer*8 :: idim, nnn, num, ni, two=2
integer :: k, nn, ierr
real :: t1, t2, GByte
integer*8 :: na, nb
! logical :: dan = .true. , john = .false. , use_avx = .true.
logical :: dan = .false. , john = .false. , use_avx = .true.
call delta_sec ('Start tests')
k = 1
do nn = 0,4
nnn = two**nn
ni = nnn*i
idim = ni * j * m * n
gbyte = 4.*idim/1.e9
print*,'=====================', nn, nnn
write(*,'(A, f0.3, A, 5i7)') 'Size ',gbyte,' GB, Size i,j,m,n= ', ni,j,m,n
! call cpu_time (t1)
call delta_sec ('start loop')
allocate ( DensitySpecies(ni,j,m,n), stat=ierr )
if (ierr.ne.0) print*, '====ierr=', ierr
! call cpu_time (t2)
call delta_sec ('Allocation time')
! print*,' Allocation time= ', t2-t1
DensitySpecies = 123
call delta_sec ('Initialisation time')
! call cpu_time (t1)
if ( dan ) then
DensitySpecies(:,:,:,1) = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
else if ( john ) then
na = ni
nb = j*m
call wrapper_add ( na, nb, DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k) )
else if ( use_avx ) then
num = ni*j*m
call axpy4@ ( DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k), num, 1.0 )
end if
! call cpu_time (t2)
call delta_sec ('Calculation')
! print*,' END section :::, time= ', t2-t1
deallocate (DensitySpecies)
call delta_sec ('Deallocation')
end do
END
subroutine delta_sec ( desc )
character desc*(*)
integer*8 :: clock, rate, last_clock = 0
real*8 :: sec
call system_clock ( clock, rate )
sec = dble(clock-last_clock) / dble(rate)
write (*,fmt='( f10.4,2x,a )') sec, desc
last_clock = clock
end subroutine delta_sec
subroutine wrapper_add ( na, nb, accum, add )
integer*8 :: na, nb
real :: accum(na,nb), add(na,nb )
integer*8 :: k
write (*,fmt='(a,i0,a,i0,a)') 'Add arrays( ',na,', ',nb,' )'
do k = 1,nb
accum(:,k) = accum(:,k) + add(:,k)
end do
end subroutine wrapper_add
|
I modified the array sizes to fit in my available memory, but the test demonstrates good performance using FTN95 with arrays up to 59 GBytes. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2877 Location: South Pole, Antarctica
|
Posted: Thu Mar 14, 2024 2:30 pm Post subject: |
|
|
All irrelevant to the subject |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2593 Location: Sydney
|
Posted: Fri Mar 15, 2024 1:38 am Post subject: |
|
|
No Dan, it is relevant at all.
If you have a poor solution approach, the compiler can only go so far.
There is still some need for understanding preferred numerical approaches in large calculations.
With your large 3d mesh, perhaps you should consider sparse calculation techniques to eliminate unnecessary calculations, which even the best optimising compilers can't yet easily implement.
I think my example showed that you can adapt to remove limitations in the compiler and utilise what is available to improve performance. This applies to all compilers, especially Gfortran and FTN95. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2877 Location: South Pole, Antarctica
|
Posted: Fri Mar 15, 2024 8:18 am Post subject: |
|
|
Flag into your hands and show your skills on Polyhedron examples. They are waiting for you for 25 years. Ooops, you already tried... Also MPI, OpenMP and CUDA begging for you for 15-20 years. Also tried some... why not with this compiler ?
Not interested in 3% proprietary "improvements" on a single core which do not go anywhere else. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8037 Location: Salford, UK
|
Posted: Fri Mar 15, 2024 9:00 am Post subject: |
|
|
Like the Gilbert and Sullivan policeman, the compiler developer's "lot is not a happy one". They must "do what it says on the tin" but try at the same time to make up for naive mistakes. All within the time and resources available.
I remember a case where a user was having problems inverting a very large matrix via determinants. Why was the compiler so slow!
The current thread reminds me that FTN95 could do more to compensate for the unnecessary use of array sections. It also reveals a stack limitation that has become out of date.
The feedback is useful and hopefully will make FTN95 even better in the future. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2593 Location: Sydney
|
Posted: Sun Mar 17, 2024 2:42 am Post subject: |
|
|
Paul,
Could you provide some more information on "Vstack".
Is it a general replacement for the STACK, enabling much larger local or automatic arrays without the need to redefine the Stack size ?
(perhaps 2 stacks, with Vstack used for large local or automatic arrays would work very well, while subroutine argument references and local variables on a small near stack)
As it is a memory address, it could be an address offset greater than the physical memory installed ( say 128 GBytes ) If it needs a long address, it can be anywhere. This has no affect on available memory as it only takes physical memory when required.
Ifort can place some very large memory strides for stack and heap addresses, without any severe performance hits.
Admittedly an array section that is larger than half the physical memory or configured virtual memory will always crash the program, so we may need a better test that the array section is not contiguous before resorting to a temporary copy.
Unnecessary temporary copies of array sections is a major cause of FTN95's poor performance in the Polyhedron examples.
Please do not resort to the Ifort approach of supporting non-contiguous memory arrays, as it breaks the F77_wrapper approach. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8037 Location: Salford, UK
|
Posted: Sun Mar 17, 2024 2:22 pm Post subject: |
|
|
John
The current virtual stack might be replaced subject to the planned review of this issue. So it would probably be best to await the outcome before providing further information. I will make sure that this has a high priority. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2593 Location: Sydney
|
Posted: Mon Mar 18, 2024 12:43 am Post subject: |
|
|
Paul,
The concept of a larger stack for automatic or large local arrays, plus for temporary arrays is very good.
Also the use of large virtual address strides provides flexibility for a very large Vstack and heap. You should review Gfortran and Ifort load maps to identify the strides that they provide, with no appreciable performance problem.
This could leave the conventional stack as small for managing subroutine argument lists and smaller local variables and so use short addresses.
In my programs, most arrays are on the heap, which use a long address but still provide good performance.
I look forward to the review and hope that this can lead to fewer stack overflow errors !
John |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 8037 Location: Salford, UK
|
Posted: Tue Apr 09, 2024 3:27 pm Post subject: |
|
|
Here is the outcome of the promised review of what has been called the 64 bit "virtual stack".
The next release of FTN95 and its associated DLLs will be amended and the following instructions will apply (i.e. these will be the new instructions).
The compiler generates temporary blocks of data, for example, when passing non-contiguous array sections and when functions return array valued results.
For 64 bit programs, by default this temporary data is allocated from a private heap that is different from the global heap which is used for ALLOCATE statements in the user's program.
The compiler uses the global heap rather than this private heap when /ALLOCATE is added to the FTN95 command line but code created using this option could run more slowly.
The default size of this private heap is 128GB. This is the reserved and not the committed size so reducing this value should have no impact on performance.
It should not be necessary to increase this value. If it is too small then runtimes will probably be unacceptable because of the amount of data being copied.
However, the default can be set by using /VSTACK <size> on the FTN95 command line (<size> is the required number of GBs). Alternatively the default can be changed by a call to HEAP_RESERVE...
SUBROUTINE HEAP_RESERVE(RESERVE)
INTEGER RESERVE
This routine must be called before calling other routines. It sets the reserve size of the private heap as the number of GBs required.
There is no known advantage in setting this value below its default value. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|