soccer jersey forums.silverfrost.com :: View topic - Insufficient virtual stack with 64bits
forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Insufficient virtual stack with 64bits
Goto page Previous  1, 2
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> 64-bit
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2593
Location: Sydney

PostPosted: Thu Mar 14, 2024 1:42 pm    Post subject: Re: Reply with quote

DanRRight wrote:
I am sure you've heard that no one already optimizes codes by hand anymore, compilers do that better than average programmer.


I don't see any compiler fixing your mistakes !

Try this changed code and see if you get more accurate elapsed time ?

Code:
  integer, parameter :: i=1000, j=1000, m=1000, n=1
  Real, allocatable :: DensitySpecies(:,:,:,:)

  integer*8 :: idim, nnn, num, ni, two=2
  integer   :: k, nn, ierr
  real      :: t1, t2, GByte
  integer*8 :: na, nb
  logical   :: dan = .false.

    call delta_sec ('Start tests')
    k = 1
    do nn = 0,3
      nnn  = two**nn
      ni   = nnn*i
      idim = ni * j * m * n
      gbyte = 4.*idim/1.e9
      print*,'=====================', nn, nnn
      write(*,'(A, f0.3, A, 5i7)') 'Size ',gbyte,' GB, Size i,j,m,n= ', ni,j,m,n
   
      call cpu_time (t1)
      call delta_sec ('start loop')
      allocate ( DensitySpecies(ni,j,m,n), stat=ierr )
      if(ierr.ne.0) print*, '====ierr=', ierr
      call cpu_time (t2)
      call delta_sec ('Allocation time')
      print*,'                      Allocation time= ', t2-t1
   
      DensitySpecies = 123
   
      call delta_sec ('Initialisation time')
      call cpu_time (t1)
    if ( dan ) then     
      DensitySpecies(:,:,:,1)    = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
    else
      na = ni
      nb = j*m
      call wrapper_add ( na, nb, DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k) )
    end if
      call cpu_time (t2)
      call delta_sec ('Calculation')
      print*,'                      END section :::, time= ', t2-t1
   
      deallocate (DensitySpecies)
      call delta_sec ('Deallocation')
    end do

  END

  subroutine delta_sec ( desc )
    character desc*(*)
    integer*8 :: clock, rate, last_clock = 0
    real*8    :: sec
     call system_clock ( clock, rate )
     sec = dble(clock-last_clock) / dble(rate)
     write (*,fmt='( f10.4,2x,a )') sec, desc
     last_clock = clock
  end subroutine delta_sec

  subroutine wrapper_add ( na, nb, accum, add )
    integer*8 :: na, nb
    real :: accum(na,nb), add(na,nb )
    integer*8 :: k

    write (*,fmt='(a,i0,a,i0,a)') 'Add arrays( ',na,', ',nb,' )'
    do k = 1,nb
      accum(:,k) = accum(:,k) + add(:,k)
    end do
  end subroutine wrapper_add


Unfortunately no compiler I used striped out the bad code.
I will have to correct any errors by hand !
It appeared to run successfully in Plato with FTN95 and Gfortran up to 32 GBytes
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2593
Location: Sydney

PostPosted: Thu Mar 14, 2024 2:11 pm    Post subject: Reply with quote

You could try this alternative code, selecting apy4@.

I ran this with FTN95 Release x64 on my Ryzen with 64 GBytes of physical memory. The test used 59 GBytes and ran faster than Gfortran.
Code:
  integer, parameter :: i=1000, j=900, m=1000, n=1
  Real, allocatable :: DensitySpecies(:,:,:,:)

  integer*8 :: idim, nnn, num, ni, two=2
  integer   :: k, nn, ierr
  real      :: t1, t2, GByte
  integer*8 :: na, nb
!  logical   :: dan = .true. ,  john = .false. ,  use_avx = .true.
  logical   :: dan = .false. ,  john = .false. ,  use_avx = .true.

    call delta_sec ('Start tests')
    k = 1
    do nn = 0,4
      nnn  = two**nn
      ni   = nnn*i
      idim = ni * j * m * n
      gbyte = 4.*idim/1.e9
      print*,'=====================', nn, nnn
      write(*,'(A, f0.3, A, 5i7)') 'Size ',gbyte,' GB, Size i,j,m,n= ', ni,j,m,n
   
!      call cpu_time (t1)
      call delta_sec ('start loop')
      allocate ( DensitySpecies(ni,j,m,n), stat=ierr )
      if (ierr.ne.0) print*, '====ierr=', ierr
!      call cpu_time (t2)
      call delta_sec ('Allocation time')
!      print*,'                      Allocation time= ', t2-t1
   
      DensitySpecies = 123
   
      call delta_sec ('Initialisation time')
!      call cpu_time (t1)
    if ( dan ) then     
      DensitySpecies(:,:,:,1)    = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
    else if ( john ) then
      na = ni
      nb = j*m
      call wrapper_add ( na, nb, DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k) )
    else if ( use_avx ) then
      num = ni*j*m
      call axpy4@ ( DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k), num, 1.0 )     
    end if
!      call cpu_time (t2)
      call delta_sec ('Calculation')
!      print*,'                      END section :::, time= ', t2-t1
   
      deallocate (DensitySpecies)
      call delta_sec ('Deallocation')
    end do

  END

  subroutine delta_sec ( desc )
    character desc*(*)
    integer*8 :: clock, rate, last_clock = 0
    real*8    :: sec
     call system_clock ( clock, rate )
     sec = dble(clock-last_clock) / dble(rate)
     write (*,fmt='( f10.4,2x,a )') sec, desc
     last_clock = clock
  end subroutine delta_sec

  subroutine wrapper_add ( na, nb, accum, add )
    integer*8 :: na, nb
    real :: accum(na,nb), add(na,nb )
    integer*8 :: k

    write (*,fmt='(a,i0,a,i0,a)') 'Add arrays( ',na,', ',nb,' )'
    do k = 1,nb
      accum(:,k) = accum(:,k) + add(:,k)
    end do
  end subroutine wrapper_add


I modified the array sizes to fit in my available memory, but the test demonstrates good performance using FTN95 with arrays up to 59 GBytes.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2877
Location: South Pole, Antarctica

PostPosted: Thu Mar 14, 2024 2:30 pm    Post subject: Reply with quote

All irrelevant to the subject
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2593
Location: Sydney

PostPosted: Fri Mar 15, 2024 1:38 am    Post subject: Reply with quote

No Dan, it is relevant at all.

If you have a poor solution approach, the compiler can only go so far.

There is still some need for understanding preferred numerical approaches in large calculations.

With your large 3d mesh, perhaps you should consider sparse calculation techniques to eliminate unnecessary calculations, which even the best optimising compilers can't yet easily implement.

I think my example showed that you can adapt to remove limitations in the compiler and utilise what is available to improve performance. This applies to all compilers, especially Gfortran and FTN95.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2877
Location: South Pole, Antarctica

PostPosted: Fri Mar 15, 2024 8:18 am    Post subject: Reply with quote

Flag into your hands and show your skills on Polyhedron examples. They are waiting for you for 25 years. Ooops, you already tried... Also MPI, OpenMP and CUDA begging for you for 15-20 years. Also tried some... why not with this compiler ? Smile

Not interested in 3% proprietary "improvements" on a single core which do not go anywhere else.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8037
Location: Salford, UK

PostPosted: Fri Mar 15, 2024 9:00 am    Post subject: Reply with quote

Like the Gilbert and Sullivan policeman, the compiler developer's "lot is not a happy one". They must "do what it says on the tin" but try at the same time to make up for naive mistakes. All within the time and resources available.

I remember a case where a user was having problems inverting a very large matrix via determinants. Why was the compiler so slow!

The current thread reminds me that FTN95 could do more to compensate for the unnecessary use of array sections. It also reveals a stack limitation that has become out of date.

The feedback is useful and hopefully will make FTN95 even better in the future.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2593
Location: Sydney

PostPosted: Sun Mar 17, 2024 2:42 am    Post subject: Reply with quote

Paul,

Could you provide some more information on "Vstack".

Is it a general replacement for the STACK, enabling much larger local or automatic arrays without the need to redefine the Stack size ?
(perhaps 2 stacks, with Vstack used for large local or automatic arrays would work very well, while subroutine argument references and local variables on a small near stack)

As it is a memory address, it could be an address offset greater than the physical memory installed ( say 128 GBytes ) If it needs a long address, it can be anywhere. This has no affect on available memory as it only takes physical memory when required.

Ifort can place some very large memory strides for stack and heap addresses, without any severe performance hits.

Admittedly an array section that is larger than half the physical memory or configured virtual memory will always crash the program, so we may need a better test that the array section is not contiguous before resorting to a temporary copy.
Unnecessary temporary copies of array sections is a major cause of FTN95's poor performance in the Polyhedron examples.
Please do not resort to the Ifort approach of supporting non-contiguous memory arrays, as it breaks the F77_wrapper approach.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8037
Location: Salford, UK

PostPosted: Sun Mar 17, 2024 2:22 pm    Post subject: Reply with quote

John

The current virtual stack might be replaced subject to the planned review of this issue. So it would probably be best to await the outcome before providing further information. I will make sure that this has a high priority.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2593
Location: Sydney

PostPosted: Mon Mar 18, 2024 12:43 am    Post subject: Reply with quote

Paul,

The concept of a larger stack for automatic or large local arrays, plus for temporary arrays is very good.

Also the use of large virtual address strides provides flexibility for a very large Vstack and heap. You should review Gfortran and Ifort load maps to identify the strides that they provide, with no appreciable performance problem.

This could leave the conventional stack as small for managing subroutine argument lists and smaller local variables and so use short addresses.

In my programs, most arrays are on the heap, which use a long address but still provide good performance.

I look forward to the review and hope that this can lead to fewer stack overflow errors !

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 8037
Location: Salford, UK

PostPosted: Tue Apr 09, 2024 3:27 pm    Post subject: Reply with quote

Here is the outcome of the promised review of what has been called the 64 bit "virtual stack".

The next release of FTN95 and its associated DLLs will be amended and the following instructions will apply (i.e. these will be the new instructions).

The compiler generates temporary blocks of data, for example, when passing non-contiguous array sections and when functions return array valued results.

For 64 bit programs, by default this temporary data is allocated from a private heap that is different from the global heap which is used for ALLOCATE statements in the user's program.

The compiler uses the global heap rather than this private heap when /ALLOCATE is added to the FTN95 command line but code created using this option could run more slowly.

The default size of this private heap is 128GB. This is the reserved and not the committed size so reducing this value should have no impact on performance.
It should not be necessary to increase this value. If it is too small then runtimes will probably be unacceptable because of the amount of data being copied.
However, the default can be set by using /VSTACK <size> on the FTN95 command line (<size> is the required number of GBs). Alternatively the default can be changed by a call to HEAP_RESERVE...

SUBROUTINE HEAP_RESERVE(RESERVE)
INTEGER RESERVE
This routine must be called before calling other routines. It sets the reserve size of the private heap as the number of GBs required.
There is no known advantage in setting this value below its default value.
Back to top
View user's profile Send private message AIM Address
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> 64-bit All times are GMT + 1 Hour
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group