Topic: Insufficient virtual stack with 64bits in 64-bit

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

8 Mar 2024 10:39 #31232

Got run-time error ( with 0.5TB RAM 😃 )

Insufficient virtual stack (FTN95 /VSTACK <MB-value>)

64bit also needs manual stack control ??? How to use it?

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

8 Mar 2024 11:08 #31233

Dan

Can you send me the code for a program that compiles and runs except for this runtime error.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

8 Mar 2024 11:46 #31235

Sending it could be problematic...And making a demo could be almost impossible. That error appeared when i tried to load really large file of 100GB or so which also might demand even more RAM ( i expect with 64bits compiler we now got all limits off and allowed automatic unlimited increases with the code decide by itself. It is not likely the code might demand more than 1TB). During next few days i will investigate what caused this. Before i was able to load ~300GB even having 5x less real+virtual memory

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

8 Mar 2024 2:04 #31236

FTN95 currently has a maximum 'virtual' stack size of 8GB. This can be reduced but not increased by using /VSTACK <MB-value> on the FTN95 command line. The maximum value is automatically reduced when there is limited physical memory available.

The FTN95 'virtual' stack is used for so-called automatic arrays and for temporary arrays created by the compiler to handle array sections that are not contiguous or not known to be whole arrays.

With 0.5TB of RAM there is effectively no physical limit for this virtual stack so the limit is currently 8GB.

I have made a note that this needs to be reviewed.

JohnCampbell

Posts: 2526 Sydney

Back to Top

9 Mar 2024 1:39 #31238

Dan,

What is Vstack and how does it differ from stack or heap ??

I know about Stack and Heap, but what is Vstack and why is it limited to 8 GBytes, when Heap gets all virtual memory ?

I have recently come across this 'VSTACK' limit when trying to write/read 8.5 GByte records to a binary file. ( this is an ongoing issue for solving 'Fails to save arrays > 4GB'

I used ' read (lu) vector(1:nn)'

The problem is that this array section initiates a temporary copy of vector(1:nn)

Paul indicates this when he stated 'The FTN95 'virtual' stack is used for so-called automatic arrays and for temporary arrays created by the compiler to handle array sections that are not contiguous or not known to be whole arrays.'

So the solution is to avoid these temproary copies of these very large array sections. Use ALLOCATE wherever possible !!

In the above case my solution was to replace the read statement with a 'F77' wrapper :

       iostat = read_stream_vector ( lu, vector, nn )
         if (iostat /= 0 ) exit
...
    integer function read_stream_vector ( lu, vector, nn )

   !  reads fortran unformatted sequential access records using stream access

     use timer_info
     integer*4 :: lu
     integer*8 :: nn
     integer*4 :: vector(nn)

     integer*8 :: four = 4, num_bytes
     integer   :: iostat
     real*4    :: gbytes

     num_bytes = nn*four
     gbytes    = real(num_bytes) / 2.**30

       seconds = delta_seconds ()
!z       read (lu,iostat=iostat) vector(1:nn)          ! this fails above 8 GBytes
!z       read (lu,iostat=iostat) (vector(k),k=1,nn)    ! this is very slow
     read (lu,iostat=iostat) vector                    ! this worked OK

     gb_sec  = GB_per_sec ( gbytes )
     write (*,12) '  reading record  nn = ',nn,' : iostat = ',iostat, gb_sec,' GB/sec'
  12 format (5x,a,i0,a,i0, 2x,f0.3,a  )   

     read_stream_vector = iostat

    end function read_stream_vector

The alternative ' read (lu) vector ' solved the problem, as the compiler now knows this is a contiguous vector in memory ( which it previously did not identify from the array section ) The other two alternatives I have commented out crashed or were far too slow.

I hope to post more about these large records soon, but I am achieving over 7 GBytes per second read rates on a PCIe SSD ( although the file is probably in the memory disk buffers )

The following is a trace of write, then read testing a 16 GByte vector, although the write rates are only about 1 GBy/sec. ( 4 GByte vector write is over 2 GBy/sec )

 TEST 5 : array size 4294967358 : record size 17179869432 bytes : 16.000 GBytes

Unformatted Sequential WRITE
   Array nn = 4294967358
      generating vector of 16.000 GBytes : stat = 0
      writing record 1 : iostat = 0  1.280 GB/sec
      writing record 2 : iostat = 0  0.846 GB/sec

Unformatted Sequential READ
   Array nn = 4294967358
      reading record 1 : iostat = 0  3.800 GB/sec
      reading record 2 : iostat = 0  3.899 GB/sec

Stream Access Header READ
   Array nn = 4294967358
      Header type -2 L = 17179869432 V = 1 Iostat = 0  header OK 

Stream Access Sequential READ
   Array nn = 4294967358
       Record Header type -2 : Size = 17179869432 bytes
       reading record  nn = 4294967358 : iostat = 0  7.555 GB/sec
       reading record  nn = 4294967358 : iostat = 0  8.026 GB/sec

This works with unformatted sequential read/write, as it uses a new header type -2 : a 9-byte header/trailer. I will post more soon when Paul confirms this is supported in the released FTN95 Ver 9.0x compiler

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

9 Mar 2024 8:13 #31241

John

I think that all of this should work with version 9.02 that can be downloaded from the Support 'Sticky Post'.

This 'virtual' stack is the one that is created and used by FTN95 for 64 bit automatic arrays and compiler generated temporary arrays. It currently has an upper limit size of 8GB but this will be reviewed.

The stack is is generated by a system call to VirtualAlloc and this call is built into the startup code for the user's executable by SLINK64. As a result this call is not currently visible in an /EXPLIST listing.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

9 Mar 2024 11:30 #31243

I use 9.02 The place where i got this error is

    dVolumeCell = XCellsize * dyCellsize  * dZCellsize +1.d-30
    DensityE3D(:,:,:) = DensityE3D(:,:,:) / dVolumeCell

Dimensions 1280 x 1280 x 2000 = 3,276,800,000, or just 25GB.

It is almost like '640K which is good for everyone'

Paul, please remove this limit! There is no cellphones with 8GB already. And supercomputers use Petabytes of RAM

JohnCampbell

Posts: 2526 Sydney

Back to Top

9 Mar 2024 12:52 #31244

Dan,

could you try: dVolumeCell = XCellsize * dyCellsize * dZCellsize +1.d-30 DensityE3D = DensityE3D / dVolumeCell ie remove the 'array section'

and let me know what happens ?

Also you could try

dVolumeCell = 1.0 / ( XC0ellsize * dyCellsize  * dZCellsize +1.d-30 )
DensityE3D = DensityE3D * dVolumeCell

Do you enable avx instructions for this time consuming calculation.

Increasing Vstack size allocation might be a problem for others without with 0.5TB RAM. Increasing the Vstack size reduces the virtual memory pool address size, but perhaps not the virtual memory allocation ?

It would not be an issue for physical memory usage, but might reduce the available virtual memory size ?

I could check, as Gfortran puts bigger memory address strides for stack and heap, compared to FTN95. I have not identified the Vstack address in memory maps.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

10 Mar 2024 7:38 #31255

John,

What 'problems for others' you are scaring here if i will increase stack just for myself (if this limit is ever needed)? Debugger will tell you about your problem place. If there will be a problem it will be your code problem not the compiler problem like now

I have also such places: how to eliminate array section here for example?

   do k=2, nActualAtomicSpeciesPresent
     DensitySpecies(:,:,:,1)    = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
   enddo

Were AVX vector instructions included into FTN95?

JohnCampbell

Posts: 2526 Sydney

Back to Top

10 Mar 2024 12:46 #31257

Dan,

The available virtual memory (not address space) on the physical memory + paging space can sometimes be a limit on x64. This may not be a problem with a larger Vstack, as virtual memory is only allocated if the memory address ( memory pages) are set a value. Something like 'DensitySpecies(:,:,:,1) = 0' could allocate a lot of memory pages.
'F77 wrappers' are a great way to avoid temporary arrays. The following could avoid the problem
```
Real :: DensitySpecies(ni,nj,nk,nz)
integer*8 :: num
```
num = ninjnk do k=2, nActualAtomicSpeciesPresent call add_species_k ( DensitySpecies(1,1,1,k), DensitySpecies(1,1,1,1), num ) end do ... subroutine add_species_k ( from, to, num ) real :: from(), to() integer*8 :: num, j do j = 1,num to(j) = to(j) + from(j) end do end subroutine add_species_k
for AVX instructions see noteson64bitftn95.txt for more info.

You could replace call add_species_k ( DensitySpecies(1,1,1,k), DensitySpecies(1,1,1,1), num ) with

  num = ni*nj*nk
 do k=2, nActualAtomicSpeciesPresent
   call axpy4@ ( DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k), num, 1.0 )
 end do

Note :

num must be integer*8
If the vectors are large, you will still struggle with memory access speeds/bandwidth for AVX instruction speed. It may not scale up *32 for avx256, but should be considerably faster.

Let me know how it goes.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

11 Mar 2024 3:50 #31260

This is a club of workarounders. Besides mecej4 no one reports any problems and bugs. And suggestions you'll not hear from absolutely anyone. If the company would not move since FTN77 by itself, all would still actively making workarounds in F77. Even AVX is a workaround. 😃 😃 😃 Do you know what is needed to change in the code to include AVX with gFortran or Intel? Nothing, just add compilation switch

Do gFortran and Intel also have 8GB limit?

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

11 Mar 2024 7:58 #31261

FTN95 uses AVX in some contexts and no switch is required.

I have already said that this particular limit will be reviewed so I expect that it will at least be increased and possibly made configurable. Your request for no limit is uninformed.

Personally I find your comments disrespectful and hence counter productive because they could discourage others from using FTN95.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

11 Mar 2024 11:01 #31262

Sorry, by my words you can feel that the rhetoric between two camps of penguins is heating up in my Antarctica so the ice is melting 😃. By the way, it is considered pro- not counter productive, and usually encourage and not discourage. Anyway, from my side i apologize if my words sound offending, because, to be 100% objective, the workarounds also could be useful sometimes.

gFortran: Up to the full RAM memory + swap 933 GB no any limits were found

JohnCampbell

Posts: 2526 Sydney

Back to Top

13 Mar 2024 11:07 #31269

Quoted from DanRRight gFortran: Up to the full RAM memory + swap 933 GB no any limits were found

What stack size are you selecting in Gfortran? Have you been able to exceed the 512 MByte limit I have assumed ?

I expect you have succeeded as Gfortran is not using temporary arrays.

Have you tested any of the strategies I suggested for avoiding temporary arrays with FTN95 ?

These are incredible memory sizes you have available ! It was not long ago that 933 GB disk files were unachievable ! When I went to 64 GBytes of installed memory I changed my disk files into allocatable derived type memory arrays. Now performance is dictated by memory to cache transfer delays ! Unfortunately we can not allocate cache usage.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

14 Mar 2024 7:28 #31270

I am sure you've heard that no one already optimizes codes by hand anymore, compilers do that better than average programmer. Programmer has just to write clear code, not a spaghetti nightmare. And in Soviet Russia already the codes optimize programmers. 😃

gFortran with -O3 -march=native switch optimizes this code to AVX speeds without any workarounds. It even optimizes to the same speeds your Fortran77 workaround above, the FTN95 non-standard AVX@ of course it could not swallow. And yes it goes without crashes, exhausting 0.5 TB RAM it takes Swap. No any stacks. FTN95 stops at 8 GB.

integer, parameter :: i=1000, j=1000, m=1000, n=1
Real, allocatable :: DensitySpecies(:,:,:,:)
integer*8 :: idim, nnn

k=1
do nn=0,7
  nnn = 2**nn
  idim = nnn * i * j * m * n

  print*,'=====================', nn, nnn
  write(*,'(A, 5i7)') 'Size GB, Size i,j,m,n=', 4*idim/1000000000, nnn*i,j,m,n

  call cpu_time(t1)
  allocate(DensitySpecies(nnn*i,j,m,n), stat=ierr )
  if(ierr.ne.0) print*, '====ierr=', ierr
  call cpu_time(t2)
  print*,'Allocation time= ', t2-t1

  DensitySpecies = 123

  call cpu_time(t1)
  DensitySpecies(:,:,:,1)    = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
  call cpu_time(t2)
  print*,'END section :::, time= ', t2-t1

  deallocate(DensitySpecies)
enddo

END

There exists such song 'This is California, Baby'...Here on the same block are AMD, Intel, Apple, Google, Western Digital etcetcetc, here server chips can be found almost on a city dump and beaches. I found few Genoas and made a supercomputer. Soon 3nm Turins will be on the dumps, will increase to 400 cores

JohnCampbell

Posts: 2526 Sydney

Back to Top

14 Mar 2024 12:42 #31271

Quoted from DanRRight I am sure you've heard that no one already optimizes codes by hand anymore, compilers do that better than average programmer.

I don't see any compiler fixing your mistakes !

Try this changed code and see if you get more accurate elapsed time ?

  integer, parameter :: i=1000, j=1000, m=1000, n=1
  Real, allocatable :: DensitySpecies(:,:,:,:)

  integer*8 :: idim, nnn, num, ni, two=2
  integer   :: k, nn, ierr
  real      :: t1, t2, GByte
  integer*8 :: na, nb
  logical   :: dan = .false.

    call delta_sec ('Start tests')
    k = 1
    do nn = 0,3
      nnn  = two**nn
      ni   = nnn*i
      idim = ni * j * m * n
      gbyte = 4.*idim/1.e9
      print*,'=====================', nn, nnn
      write(*,'(A, f0.3, A, 5i7)') 'Size ',gbyte,' GB, Size i,j,m,n= ', ni,j,m,n
    
      call cpu_time (t1)
      call delta_sec ('start loop')
      allocate ( DensitySpecies(ni,j,m,n), stat=ierr )
      if(ierr.ne.0) print*, '====ierr=', ierr
      call cpu_time (t2)
      call delta_sec ('Allocation time')
      print*,'                      Allocation time= ', t2-t1
    
      DensitySpecies = 123
    
      call delta_sec ('Initialisation time')
      call cpu_time (t1)
    if ( dan ) then     
      DensitySpecies(:,:,:,1)    = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
    else
      na = ni
      nb = j*m
      call wrapper_add ( na, nb, DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k) )
    end if
      call cpu_time (t2)
      call delta_sec ('Calculation')
      print*,'                      END section :::, time= ', t2-t1
    
      deallocate (DensitySpecies)
      call delta_sec ('Deallocation')
    end do

  END

  subroutine delta_sec ( desc )
    character desc*(*)
    integer*8 :: clock, rate, last_clock = 0
    real*8    :: sec
     call system_clock ( clock, rate )
     sec = dble(clock-last_clock) / dble(rate)
     write (*,fmt='( f10.4,2x,a )') sec, desc
     last_clock = clock
  end subroutine delta_sec

  subroutine wrapper_add ( na, nb, accum, add )
    integer*8 :: na, nb
    real :: accum(na,nb), add(na,nb )
    integer*8 :: k

    write (*,fmt='(a,i0,a,i0,a)') 'Add arrays( ',na,', ',nb,' )'
    do k = 1,nb
      accum(:,k) = accum(:,k) + add(:,k)
    end do
  end subroutine wrapper_add

Unfortunately no compiler I used striped out the bad code. I will have to correct any errors by hand ! It appeared to run successfully in Plato with FTN95 and Gfortran up to 32 GBytes

JohnCampbell

Posts: 2526 Sydney

Back to Top

14 Mar 2024 1:11 #31272

You could try this alternative code, selecting apy4@.

I ran this with FTN95 Release x64 on my Ryzen with 64 GBytes of physical memory. The test used 59 GBytes and ran faster than Gfortran.

  integer, parameter :: i=1000, j=900, m=1000, n=1
  Real, allocatable :: DensitySpecies(:,:,:,:)

  integer*8 :: idim, nnn, num, ni, two=2
  integer   :: k, nn, ierr
  real      :: t1, t2, GByte
  integer*8 :: na, nb
!  logical   :: dan = .true. ,  john = .false. ,  use_avx = .true.
  logical   :: dan = .false. ,  john = .false. ,  use_avx = .true.

    call delta_sec ('Start tests')
    k = 1
    do nn = 0,4
      nnn  = two**nn
      ni   = nnn*i
      idim = ni * j * m * n
      gbyte = 4.*idim/1.e9
      print*,'=====================', nn, nnn
      write(*,'(A, f0.3, A, 5i7)') 'Size ',gbyte,' GB, Size i,j,m,n= ', ni,j,m,n
    
!      call cpu_time (t1)
      call delta_sec ('start loop')
      allocate ( DensitySpecies(ni,j,m,n), stat=ierr )
      if (ierr.ne.0) print*, '====ierr=', ierr
!      call cpu_time (t2)
      call delta_sec ('Allocation time')
!      print*,'                      Allocation time= ', t2-t1
    
      DensitySpecies = 123
    
      call delta_sec ('Initialisation time')
!      call cpu_time (t1)
    if ( dan ) then     
      DensitySpecies(:,:,:,1)    = DensitySpecies(:,:,:,1) + DensitySpecies(:,:,:,k)
    else if ( john ) then
      na = ni
      nb = j*m
      call wrapper_add ( na, nb, DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k) )
    else if ( use_avx ) then
      num = ni*j*m
      call axpy4@ ( DensitySpecies(1,1,1,1), DensitySpecies(1,1,1,k), num, 1.0 )      
    end if
!      call cpu_time (t2)
      call delta_sec ('Calculation')
!      print*,'                      END section :::, time= ', t2-t1
    
      deallocate (DensitySpecies)
      call delta_sec ('Deallocation')
    end do

  END

  subroutine delta_sec ( desc )
    character desc*(*)
    integer*8 :: clock, rate, last_clock = 0
    real*8    :: sec
     call system_clock ( clock, rate )
     sec = dble(clock-last_clock) / dble(rate)
     write (*,fmt='( f10.4,2x,a )') sec, desc
     last_clock = clock
  end subroutine delta_sec

  subroutine wrapper_add ( na, nb, accum, add )
    integer*8 :: na, nb
    real :: accum(na,nb), add(na,nb )
    integer*8 :: k

    write (*,fmt='(a,i0,a,i0,a)') 'Add arrays( ',na,', ',nb,' )'
    do k = 1,nb
      accum(:,k) = accum(:,k) + add(:,k)
    end do
  end subroutine wrapper_add

I modified the array sizes to fit in my available memory, but the test demonstrates good performance using FTN95 with arrays up to 59 GBytes.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

14 Mar 2024 1:30 #31273

All irrelevant to the subject

JohnCampbell

Posts: 2526 Sydney

Back to Top

15 Mar 2024 12:38 #31274

No Dan, it is relevant at all.

If you have a poor solution approach, the compiler can only go so far.

There is still some need for understanding preferred numerical approaches in large calculations.

With your large 3d mesh, perhaps you should consider sparse calculation techniques to eliminate unnecessary calculations, which even the best optimising compilers can't yet easily implement.

I think my example showed that you can adapt to remove limitations in the compiler and utilise what is available to improve performance. This applies to all compilers, especially Gfortran and FTN95.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

15 Mar 2024 7:18 #31275

Flag into your hands and show your skills on Polyhedron examples. They are waiting for you for 25 years. Ooops, you already tried... Also MPI, OpenMP and CUDA begging for you for 15-20 years. Also tried some... why not with this compiler ? 😃

Not interested in 3% proprietary 'improvements' on a single core which do not go anywhere else.