forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How to cRAM as much computing power as possible into a progr
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Tue Aug 11, 2020 6:54 am    Post subject: How to cRAM as much computing power as possible into a progr Reply with quote

How to cRAM as much computing power as possible into a program for use from a 'normal PC
(damn that post title length limitation of 60 characters !)

A short discussion here: http://forums.silverfrost.com/viewtopic.php?t=4290
between Bill and Eddie has prompted me to ask a question which has been lingering in the back of my mind for a long time.

Imagine you have an array of say. 10000 x 100000 of data

at double precision that needs 0.8 Gb of free RAM to store

A 'typical' computer these days has maybe 4Gb of RAM installed, but 2Gb is already eaten up by that 'frugal' sofware called M$ Windows, leaving 2Gb 'free'.
But of course 'free' is not always 'free' and a user will have often many pograms on-the-go at the same time (browser, Word, Excel , powerpoint, etc .....
So, even getting the nominal 0.8_Gb memory to run the specific program may be problematic for the typical user ! .... leading to virtual memory use (disk paging) .

Clearwin variables need to be double precision !
So, if you want to plot a large amount of the data, that adds another dimension to the problem.

So, 2 questions

Why can't ftn95, and especially ClearWin, have an option to switch back and forth between single and double precision ?
Many applications won't actually NEED double precision, and a single precision version would effectively double the memory available.

How best can a program with such typical size (or larger) arrays be structured such that the RAM limitations of a machine are not exceeded, as far as possible ?

(Eddies suggestion on the previous post tosimply take the 'easy route' and put more physical memory in the machine might not go down too well with a customer !)
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Tue Aug 11, 2020 11:38 am    Post subject: Reply with quote

John,
It’s a challenge. The first thing to note is that if you are running a program that needs a 10kx10k REAL*8 array, you probably aren’t running other memory-demanding applications at the same time, and indeed, if you think you can, you will probably make the computer slow to a crawl. If you need more memory, you can always compile in 64bit mode, and then you have access to more.
My suggestion is that if you employ computer models of that size, you probably will never understand what they are telling you, so where I’ve seen them used it’s always by someone who hasn’t got a clue either about what they are doing, or even more probability about what limitations their answers have.
If you don’t like the idea of buying more RAM, then build a computer to do those analyses and let it run in the background. It isn’t expensive. Since FTN95 only uses 1 core, a twin core machine should do nicely. Here’s the build list:
Case, Deepcool Smarter £19.99
Motherboard, Asrock A320M-DVS £44.18
CPU, AMD Athlon 3000G £45.49
RAM, 16Gb Corsair DDR4 2400 £60.98
SSD 256Gb Verbatim £24.98
PSU 400W CoolerMaster £32.99
Total £228.61 + carriage, say another £5. You can install Windows 10 for free if you link it to your Microsoft account. And yes, I’ve built a machine like this, and it runs FTN95 programs as fast as my current main machine that cost 8 times as much to build. (These are prices I got off a supplier website today).
In my experience, you need REAL*8 to reduce roundoff in just about any calculation, but REAL*4 can be used with data that you input. For example, if you wanted to define coordinates of anything on an Airbus 380, length 73m, wingspan 80m, you can do it to the nearest 1mm using 5 significant figures, or to 0.1mm with 6 – and REAL*4 is typically good enough for 7, roughly speaking. You don’t need REAL*8 for that. In fact, you could do all the data input and checking from a file in REAL*4.
The problem then arises that you can’t use CW+ in REAL*4 mode! You might be better off using INTEGER*4 coordinates with units of a micro metre, which would allow some operations to be carried out exactly, e.g. where particular rivets are positioned relative to the wingtip can be computed relative to the whole aircraft coordinate system..
So my answer to Q1, is to use REAL*4 when you want to, and REAL*8 when you want to. The calculations in x87 are always done in REAL*10, and in SSSE to REAL*8, with the usual reservations about padding out the input and rounding off the output from any calculation. If CW+’s use of REAL*8 bothers you, then input things like coordinates and so on in INTEGER*4. There’s a small overhead converting eventually to REAL, but anything with a human interface runs faster than the human can keep up.
My answer to Q2 is to separate input, computation and output phases into three separate applications that run sequentially, storing intermediate results on file. I’ve had to do this in the past, even using different computers and transferring the intermediate results on punched paper tape.
My answer to the question you didn’t ask, is to supply a complete computer with any marketed software. At the above price it would be cheaper than answering one support call.

Eddie
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Tue Aug 11, 2020 11:56 am    Post subject: Reply with quote

For 32 bit applications, FTN95 has single, double and extended precision reals and you can program with any or all or even switch from one to another via command line options such as /DREAL.

The same is true for 64 bit FTN95 except that extended precision is not provided.

ClearWin+ provides an input/output interface where reals must be double precision. This is by design and makes the interface easier to write, maintain and document. However, this does not force you to use a double precision model, only that you must convert to double precision for the purposes of inputting and outputting real values.
Back to top
View user's profile Send private message AIM Address
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Tue Aug 11, 2020 3:40 pm    Post subject: Reply with quote

John-Silver wrote:
A 'typical' computer these days has maybe 4Gb of RAM installed

Typical ? Where? I am using 32 GB for computation and about to go to 64 GB.

I do find that Clearwin+ can probably work with 8GB for graphics, but that is probably because I do most computation in a seperate program. Even 4K screen would not require a lot of memory for a virtual image. (64MBytes?)

Changing from real*4 to real*8 involves doubling memory, which is not a big deal. I have not used real*4 for anything real in 40 years ! Ten years at least since I had a 4GB PC.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Tue Aug 11, 2020 5:15 pm    Post subject: Reply with quote

John (C) you really are a special case!

FTN95 works perfectly well for many applications in far less RAM than you have available, and although for decades I have subscribed to the bigger is better paradigm, it isn't necessarily so. There are myriads of useful things you can do with a lot less.

As I pointed out REAL*4 or even INTEGER*4 is good for inputting coordinates, if you are short of memory, and I'll bet that you have never directly measured any quantity to REAL*8 precision. Or, for that matter, counted to more than 4 billions (although Mecej4 will tell us that the computer has to).

Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 1885

PostPosted: Tue Aug 11, 2020 6:45 pm    Post subject: Reply with quote

Let us not forget that current PCs sold with 2, 4 or 8 GB RAM also come with an integrated display controller, which shares RAM with the CPU. That display controller may well use 2 to 4 GB unless the user sets a limit on how much the display driver is allowed to use.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Thu Aug 13, 2020 12:41 pm    Post subject: Reply with quote

Yes, I hadn't forgotten that, but then onboard video reserving a 5Gb memory space on a machine with only 2Gb of RAM tells us rather a lot about the genius of hardware manufacturers, doesn't it?

Perhaps we should rename RAM as Tardis Memory - bigger on the inside than the outside.

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu Aug 13, 2020 3:17 pm    Post subject: Reply with quote

Not sure where this 2, 4 or 8GB concern about memory is going ?
With /64, worrying about the last few Mb is not really an issue today.
What can be a problem is declaring large arrays and not addressing them in a concurrent/sequential manner.

My latest approach to stack problems is to declare a stack size of 500Mb. All it uses is a virtual memory address and only the used stack is given a physical memory alllocation. Works for threads also.

I don't see many PC's with less than 8gb being sold.
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Fri Aug 14, 2020 8:37 pm    Post subject: Reply with quote

thanks for the feedback so far lads
My example wasn't an 'absolute' case, the logic is the same if you have 8Gb, the array numbers can easily multiply by an order of magnitude and result in the same beasic problem.
Also, there are a lot of bufget-o PC's with 4GB, and which companies like Airbus would jump at , especially in the current econo-covid-environment we are currently tip-toeing into at the moment.

Big, and not so big, companies are notorious for going low-end when they have to 'update' their hardwares. eg 5000 x 500 = a lot of spondoolicks, in whatever currency they like to buy in.

They typically have a 10 year life-expectancy (the company) before they go through those cycles.


I wonder, have *silverfrost ever run any benchmarking concerning potential runtime & disk usage on some 'standard programs' (not the polywotsit ones, ones devised internally) to gauge the real performance of ftn95 on matrix operations ?
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Fri Aug 21, 2020 7:19 pm    Post subject: Reply with quote

my last post had an interesting question at the bottom which appears somewhat 'hidden' .... I'll flush it out in the open here:

Quote:
I wonder, have *silverfrost ever run any benchmarking concerning potential runtime & disk usage on some 'standard programs' (not the polywotsit ones, ones devised internally) to gauge the real performance of ftn95 on matrix operations ?

_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Wed Aug 26, 2020 10:11 pm    Post subject: Reply with quote

About the speed of matrix operations John have mentioned. It is good to know how fast FTN95 is here vs other compilers.

One thing with large matrices annoyed me lately. It is just as simple as zeroizing of array. It became so time consuming due to large array sizes that i spent some time to remove the necessity to do that. But often zeroising is still needed.

So i have the questions or request to you guys who have other compilers installed

1) How FTN95 is compared to other compilers with this respect?
2) If all compilers are similarly slow - is it possible to parallelize this operation?

Here is the code to check. It creates and zeroises matrix from 1000x1000 to 100000x100000, largest is 40GB in size. Run takes 30-40 seconds to zeroise largest 40GB matrix. Compilation: ftn95 aaa.f95 /link /64 >z

Of course when matrix does not fit into the RAM the swap to harddrive (SSD) will slow runtime down additionally. But even if matrix fits it is still way too low for large sizes >20 GB

Code:
 Real*4, allocatable :: A(:,:)

 c = sqrt(10.)

 do i=6, 10
   j=nint(c**i)
   allocate(A(j,j))

   Call system_clock(i0, icount_rate, icount_max)
   A(:,:) = 0
   Call system_clock(i1, icount_rate, icount_max)

   Print*, 'Dim., Size_MB, Time', j, 4.*j*j/1e6, real(i1-i0)/icount_rate
   deallocate(A)
 end do
 pause
 end

! call clock@(t0)
! ftn95 aaa.f95 /link /64 >z
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu Aug 27, 2020 3:25 am    Post subject: Reply with quote

Dan,

You could simply write " A = 0 " and let the compiler choose the best approach to zero the array.

However, I disagree with your comment.
For most practical uses, the time taken for "A = 0" is insignificant compared to using A for computation.

Qualifier : That is as long as you have sufficient physical memory to store A.

I do have an example of using an "A", much bigger than the physical memory, but less than the virtual memory size limit, where I only use small parts of A as a "virtual storage". This works very well, although "A=0" would defeat the process and page the array to disk.

FTN95 does not generate the fastest .exe. Use of /64 AVX functions can help.

FTN95 MATMUL has not been optimised for AVX or cache efficiency but does work. I do have some recent MATMUL tests using AVX if you are interested, although I've never used MATMUL in a production code.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Thu Aug 27, 2020 6:12 am    Post subject: Reply with quote

John,
The data I use is block sparse, but if zeroizing you do that for the entire matrix, typically larger size to fit all potential sizes. I do not know in advance which cells of matrix will be filled and which will be empty. So the process of zeroising increased processing in my case several times.

Currently I zeroize only the block which was already processed and hence I know its coordinates

I am thinking may be there exist the way to zeroize large matrices using parallel multi-core approach.


Last edited by DanRRight on Fri Aug 28, 2020 2:42 am; edited 1 time in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu Aug 27, 2020 10:35 am    Post subject: Re: Reply with quote

DanRRight wrote:
The data I use is block sparse, but if zeroizing you do that for the entire matrix, typically larger size to fit all potential sizes. I do not know in advance which cells of matrix will be filled and which will be empty. So the process of zeroising increased processing in my case several times.

Could you have a derived type array with allocatable blocks ?
You could test for the existence of each block and allocate/initialise each block as you go. The following may work as you describe.
Code:
  module block_info
!
      TYPE block_array_record                  ! record for each block
         integer*4 :: block_size                 ! block_size
         integer*4, allocatable :: block(:,:)  ! (block_size,block_size)  block data
      END TYPE block_array_record
!
      integer*4 :: max_blocks
      integer*4 :: num_blocks
!
      type (block_array_record), allocatable :: block_array_records(:)        ! (max_blocks)
!
  end module block_info

  use block_info
!
      integer :: i, stat, k, step
      integer :: block_size, alloc_size, last_block
!
      alloc_size = 0
      max_blocks = 50000
      allocate ( block_array_records(max_blocks) )
      alloc_size = max_blocks
      do i = 1, max_blocks
        block_array_records(i)%block_size = -1
      end do

      num_blocks = 0
      do k = 1,2
        if ( k==1 ) step = 2000
        if ( k==2 ) step = 3000
       do i = 1, max_blocks, step
        block_size = 100 + mod(i,42) + k
        if ( block_array_records(i)%block_size == block_size ) cycle
!
        if ( block_array_records(i)%block_size > 0 ) then
          deallocate ( block_array_records(i)%block, stat=stat )
          write (*,*) 'Block',I,' released : stat=',stat
          num_blocks = num_blocks-1
        end if
!
        allocate ( block_array_records(i)%block(block_size,block_size), stat=stat )
        if ( stat /= 0 ) then
          write (*,*) 'Block',I,' could not be allocated : stat=',stat
          block_array_records(i)%block_size = -2

        else
          write (*,*) 'Block',I,' allocated with size=',block_size
          block_array_records(i)%block_size = block_size
          block_array_records(i)%block = 0
          num_blocks = num_blocks + 1
          last_block = i
          alloc_size = alloc_size + block_size**2
        end if

       end do ! i
      end do  ! k
!
      write (*,*) num_blocks,' blocks allocated and initialised'
      write (*,*) last_block,' last block number'
      write (*,*) alloc_size,' integers allocated'

    end

DanRRight wrote:
I am thinking may be exist way to zeroized large matrices using parallel multi-core approach.

Using a multi-core approach for initialising a large shared array might not be a good idea. OpenMP is a shared memory, multi-core approach where there is only a single memory source. I dont think this would work much faster, but perhaps the above would have dispersed memory arrays for each active block. With large arrays in OpenMP, memory can become the bottleneck.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2813
Location: South Pole, Antarctica

PostPosted: Fri Aug 28, 2020 10:35 pm    Post subject: Reply with quote

Still the question #1 is if FTN95 laging or not in this respect. So the question stays: does my test run with the same speed on Intel and Gfortran?

If it lags then this needs fixing which will benefit all users. If not, then i will think how to fix this using workarounds.

I had so far bad experience with derived type allocatable arrays. After years of preparations i spent a month or two for programming and debugging but failed. Code was crashing, freezing, not compiling. Just had no more time to find the problem.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group