|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
John-Silver
Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley
|
Posted: Tue Aug 11, 2020 6:54 am Post subject: How to cRAM as much computing power as possible into a progr |
|
|
How to cRAM as much computing power as possible into a program for use from a 'normal PC
(damn that post title length limitation of 60 characters !)
A short discussion here: http://forums.silverfrost.com/viewtopic.php?t=4290
between Bill and Eddie has prompted me to ask a question which has been lingering in the back of my mind for a long time.
Imagine you have an array of say. 10000 x 100000 of data
at double precision that needs 0.8 Gb of free RAM to store
A 'typical' computer these days has maybe 4Gb of RAM installed, but 2Gb is already eaten up by that 'frugal' sofware called M$ Windows, leaving 2Gb 'free'.
But of course 'free' is not always 'free' and a user will have often many pograms on-the-go at the same time (browser, Word, Excel , powerpoint, etc .....
So, even getting the nominal 0.8_Gb memory to run the specific program may be problematic for the typical user ! .... leading to virtual memory use (disk paging) .
Clearwin variables need to be double precision !
So, if you want to plot a large amount of the data, that adds another dimension to the problem.
So, 2 questions
Why can't ftn95, and especially ClearWin, have an option to switch back and forth between single and double precision ?
Many applications won't actually NEED double precision, and a single precision version would effectively double the memory available.
How best can a program with such typical size (or larger) arrays be structured such that the RAM limitations of a machine are not exceeded, as far as possible ?
(Eddies suggestion on the previous post tosimply take the 'easy route' and put more physical memory in the machine might not go down too well with a customer !) _________________ ''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... " |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Tue Aug 11, 2020 11:38 am Post subject: |
|
|
John,
It’s a challenge. The first thing to note is that if you are running a program that needs a 10kx10k REAL*8 array, you probably aren’t running other memory-demanding applications at the same time, and indeed, if you think you can, you will probably make the computer slow to a crawl. If you need more memory, you can always compile in 64bit mode, and then you have access to more.
My suggestion is that if you employ computer models of that size, you probably will never understand what they are telling you, so where I’ve seen them used it’s always by someone who hasn’t got a clue either about what they are doing, or even more probability about what limitations their answers have.
If you don’t like the idea of buying more RAM, then build a computer to do those analyses and let it run in the background. It isn’t expensive. Since FTN95 only uses 1 core, a twin core machine should do nicely. Here’s the build list:
Case, Deepcool Smarter £19.99
Motherboard, Asrock A320M-DVS £44.18
CPU, AMD Athlon 3000G £45.49
RAM, 16Gb Corsair DDR4 2400 £60.98
SSD 256Gb Verbatim £24.98
PSU 400W CoolerMaster £32.99
Total £228.61 + carriage, say another £5. You can install Windows 10 for free if you link it to your Microsoft account. And yes, I’ve built a machine like this, and it runs FTN95 programs as fast as my current main machine that cost 8 times as much to build. (These are prices I got off a supplier website today).
In my experience, you need REAL*8 to reduce roundoff in just about any calculation, but REAL*4 can be used with data that you input. For example, if you wanted to define coordinates of anything on an Airbus 380, length 73m, wingspan 80m, you can do it to the nearest 1mm using 5 significant figures, or to 0.1mm with 6 – and REAL*4 is typically good enough for 7, roughly speaking. You don’t need REAL*8 for that. In fact, you could do all the data input and checking from a file in REAL*4.
The problem then arises that you can’t use CW+ in REAL*4 mode! You might be better off using INTEGER*4 coordinates with units of a micro metre, which would allow some operations to be carried out exactly, e.g. where particular rivets are positioned relative to the wingtip can be computed relative to the whole aircraft coordinate system..
So my answer to Q1, is to use REAL*4 when you want to, and REAL*8 when you want to. The calculations in x87 are always done in REAL*10, and in SSSE to REAL*8, with the usual reservations about padding out the input and rounding off the output from any calculation. If CW+’s use of REAL*8 bothers you, then input things like coordinates and so on in INTEGER*4. There’s a small overhead converting eventually to REAL, but anything with a human interface runs faster than the human can keep up.
My answer to Q2 is to separate input, computation and output phases into three separate applications that run sequentially, storing intermediate results on file. I’ve had to do this in the past, even using different computers and transferring the intermediate results on punched paper tape.
My answer to the question you didn’t ask, is to supply a complete computer with any marketed software. At the above price it would be cheaper than answering one support call.
Eddie |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7927 Location: Salford, UK
|
Posted: Tue Aug 11, 2020 11:56 am Post subject: |
|
|
For 32 bit applications, FTN95 has single, double and extended precision reals and you can program with any or all or even switch from one to another via command line options such as /DREAL.
The same is true for 64 bit FTN95 except that extended precision is not provided.
ClearWin+ provides an input/output interface where reals must be double precision. This is by design and makes the interface easier to write, maintain and document. However, this does not force you to use a double precision model, only that you must convert to double precision for the purposes of inputting and outputting real values. |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Tue Aug 11, 2020 3:40 pm Post subject: |
|
|
John-Silver wrote: | A 'typical' computer these days has maybe 4Gb of RAM installed |
Typical ? Where? I am using 32 GB for computation and about to go to 64 GB.
I do find that Clearwin+ can probably work with 8GB for graphics, but that is probably because I do most computation in a seperate program. Even 4K screen would not require a lot of memory for a virtual image. (64MBytes?)
Changing from real*4 to real*8 involves doubling memory, which is not a big deal. I have not used real*4 for anything real in 40 years ! Ten years at least since I had a 4GB PC. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Tue Aug 11, 2020 5:15 pm Post subject: |
|
|
John (C) you really are a special case!
FTN95 works perfectly well for many applications in far less RAM than you have available, and although for decades I have subscribed to the bigger is better paradigm, it isn't necessarily so. There are myriads of useful things you can do with a lot less.
As I pointed out REAL*4 or even INTEGER*4 is good for inputting coordinates, if you are short of memory, and I'll bet that you have never directly measured any quantity to REAL*8 precision. Or, for that matter, counted to more than 4 billions (although Mecej4 will tell us that the computer has to).
Eddie |
|
Back to top |
|
|
mecej4
Joined: 31 Oct 2006 Posts: 1886
|
Posted: Tue Aug 11, 2020 6:45 pm Post subject: |
|
|
Let us not forget that current PCs sold with 2, 4 or 8 GB RAM also come with an integrated display controller, which shares RAM with the CPU. That display controller may well use 2 to 4 GB unless the user sets a limit on how much the display driver is allowed to use. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Thu Aug 13, 2020 12:41 pm Post subject: |
|
|
Yes, I hadn't forgotten that, but then onboard video reserving a 5Gb memory space on a machine with only 2Gb of RAM tells us rather a lot about the genius of hardware manufacturers, doesn't it?
Perhaps we should rename RAM as Tardis Memory - bigger on the inside than the outside.
Eddie |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Thu Aug 13, 2020 3:17 pm Post subject: |
|
|
Not sure where this 2, 4 or 8GB concern about memory is going ?
With /64, worrying about the last few Mb is not really an issue today.
What can be a problem is declaring large arrays and not addressing them in a concurrent/sequential manner.
My latest approach to stack problems is to declare a stack size of 500Mb. All it uses is a virtual memory address and only the used stack is given a physical memory alllocation. Works for threads also.
I don't see many PC's with less than 8gb being sold. |
|
Back to top |
|
|
John-Silver
Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley
|
Posted: Fri Aug 14, 2020 8:37 pm Post subject: |
|
|
thanks for the feedback so far lads
My example wasn't an 'absolute' case, the logic is the same if you have 8Gb, the array numbers can easily multiply by an order of magnitude and result in the same beasic problem.
Also, there are a lot of bufget-o PC's with 4GB, and which companies like Airbus would jump at , especially in the current econo-covid-environment we are currently tip-toeing into at the moment.
Big, and not so big, companies are notorious for going low-end when they have to 'update' their hardwares. eg 5000 x 500 = a lot of spondoolicks, in whatever currency they like to buy in.
They typically have a 10 year life-expectancy (the company) before they go through those cycles.
I wonder, have *silverfrost ever run any benchmarking concerning potential runtime & disk usage on some 'standard programs' (not the polywotsit ones, ones devised internally) to gauge the real performance of ftn95 on matrix operations ? _________________ ''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... " |
|
Back to top |
|
|
John-Silver
Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley
|
Posted: Fri Aug 21, 2020 7:19 pm Post subject: |
|
|
my last post had an interesting question at the bottom which appears somewhat 'hidden' .... I'll flush it out in the open here:
Quote: | I wonder, have *silverfrost ever run any benchmarking concerning potential runtime & disk usage on some 'standard programs' (not the polywotsit ones, ones devised internally) to gauge the real performance of ftn95 on matrix operations ? |
_________________ ''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... " |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Wed Aug 26, 2020 10:11 pm Post subject: |
|
|
About the speed of matrix operations John have mentioned. It is good to know how fast FTN95 is here vs other compilers.
One thing with large matrices annoyed me lately. It is just as simple as zeroizing of array. It became so time consuming due to large array sizes that i spent some time to remove the necessity to do that. But often zeroising is still needed.
So i have the questions or request to you guys who have other compilers installed
1) How FTN95 is compared to other compilers with this respect?
2) If all compilers are similarly slow - is it possible to parallelize this operation?
Here is the code to check. It creates and zeroises matrix from 1000x1000 to 100000x100000, largest is 40GB in size. Run takes 30-40 seconds to zeroise largest 40GB matrix. Compilation: ftn95 aaa.f95 /link /64 >z
Of course when matrix does not fit into the RAM the swap to harddrive (SSD) will slow runtime down additionally. But even if matrix fits it is still way too low for large sizes >20 GB
Code: | Real*4, allocatable :: A(:,:)
c = sqrt(10.)
do i=6, 10
j=nint(c**i)
allocate(A(j,j))
Call system_clock(i0, icount_rate, icount_max)
A(:,:) = 0
Call system_clock(i1, icount_rate, icount_max)
Print*, 'Dim., Size_MB, Time', j, 4.*j*j/1e6, real(i1-i0)/icount_rate
deallocate(A)
end do
pause
end
! call clock@(t0)
! ftn95 aaa.f95 /link /64 >z |
|
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Thu Aug 27, 2020 3:25 am Post subject: |
|
|
Dan,
You could simply write " A = 0 " and let the compiler choose the best approach to zero the array.
However, I disagree with your comment.
For most practical uses, the time taken for "A = 0" is insignificant compared to using A for computation.
Qualifier : That is as long as you have sufficient physical memory to store A.
I do have an example of using an "A", much bigger than the physical memory, but less than the virtual memory size limit, where I only use small parts of A as a "virtual storage". This works very well, although "A=0" would defeat the process and page the array to disk.
FTN95 does not generate the fastest .exe. Use of /64 AVX functions can help.
FTN95 MATMUL has not been optimised for AVX or cache efficiency but does work. I do have some recent MATMUL tests using AVX if you are interested, although I've never used MATMUL in a production code. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Thu Aug 27, 2020 6:12 am Post subject: |
|
|
John,
The data I use is block sparse, but if zeroizing you do that for the entire matrix, typically larger size to fit all potential sizes. I do not know in advance which cells of matrix will be filled and which will be empty. So the process of zeroising increased processing in my case several times.
Currently I zeroize only the block which was already processed and hence I know its coordinates
I am thinking may be there exist the way to zeroize large matrices using parallel multi-core approach.
Last edited by DanRRight on Fri Aug 28, 2020 2:42 am; edited 1 time in total |
|
Back to top |
|
|
JohnCampbell
Joined: 16 Feb 2006 Posts: 2554 Location: Sydney
|
Posted: Thu Aug 27, 2020 10:35 am Post subject: Re: |
|
|
DanRRight wrote: | The data I use is block sparse, but if zeroizing you do that for the entire matrix, typically larger size to fit all potential sizes. I do not know in advance which cells of matrix will be filled and which will be empty. So the process of zeroising increased processing in my case several times. |
Could you have a derived type array with allocatable blocks ?
You could test for the existence of each block and allocate/initialise each block as you go. The following may work as you describe.
Code: | module block_info
!
TYPE block_array_record ! record for each block
integer*4 :: block_size ! block_size
integer*4, allocatable :: block(:,:) ! (block_size,block_size) block data
END TYPE block_array_record
!
integer*4 :: max_blocks
integer*4 :: num_blocks
!
type (block_array_record), allocatable :: block_array_records(:) ! (max_blocks)
!
end module block_info
use block_info
!
integer :: i, stat, k, step
integer :: block_size, alloc_size, last_block
!
alloc_size = 0
max_blocks = 50000
allocate ( block_array_records(max_blocks) )
alloc_size = max_blocks
do i = 1, max_blocks
block_array_records(i)%block_size = -1
end do
num_blocks = 0
do k = 1,2
if ( k==1 ) step = 2000
if ( k==2 ) step = 3000
do i = 1, max_blocks, step
block_size = 100 + mod(i,42) + k
if ( block_array_records(i)%block_size == block_size ) cycle
!
if ( block_array_records(i)%block_size > 0 ) then
deallocate ( block_array_records(i)%block, stat=stat )
write (*,*) 'Block',I,' released : stat=',stat
num_blocks = num_blocks-1
end if
!
allocate ( block_array_records(i)%block(block_size,block_size), stat=stat )
if ( stat /= 0 ) then
write (*,*) 'Block',I,' could not be allocated : stat=',stat
block_array_records(i)%block_size = -2
else
write (*,*) 'Block',I,' allocated with size=',block_size
block_array_records(i)%block_size = block_size
block_array_records(i)%block = 0
num_blocks = num_blocks + 1
last_block = i
alloc_size = alloc_size + block_size**2
end if
end do ! i
end do ! k
!
write (*,*) num_blocks,' blocks allocated and initialised'
write (*,*) last_block,' last block number'
write (*,*) alloc_size,' integers allocated'
end |
DanRRight wrote: | I am thinking may be exist way to zeroized large matrices using parallel multi-core approach. |
Using a multi-core approach for initialising a large shared array might not be a good idea. OpenMP is a shared memory, multi-core approach where there is only a single memory source. I dont think this would work much faster, but perhaps the above would have dispersed memory arrays for each active block. With large arrays in OpenMP, memory can become the bottleneck. |
|
Back to top |
|
|
DanRRight
Joined: 10 Mar 2008 Posts: 2818 Location: South Pole, Antarctica
|
Posted: Fri Aug 28, 2020 10:35 pm Post subject: |
|
|
Still the question #1 is if FTN95 laging or not in this respect. So the question stays: does my test run with the same speed on Intel and Gfortran?
If it lags then this needs fixing which will benefit all users. If not, then i will think how to fix this using workarounds.
I had so far bad experience with derived type allocatable arrays. After years of preparations i spent a month or two for programming and debugging but failed. Code was crashing, freezing, not compiling. Just had no more time to find the problem. |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|