forums.silverfrost.com

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

How to cRAM as much computing power as possible into a program for use from a 'normal PC
(damn that post title length limitation of 60 characters !)

A short discussion here: http://forums.silverfrost.com/viewtopic.php?t=4290
between Bill and Eddie has prompted me to ask a question which has been lingering in the back of my mind for a long time.

Imagine you have an array of say. 10000 x 100000 of data

at double precision that needs 0.8 Gb of free RAM to store

A 'typical' computer these days has maybe 4Gb of RAM installed, but 2Gb is already eaten up by that 'frugal' sofware called M$ Windows, leaving 2Gb 'free'.
But of course 'free' is not always 'free' and a user will have often many pograms on-the-go at the same time (browser, Word, Excel , powerpoint, etc .....
So, even getting the nominal 0.8_Gb memory to run the specific program may be problematic for the typical user ! .... leading to virtual memory use (disk paging) .

Clearwin variables need to be double precision !
So, if you want to plot a large amount of the data, that adds another dimension to the problem.

So, 2 questions

Why can't ftn95, and especially ClearWin, have an option to switch back and forth between single and double precision ?
Many applications won't actually NEED double precision, and a single precision version would effectively double the memory available.

How best can a program with such typical size (or larger) arrays be structured such that the RAM limitations of a machine are not exceeded, as far as possible ?

(Eddies suggestion on the previous post tosimply take the 'easy route' and put more physical memory in the machine might not go down too well with a customer !)
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile

"

LitusSaxonicum · Posted: Tue Aug 11, 2020 11:38 am Post subject:

John,
It’s a challenge. The first thing to note is that if you are running a program that needs a 10kx10k REAL*8 array, you probably aren’t running other memory-demanding applications at the same time, and indeed, if you think you can, you will probably make the computer slow to a crawl. If you need more memory, you can always compile in 64bit mode, and then you have access to more.
My suggestion is that if you employ computer models of that size, you probably will never understand what they are telling you, so where I’ve seen them used it’s always by someone who hasn’t got a clue either about what they are doing, or even more probability about what limitations their answers have.
If you don’t like the idea of buying more RAM, then build a computer to do those analyses and let it run in the background. It isn’t expensive. Since FTN95 only uses 1 core, a twin core machine should do nicely. Here’s the build list:
Case, Deepcool Smarter £19.99
Motherboard, Asrock A320M-DVS £44.18
CPU, AMD Athlon 3000G £45.49
RAM, 16Gb Corsair DDR4 2400 £60.98
SSD 256Gb Verbatim £24.98
PSU 400W CoolerMaster £32.99
Total £228.61 + carriage, say another £5. You can install Windows 10 for free if you link it to your Microsoft account. And yes, I’ve built a machine like this, and it runs FTN95 programs as fast as my current main machine that cost 8 times as much to build. (These are prices I got off a supplier website today).
In my experience, you need REAL*8 to reduce roundoff in just about any calculation, but REAL*4 can be used with data that you input. For example, if you wanted to define coordinates of anything on an Airbus 380, length 73m, wingspan 80m, you can do it to the nearest 1mm using 5 significant figures, or to 0.1mm with 6 – and REAL*4 is typically good enough for 7, roughly speaking. You don’t need REAL*8 for that. In fact, you could do all the data input and checking from a file in REAL*4.
The problem then arises that you can’t use CW+ in REAL*4 mode! You might be better off using INTEGER*4 coordinates with units of a micro metre, which would allow some operations to be carried out exactly, e.g. where particular rivets are positioned relative to the wingtip can be computed relative to the whole aircraft coordinate system..
So my answer to Q1, is to use REAL*4 when you want to, and REAL*8 when you want to. The calculations in x87 are always done in REAL*10, and in SSSE to REAL*8, with the usual reservations about padding out the input and rounding off the output from any calculation. If CW+’s use of REAL*8 bothers you, then input things like coordinates and so on in INTEGER*4. There’s a small overhead converting eventually to REAL, but anything with a human interface runs faster than the human can keep up.
My answer to Q2 is to separate input, computation and output phases into three separate applications that run sequentially, storing intermediate results on file. I’ve had to do this in the past, even using different computers and transferring the intermediate results on punched paper tape.
My answer to the question you didn’t ask, is to supply a complete computer with any marketed software. At the above price it would be cheaper than answering one support call.

Eddie

PaulLaidler · Posted: Tue Aug 11, 2020 11:56 am Post subject:

For 32 bit applications, FTN95 has single, double and extended precision reals and you can program with any or all or even switch from one to another via command line options such as /DREAL.

The same is true for 64 bit FTN95 except that extended precision is not provided.

ClearWin+ provides an input/output interface where reals must be double precision. This is by design and makes the interface easier to write, maintain and document. However, this does not force you to use a double precision model, only that you must convert to double precision for the purposes of inputting and outputting real values.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

LitusSaxonicum · Posted: Tue Aug 11, 2020 5:15 pm Post subject:

John (C) you really are a special case!

FTN95 works perfectly well for many applications in far less RAM than you have available, and although for decades I have subscribed to the bigger is better paradigm, it isn't necessarily so. There are myriads of useful things you can do with a lot less.

As I pointed out REAL*4 or even INTEGER*4 is good for inputting coordinates, if you are short of memory, and I'll bet that you have never directly measured any quantity to REAL*8 precision. Or, for that matter, counted to more than 4 billions (although Mecej4 will tell us that the computer has to).

Eddie

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Let us not forget that current PCs sold with 2, 4 or 8 GB RAM also come with an integrated display controller, which shares RAM with the CPU. That display controller may well use 2 to 4 GB unless the user sets a limit on how much the display driver is allowed to use.

LitusSaxonicum · Posted: Thu Aug 13, 2020 12:41 pm Post subject:

Yes, I hadn't forgotten that, but then onboard video reserving a 5Gb memory space on a machine with only 2Gb of RAM tells us rather a lot about the genius of hardware manufacturers, doesn't it?

Perhaps we should rename RAM as Tardis Memory - bigger on the inside than the outside.

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Not sure where this 2, 4 or 8GB concern about memory is going ?
With /64, worrying about the last few Mb is not really an issue today.
What can be a problem is declaring large arrays and not addressing them in a concurrent/sequential manner.

My latest approach to stack problems is to declare a stack size of 500Mb. All it uses is a virtual memory address and only the used stack is given a physical memory alllocation. Works for threads also.

I don't see many PC's with less than 8gb being sold.

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

thanks for the feedback so far lads
My example wasn't an 'absolute' case, the logic is the same if you have 8Gb, the array numbers can easily multiply by an order of magnitude and result in the same beasic problem.
Also, there are a lot of bufget-o PC's with 4GB, and which companies like Airbus would jump at , especially in the current econo-covid-environment we are currently tip-toeing into at the moment.

Big, and not so big, companies are notorious for going low-end when they have to 'update' their hardwares. eg 5000 x 500 = a lot of spondoolicks, in whatever currency they like to buy in.

They typically have a 10 year life-expectancy (the company) before they go through those cycles.

I wonder, have *silverfrost ever run any benchmarking concerning potential runtime & disk usage on some 'standard programs' (not the polywotsit ones, ones devised internally) to gauge the real performance of ftn95 on matrix operations ?
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile

"

John-Silver · Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley

my last post had an interesting question at the bottom which appears somewhat 'hidden' .... I'll flush it out in the open here:

DanRRight · Posted: Wed Aug 26, 2020 10:11 pm Post subject:

About the speed of matrix operations John have mentioned. It is good to know how fast FTN95 is here vs other compilers.

One thing with large matrices annoyed me lately. It is just as simple as zeroizing of array. It became so time consuming due to large array sizes that i spent some time to remove the necessity to do that. But often zeroising is still needed.

So i have the questions or request to you guys who have other compilers installed

1) How FTN95 is compared to other compilers with this respect?
2) If all compilers are similarly slow - is it possible to parallelize this operation?

Here is the code to check. It creates and zeroises matrix from 1000x1000 to 100000x100000, largest is 40GB in size. Run takes 30-40 seconds to zeroise largest 40GB matrix. Compilation: ftn95 aaa.f95 /link /64 >z

Of course when matrix does not fit into the RAM the swap to harddrive (SSD) will slow runtime down additionally. But even if matrix fits it is still way too low for large sizes >20 GB

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

You could simply write " A = 0 " and let the compiler choose the best approach to zero the array.

However, I disagree with your comment.
For most practical uses, the time taken for "A = 0" is insignificant compared to using A for computation.

Qualifier : That is as long as you have sufficient physical memory to store A.

I do have an example of using an "A", much bigger than the physical memory, but less than the virtual memory size limit, where I only use small parts of A as a "virtual storage". This works very well, although "A=0" would defeat the process and page the array to disk.

FTN95 does not generate the fastest .exe. Use of /64 AVX functions can help.

FTN95 MATMUL has not been optimised for AVX or cache efficiency but does work. I do have some recent MATMUL tests using AVX if you are interested, although I've never used MATMUL in a production code.

DanRRight · Posted: Thu Aug 27, 2020 6:12 am Post subject:

John,
The data I use is block sparse, but if zeroizing you do that for the entire matrix, typically larger size to fit all potential sizes. I do not know in advance which cells of matrix will be filled and which will be empty. So the process of zeroising increased processing in my case several times.

Currently I zeroize only the block which was already processed and hence I know its coordinates

I am thinking may be there exist the way to zeroize large matrices using parallel multi-core approach.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

DanRRight · Posted: Fri Aug 28, 2020 10:35 pm Post subject:

Still the question #1 is if FTN95 laging or not in this respect. So the question stays: does my test run with the same speed on Intel and Gfortran?

If it lags then this needs fixing which will benefit all users. If not, then i will think how to fix this using workarounds.

I had so far bad experience with derived type allocatable arrays. After years of preparations i spent a month or two for programming and debugging but failed. Code was crashing, freezing, not compiling. Just had no more time to find the problem.