|
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sat May 23, 2020 11:01 am Post subject: Quest for Speed |
|
|
Now that I’ve by and large retired from it, the quest for speed still fascinates and seduces me. There was a time when the battle was to get things to run at all, given the limited amount of memory in the average computer. I can well remember finding that a particular program could compile and run on one make of mainframe because it had an overlay system, and not on another because it didn’t. (And before anyone asks, 32k words of memory, or 96k in new money).
Algorithms were developed that made use of tape or disk storage to make up for lack of memory. They still work because of Fortran’s longevity, and far better than they did in the past because the modern PC disk is fast and SSDs even faster. It means for me that stuff I wrote 40 to 50 years ago runs better now than it ever did then, and I can run problems hundreds of times bigger just by increasing array sizes. More to the point, those overnight runs of the past are now more like instant.
My first PC had an 8086, and I could run things faster with an 8087 aboard, then I put in a V30 clock-doubled chip. I found that ran 1.8 times faster with no 8087, but only 1.2 times faster with it (although that 1.2x was faster than the 1.8x if you get my gist). I later discovered that clock-multiplied CPUs eventually ran out of steam with less and less improvement each time.
Of course, there were things that have stood the test of time, like choosing the right algorithm, nesting DO loops in the right order and so on, and things like not printing when you got something but saving it on disk for printing later, if at all.
So now I have a multicore computer (actually I’ve had them for more than a decade) with multiple cores, and frankly I don’t see much improvement in the performance of my FTN95 programs the more cores I have.
I suppose that this is because FTN95 is at heart, single-threaded.
As I am a committed reader of documentation, I came across START_THREAD@ and wondered if I could make some of my programs run multi-threaded. A question is how many logical processors are there on any particular PC? It seems that I can discover this outside my FTN95 program, but there doesn’t seem to be a routine for it in the FTN95 library. Obviously, if one starts more threads than there are logical processors, the subdivision of the task can’t be as efficient as getting going with a thread without waiting in a queue. Also, it occurs to me that some cores are going to be busy anyway, so maybe one should start with fewer threads than logical processors. Let me imagine that I can start 8 threads. Now, if there’s an overhead in doing things this way, say that means things might execute 6 times faster than single-threaded. That would be a huge benefit if the whole program could be run ab initio as those separate threads, or even if a large part of the program execution could be multithreaded.
The question I have, really, is does anyone on the Forum have experience of multithreading in an FTN95 program? And is it helpful?
Suppose I have a program that has 3 phases, A, B and C, with execution times respectively TA, TB and TC. Suppose now that TA=TB=TC, and that only phases A and C are readily divided into separate tasks for multi-threading without a huge amount of reprogramming. If I do one of them, say task C and I reduce the runtime thereby to a sixth of TC, the program still runs in TA+TB+TC/6, or if I program task A as well in TB+ (TA+TC)/6 which comes down to a big speedup. Now obviously, reducing a runtime of something to a sixth overall is a fantastic reduction, multithreading A and C very worthwhile, and only one phase perhaps still worthwhile but far from astonishing. Also, if TB is much bigger than TA or TB or both, the gains are smaller. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sat May 23, 2020 11:02 am Post subject: |
|
|
As always, it is very worthwhile to get a speed gain if one costs one’s time as worthless. If that enters the equation, what is worth doing may vary. For example, I have a backup computer that has a dual core processor, and the CPU, motherboard and RAM cost rather less than my hourly consulting rate. (I had the rest already) It runs my single-threaded FTN95-compiled programs just as fast as my main computer that has 16-thread capability, and which was assembled from equivalent components that cost my daily rate. (OK, I haven’t costed the time to build them, nor to install Windows, but the cost of computers is nowhere near what it was in the mainframe days when one might cost a lifetime’s earnings or more or the early PC days where it still amounted to months.)
A point that still puzzles me is that in swapping my scratch drive from mechanical to SSD for the price of 10 minutes fee, I managed to double the speed of a disk-bound application, and perhaps that was most cost-effective as well because it took no programming effort.
Finally, I hear of huge gains by using CUDA programming, but I imagine that do do that requires a huge investment in reprogramming.
Eddie |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7927 Location: Salford, UK
|
Posted: Sat May 23, 2020 12:14 pm Post subject: |
|
|
Eddie
Have you seen the document called notes_on_parallel_processing.txt that is in the DOC subfolder typically at C:\Program Files (x86)\Silverfrost\FTN95\DOC. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sat May 23, 2020 2:20 pm Post subject: |
|
|
Paul,
I remember reading it some time ago now that you mention it, and not finding it terribly useful (then). Has it increased in size and content recently? There's a lot more in it than I remember, including a lot more functions. I'm sure it will prove to contain some of the answers to my questions.
Or it may have been the mention of 64-bit that made me not realise its significance. I'm afraid that 64-bit* makes my eyes glaze over, as to be frank, once you transfer stuff from a 32k word mainframe to a reasonably modern PC even the 32-bit Windows space is as vast as the solar system, and the contemplation of interstellar space may be left to others! Do the routines work with 32-bit? (My attempt to do that with the Mandelbrot example suggests not.)
I had tried the multiple applications route before, including using multiple computers, and came to the conclusion that I saved the programming time by buying a faster computer (I build my own from components). In the past I've been put off SSDs because they wear out, but I have discovered that Windows wears out with incompatible upgrades, CDs wear out with time, cars get rusty, and nothing is forever. The speedup with something developed for mag tapes or 8 to 10Mb hard drives on a PC with an SSD is something to marvel at.
I think that I need to go and study the document in detail. Thanks for pointing it out. What about putting on the top page of the website that FTN95 contains all the tools needed to exploit all those cores in your PC?
Eddie
* And I retain an affection for the x87 that may well be misplaced. |
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7927 Location: Salford, UK
|
Posted: Sat May 23, 2020 2:54 pm Post subject: |
|
|
Eddie
This feature is only available for 64 bit executables built using FTN95. |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sat May 23, 2020 3:12 pm Post subject: |
|
|
I came to the conclusion that the remaining years of my life are far too short a time to understand the Mandelbrot example, so I focused on the other examples. Hmmm. After a certain amount of experimentation, I also came to the conclusion that I didn’t understand the second example, either.
And none of the examples contains START_THREAD@. I think that I'd better do my own experimentation.
At least I've got several more weeks in lockdown!
Eddie |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Sun May 24, 2020 5:03 pm Post subject: |
|
|
And to add to my confusion, I'm not entirely sure that START_THREAD@ actually does anything. Certainly it doesn't seem to work for me. It doesn't seem to start the subroutine, and can't make its mind up about whether to take exception or not.
Eddie |
|
Back to top |
|
|
John-Silver
Joined: 30 Jul 2013 Posts: 1520 Location: Aerospace Valley
|
Posted: Fri Jun 12, 2020 10:26 am Post subject: |
|
|
Eddie, not surprising as the examples are not multi-threading techniques but multi-tasking !!!!
(Same aim, different name/concept I guess)
see https://silverfrost.com/19/ftn95/support/ftn95_revision_history.aspx and scross down to the first entry under V8.3 changes where you'll see the statement together with the file Paul quotes mentioned _________________ ''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... " |
|
Back to top |
|
|
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2388 Location: Yateley, Hants, UK
|
Posted: Fri Jun 12, 2020 11:36 am Post subject: |
|
|
John,
Thanks for your comment, and actually I don't understand the difference. My wish is to understand concepts even if I don't actually employ them, and if possible to get a worthwhile gain for minimal pain.
Eddie |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 697 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Sun Jun 14, 2020 4:25 pm Post subject: |
|
|
Eddie’s post prompted me to look again at the notes on parallel processing – it was a miserable wet day here in Scotland. I was thinking about the possibility of inserting some parallel tasks within some serial code and below is a slightly modified version of Example 2.
There is a print statement in the code before any parallel processes are initiated, yet that print statement is executed a number of times equal to the number of processes invoked later in the code.
The print statement at the end of the code, after the parallel processes are killed executes correctly only once.
You can change the number of processors used by changing the declaration of np – this demonstrates the speed increase with np = 1, 2, 4, 8 etc.
If you uncomment the do loop which varies the number of processors – be prepared to kill the executable via the Task Manager!
At the moment, I don’t understand what’s happening.
Ken |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 697 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Sun Jun 14, 2020 4:26 pm Post subject: |
|
|
Code: |
implicit none !##
INCLUDE <windows.ins>
DOUBLE PRECISION start_time,end_time,sum
double precision duration, sum1 !##
DOUBLE PRECISION,allocatable::partial_answer(:)
INTEGER(kind=4) ID
INTEGER(kind=4) k
integer(kind=4) :: np = 2, i, j
print*, 'This print statement is executed NP times'
!$$$$$$ do np = 2, RecommendedProcessorCount@(.true.), 2
!>> Start np-1 additional tasks. ID will be returned thus:
!>> Master task ID=0
!>> Slave task ID=1,2,3 in the different processes
ID=GetParallelTaskID@(np-1) !##
IF(ID .eq. 0) print*, 'Number of processors', np
!>> Allocate a shared array. The string "AUTO" couples the ALLOCATE with the parallel task mechanism
ALLOCATE(partial_answer(np),SHARENAME="shared_stuff")
CALL TaskSynchronise@()
!>> Time the task using wall clock elapsed time
CALL dclock@(start_time)
sum=0d0
!>> All np processes compute the sum in an interleaved fashion
k = 10000000000_4 - ID
WHILE(k > 0)DO
sum = sum + k
k = k - np
ENDWHILE
!>> Copy the partial sum into the array shared between the processes
partial_answer(ID+1)=sum
CALL TaskSynchronise@()
CALL dclock@(end_time)
IF(ID==0)THEN
!>> We are the master task, so print out the results and the timing
sum1 = 0.d0
do i = 1, np
sum1 = sum1 + partial_answer(i)
end do
PRINT *,"Sum=",sum1
duration=end_time-start_time
PRINT *,"Parallel computation time = ",duration
ENDIF
CALL TaskSynchronise@()
!>> Kill off the slave process
IF(ID .ne. 0) STOP
DEALLOCATE(partial_answer)
!$$$$$$ end do
print*, 'This print statement is executed once'
END PROGRAM |
|
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 697 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Sun Jun 14, 2020 11:41 pm Post subject: |
|
|
I've made a little bit of progress but I think it's fair to say that I am well out of my comfort zone! Certainly not behaving the way I envisaged it would.
A possible topic for your next video Paul ??
Code: | program main
implicit none !##
INCLUDE <windows.ins>
DOUBLE PRECISION start_time,end_time,sum
double precision duration, sum1 !##
DOUBLE PRECISION,allocatable::partial_answer(:)
INTEGER(kind=4) ID
INTEGER(kind=4) k
integer(kind=4) :: np=4, i, j
IF( .not. IsSlaveProcess@()) THEN
! Initial serial comands go here
call set_parameters(np)
ENDIF
!>> Start np-1 additional tasks. ID will be returned thus:
!>> Master task ID=0
!>> Slave task ID=1,2,3 in the different processes
ID=GetParallelTaskID@(np-1) !##
IF(ID .eq. 0) print*, 'Number of processors', np
!>> Allocate a shared array. The string "AUTO" couples the ALLOCATE with the parallel task mechanism
ALLOCATE(partial_answer(np),SHARENAME="shared_stuff")
CALL TaskSynchronise@()
!>> Time the task using wall clock elapsed time
CALL dclock@(start_time)
sum=0d0
!>> All np processes compute the sum in an interleaved fashion
k = 10000000000_4 - ID
WHILE(k > 0)DO
sum = sum + k
k = k - np
ENDWHILE
!>> Copy the partial sum into the array shared between the processes
partial_answer(ID+1)=sum
CALL TaskSynchronise@()
CALL dclock@(end_time)
IF(ID==0)THEN
!>> We are the master task, so print out the results and the timing
sum1 = 0.d0
do i = 1, np
sum1 = sum1 + partial_answer(i)
end do
PRINT *,"Sum=",sum1
duration=end_time-start_time
PRINT *,"Parallel computation time = ",duration
ENDIF
CALL TaskSynchronise@()
!>> Kill off the slave process
IF(ID .ne. 0) STOP
DEALLOCATE(partial_answer)
END PROGRAM
subroutine set_parameters(np)
implicit none
integer(kind=4), intent(out) :: np
10 write(6,*)
write(6,*) 'Enter number of processors to use'
read(5,*) np
if (np .lt. 1) goto 10
end set_parameters |
|
|
Back to top |
|
|
PaulLaidler Site Admin
Joined: 21 Feb 2005 Posts: 7927 Location: Salford, UK
|
Posted: Mon Jun 15, 2020 7:37 am Post subject: |
|
|
Thanks. I am preparing a video about %pl. After that I don't know. |
|
Back to top |
|
|
Kenneth_Smith
Joined: 18 May 2012 Posts: 697 Location: Hamilton, Lanarkshire, Scotland.
|
Posted: Mon Jun 15, 2020 11:22 am Post subject: |
|
|
Paul,
I think what is needed is a better explanation of the flow of control through the parallel programs.
I assumed that it was something like a block of serial code executed first to set up the task, start N parallel process, pull the results together, kill the slave process, output the results.
For example, say a we have a do loop within an existing serial program that executes 8000 times calling a function, where the results of the function call at trip i=x round the loop has no dependency on earlier trips at i<x. My thought was that at the do loop we can switch to parallel processing and can now be executed 1000 times with 8 processes running in parallel.
Looking at Examples 1 and 2, and with my preconditioned “serial” thinking implies this is possible. What it not clear from Examples 1 and 2 is that all the parallel processes run all the code. The print statement at the beginning of my 1st example told me that – which was rather mind blowing! Why should the print statement be executed N times when it occurs in the serial code before the parallel processes are kicked off using GetParallelTaskID@?
Example 3 does point towards the idea of all the processes running all the code, the testing with IsSlaveProcess@() at the beginning shows this, so it is possible to create a serial path through the initial code for the master process only. So clearly my initial though of what might be possible, is indeed possible, but to get it to work I have to forget decades of “serial” thinking.
Example 3 produces a nice picture, but I don’t really understand what the calculation is doing, and find it difficult to relate to.
I will not give up. I still want to learn and when the windows task manager sits at 25% running some code I am remined that there is the opportunity to do much better performance wise. Some more pointers on how to get there are needed – I don’t like doing a hard reset on my machine when N process themselves each launch N processes etc.
So what did I learn yesterday? Think parallel from the beginning of writing the code.
I am sure any hints, tips, and/or examples that could be provided would be appreciated by all.
Ken |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|