forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Quest for Speed

 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat May 23, 2020 11:01 am    Post subject: Quest for Speed Reply with quote

Now that I’ve by and large retired from it, the quest for speed still fascinates and seduces me. There was a time when the battle was to get things to run at all, given the limited amount of memory in the average computer. I can well remember finding that a particular program could compile and run on one make of mainframe because it had an overlay system, and not on another because it didn’t. (And before anyone asks, 32k words of memory, or 96k in new money).
Algorithms were developed that made use of tape or disk storage to make up for lack of memory. They still work because of Fortran’s longevity, and far better than they did in the past because the modern PC disk is fast and SSDs even faster. It means for me that stuff I wrote 40 to 50 years ago runs better now than it ever did then, and I can run problems hundreds of times bigger just by increasing array sizes. More to the point, those overnight runs of the past are now more like instant.
My first PC had an 8086, and I could run things faster with an 8087 aboard, then I put in a V30 clock-doubled chip. I found that ran 1.8 times faster with no 8087, but only 1.2 times faster with it (although that 1.2x was faster than the 1.8x if you get my gist). I later discovered that clock-multiplied CPUs eventually ran out of steam with less and less improvement each time.
Of course, there were things that have stood the test of time, like choosing the right algorithm, nesting DO loops in the right order and so on, and things like not printing when you got something but saving it on disk for printing later, if at all.
So now I have a multicore computer (actually I’ve had them for more than a decade) with multiple cores, and frankly I don’t see much improvement in the performance of my FTN95 programs the more cores I have.
I suppose that this is because FTN95 is at heart, single-threaded.
As I am a committed reader of documentation, I came across START_THREAD@ and wondered if I could make some of my programs run multi-threaded. A question is how many logical processors are there on any particular PC? It seems that I can discover this outside my FTN95 program, but there doesn’t seem to be a routine for it in the FTN95 library. Obviously, if one starts more threads than there are logical processors, the subdivision of the task can’t be as efficient as getting going with a thread without waiting in a queue. Also, it occurs to me that some cores are going to be busy anyway, so maybe one should start with fewer threads than logical processors. Let me imagine that I can start 8 threads. Now, if there’s an overhead in doing things this way, say that means things might execute 6 times faster than single-threaded. That would be a huge benefit if the whole program could be run ab initio as those separate threads, or even if a large part of the program execution could be multithreaded.
The question I have, really, is does anyone on the Forum have experience of multithreading in an FTN95 program? And is it helpful?
Suppose I have a program that has 3 phases, A, B and C, with execution times respectively TA, TB and TC. Suppose now that TA=TB=TC, and that only phases A and C are readily divided into separate tasks for multi-threading without a huge amount of reprogramming. If I do one of them, say task C and I reduce the runtime thereby to a sixth of TC, the program still runs in TA+TB+TC/6, or if I program task A as well in TB+ (TA+TC)/6 which comes down to a big speedup. Now obviously, reducing a runtime of something to a sixth overall is a fantastic reduction, multithreading A and C very worthwhile, and only one phase perhaps still worthwhile but far from astonishing. Also, if TB is much bigger than TA or TB or both, the gains are smaller.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat May 23, 2020 11:02 am    Post subject: Reply with quote

As always, it is very worthwhile to get a speed gain if one costs one’s time as worthless. If that enters the equation, what is worth doing may vary. For example, I have a backup computer that has a dual core processor, and the CPU, motherboard and RAM cost rather less than my hourly consulting rate. (I had the rest already) It runs my single-threaded FTN95-compiled programs just as fast as my main computer that has 16-thread capability, and which was assembled from equivalent components that cost my daily rate. (OK, I haven’t costed the time to build them, nor to install Windows, but the cost of computers is nowhere near what it was in the mainframe days when one might cost a lifetime’s earnings or more or the early PC days where it still amounted to months.)
A point that still puzzles me is that in swapping my scratch drive from mechanical to SSD for the price of 10 minutes fee, I managed to double the speed of a disk-bound application, and perhaps that was most cost-effective as well because it took no programming effort.
Finally, I hear of huge gains by using CUDA programming, but I imagine that do do that requires a huge investment in reprogramming.
Eddie
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Sat May 23, 2020 12:14 pm    Post subject: Reply with quote

Eddie

Have you seen the document called notes_on_parallel_processing.txt that is in the DOC subfolder typically at C:\Program Files (x86)\Silverfrost\FTN95\DOC.
Back to top
View user's profile Send private message AIM Address
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat May 23, 2020 2:20 pm    Post subject: Reply with quote

Paul,

I remember reading it some time ago now that you mention it, and not finding it terribly useful (then). Has it increased in size and content recently? There's a lot more in it than I remember, including a lot more functions. I'm sure it will prove to contain some of the answers to my questions.

Or it may have been the mention of 64-bit that made me not realise its significance. I'm afraid that 64-bit* makes my eyes glaze over, as to be frank, once you transfer stuff from a 32k word mainframe to a reasonably modern PC even the 32-bit Windows space is as vast as the solar system, and the contemplation of interstellar space may be left to others! Do the routines work with 32-bit? (My attempt to do that with the Mandelbrot example suggests not.)

I had tried the multiple applications route before, including using multiple computers, and came to the conclusion that I saved the programming time by buying a faster computer (I build my own from components). In the past I've been put off SSDs because they wear out, but I have discovered that Windows wears out with incompatible upgrades, CDs wear out with time, cars get rusty, and nothing is forever. The speedup with something developed for mag tapes or 8 to 10Mb hard drives on a PC with an SSD is something to marvel at.

I think that I need to go and study the document in detail. Thanks for pointing it out. What about putting on the top page of the website that FTN95 contains all the tools needed to exploit all those cores in your PC?

Eddie

* And I retain an affection for the x87 that may well be misplaced.
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Sat May 23, 2020 2:54 pm    Post subject: Reply with quote

Eddie

This feature is only available for 64 bit executables built using FTN95.
Back to top
View user's profile Send private message AIM Address
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sat May 23, 2020 3:12 pm    Post subject: Reply with quote

I came to the conclusion that the remaining years of my life are far too short a time to understand the Mandelbrot example, so I focused on the other examples. Hmmm. After a certain amount of experimentation, I also came to the conclusion that I didn’t understand the second example, either.

And none of the examples contains START_THREAD@. I think that I'd better do my own experimentation.

At least I've got several more weeks in lockdown!

Eddie
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Sun May 24, 2020 5:03 pm    Post subject: Reply with quote

And to add to my confusion, I'm not entirely sure that START_THREAD@ actually does anything. Certainly it doesn't seem to work for me. It doesn't seem to start the subroutine, and can't make its mind up about whether to take exception or not.

Eddie
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Fri Jun 12, 2020 10:26 am    Post subject: Reply with quote

Eddie, not surprising as the examples are not multi-threading techniques but multi-tasking !!!!
(Same aim, different name/concept I guess)

see https://silverfrost.com/19/ftn95/support/ftn95_revision_history.aspx and scross down to the first entry under V8.3 changes where you'll see the statement together with the file Paul quotes mentioned
_________________
''Computers (HAL and MARVIN excepted) are incredibly rigid. They question nothing. Especially input data.Human beings are incredibly trusting of computers and don't check input data. Together cocking up even the simplest calculation ... Smile "
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Fri Jun 12, 2020 11:36 am    Post subject: Reply with quote

John,

Thanks for your comment, and actually I don't understand the difference. My wish is to understand concepts even if I don't actually employ them, and if possible to get a worthwhile gain for minimal pain.

Eddie
Back to top
View user's profile Send private message
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Sun Jun 14, 2020 4:25 pm    Post subject: Reply with quote

Eddie’s post prompted me to look again at the notes on parallel processing – it was a miserable wet day here in Scotland. I was thinking about the possibility of inserting some parallel tasks within some serial code and below is a slightly modified version of Example 2.

There is a print statement in the code before any parallel processes are initiated, yet that print statement is executed a number of times equal to the number of processes invoked later in the code.

The print statement at the end of the code, after the parallel processes are killed executes correctly only once.

You can change the number of processors used by changing the declaration of np – this demonstrates the speed increase with np = 1, 2, 4, 8 etc.

If you uncomment the do loop which varies the number of processors – be prepared to kill the executable via the Task Manager!

At the moment, I don’t understand what’s happening.

Ken
Back to top
View user's profile Send private message Visit poster's website
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Sun Jun 14, 2020 4:26 pm    Post subject: Reply with quote

Code:

    implicit none !##
    INCLUDE <windows.ins>
    DOUBLE PRECISION start_time,end_time,sum
    double precision duration, sum1      !##
    DOUBLE PRECISION,allocatable::partial_answer(:)
    INTEGER(kind=4) ID
    INTEGER(kind=4) k
    integer(kind=4) :: np = 2, i, j

      print*, 'This print statement is executed NP times'

!$$$$$$     do np = 2, RecommendedProcessorCount@(.true.), 2
     
!>>   Start np-1 additional tasks. ID will be returned thus:
!>>   Master task ID=0
!>>   Slave task ID=1,2,3 in the different processes       
      ID=GetParallelTaskID@(np-1)    !##
      IF(ID .eq. 0) print*, 'Number of processors', np
!>>   Allocate a shared array. The string "AUTO" couples the ALLOCATE with the parallel task mechanism   
      ALLOCATE(partial_answer(np),SHARENAME="shared_stuff")
      CALL TaskSynchronise@()
!>>   Time the task using wall clock elapsed time   
      CALL dclock@(start_time)
      sum=0d0
!>>   All np processes compute the sum in an interleaved fashion   
      k = 10000000000_4 - ID
      WHILE(k > 0)DO
        sum = sum + k       
        k = k - np
      ENDWHILE
!>>   Copy the partial sum into the array shared between the processes   
      partial_answer(ID+1)=sum
      CALL TaskSynchronise@()
      CALL dclock@(end_time)
      IF(ID==0)THEN
!>>     We are the master task, so print out the results and the timing   
        sum1 = 0.d0
        do i = 1, np
          sum1 = sum1 + partial_answer(i)
        end do
        PRINT *,"Sum=",sum1
        duration=end_time-start_time
        PRINT *,"Parallel computation time = ",duration
      ENDIF
      CALL TaskSynchronise@()
!>>   Kill off the slave process   
      IF(ID .ne. 0) STOP
      DEALLOCATE(partial_answer)
     
!$$$$$$     end do

      print*, 'This print statement is executed once'
     
  END PROGRAM
Back to top
View user's profile Send private message Visit poster's website
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Sun Jun 14, 2020 11:41 pm    Post subject: Reply with quote

I've made a little bit of progress but I think it's fair to say that I am well out of my comfort zone! Certainly not behaving the way I envisaged it would.

A possible topic for your next video Paul ??

Code:
    program main
    implicit none !##
    INCLUDE <windows.ins>
    DOUBLE PRECISION start_time,end_time,sum
    double precision duration, sum1      !##
    DOUBLE PRECISION,allocatable::partial_answer(:)
    INTEGER(kind=4) ID
    INTEGER(kind=4) k
    integer(kind=4) :: np=4, i, j
   
     IF( .not. IsSlaveProcess@()) THEN
        ! Initial serial comands go here
        call set_parameters(np)
     ENDIF
     
!>>   Start np-1 additional tasks. ID will be returned thus:
!>>   Master task ID=0
!>>   Slave task ID=1,2,3 in the different processes       
      ID=GetParallelTaskID@(np-1)    !##
      IF(ID .eq. 0) print*, 'Number of processors', np
!>>   Allocate a shared array. The string "AUTO" couples the ALLOCATE with the parallel task mechanism   
      ALLOCATE(partial_answer(np),SHARENAME="shared_stuff")
      CALL TaskSynchronise@()
!>>   Time the task using wall clock elapsed time   
      CALL dclock@(start_time)
      sum=0d0
!>>   All np processes compute the sum in an interleaved fashion   
      k = 10000000000_4 - ID
      WHILE(k > 0)DO
        sum = sum + k       
        k = k - np
      ENDWHILE
!>>   Copy the partial sum into the array shared between the processes   
      partial_answer(ID+1)=sum
      CALL TaskSynchronise@()
      CALL dclock@(end_time)
      IF(ID==0)THEN
!>>     We are the master task, so print out the results and the timing   
        sum1 = 0.d0
        do i = 1, np
          sum1 = sum1 + partial_answer(i)
        end do
        PRINT *,"Sum=",sum1
        duration=end_time-start_time
        PRINT *,"Parallel computation time = ",duration
      ENDIF
      CALL TaskSynchronise@()
!>>   Kill off the slave process   
      IF(ID .ne. 0) STOP
      DEALLOCATE(partial_answer)
  END PROGRAM

  subroutine set_parameters(np)
  implicit none
  integer(kind=4), intent(out) :: np
10  write(6,*)
    write(6,*) 'Enter number of processors to use'
    read(5,*) np
    if (np .lt. 1) goto 10
  end set_parameters
Back to top
View user's profile Send private message Visit poster's website
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 7916
Location: Salford, UK

PostPosted: Mon Jun 15, 2020 7:37 am    Post subject: Reply with quote

Thanks. I am preparing a video about %pl. After that I don't know.
Back to top
View user's profile Send private message AIM Address
Kenneth_Smith



Joined: 18 May 2012
Posts: 697
Location: Hamilton, Lanarkshire, Scotland.

PostPosted: Mon Jun 15, 2020 11:22 am    Post subject: Reply with quote

Paul,

I think what is needed is a better explanation of the flow of control through the parallel programs.

I assumed that it was something like a block of serial code executed first to set up the task, start N parallel process, pull the results together, kill the slave process, output the results.

For example, say a we have a do loop within an existing serial program that executes 8000 times calling a function, where the results of the function call at trip i=x round the loop has no dependency on earlier trips at i<x. My thought was that at the do loop we can switch to parallel processing and can now be executed 1000 times with 8 processes running in parallel.

Looking at Examples 1 and 2, and with my preconditioned “serial” thinking implies this is possible. What it not clear from Examples 1 and 2 is that all the parallel processes run all the code. The print statement at the beginning of my 1st example told me that – which was rather mind blowing! Why should the print statement be executed N times when it occurs in the serial code before the parallel processes are kicked off using GetParallelTaskID@?

Example 3 does point towards the idea of all the processes running all the code, the testing with IsSlaveProcess@() at the beginning shows this, so it is possible to create a serial path through the initial code for the master process only. So clearly my initial though of what might be possible, is indeed possible, but to get it to work I have to forget decades of “serial” thinking.

Example 3 produces a nice picture, but I don’t really understand what the calculation is doing, and find it difficult to relate to.

I will not give up. I still want to learn and when the windows task manager sits at 25% running some code I am remined that there is the opportunity to do much better performance wise. Some more pointers on how to get there are needed – I don’t like doing a hard reset on my machine when N process themselves each launch N processes etc.

So what did I learn yesterday? Think parallel from the beginning of writing the code.

I am sure any hints, tips, and/or examples that could be provided would be appreciated by all.

Ken
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group