forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Fortran 2003/2008
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2105
Location: Sydney

PostPosted: Thu May 10, 2012 2:11 pm    Post subject: Reply with quote

Dan,

The answer must be to multi-thread the J loop, as this loop group is initiated 160,000 times.
The K loop is initiated 270,000,000 times, which is what I previously tried to multi-thread. Too many threads, which is why I did not get a good result.
I introduced the K loop so there would be no call inside the J loop.
I'll look at your thread example.

Back in 70's, operations counts looked at the number of multiplies and divides, as these were considered the time consuming parts of the run. That is no longer the case !

For this example, I presented 3 alternative codes, 1 that takes 287 seconds and 2 that takes 950 seconds. We can assume the multiplies take only 287 seconds and something else takes 660 seconds. Thats 11 minutes of doing nothing, worse than when windows starts up.
Considering :
A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
I thought this might be slow due to the array sections, but the K loop :
do k = JEQ_bot,J-1
c = c + A(A_ptr_0+k) * B(B_ptr_0+k)
end do
This has similar poor performance. ( I thought the K index, common to both A and B would have been efficient )
There must be another reason why it processes so slowly, and it is difficult to understand what it is.

Eddie,
I use EXTERNAL like /implicit_none, to document the functions being used.
Using functions as an argument has never appealed. I hope you note that the F77 syntax is far better than F90 in the above example !

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2032
Location: South Pole, Antarctica

PostPosted: Thu May 10, 2012 7:56 pm    Post subject: Reply with quote

Well, that does not sound good. Processor must spend at least 0.01 second (from my very rough estimation. Could be 10 times more) in the thread for overhead of creating the thread to be small. Think if you can change the way the code works.

But did i understood correctly that the run takes around 10 min to complete on laptop? I'd say that's fast already to think about modifications!
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2023
Location: Yateley, Hants, UK

PostPosted: Thu May 10, 2012 8:17 pm    Post subject: Reply with quote

John,

Surely the best way to indicate that something is a function is to include FUNCTION (or _FN) in its name. You don't have to for a subroutine, as it is CALLed.

For things that have to be EXTERNAL like Clearwin callbacks, I found that FTN95 is happy to be EXTERNAL without explicitly giving the INTEGER attribute, although I have begun to adopt the Fortran 90-ish version

INTEGER, EXTERNAL :: etc.

The habit of declaring routines as EXTERNAL when they aren't passed as parameters to a subprogram seems to have originated with named BLOCKDATA routines, as otherwise one could link without them and nobody would be any the wiser ...

Your timing problem looks to me like it is caused by differing numbers of cache misses between the various approaches.

Eddie
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri May 11, 2012 7:13 am    Post subject: Re: Reply with quote

JohnCampbell wrote:

By placing the inner loop in a parallel construct, there are 150,000 x 1,500 = 225 million initiated threads involving a dot product of size 1 to 1,500. This loop calculates the modification to a single element of column (:,j) : too many threads ! My attempt (1 year ago) was on the inner loop and did not work well since I did not have enough processors and there were too many threads.
John


Forgive me, but this doesn't make sense. The number of threads and the number of iterations are different. You can have a loop doing 16 million iterations, shared across 8 threads with each thread doing 2 million iterations. The work should be shared out amongst the threads.

If you want to parallelise your inner loop (the dot product) you would do something like the following in OpenMP. Note, I haven't coded your offsets to the arrays.

Also I have vectorised contributions from successive 4 array elements. You might change this to 2 depending on the hardware and whether you are working in single or double precision. Keep it at 4 for double precision if your processor supports AVX, set it to 2 if it supports SSE2. (Obviously your compiler needs to support OpenMP and vectorization too!)

A mix of parallelising using threads and parallelising using vectorization within each thread should give good performance.

I have tried this and I get just less than a 4 X speed up with 4 cores when N is 1000000.

Code:

! N is number of iterations

NN = MOD(N, 4)
C = SUM(A(1:NN)*B(1:NN))
S = 0.0D0

!$OMP PARALLEL DO PRIVATE(J) REDUCTION(+:S)
DO J=NN+1,N,4
   S = S + SUM(A(J:J+3)*B(J:J+3))
END DO
C = C + S


The !$OMP directive creates the parallel region, and shares out the work of the DO loop amongst the threads. You can limit the number of threads by adding more clauses to the directive, but I have just assumed you want to use the maximum number (the number of cores usually).
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl


Last edited by davidb on Fri May 11, 2012 7:56 am; edited 2 times in total
Back to top
View user's profile Send private message
Wilfried Linder



Joined: 14 Nov 2007
Posts: 314
Location: Düsseldorf, Germany

PostPosted: Fri May 11, 2012 7:40 am    Post subject: Reply with quote

I would like to use parallel processing, and after reading this discussion, I try to find information about OpenMP. On the official page http://openmp.org/wp/openmp-compilers/ is a list of compilers supporting OpenMP - FTN95 is missing. Is it nevertheless possible to use OpenMP together with FTN95?

Regards - Wilfried
Back to top
View user's profile Send private message
davidb



Joined: 17 Jul 2009
Posts: 557
Location: UK

PostPosted: Fri May 11, 2012 8:01 am    Post subject: Reply with quote

Hi Wilfried,

Silverfrost's FTN95 doesn't support OpenMP. But it will compile OpenMP programs since it just treats the directives as comments. I use FTN95 to debug my programs and another compiler (Ifort, Gfortran etc) to create executables to run which enables the OpenMP features.

David.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl
Back to top
View user's profile Send private message
Wilfried Linder



Joined: 14 Nov 2007
Posts: 314
Location: Düsseldorf, Germany

PostPosted: Fri May 11, 2012 8:52 am    Post subject: Reply with quote

Thank you, David. What a pity... FTN95: No development for 64 bits, no support of parallel processing, no support of OpenMP... It's like a Ferrari with an 80-hp-engine. I really hope that ClearWin will be prepared to work together with other compilers like Intel, then I will start using that compiler for calculations and the very good ClearWin library only for the GUI.

Fortran is the language of choice for number-crunching software, and here we need access to really all of the machines's computing power. For other purposes, I wouldn't use Fortran.

Regards - Wilfried
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2105
Location: Sydney

PostPosted: Fri May 11, 2012 9:56 am    Post subject: Reply with quote

David,

Thanks for your comments. They have been very helpful.

To explain the calculation I am performing:
There are 154,220 equations in a blocked skyline, using 4 blocks.
Giving 159,139 column reductions (repeated for columns spanning blocks)
These results in 271,679,207 Dot_Product calls, roughly the number of numbers in the matrix profile.
The number of iterations in each Dot Product range from 1 to 9,861
0.5% < 10
2% < 34
10% < 100
66% in the range 400-1800
10% > 1,800
I hope this better describes the profile of my problem.

I was under the impression there would be new threads generated for each Dot_Product call if the process was parallelised using OpenMP.
I thought the use of vector instructions or AVX is a different approach which could suit yours and my example.
I certainly need to understand the difference between these approaches or if they are the same process.
My impression is that vector instructions or AVX involve vector instructions in a single processor and is the solution I should be chasing.

Back to the FTN95 solution
I have now produced 4 approaches to the "Dot_Product" call.
1) The old F77 code, using a DO loop:
c = 0
do k = 0,jband-1
c = c + A(A_ptr_b+k) * B(B_ptr_b+k)
end do
A(A_ptr_0+J) = A(A_ptr_0+J) - c

2) Using Dot_Product intrinsic with array sections:
A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_b+jband-1), B(B_ptr_b:B_ptr_b+jband-1) )

3) A DO loop inside a F77 wrapper
A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND)
where
REAL*8 FUNCTION VEC_SUM (A, B, N)
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
c = 0
do k = 1,N
c = c + A(k) * B(k)
end do
vec_sum = c
end

4) A Dot_Product inside a F77 wrapper
where
REAL*8 FUNCTION VEC_SUM (A, B, N)
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
vec_sum = dot_product ( a, b )
end

Both options 1 and 2 take about 950 seconds while both wrapper options 3 and 4 take 300 seconds on my core i5 notebook using FTN95 Ver. 6.10 with /opt
I'm not sure if 4) becomes 3 or 3) becomes 4 when compiled ?
I don't think FTN95 has an optimised Dot_Product library function.

I like Eddie's suggestion that it is due to "cache misses" but I have no idea of what that could really mean or how to address it.
Again, 11 minutes of doing nothing due to a one line syntax change is also difficult to understand.

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2032
Location: South Pole, Antarctica

PostPosted: Fri May 11, 2012 6:22 pm    Post subject: Reply with quote

Wilfried,
Absence of 64bit is really worst miss, i agree. But i hope for the movement here.
The OpenMP probably too but you have some choices - you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do (non-standard way, of course, same as non-standard is with the OpenMP). NVIDIA CUDA or AMD's own parallel compiler libraries support would be probably even better because our graphics cards are actually massively parallel vector supercomputers

John,
It's tricky for outsiders to suggest something they do not know in all details. I can only suggest something very general which works if you satisfy some requirements.

Does your method allow division of the workload into Np independent threads - one per processor core - which will do their 1/Np portion of job or not ? In other words, can you fit your code into this parallel program demo where i do non-threaded do loop run and then divide it into two threads with 1/2 job and measure time for all these tasks? You should see speedup almost 100% with two cores (or 4x, 6x .... if you add more threads same way with multicore CPU)


Code:
 ! Multithreading example mult04.for
 ! Dan R Right 2012
 !
 ! Watch in Task Manager how two threads run1 and run2
 ! grab two processor cores working in parallel
 !
 ! Compilation: ftn95 mult04.FOR /clr /link /free /multi_threaded
 !
 include <clearwin.ins>
 EXTERNAL run1, run2
 common /abc_/kEnded1, kended2
 character*1 cha

!..... straight non-threaded run

 print*,' Run w/o threads started'
 call clock@ (time_start)
    d=2.0
    do k=1,20
    do i=1,10000000
    d=alog(exp(d))
    enddo
    enddo
 call clock@ (time_finish)
 print*, 'Elapsed time without threads=', time_finish-time_start

!...multithreaded
 call clock@ (time_start)

   CALL CREATE_THREAD@(run1,22)
   CALL CREATE_THREAD@(run2,23)

!...wait till both threads end
 do while (kEnded1==0.or.kEnded2==0)
   call sleep1@(0.1)
 enddo

 call clock@ (time_finish)

 print*, 'Elapsed time with threads=', time_finish-time_start
 print*, 'Enter any key to exit'
 read(*,*) cha
 END
 !----------------
 subroutine run1()
 include <clearwin.ins>
 common /abc_/kEnded1, kended2 
    lock;   print*,'Thr.1 started'; end lock
    kEnded1=0
    d=2.0
    do k=1,10
    do i=1,10000000
    d=alog(exp(d))
    enddo
    enddo
    kEnded1=1
    lock;   print*,'Thr.1 ended' ; end lock
 end
 !----------------
 subroutine run2()
 include <clearwin.ins>
 common /abc_/kEnded1, kended2 

    lock;   print*,'Thr.2 started' ;end lock
    kEnded2=0
    d=2.0
    do k=1,10
    do i=1,10000000
    d=alog(exp(d))
    enddo
    enddo     
    kEnded2=1
    lock;   print*,'Thr.2 ended' ;end lock
 end


Last edited by DanRRight on Sat May 12, 2012 8:54 am; edited 3 times in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2105
Location: Sydney

PostPosted: Sat May 12, 2012 1:05 am    Post subject: Reply with quote

Dan,

I've looked at this problem a lot and one issue that worries me is why the poor FTN95 performance in approach 3) compared to 1)
Basically changing from:
Code:
!3) - good
   do k = 1,n
      c = c + A(k) * B(k)
   end do
to:
! 1) - bad
   do k = 0,n-1
      c = c + A(A_ptr_b+k) * B(B_ptr_b+k)
   end do

This results in a 3-fold increase in run time.
I've tested this on both a Core i5 and Xeon processor and got similar results.
I'm not sure if 3) has fluked a processor optimisation or 1) has a basic inefficiency, but the results are dramatic. It would be good to have an optimised/vectorised Dot_Product to apply to case 4) and see any possible change or if there is already some processor optimisation.
I think the performance, as in case 1) is similar to that seen in the Polyhedron benchmark tests.
I'm not sure if this example can be a generalised guide to DO loops with FTN95 or an isolated case.

I am not as enthusiastic about the OpenMP approach, as I think the new vector instructions have much to offer for this problem.
Thanks for your comments, as I will consider multiple threading further.

John
Back to top
View user's profile Send private message
PaulLaidler
Site Admin


Joined: 21 Feb 2005
Posts: 6003
Location: Salford, UK

PostPosted: Sat May 12, 2012 8:04 am    Post subject: Reply with quote

John

I have noted this for investigation. What is the purpose of A_ptr_b and B_ptr_b? How do they vary?

An initial investigate would take a look at the /explist for each case and also possibly...

Code:
   do k = 0,n-1
      ia = A_ptr+k
      ib = B_ptr+k
      c = c + A(ia) * B(ib)
   end do
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2032
Location: South Pole, Antarctica

PostPosted: Sat May 12, 2012 9:02 pm    Post subject: Reply with quote

I'd like to remind all that small demo code would simplify understanding of the problem and speed up the solutions by the factor of .... 1000.

Optimization is trickiest problem in any language and the main achievement and pride of Fortran. Besides its inherent simplicity and legacy of developed libraries, it's the primary reason Fortran still survives. My suggestion is to ask the hard questions in comp.lang.fortran where you may have a luck finding the ultimate fortran pro in your subject. 2-3 fortran compiler developers and fortran standard committee people were often there and hidden tricks of optimization were always the favorite topic there.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2105
Location: Sydney

PostPosted: Sun May 13, 2012 11:13 am    Post subject: Reply with quote

Dan,

I've generated a simplified code, based on the direct reduction of a profile solver. I've included 6 inner loop options, the 4 I presented plus 2 variations of Paul's post.
It should be compiled with /lgo or /opt /lgo.
To change the size of the problem, I'd recommend changing "max_size". It can run up to 2gb on Win_7-64 or probably 1.3gb on XP_32.
The results show that the wrapper improves by a factor of 3 with FTN95 /opt.
I need to test it with other compilers.
253 lines of code, so see how much can be copied per post.

John
Code:
module big_arrays
   integer*4, parameter :: mb_8     = 1024*1024/8       ! real*8 variables in 1 mb
!
   integer*4, parameter :: max_eqn  = 200000            ! maximum equations possible
   integer*4, parameter :: max_band =   2800            ! maximum profile size
   integer*4, parameter :: max_size = 100*mb_8          ! maximum size profile matrix  (CHANGE THIS)
!
   integer*4, allocatable, dimension(:) :: nb           ! profile storage
   real*8,    allocatable, dimension(:) :: B            ! matrix storage

   integer*4  neq
   integer*4  mstor
   integer*4  avg_band
!
end module big_arrays

use big_arrays
!
   integer*4  ieq, iband, ipos, method, k
   real*8     c_start(6), c_end(6), x, dt, ops_per_sec
   integer*8  num_ops, num_call
!
   character  method_name(6)*30
   data method_name / 'Original DO Loop', 'F90 syntax', 'F77 wrapper for DO Loop',         &
                      'F77 wrapper for Dot_Product', 'Paul option', 'alternate Paul option' /
!
!  generate profile in nb
!
    allocate ( nb(max_eqn) )
!
    iband = 1
    nb(1) = 1
    nb(2) = 0
    do ieq = 1, max_eqn
       call random_number (x)
       iband = (iband + x * max_band) / 2
       nb(ieq+2) = nb(ieq+1) + min (iband,ieq)
       if (nb(ieq+2) > max_size) exit
    end do
!
!  generate diagonal pointer
    neq    = ieq-1
    do ieq = 0,neq
       nb(ieq+2) = nb(ieq+2) + neq+2
    end do
!
    mstor    = nb(neq+2)
    avg_band = mstor / neq
    x        = real(mstor) / real(mb_8)
!
    write (*,2001) neq,      ' equations'
    write (*,2001) avg_band, ' average profile'
    write (*,2001) mstor,    ' coefficients'
    write (*,2002) x ,       ' storage (mb)'
    write (*,1001) ' '
2001 format (i10,a)
2002 format (f10.2,a)
!
!  initialise the matrix
!
    allocate ( B(mstor) )
    do ieq = 1,neq
       ipos  = ieq+2
       iband = nb(ipos) - nb(ipos-1)
       do k = nb(ipos-1)+1, nb(ipos)
          b(k) = 1
       end do
       b(nb(ipos)) = iband
    end do
!
   do method = 1,6
!
     call cpu_time (c_start(method))
!
     call redblk (B, NB, B, NB, method, num_ops, num_call)
!
     call cpu_time (c_end(method))
     dt = c_end(method)-c_start(method)
     ops_per_sec = dble (num_ops) / dt
     write (*,1001) ' Method', method, '  CPU_time =',dt,'  ops/sec =',ops_per_sec, method_name(method)
   end do
1001 format (a,i2,a,f9.3,a,en12.2,2x,a)
!
   write (*,*) num_call,' Number of dot_product calls'
   write (*,*) num_ops, ' Number of itterations'
!
end

      SUBROUTINE REDBLK (A, NA, B, NB, method, num_ops, num_call)
!
!   reduces block 'A' by block 'B'
!
      REAL*8    A(*), B(*)
      INTEGER*4 NA(*), NB(*), method
      integer*8 num_ops, num_call
      INTRINSIC MIN, MAX
!
      INTEGER*4 I, IBOT, ITOP, IPOINT, A_ptr_0, A_ptr_I, IL,            &
                J, JBOT, JTOP, JB, JT, JI
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2105
Location: Sydney

PostPosted: Sun May 13, 2012 11:15 am    Post subject: Reply with quote

continued
Code:
!
      IBOT = NA(1)                               ! FIRST EQUATION
      ITOP = IBOT + NA(2) - 3                    ! LAST EQUATION
!
      JBOT = NB(1)
      JTOP = JBOT + NB(2) - 3
      num_ops  = 0
      num_call = 0
!
      DO I = IBOT,ITOP
         IPOINT  = I - IBOT + 3
         A_ptr_0 = NA(IPOINT) - I
!
         IL      = I - (NA(IPOINT) - NA(IPOINT-1)) + 1
         JB      = MAX (IL+1,JBOT)
         JT      = MIN (I-1 ,JTOP)
!
!     patch to test the column reduction, returning stats
!
         call redcol (A(A_ptr_0+IL), B, NB, JB, JT, IL, method, num_ops, num_call)
         IF (IBOT /= JBOT) CYCLE
!
!      NOW REDUCE PIVOT COLUMN ON ITSELF (now resert values)
!
         A_ptr_I = A_ptr_0 + I
         DO J = IL,JT
            JI = A_ptr_0+J
            A(JI) = 1
         END DO
         A(A_ptr_I) = NA(IPOINT) - NA(IPOINT-1)
!
      END DO
!
      RETURN
      END

      SUBROUTINE REDCOL (A, B, NB, JB, JT, IEQ_bot, method, num_ops, num_call)
!
!   Reduces vector 'A' by block 'B'
!   Allows for skyline matrix to be stored in multiple blocks
!   Vector A might or might not be in block B
!
      REAL*8    A(*),                &          ! column to be reduced provided as A(IEQ_bot:)
                B(*)                            ! skyline block of columns
      INTEGER*4 NB(*),               &          ! block structure of B
                JB,                  &          ! first equation to reduce (is in block B)
                JT,                  &          ! last equation to reduce (is in block B and < equation A)
                IEQ_bot,             &          ! first active equation in column A
                method
      integer*8 num_ops, num_call
!
      INTEGER*4 J, JEQ_bot, JBOT, JPOINT, JBAND, k, ia, ib,              &
                A_ptr_0, B_ptr_0, A_ptr_b, B_ptr_b, A_ptr_t, B_ptr_t
      REAL*8    Vec_Sum, Vec_Sum_d, c
      EXTERNAL  Vec_Sum, Vec_Sum_d              ! Dot_Product using F77 syntax
!
      IF (JB > JT) RETURN
      JBOT    = NB(1)                           ! first equation column in Block B
      A_ptr_0 = 1-IEQ_bot                       ! virtual pointer to A(0)
!
      DO J = JB,JT                              ! loop over range of equations
         JPOINT  = J - JBOT + 3                 ! pointer to index for column J in block B
         B_ptr_0 = NB(JPOINT) - J               ! virtual pointer to B(0,j)
!
         JEQ_bot = NB(JPOINT-1) - B_ptr_0 + 1   ! first active equation of column J  = J - jband + 1 where jband = NB(JPOINT)-NB(JPOINT-1)
          IF (IEQ_bot > JEQ_bot) JEQ_bot = IEQ_bot   
!
         JBAND   = J - JEQ_bot                  ! active bandwidth for equation J and equation A
          IF (JBAND < 1) CYCLE
!
         A_ptr_b  = A_ptr_0  + JEQ_bot
         B_ptr_b  = B_ptr_0  + JEQ_bot
         num_ops  = num_ops  + JBAND
         num_call = num_call + 1
!
         select case (method)
!
          case (1) ! if (method == 1) then                    ! Original DO Loop
            c = 0
            do k = JEQ_bot,J-1
               c = c + A(A_ptr_0+k) * B(B_ptr_0+k)
            end do
            A(A_ptr_0+J) = A(A_ptr_0+J) - c
!
          case (2) ! else if (method == 2) then               ! F90 syntax

            A_ptr_t = A_ptr_b + JBAND - 1
            B_ptr_t = B_ptr_b + JBAND - 1
            A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
!
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2105
Location: Sydney

PostPosted: Sun May 13, 2012 11:18 am    Post subject: Reply with quote

hopefully the end
Code:
          case (3) ! else if (method == 3) then               ! F77 wrapper for DO loop
!
            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND)
!
          case (4) ! else if (method == 4) then               ! F77 wrapper for Dot_Product
!
            A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum_d (A(A_ptr_b), B(B_ptr_b), JBAND)
!
          case (5) ! else if (method == 5) then               ! Paul option, simplified subscripts
            c = 0
            do k = JEQ_bot,J-1
               ia = A_ptr_0+k
               ib = B_ptr_0+k
               c = c + A(ia) * B(ib)
            end do
            A(A_ptr_0+J) = A(A_ptr_0+J) - c
!
          case (6) ! else if (method == 6) then               ! alternate Paul option (auto increment ?)
            c = 0
            do k = JEQ_bot,J-1
               c = c + A(A_ptr_b) * B(B_ptr_b)
               A_ptr_b = A_ptr_b+1
               B_ptr_b = B_ptr_b+1
            end do
            A(A_ptr_b) = A(A_ptr_b) - c
         end select
!
      END DO
!
      RETURN
      END

      REAL*8 FUNCTION VEC_SUM (A, B, N)
!
!    Performs a vector dot product  VECSUM =  [A] . [B]
!    account is NOT taken of the zero terms in the vectors
!
      real*8    a(*), b(*), c
      integer*4 n, k
!
      c = 0
      do k = 1,N
         c = c + A(k) * B(k)
      end do
      vec_sum = c
!
      RETURN
!
      END

      REAL*8 FUNCTION VEC_SUM_d (A, B, N)
!
!   Performs a vector dot product  VECSUM =  [A] . [B]
!
      integer*4,               intent (in) :: n
      real*8,    dimension(n), intent (in) :: a
      real*8,    dimension(n), intent (in) :: b
!
      vec_sum_d = dot_product ( a, b )
!
      RETURN
!
      END
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8, 9, 10  Next
Page 3 of 10

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group