Paul and Robert,
I think what Dan may be asking is could a third party .dll be linked into a FTN95 executable, either 32 or 64 bit ?
The .dll being proposed is a multi-thread computation, generated either from gFortran or ifort, that has !$OMP capabilities.
Could the following code (or a subset) be compiled in gFortran with options of -O3 -mavx -ffast-math -fopenmp then linked into a FTN95 calling program ?
Basically, we have seen the opposite with clearwin64.
John
subroutine laipe_matmul_cache (a,b,c,nra,nca,ncb)
use precision
! matrix multiplication : multi thread and cacheing strategy
integer*4 nra,nca,ncb, j,k, k1,k2
real(dp) :: a(nra,nca), b(nca,ncb)
real(dp) :: c(nra,ncb)
!
integer*4 num_cache_columns, nk
external num_cache_columns
!
! determine columns of A per pass
nk = num_cache_columns (nra,nca)
!
do k1 = 1,nca,nk
k2 = min ( k1+nk-1, nca)
!
!$OMP PARALLEL DO shared (a,b,c,nra,nca,ncb,k1,k2) private (j,k)
do j = 1,ncb
if (k1==1) c(:,j) = 0
do k = k1,k2
!! c(1:nra,j) = c(1:nra,j) + a(1:nra,k) * b(k,j)
call vec_add_dp ( c(1,j), a(1,k), b(k,j), nra )
end do
end do
!$OMP END PARALLEL DO
!
end do ! cache size passes of A
!
end subroutine laipe_matmul_cache
subroutine vec_add_dp ( y, x, a, n )
! DAXPY interface routine
use precision
integer*4 :: n
real(dp) :: y(n), x(n), a
!
INTEGER*8 :: n8
n8 = n
call AXPY4@(y,x,n8,a) ! FTN95 /64 routine
!
! y = y + x * a ! array syntax alternative
!
! do i = 1,n ! do loop alternative
! y(i) = y(i) + x(i) * a
! end do
end subroutine vec_add_dp
integer*4 function num_cache_columns (nra,nca)
!
! matrix multiplication : multi thread and cacheing strategy
! find the number of columns of A to store in each pass of multiplication
! number is based on
! size of cache and
! number of cores (threads) in use
!
use precision ! byte_size
use laipe_test ! cache_size, use_cores, nk, ncp
integer*4 nra, & ! number of rows in A
nca ! number of columns of A
!
! Check that A is cached to 5mb
! nk = number of columns per cache pass
! ncp = number of passes
!
! Estimate number of columns for cache limit
nk = (cache_size/byte_size) / nra - use_cores ! allow 1 column for C for each thread
!
if ( nk > nca ) then ! too many : no cache strategy required
nk = nca
ncp = 1
!
else if ( nk <= use_cores ) then ! too few : no smaller than 1 column per thread
nk = use_cores
ncp = (nca+nk-1)/nk ! number of passes
!
else
ncp = (nca+nk-1)/nk ! number of passes
nk = (nca+ncp-1)/ncp ! even up columns per pass
if ( use_cores > 1 ) & ! make sure multiple of use_threads
nk = ( (nk+use_cores-1)/use_cores ) * use_cores ! round up to columns as multiple of cores
!
end if
!
write (*,*) ' A is cached to',ncp,' passes of',nk,' for',nca,' columns'
!
num_cache_columns = nk
!
end function num_cache_columns