forums.silverfrost.com

JohnCampbell · Joined: 16 Feb 2006 Posts: 2560 Location: Sydney

Dan,

The answer must be to multi-thread the J loop, as this loop group is initiated 160,000 times.
The K loop is initiated 270,000,000 times, which is what I previously tried to multi-thread. Too many threads, which is why I did not get a good result.
I introduced the K loop so there would be no call inside the J loop.
I'll look at your thread example.

Back in 70's, operations counts looked at the number of multiplies and divides, as these were considered the time consuming parts of the run. That is no longer the case !

For this example, I presented 3 alternative codes, 1 that takes 287 seconds and 2 that takes 950 seconds. We can assume the multiplies take only 287 seconds and something else takes 660 seconds. Thats 11 minutes of doing nothing, worse than when windows starts up.
Considering :
A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product ( A(A_ptr_b:A_ptr_t), B(B_ptr_b:B_ptr_t) )
I thought this might be slow due to the array sections, but the K loop :
do k = JEQ_bot,J-1
c = c + A(A_ptr_0+k) * B(B_ptr_0+k)
end do
This has similar poor performance. ( I thought the K index, common to both A and B would have been efficient )
There must be another reason why it processes so slowly, and it is difficult to understand what it is.

Eddie,
I use EXTERNAL like /implicit_none, to document the functions being used.
Using functions as an argument has never appealed. I hope you note that the F77 syntax is far better than F90 in the above example !

John

DanRRight · Posted: Thu May 10, 2012 7:56 pm Post subject:

Well, that does not sound good. Processor must spend at least 0.01 second (from my very rough estimation. Could be 10 times more) in the thread for overhead of creating the thread to be small. Think if you can change the way the code works.

But did i understood correctly that the run takes around 10 min to complete on laptop? I'd say that's fast already to think about modifications!

LitusSaxonicum · Posted: Thu May 10, 2012 8:17 pm Post subject:

John,

Surely the best way to indicate that something is a function is to include FUNCTION (or _FN) in its name. You don't have to for a subroutine, as it is CALLed.

For things that have to be EXTERNAL like Clearwin callbacks, I found that FTN95 is happy to be EXTERNAL without explicitly giving the INTEGER attribute, although I have begun to adopt the Fortran 90-ish version

INTEGER, EXTERNAL :: etc.

The habit of declaring routines as EXTERNAL when they aren't passed as parameters to a subprogram seems to have originated with named BLOCKDATA routines, as otherwise one could link without them and nobody would be any the wiser ...

Your timing problem looks to me like it is caused by differing numbers of cache misses between the various approaches.

Eddie

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Wilfried Linder · Posted: Fri May 11, 2012 7:40 am Post subject:

I would like to use parallel processing, and after reading this discussion, I try to find information about OpenMP. On the official page http://openmp.org/wp/openmp-compilers/ is a list of compilers supporting OpenMP - FTN95 is missing. Is it nevertheless possible to use OpenMP together with FTN95?

Regards - Wilfried

davidb · Joined: 17 Jul 2009 Posts: 560 Location: UK

Hi Wilfried,

Silverfrost's FTN95 doesn't support OpenMP. But it will compile OpenMP programs since it just treats the directives as comments. I use FTN95 to debug my programs and another compiler (Ifort, Gfortran etc) to create executables to run which enables the OpenMP features.

David.
_________________
Programmer in: Fortran 77/95/2003/2008, C, C++ (& OpenMP), java, Python, Perl

Wilfried Linder · Posted: Fri May 11, 2012 8:52 am Post subject:

Thank you, David. What a pity... FTN95: No development for 64 bits, no support of parallel processing, no support of OpenMP... It's like a Ferrari with an 80-hp-engine. I really hope that ClearWin will be prepared to work together with other compilers like Intel, then I will start using that compiler for calculations and the very good ClearWin library only for the GUI.

Fortran is the language of choice for number-crunching software, and here we need access to really all of the machines's computing power. For other purposes, I wouldn't use Fortran.

Regards - Wilfried

JohnCampbell · Joined: 16 Feb 2006 Posts: 2560 Location: Sydney

David,

Thanks for your comments. They have been very helpful.

To explain the calculation I am performing:
There are 154,220 equations in a blocked skyline, using 4 blocks.
Giving 159,139 column reductions (repeated for columns spanning blocks)
These results in 271,679,207 Dot_Product calls, roughly the number of numbers in the matrix profile.
The number of iterations in each Dot Product range from 1 to 9,861
0.5% < 10
2% < 34
10% < 100
66% in the range 400-1800
10% > 1,800
I hope this better describes the profile of my problem.

I was under the impression there would be new threads generated for each Dot_Product call if the process was parallelised using OpenMP.
I thought the use of vector instructions or AVX is a different approach which could suit yours and my example.
I certainly need to understand the difference between these approaches or if they are the same process.
My impression is that vector instructions or AVX involve vector instructions in a single processor and is the solution I should be chasing.

Back to the FTN95 solution
I have now produced 4 approaches to the "Dot_Product" call.
1) The old F77 code, using a DO loop:
c = 0
do k = 0,jband-1
c = c + A(A_ptr_b+k) * B(B_ptr_b+k)
end do
A(A_ptr_0+J) = A(A_ptr_0+J) - c

2) Using Dot_Product intrinsic with array sections:
A(A_ptr_0+J) = A(A_ptr_0+J) - Dot_Product (A(A_ptr_b:A_ptr_b+jband-1), B(B_ptr_b:B_ptr_b+jband-1) )

3) A DO loop inside a F77 wrapper
A(A_ptr_0+J) = A(A_ptr_0+J) - Vec_Sum (A(A_ptr_b), B(B_ptr_b), JBAND)
where
REAL*8 FUNCTION VEC_SUM (A, B, N)
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
c = 0
do k = 1,N
c = c + A(k) * B(k)
end do
vec_sum = c
end

4) A Dot_Product inside a F77 wrapper
where
REAL*8 FUNCTION VEC_SUM (A, B, N)
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
vec_sum = dot_product ( a, b )
end

Both options 1 and 2 take about 950 seconds while both wrapper options 3 and 4 take 300 seconds on my core i5 notebook using FTN95 Ver. 6.10 with /opt
I'm not sure if 4) becomes 3 or 3) becomes 4 when compiled ?
I don't think FTN95 has an optimised Dot_Product library function.

I like Eddie's suggestion that it is due to "cache misses" but I have no idea of what that could really mean or how to address it.
Again, 11 minutes of doing nothing due to a one line syntax change is also difficult to understand.

John

DanRRight · Posted: Fri May 11, 2012 6:22 pm Post subject:

Wilfried,
Absence of 64bit is really worst miss, i agree. But i hope for the movement here.
The OpenMP probably too but you have some choices - you can do some parallelization with FTN95 and do that NATIVELY what no other compiler can do (non-standard way, of course, same as non-standard is with the OpenMP). NVIDIA CUDA or AMD's own parallel compiler libraries support would be probably even better because our graphics cards are actually massively parallel vector supercomputers

John,
It's tricky for outsiders to suggest something they do not know in all details. I can only suggest something very general which works if you satisfy some requirements.

Does your method allow division of the workload into Np independent threads - one per processor core - which will do their 1/Np portion of job or not ? In other words, can you fit your code into this parallel program demo where i do non-threaded do loop run and then divide it into two threads with 1/2 job and measure time for all these tasks? You should see speedup almost 100% with two cores (or 4x, 6x .... if you add more threads same way with multicore CPU)

JohnCampbell · Joined: 16 Feb 2006 Posts: 2560 Location: Sydney

Dan,

I've looked at this problem a lot and one issue that worries me is why the poor FTN95 performance in approach 3) compared to 1)
Basically changing from:

PaulLaidler · Posted: Sat May 12, 2012 8:04 am Post subject:

John

I have noted this for investigation. What is the purpose of A_ptr_b and B_ptr_b? How do they vary?

An initial investigate would take a look at the /explist for each case and also possibly...

DanRRight · Posted: Sat May 12, 2012 9:02 pm Post subject:

I'd like to remind all that small demo code would simplify understanding of the problem and speed up the solutions by the factor of .... 1000.

Optimization is trickiest problem in any language and the main achievement and pride of Fortran. Besides its inherent simplicity and legacy of developed libraries, it's the primary reason Fortran still survives. My suggestion is to ask the hard questions in comp.lang.fortran where you may have a luck finding the ultimate fortran pro in your subject. 2-3 fortran compiler developers and fortran standard committee people were often there and hidden tricks of optimization were always the favorite topic there.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2560 Location: Sydney

Dan,

I've generated a simplified code, based on the direct reduction of a profile solver. I've included 6 inner loop options, the 4 I presented plus 2 variations of Paul's post.
It should be compiled with /lgo or /opt /lgo.
To change the size of the problem, I'd recommend changing "max_size". It can run up to 2gb on Win_7-64 or probably 1.3gb on XP_32.
The results show that the wrapper improves by a factor of 3 with FTN95 /opt.
I need to test it with other compilers.
253 lines of code, so see how much can be copied per post.

John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2560 Location: Sydney

continued

JohnCampbell · Joined: 16 Feb 2006 Posts: 2560 Location: Sydney

hopefully the end