Silverfrost Forums

Welcome to our forums

FTN 95 8.10 Personal Edition

24 Feb 2017 11:03 #18888

Will the new version 8.10 of FTN95 also be available as Personal Edition?

25 Feb 2017 2:40 #18895

It will.


-- Admin Silverfrost Limited
25 Feb 2017 4:41 #18896

Why not just 'Yes'?

28 Feb 2017 3:24 #18925

Not palindromes, sorry, as understood by Brits who don't have a thing about Alaskan Bush People but who mastered ancient Greek at school.

'It will' was excessively prolix when 'Yes' would have done. The answer to 'When' is probably 'Soon'. Eventually it will be 'Now'.

Is it? It is! (which isn't, if you get my meaning.)

Eddie

28 Feb 2017 3:36 #18926

Sorry, I am not so familiar with your language. So I looked at Wikepedia for the meaning of Palindrome. Wikipedia says:

A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward, such as madam or racecar.

So I think:

Yes backward is sew

and

** It will** backward is lliw ti

My original question was, when the Personal Edition 8.10 will be available? Today, tomorrow or later on?

28 Feb 2017 3:37 #18927

Of cause:

Yes backward is sey

1 Mar 2017 1:53 #18928

When you get 8.10 or any further upgrades please do not forget to report all the problems with it. This way we users will debug it and make it perfect much faster. Otherwise sometimes it takes years for some hidden bug to expose itself. Do not forget also about your suggestions to improve compiler and debugger.

1 Mar 2017 2:20 #18934

Paul,

Since you read every post, then I suspect that you didn't appreciate the difference between palindrome and Palin drome: the latter being a punning usage about a US political figure (I guessed) or one of the Monty Python team (more probable in retrospect).

My original post was an observation on brevity, for which you are famous (notorious?).

The whole point is will 8.10 be PE, yes, and when, soon. Even you can't be briefer than that! (I'll accept OK instead of Yes to prove that you can).

The second point is that a PE release opens up a whole set of extra testers, for example I use an obsolete Academic version of FTN95 to support some software in the department I retired from 5 years ago, and the current PE for my personal dabblings. While 8.10 isn't PE-ready I can't find any problems with it (Dan).

If you want the answer to the question: Which is more likely to find bugs: (a) 1 tester working for X days (b) N testers working for X/N days The answer is probably (b). If you tack on (b) after (a) and ask will this do a better job than (a) or even (b) alone, the answer with a high degree of probability is 'yes', with the certainty increased along with N.

Every dilettante's favourite program is a test suite that cannot be simply reproduced at Silverfrost, and overall is run on a range of hardware that few single organisations can support.

An early release of the PE seems to me therefore to be in everyone's interest.

Eddie

PS. The briefest reply short of ignoring this is 'OK', but I suppose most of us would settle for a date.

6 Mar 2017 5:07 (Edited: 7 Mar 2017 10:36) #18984

Eddie, Now!

Today, FTN95 8.10 PE is available for download. I am pleased to find that most of the bugs that I reported have been fixed.

For the first time, I find that SDBG64 is usable, but it is still lacking features that I like the 32-bit debugger for. For example, I cannot set the font large enough for easy reading, nor can I set the background color.

It would be nice if the register window displayed the values in FPU registers, and if the call stack showed, in addition to routine names, line numbers or at least addresses.

7 Mar 2017 9:21 #18994

Does the 32-bit version show line numbers in the call stack window?

7 Mar 2017 10:38 #18997

No, but I wish it did! The information to do that seems to be in the EXE already, since the pop up after a program error shows the information in the traceback.

7 Mar 2017 10:48 #18998

I'll put it on the to-do list.

8 Mar 2017 2:56 #19005

The new 64bit 8.10 is fast and sometimes is much faster with /optimize option but optimization not always works sometimes crashing the code.

The old compiler was not completely fixed for all such kind of errors for years since I suspect it was difficult to demonstrate the cause for this on some reasonably small code for developers to work on error.

I'd urge users to try /opt and if you can minimize the source to write smaller demonstration program report it to Silverfrost.

11 Mar 2017 9:58 #19061

Couple years back Davidb wrote assembler utilities Vec_Add_SSE, Vec_Sum_SSE ... to use SSE. As usually they were just embedded into Fortran text and recognized. They looked like this

! Assembly code is between code, edoc lines
    code
       movupd xmm7%, v            ; move v array to xmm7
       mov eax%, =x               ; address of x
       mov ecx%, =y               ; address of y
.................

Now 64bit compiler does not recognize them

[6942) movsd [ecx%], xmm0%        ; form y(1) = y(1) + a*x(1)
*** Error 29: Syntax Error
6966) movupd [ecx%], xmm0%       ; move xmm0 into next 2 doubles in y
*** Error 29: Syntax Error
*** Error 343: Unrecognised assembler mnemonic - MOVAPD
6976) movapd xmm1%, [eax%+16]    ; move next 2 doubles in x into xmm1
6999) movsd [ecx%], xmm0%
    10 ERRORS  [<VEC_ADD_SSE> FTN95 v8.10.0]

Any ideas how to resolve this issue ?

11 Mar 2017 10:06 #19062

Dan,

FTN95 /64 provides new routines for this. see ...\ftn95\doc\noteson64bitftn95.txt

SSE and AVX support
-------------------------------------------------------------------------------
FTN95 /64 creates machine code that makes some use of the SSE and AVX instruction 
sets (see https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions). Users can 
also provide direct SSE/AVX support via CODE/EDOC statements in their code (see 
below for further details).

Four 'BLAS' type library routines (DOT_PRODUCT8@,DOT_PRODUCT4@,AXPY8@ and AXPY4@) 
are also provided and these make direct use of the SSE/AVX instruction sets. 
In addition, the library function USE_AVX@ can be called in order to instruct these
routines to use AVX rather than SSE when the CPU and operating system make this 
possible.

REAL*8 FUNCTION DOT_PRODUCT8@(x,y,n) 
REAL*8 x(n),y(n) 
INTEGER*8 n 

REAL*4 FUNCTION DOT_PRODUCT4@(x,y,n) 
REAL*4 x(n),y(n) 
INTEGER*8 n 

SUBROUTINE AXPY8@(y,x,n,a) 
REAL*8 x(n),y(n),a 
INTEGER*8 n
(Y = Y + A*X) 

SUBROUTINE AXPY4@(y,x,n,a) 
REAL*4 x(n),y(n),a 
INTEGER*8 n 
(Y = Y + A*X) 

INTEGER FUNCTION USE_AVX@(level)
INTEGER level
(Set level = 0 for SSE. Set level = 1 for AVX. The function returns the level that 
will be used by the current CPU/OS.
The default level is 1 which means that AVX will be used when available otherwise 
SSE. If USE_AVX@(1) is called before an ALLOCATE statement then the resultant 
addresses will be 32 byte aligned. The USE_AVX@ level must be the same at a 
corresponding DEALLOCATE.)

For example:

INTEGER(4),PARAMETER::n=100
REAL(2) DOT_PRODUCT8@,prod,x(n),y(n)
INTEGER USE_AVX@,level
! x = ...; y = ...
level = USE_AVX@(0)
prod = DOT_PRODUCT8@(x,y,n)
11 Mar 2017 10:28 #19064

Cool! Thanks, John. From first glance i do not see they offer exactly the same functionality for Vec_Add_SSE and Vec_Sum_SSE in this routine below you wrote but will look closer

    subroutine SSE_BlockSolver
    use clrwin 
    use MajorDeclarations 
    real*8    FFFF, SUM1, Vec_Sum_SSE 
    external  Vec_Sum_SSE 
    integer*4 k,i, next_k 

    next_k = 100
    Progress = 0
    DO  k=1, nEquat-1 

 !........ Progress 
      if (k == next_k) then 
         Progress = k/(nEquat-1.) 
         call temporary_yield@ 
         call window_update@(Progress)    
         next_k = k+100 
      endif 
 !....... End Progress 

      do I=k+1,IJmax(k) 
         FFFF = -AT(k,i)/AT(k,k) 
         AT(k,i) = 0. 
 !          do  j=k+1,IJmax(k) 
 !            AT(j,i) = AT(j,i) - FFFF * AT(j,k) 
 !          enddo 
         call Vec_Add_SSE ( AT(k+1,i), AT(k+1,k), FFFF, IJmax(k)-k) 
         B(i) = B(i) + FFFF * B(k) 
      end do 
    END DO 

 !   X(nEquat) = B(nEquat)/AT(nEquat,nEquat) 
 ! 100   SUM1=0. 
 !      do j=i+1,IJmax(I) 
 !        SUM1 = SUM1 + AT(j,i) * X(j) 
 !      enddo 
    do i = nEquat, 1, -1 
       SUM1  = Vec_Sum_SSE ( AT(i+1,i), X(i+1) , IJmax(I)-i ) 
       X(i) = (B(i)-SUM1)/AT(i,i) 
     end do 
 !      i=i-1 
 !      IF(i.gt.0) GOTO 100 

       if(kLookAtSolution.eq.1) write(*,'( 1pe14.7)') (X(i),i=1,5)
 
 ! 10000   continue 
      end subroutine
11 Mar 2017 1:44 #19070

One should be careful when using linear equation solving subroutines that do not implement pivoting, at least partial pivoting.

Adding pivoting, however, need not imply the use of FPU or SSE instructions, since block copies can be performed using memcpy() and friends, which use only integer instructions.

11 Mar 2017 8:49 #19073

Besides that without pivoting the algorithm becomes super simple so far I never seen any problems after killing pivoting specifically if you move to real*8 where rounding errors decrease tremendously while speed is the same. There was no zeroes on major diagonal in my physical model and the numbers there were naturally the largest or not too small. I would not risk doing that calculating Mars landing though. 😃

Mecej4, are you familiar with good parallel methods for block matrices (squares of different sizes on its major diagonal)? This is the only reason I use LAIPE.LIB library which has to be now recompiled by its author for 64 bits for Intel Fortran which should be partially compatible in the LIB form or fully compatible as DLL. It is generally good library and exists for 32bits IVF and 32 and 64 gFortran but 64 bit one was never tried with FTN95 unless JohnCampbell already done that. It should go together with gFortran for free.

/* By the way John promised to come to my North Pole and 'collect' from me some small prize I forgot how much $30--50--100 I offered few years back if showing the proof that his own methods are faster then LAIPE but I never seen the real comparison even for the simple dense or skyline matrix and even for 32 bits. Any news, John? 😃

Comparisons of different compilers can be seen on website called equation dot com

12 Mar 2017 1:58 #19075

The use of partial pivoting is made more difficult when sparse storage methods are used, such as banded, or skyline storage. SSE_BlockSolver is a variable band solver, used for well conditioned sets of equations. It appears to use Gaussian Elimination, with variable length rows, as DAXPY is used for forward reduction. I have not seen examples of pivoting used with banded or skyline solvers, but I presume some 'partial' pivoting could be applied. Typically with these sets of equations, if the diagonal is very small, an artificial restraint is applied to the equation.

Dan,

To answer your question: I have found my Laipe comparison results, run on my i7-4790K, i5-2300 and i7-6790HQ. All are 4-core processors. I've been trying to source new pc's (i7-7700K or i7-6850k) with faster memory and/or more cores, to see if cache, cores or memory speed is significant, but don't have the budget. The laipe test is to compute [C]=[A][B], where matrices [A], [B] and [C] are 4-byte real matrix. Matrix [A] is of order (15,000-by-11,000), and matrix [B] is of order (11,000-by-12,000), and matrix [C] is of order (15,000-by-12,000). My tests use 8-byte reals, which doubles the memory requirement. (more cache conflicts) My matrix multiplier includes a cache size blocking strategy to minimise cache-memory conflicts. Large matrix multiplication is one of the easiest calculations for applying OpenMP. One of the interesting outcomes from my tests is I don't get good efficiency as more threads are introduced, due mainly to problems with hyper-threading of 5-8 threads onto 4 cores, but it is elapsed time, rather than efficiency that is important. (i7-4790K result is clear/worst example of hyper-thread failure I have found)

No of       i5      i7      i7    Intel     AMD
Threads    2300    4790K  6700HQ   Xeon  Opteron
cache        4.5       6     4.5  L7555    6168
       1  1108.6   579.8   656.5  5678.2  3493.6
       2   577.9   295.8   373.6  2839.3  1730.2
       3   404.5   201.0   296.6  1896.5  1151.6
       4   318.0   154.9   240.7  1420.4   865.9
       5           196.0   246.8  1136.6   691.4
       6           179.9   232.0   955.1   580.7
       7           190.2   234.4   820.9   498.0
       8           193.6   241.2   745.7   434.8
      32                           204.4   119.6
      48                                    88.6

The processors I have used are your basic intel i series processor, which is the basic cheap processor available in most stores. I don't know a lot about the Laipe multi-core processors that have been used for the laipe results, but for single thread they are amazingly slow. One is a many/multi-core Xeon so should not be this slow ? To quote great efficiency of multi thread calculations, with such poor elapsed time performance is hardly relevant.

John

12 Mar 2017 3:16 (Edited: 12 Mar 2017 9:35) #19076

Again, John, you are feeding Shakespeare country forum with words, words, words. This comparison is even not apples to oranges but apples to description of oranges. Take real LAIPE library and your test and do the elementary:

  1. SAME SOURCE SOFTWARE on
  2. SAME HARDWARE.

Over decades i have seen many strange claims and strange test results because of typos, different assumptions, wrong initial conditions, etc. Everything must be done in the so called clean experiment when there is no other explanations. In our case that means that all has to be done side by side in order to get clean results

Lately on the net kids compare everything to everything, CPUs, GPUs, cellphones, car fuel efficiency etcetcetc, and no single novice would do the comparison like in your post. No one ever compares, say, different cellphones on even different VERSION of the same software! You are comparing unknown test with unknown test done on the different processors and claim that your method is faster !!! 😃

And finally what cache miss are you talking about ? You cache is around 10MB while the memory size is 12000150008 = more then 1 GB ! The 12000*15000 multiplications itself takes less then a second out of 1000s your test takes. This is memory bandwidth bound problem. Bad 'test', bad solution method, processor is doing nothing, just waiting for the SDRAM. 😃. For this primitive test cache size used is exactly zero because there are no intermediate calculation which are further used, actually besides just one multiplication per new pair of array elements there is nothing else done 😃. The only it is good for is to show scalability of the method with number of cores exactly as author of LAIPE doing. I do not see matrix multiplication in my LAIPE library by the way, this is probably some addon. Take skyline, block or just dense solver for example and prove in straight side by side comparison that your method is faster, John. Prize is good quality Stoli, whiskey or $50.

Additionally you or anyone succeed to adopt 64 bit LAIPE to 64 bit FTN95 and this will increase code speed with block matrix versus current 32 bit LAIPE on 32 bit FTN95 I will double the prize. Same offer for any other parallel method for block matrix adopted to 64 bit FTN95 if it is faster then current 32 bit LAIPE. Worth the fun!

Please login to reply.