forums.silverfrost.com

lozzer · Joined: 27 Jun 2007 Posts: 49

It might be a dumb question, but is FTN95 able to multithread and therefore utilise the capabilities of a dual core processor setup such as the Intel Centrino Duo systems?

We've found that certain FTN95 applications now run slower on a dual core system than they did on the previous model single-processor computers using the same OS.

Any ideas anyone please?
_________________
Lozzer

PaulLaidler · Posted: Wed Jan 09, 2008 5:26 pm Post subject:

Multi-threading is available for .NET only unless you want to do something clever with the Windows API.

In any case, multi-threading requires careful program design and modification. You don't get it automatically.

lozzer · Joined: 27 Jun 2007 Posts: 49

Ah, thought as much. I think a few customers remain disappointed with what they thought would be PCs with twice the performance only to discover that most s/w doesn't utilise the extra power and Vista has already sapped the last gasps from performance that is there.

I think .NET is the way we are going anyway medium to long-term.

Thanks for your quick reply.
_________________
Lozzer

LitusSaxonicum · Posted: Thu Jan 10, 2008 12:42 pm Post subject:

I've sent you a private message because my suggestions are too long to appear in the normal postings. If you don't get it all, then I might try your firm's e mail if you don't mind.

Eddie

JohnHorspool · Joined: 26 Sep 2005 Posts: 270 Location: Gloucestershire UK

My memory may be playing tricks on me, but couldn't the old DOS based FTN77 compile multi-threaded applications, if so then why must we have to use .NET with FTN95 ?

One other obvious advantage of a dual processor over a single is that more than one CPU intensive application can be running at a time with much less impact on performance. But, managing a dual processor will have an overhead, thus running a single CPU intensive application on the dual will be slightly slower than the single.

In practice there are a limited number of applications that lend themselves to parallel processing. FE analysis is one. The mathematics involved (matrix partitioning and such like) is daunting. I have great admiration for anyone doing this !

Eddie, I'm sure forum members would like to see your suggestions !

PaulLaidler · Posted: Thu Jan 10, 2008 2:42 pm Post subject:

I cannot find any multi-threading routines in any FTN77 or FTN95 except for FTN95 .NET. I guess that it was implemented in FTN95 .NET because it was relatively easy to tap into the .NET multi-threading features.

In theory we could provide multi-threading capabilities in Win32 FTN95 but I do not see this as having a high priority.

LitusSaxonicum · Posted: Thu Jan 10, 2008 4:39 pm Post subject:

In response to John Horspool as well as Nigel, here is my suggestion for client server computing, effectively multi-threading, using FTN95 and Clearwin, not .NET. If the forum truncates my message, I'll post a "second half". I originally had FE analysis in mind. My idea was to have several server apps able to (a) produce stiffness matrices, perhaps for 100-element blocks, and (b) perform stress-extraction, again, perhaps in 100-element blocks. The client in any case orders the elements (perhaps in frontal solution order) for processing. I imagine that reduction of your structure into substructures could permit you to also have server applications that performed the substructure reduction, so that the client eventually only needs to solve an assemblage of substructures. John, you will have to read this (following) with FE in mind.

Nigel,

I read your posting on the Silverfrost Forum with interest, and was reminded of a comparable debate when cpus started to become clock-doubled (they are now all clock-multiplied). The problem is that the rest of the system is equivalent – the RAM, the hard disk, etc. This reply is too long for a single reply post on the Forum, so I have taken the liberty of communicating directly.

Imagine a cafe, running smoothly with one cook, one waitress, a few customers. The speed of service is how long it takes to cook a pizza. More cooks (dual core cooks!) often don’t help – they are just idle more of the time. When one gets more customers, the extra cooks come into their own. But two cooks can’t cook twice as many pizzas – they have to use the same pizza oven and the same fridge – and they get in each others’ way. The single waitress often can’t cope with a rush … and in any case, the café isn’t simply used only by paying customers, there is also a health inspector, the VAT inspector, and (assuming the café is in Naples!) a Mafioso checking on activity to see how much profit can be creamed off for protection.

For some situations, the extra cook is useful. In others, it needs more waitresses, more tables, a bigger pizza oven etc.

I find that my dual-core machine helps when my virus checker does a scan – I can keep working, most times. Before, I couldn’t. But nothing runs faster.

* If your problem is disk access, then as well as a faster cpu, you need either a faster hard disk, or hard disks set up in a Raid array.
* If your problem is speed of generating complicated graphics, then you need a faster and more expensive graphics card.
* If the problem is speed of RAM, you need faster RAM, lower latency RAM (although this depends on cpu type), more cache (ditto).
* If the problem is lack of RAM (all the inspectors taking up tables in the café) – more RAM helps, although a faster hard disk helps too if the virtual memory sends things to disk often.

In the days before FTN for Windows (i.e. FTN77 with DBOS), it had multi-threading, but I could never see the point of that (except in the ease with which some things could be programmed) with a single cpu.

I looked on your firm’s website to see what you do - so that I could write sensibly with an example. Here it is.

When your main application senses that it is going to have a busy computationally-intensive set of jobs to do – say to superimpose a 10km road layout on a digital ground model – the problem is probably solved in its entirety, and then one displays the part of the job that can be seen. If the problem is the amount of computation on the model, then there’s a wait before the scene shows, but the scene renders quickly. If the problem is drawing the screen, then a faster graphics system is probably going to help, but if it is calculating the geometry of what is on the screen, then it is cpu and memory speed again that helps most.

LitusSaxonicum · Posted: Thu Jan 10, 2008 4:40 pm Post subject:

If you now divide your problem into (say) 1 km sections, you only need to compute the sections actually visible on the screen. But, when the screen view changes, you need to compute the extra sections – and that slows the screen responsiveness down.

Users have a poor idea of what takes time to compute, and in any case, will identify sluggishness in a program with a few seconds of wait. The hourglass cursor and the progress bar are ways of pacifying users, no more!

What you are saying is in effect, could you calculate two sections simultaneously, using the dual cores? The problem is that even if your application is multi-threaded, your operating system distributes the extra thread(s) to the processors. They may both run on the same core, if the other core is busy when the request is made. Now, the core doing the work has to contend with the threads contending for resources, and in any case, there is an overhead in dividing the problem into two, and then merging the results, so you get poorer performance. So, the answer has to be, yes, in principle, distributing the computational load between processors might greatly speed things up, but it could equally slow them down a bit in some cases.

You can get some of the benefits of the second cpu core without explicit multithreading by getting some of the work done by a second (helper or “server”) application (B), that possibly works in parallel with the first or “client application” (A). The servers take the role of extra threads, but are self-contained programs themselves. You could then have separate instances of B running – in the case mentioned, perhaps B1 … B10 to cope with each km. I have been playing around with something similar in recent months, but not at this moment in time

So, the strategy is as follows: Sort out which parts of the problem are visible, and which are not (these have lower priority). Say km 5 to 8 are visible. First, abstract the data to work on for km 5, and write it to disk. The data file will stay in the cache if it isn’t too big. Start up instance B1 of the program that is doing the heavy computations for this section of the road. This is done with call start_pprocess@. You can write results to disk – and again, this data set stays in the cache (depending on size). Start up instance B2 to run km 6, and instance B3 for km 7 etc. Each instance of program B can report back when it has finished through ClearWin’s messaging (see SEND_TEXT_MESSAGE@, REPLY_TO_TEXT_MESSAGE@, CLEARWIN_STRING@ (‘MESSAGE_TEXT’) and %nc and %rm format codes). You can also make the main window of a server application not show on the taskbar using %sy[toolwindow], and minimise it with %ww[minimise]so that the use of server applications is invisible. (You need a window of sorts so that you can use %nc and %rm). When instance B1 has finished, maybe you can send it the names of other datasets to work on, and avoid the overhead of starting up a new instance (say B4). When enough server applications have reported back, application A can get on with displaying those results.

When the display has been completed, perhaps you need to send messages and datasets to the now idle B1, B2 and B3 to process km 4, km 9, km 3, km 10, km 2 and then km 1, so that if the user scrolls from the presently visible section the computations have been already been done. Indeed, assuming that the user stops to look at the display, then nothing much is happening in client A, and all the servers, B1 … Bn can get on with their calculations.

So far, I have determined that there are useful points in any program to actually start the helper applications B1 …Bn. A good time is while A is starting up – users are used to this taking some time. Another good time is after something has been displayed that the user has to look at and read – then you know that the cpu(s) will be comparatively idle for long enough to get B1 … Bn launched.

LitusSaxonicum · Posted: Thu Jan 10, 2008 4:40 pm Post subject:

The more subtasks (server applications) there are, the more chance you have of using both (or all four in a quad core) cores simultaneously. The smaller the datasets are, the more likely they stay in cache, and aren’t written to or read from hard disk while the processor cores are fully engaged. However, the more messaging is going on the worse the performance will be on a single core cpu. It needs some tuning to know how many server applications are needed relative to the number of cpu cores etc.

All this is a major programming task, but I have concluded that the move from Clearwin to .NET is a bigger one, both in comprehension and in coding. The strategies in using multi-threading, and the problems of optimising it, are broadly similar to the client – server approach of using separate applications, except that in the latter, you are most likely not operating on a single dataset (for example some huge arrays), and have the problem of extracting subsets, and merging the results.

What would be nice is to do is to run the server applications B1 … Bn on different computers on a network as well as on different cores on the cpu in a particular computer. This could distribute the entire processing load throughout all the networked computers in a given office, most of which are doing absolutely nothing but consuming power at any instant in time!

To sum up (identifying my personal experience with FTN95):

• Multi-threading is, in effect, a client-server approach within a single program.
• Multi-threading can’t be done straightforwardly, if at all, using Win32 FTN95 & ClearWin.
• There are no speed benefits from this on a single core cpu, and there may well be speed disbenefits in this case.
• You can write client-server programs with Win32 FTN95 & ClearWin both straightforwardly and simply, and they do work (my personal experience)
• The server programs have a good chance of running on different cpu cores at the same time on a multi-core cpu (my personal experience)
• The datasets passed from client (A) to server(s) B can be written and read very quickly if they are small enough to remain in cache – otherwise they contend for access to hard disk and the process is slowed (my personal experience).
• Fortran’s modular nature means that it is possible to extract huge chunks of code to put in server applications without necessarily removing it from the client, and the client application can then decide whether to use server applications or not (my personal experience), so the programming effort is not necessarily prohibitive.
• Working out which sections of a given application are worth doing should become obvious from which parts of the user experience they find sluggish.
• There are no speed benefits from client – server systems on a single core cpu (obvious), and there may well be speed disbenefits in that case. However, if their use is optional, which it may be, then the more effective strategy can be selected in a particular case.
• Once you have a client-server system, then its benefits increase dramatically as the number of cpu cores goes up – from 1 to 2, 4 and beyond. The more there are, the more likely it is that some will be free. As I understand it, both Intel and AMD are saying this is the way they are going, not to yet faster single-core cpus.
• In principle, client – server computing should be possible over a windows network, but how this is done, with or without FTN95, is beyond me.

I hope that this is some help.

Best regards

Eddie Bromhead

LitusSaxonicum · Posted: Thu Jan 10, 2008 5:00 pm Post subject:

So finally, and in response to Paul, multi-threading isn't the only way to do client server computing - you can do it with separate applications. It is simply a matter of deciding what the roles and responsibilities of each type of server are; of deciding when to launch the server apps; and knowing how to (a) break the problem down, and (b) re-assemble the results. In virtually every respect, these are problems to be addressed in multi-threading. The inter-process communications are trivial, and are fully described in FTN95's helpfile - they use standard Clearwin procedures and facilities !

The communication medium is the simple Fortran "disk" file, which may well remain cached and thus is written to and read from at RAM speeds (actually, this is stated in FTN95's .chm file, if you look !).

As is usual, there is a lot more in Clearwin than first meets the eye, and whatever geniuses conceived it deserve admiration.

My apologies for such a lengthy posting, which of necessity had to be split over several entries.

I just re-read John's posting, and the difficulties you see in matrix partitioning in your FE code are perhaps not so difficult as you imagine. If you sort your elements into a list for processing in frontal order the list can be chopped up into arbitrary "substructures". Within each sublist you have (I think) a perfectly self-contained substructure, within which all the interior node equations can be eliminated (by a server app). Eventually, the final assembly and solution is of a much reduced problem made up of a number of "super-elements", each of which is one of the substructures. I never originally saw the point of substructure operations, but they fit well into a client-server model. After the first, arbitrary subdivision, stage, you might wish to consider sub-structres that contain only linear behaviour and those that contain non-linear behaviour, and treat the two with different servers. Start now, and you will be ready for 16-core cpu's !!!!!

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

I'd be very interested to know if some basic vector/matrix operations, such as:-
dot_product or
vector A = vector A + constant x vector B
could be adapted for multiple ( dual/quad) processors.

I find these two operations account for the bulk of my FE calculations.

John

LitusSaxonicum · Posted: Sun Feb 10, 2008 11:46 am Post subject:

John,

Have you looked at coding this in Assembler?

Even back in 8086 / 8087 days you could do vector operations extremely fast if you kept intermediate results in the coprocessor stack, basically accumulating inner products and not writing intermediate results back to memory, and thus saving (many) clock cycles (you also avoided roundoff as the copro stack was 80-bit). The basic principles were written up by (I think) Richard Startz in a paperback book on the 8087 coprocessor. I Googled, and all I could find was what I think is an updated version also covering the 287 and 387. In his original book he did the vector ops that you mentioned, and interfaced his ASM routines with MS interpreted Basic. The improvement was spectacular. I tried the things out with MS Fortran, and they were still a huge improvement in speed.
Nowadays you also have options relating to SSE'n' for doing the calcs as well as the copro stack, which seems to be regarded as rather obsolete.
My bet is that you could produce a much faster dot_product routine even today by using assembler than in Fortran.

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Eddie,

Thanks for the thought. I did do that many years ago. These days with the optimisation available in processors and the use of primary and secondary cache, I admit I don't know how best to approach the problem.
The concepts I once used (20 years ago) to optimise do loops do not appear to apply today. If I do simple benchmark tests, I rarely get the same results. With virus checkers and firewalls and all the other background processes running, I'm never sure what I am testing.
I also know that dot products can vary considerably using ftn95, which is the subject of many of my earlier emails. I'm hoping that this can be addressed in the near future.

regards john

PaulLaidler · Posted: Mon Feb 25, 2008 9:31 am Post subject:

John

Off the cuff, I cannot think an effective way to use multiple processors for dot_product etc neither at the compiler end nor the programmer end.

My experience is that improvements are hardware driven in the sense that, by the time you have finished optimising the code (via assembler etc.) the advances in hardware have outstripped the effect of the optimisations.

The general advice is:
a) make sure that your algorithms are optimal (in the numerical analysis sense). You still find programmers using determinants to solve linear equations.
b) use the latest processors
c) maximise the RAM and the size of the swap file.

LitusSaxonicum · Posted: Mon Feb 25, 2008 9:36 pm Post subject:

Paul's suggestions are, as always, useful. Certainly, having an efficient algorithm helps - sometimes. If the program is waiting for user input, there is no harm in having a slower algorithm which you can write and debug more easily.

Using the latest processors is good advice - for someone programming for himself (like me!). Sadly, commercial programmers have to make their wares run on the customers' machines. You are likely to find in my Uni that the admin staff have the latest hardware to do nothing on except drive the rest of us mad, and everyone else scrapes by on old and slow machines. My solution to this is to build my own, and take an appropriately specified machine into work as a Fortran Engine. The machine I have to do Uni business on has a clockwork cpu, I'm sure. (Paul: you "scrape by" too. Where is your Vista development machine?).

The remark about hardware is only true if Intel and AMD keep making faster processors. That stalled a year or two ago. Now they make multi-core processors that improve "the windows experience", but don't numbercrunch significantly faster. For example, the extra cpu does help with the virus checker problem, but it doesn't help at all with the 10hr run time, if that is what it takes to solve the problem going flat out.

Coding specific elements of a Fortran program in Assembler when those elements are used millions of times, could benefit John, providing that the gains are significant, and the effort isn't huge . Startz' example, which was to simply accumulate intermediate results on the 8087 stack, made a huge difference, regardless of cpu speed, and was simple to do. John did mention dot products, and these are simple enough. Isn't that what Startz was doing? His 2nd edition must cover 32 bit cpus. Recoding the whole application in Assembler would be barmy. The big advantage of this is that when you move to a faster cpu, you still reap the benefits - so long as you still have i86 architecture, you still keep the speed improvement. I remember trying out Startz' routines (for the 8086/7) on a 286/7, and still getting the same ratios of speed-up ... I used MASM and MS Fortran v 3, he used MASM and compiled Basic, but no matter. I think it was 10x faster on multiplying two matrices. Just compare the time it takes to push an 80-bit number on the coprocessor stack (or whatever the SSE equivalent is) with how long it takes to store it to RAM, even cache RAM, and then get it back to add into the running total, and you will see the point.

The benefits from a multi-cpu set up can only be realised if either the programmer, the compiler writer, or the OS author implements the right approach. I can't see a finite element application getting much benefit from a dual core machine. I'm writing this on such a machine, and it isn't a jot faster to number crunch than the single core cpu I took out. Effort in coding specifically for multicore machines, however, could be worth it. By the time the program is debugged, there will be 8, 16 or more cores in common use, and sending off myriads of applications each to solve part of the problem will get there faster than a single threaded behemoth. However, converting a single threaded application to be multi-threaded is not a trivial undertaking.

Eddie