mecej4,
I find the /timing option to be a good approach. It reports the elapsed time associated with each routine compiled with the /timing option. From the source code, I generate 2 lists of files, based on if they are utility routines or if they are code I want to find their delays. I then include these files as a list of include statements, and compile the first as /debug and the second with /timing. This encourages you to break up large subroutines into smaller bits and get timings for the bits, which can be a good thing when isolating code to improve. Basically I only compile the source code with /timing that I want to review or does not take too long to run (eg exclude functions that are called millions of times). There is a timing call overhead on entry and exit to each routine (based on cpu_clock@/RDTSC_VAL@)
The following is a batch file I used for a large simulation I have
now >ftn95.tce
del *.obj >>ftn95.tce
del *.mod >>ftn95.tce
SET TIMINGOPTS=/TMO /DLM ,
ftn95 sim_ver1_tim /timing >>ftn95.tce
ftn95 sutil /debug >>ftn95.tce
ftn95 util /debug >>ftn95.tce
slink main_tim.txt >>ftn95.tce
type ftn95.tce
dir aaa_tim.exe
rem run aaa_tim.exe
aaa_tim IH_2009_AB_g40_C205.txt >sim_tim.tce
sim_ver1_tim.f95 is an INCLUDE 'xxx.f95' main_tim.txt is the link list, which lists the .obj files plus some libraries.
lo sim_ver1_tim.obj
lo sutil.obj
lo util.obj
le \clearwin\saplib.mem\saplib.lib
map aaa_tim.map
file aaa_tim.exe
The timing output is 2 files; .tmo and .tmr, one is aaa_tim.tmo, which is a .csv file of accumulated elapsed times. It is easy to review in Excel. You can see where all the time is being taken and may identify where the code has problems. I find it provides a lot of information at the routine level, which is more helpful than the /profile approach.
I would recommend this approach as worth testing. (I have not yet used this with /64.)
FTN95 is not good with array sections and long strides in array addressing. It can benefit from including SSE vector routines where available.
John