Topic: Bug in SCC 3.88 in General

mecej4

Posts: 1911

Back to Top

25 Nov 2016 1:47 #18470

For the following program, SCC /64 generates two false warnings.

#include <stdio.h>
#include <stdlib.h>
#define MMASK 0x7FFFFF
#define SMASK 0x0800000
#define OMASK 0x7000000

int main(){
int ival; unsigned mant;
int expo2,expo8,nshft;
int n=3;

ival=0x38C8EB83;
mant= (ival & MMASK) | SMASK;
expo2=((ival >> 23) & 0x0FF) - 0x07F - 2;
switch(expo2%3){
   case -1 : mant <<= 2; expo2-=2; break;
   case -2 : mant <<= 1; expo2--; break;
   case 1 : mant <<=1; expo2--; break;
   case 2 : mant <<=2; expo2-=2; break;
   }
expo8=expo2/3;
if(mant & OMASK){
   nshft=n-8; expo8++;
   }
else nshft=n-7;
printf('nshft = %d\n',nshft);
return 0;
}

The messages:

   0021   expo8=expo2/3;
WARNING - This statement will never be executed
   0025   else nshft=n-7;
WARNING - This statement will never be executed
    NO ERRORS, 2 WARNINGS  [<BUG> SCC/WIN32 Ver 3.88]

P.S. Sorry, I should have posted this in the Support section.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

12 Dec 2016 11:27 #18539

Mecej4, Since you are familiar with SCC I have the following suggestion/request if you have some free time. Can the CrystalDiskMark be compiled successfully with C and show how it works? How it checks I/O speed ? By this we will know how read/write speed test works, where are bottlenecks and is there potential for improvement.

Question #1 is: test shows the read or write speeds 10 GB per second on RAMDrives. That means that the read/write itself (as overhead) must go with even faster speeds! Is this true with C ?

http://crystalmark.info/software/CrystalDiskMark/index-e.html

mecej4

Posts: 1911

Back to Top

15 Dec 2016 12:49 #18550

That is a full-fledged Windows GUI program, and I do not think that SCC can compile the thing from source code without a lot of pampering. Besides, why on earth do you want to compile it from source?

Frankly, I do not understand your fixation on I/O benchmarks when there are so many other aspects of your programs that are more worthy of your attention.

A disk I/O benchmark program is justified in shoving random data to and fro and timing the movement. You cannot do the same, however, in any real application that does something useful. Real programs tend to consume and/or produce buckets of data. If you want to assess how fast your application would perform with almost infinite I/O speed, simply set the output file to NUL: (on Windows; /dev/null on Unix/Linux) and time a simulation run of your program. If the time of the run is not drastically less than it was with output to a real file, you will have proved that you are barking up the wrong tree.

You can also try this MS Technet command-line utility to time I/O to a specific file of your choice:

 https://gallery.technet.microsoft.com/DiskSpd-a-robust-storage-6cd2f223

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

16 Dec 2016 7:16 #18551

I more or less know how my app will behave with infinite I/O speed. It will go at least ~2-3x faster. I can stretch additional factor of 2 probably by switching off some extra not always needed calculations during the load. And that's the reason of my interest. My Fortran loading speed with even unformatted load is hell annoying because it's slow, around 300KB/s.

This is the program which visualizes existing data, and there is many TBs of data. When you try to find something in this forest, ideally visualization must go with instant speed because a lot of data you click on is just not what you need to find. As a result you even do not want to touch the data so sickening boring the loading process is. As soon as data is loaded the OpenGL visualization is almost instant thanks to its very good OpenGL implementation and fast hardware (thanks to realistic 3D games).

With few Fortran compilers we do not see speeds faster than mentioned above with any settings. C code like with this benchmark though somehow shows speeds 30x faster. Question remains - how C reach that speeds and why Fortran can't ?

mecej4

Posts: 1911

Back to Top

16 Dec 2016 3:22 #18552

That note clears up some questions. It also clarifies that by using I/O devices and software that are '30X faster', your effective overall speed gain may be about 2X. And, because the I/O is mostly [u:25bc03acc7]input [/u:25bc03acc7]of massive amounts of data, you cannot use the NUL device to test the best achievable speed.

Other speed-ups such as those coming from avoiding or delaying calculations are not relevant at this point of the discussion. You can implement them or not, independently of the I/O problem and solution.

This kind of situation is standard in searching in a database. The usual solution is to compile an index to the data tables. These indices are much smaller than the main tables, and so one can search fast in the index and, when an exact or partial match is found, the corresponding portion of the main table is read into memory and processed further.

The indices do take time to build, but they need to be rebuilt/refreshed only when the new data is loaded or old data is deleted. Therefore, for the 'create once, use many times' scenario, they are definitely worthwhile.

To define and create an effective index, you have to know your data intimately, and you must have a good idea of the access patterns of your users (including yourself). You have probably used an old dictionary that had thumb indices cut into the edge of the pages. So, if you want to look up 'Dan', you put your thumb on the 'D' notch and open the book. The same idea should be tried on your data.

Your reaction?

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

17 Dec 2016 3:34 #18553

I still hope for getting 5x from just the software I/O speed bump alone. Because if C really can read GBs per second Fortran literally MUST do that even faster, this is what users expect from Fortran - to beat all others in speed in science and engineering area.

If this will fail - the only other way to me to speedup the navigation would be to make small thumbnail images of all parameters same like with the photography. I can not imagine how it is possible to make indexing for fast search other way.

mecej4

Posts: 1911

Back to Top

17 Dec 2016 1:19 #18556

Quoted from DanRRight ...if C really can read GBs per second Fortran literally MUST do that even faster, this is what users expect from Fortran ' .

That is a wish stated in the form of an assertion that happens not to be true.

Fortran code can be marginally faster than C for some type of work (numerical calculations, for example) and can be substantially slower than C for other types of work (character processing, for example). These days, Fortan and C compilers on microprocessor systems almost always use the same back-end for code generation and optimization, and most compiler systems use substantially the same RTL (Microsoft DLLs).

In general, in my experience, the speed of compiled Fortran code is the same as that of compiled C code.

Starting out with expectations of the improbable or, worse, the impossible, is not a recipe for success.

JohnCampbell

Posts: 2526 Sydney

Back to Top

18 Dec 2016 1:14 #18561

Dan,

I would like to agree with mecej4.

In the benchmarking I did for you, I showed that even basic numerical conversion of text, with no file I/O processes about 100 million bytes per second. ( some gFortran are very poor and convert F and ES at 4 MB/sec; while FTN95 /64 and FTN95 '/32' do much better) With a processor clock rate of 3 giga hertz, I don't see how you could achieve multiple giga bytes per second. ( The C code rate claims don't look realistic or if are real they can't be utilised by even basic processing of the info. Your quoting multiple GB is not feasible, as you can not process them at that speed.) When quoting transmission rates, there is always the difference between MB (mega bytes) and Mb (mega bits) or Gb (giga bits), so there is always the uncertainty of what speed is really being quoted.

My impression was that you were struggling with 1 MB (megabyte per second) read and processing, which could be increased to 50 to 100 with stream I/O on HDD or 200-500 MB with SSD. BUT, as you can only process the characters at about 100 MB, does it matter ?

Also, you have not identified the source of this data, How do you get it ? If it is via the internet, the transmission rate for receiving the files is much slower than you can read them from disk.

In summary, you need to identify where the bottleneck is, and I doubt if it is with SSD or HDD transmission rates. It will probably be with processing or receiving the files.

It sounds to me that you need to have multiple PC's to process all the different files into summary or indexed forms.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

18 Dec 2016 2:23 #18562

Ok, mecej4 and agreeing with you John,

Please show me read and write speeds at least a half what CrystalDiskMark measures, or 5-6 GBytes per second in my case on ramdrives (yes bytes not bits per second like all C tests show) with any your methods using Fortran and then we will continue conversation about Fortran delivering almost the same speeds as C.

I even don't read and process characters, John, I use unformatted read. You are welcome to use it too for your tests to make your life easier. Processing speed after data was loaded should be absolutely different topic and is not discussed here. PM me your address and I will send you 12 beers for the effort. 😃

mecej4

Posts: 1911

Back to Top

18 Dec 2016 12:49 #18563

Dan, I think that you are still tilting at windmills, as you can see with these tiny example programs. Both write a 64 MByte 'binary' file. I ran the programs on a laptop with an i5-4200U CPU and a 128 MB ramdisk.

The Fortran code:

program writebinbuf
integer, parameter :: I2 = selected_int_kind(4), I4 = selected_int_kind(9), &
                      I8 = selected_int_kind(18)
integer, parameter :: BSIZ = Z'4000000'   ! 64 megabytes
character (Len=1) :: buf(BSIZ)
integer (I2) :: hndl, ecode
integer (I8) :: nbytes = BSIZ
real :: t1,t2
!
call openw@('big.bin',hndl,ecode)
if(ecode /= 0)stop 'Error opening file BIG.BIN for writing'
call cpu_time(t1)
call writef@(buf,hndl,nbytes,ecode)
call cpu_time(t2)
if(ecode /= 0)stop 'Error writing file BIG.BIN'
call closef@(hndl,ecode)
if(ecode /= 0)stop 'Error closing file'
write(*,'(A,2x,F7.3,A)')'Time for writing 64 MB file: ',t2-t1,' s'
write(*,'(A,6x,F6.0,A)')'Estimated throughput = ',64.0/(t2-t1),' MB/s'
end program

The equivalent C program:

#include <stdio.h>
#include <stdlib.h>
#include <io.h>
#include <fcntl.h>
#include <time.h>
#include <sys/stat.h>

#define BSIZ 0x4000000

int main(){
char *buf; int fid,bsiz=BSIZ; clock_t t1,t2;
float te;

buf=(char *)malloc(bsiz);
fid=open('BIG.BIN', O_CREAT | O_WRONLY | O_BINARY); //, S_IWRITE | S_IREAD);
t1=clock();
write(fid,buf,bsiz);
t2=clock(); te=(t2-t1)/(float)CLOCKS_PER_SEC;
printf('Time for writing 64 MB to file: %6.3f s\nEstimated throughput = %.1f MB/s\n',
   te,64.0/te);
close(fid);
}

We run the first with FTN95:

s:\FTN95>ftn95 /no_banner fwrfil.f90 & slink fwrfil.obj & fwrfil
Creating executable: s:\FTN95\fwrfil.exe
Time for writing 64 MB file:     0.047 s
Estimated throughput =        1365. MB/s

We run the second with SCC:

s:\FTN95>scc /no_banner cwrfil.c & slink cwrfil.obj & cwrfil
Creating executable: s:\FTN95\cwrfil.exe
Time for writing 64 MB to file:  0.047 s
Estimated throughput = 1361.7 MB/s

Vive la non-différence! And, please drink those 12 beers on my behalf.

It would be interesting to see what numbers you get on your terabyte cruncher of a machine with these small test programs.

Once you run one of these two programs you will have a 64 MB file that you can use with similar read tests. Change writef@ to readf@, and so on. I see more or less the same speeds for reads as I did for writes.

Having done that, compare the read throughput (1.36 GB/s on my laptop) with the value that you gave in #3 (counting from 0 for the initial post), 0.0003 GB/s. The difference must be investigated, and it will be found to be explained by the throughput of your I/O devices and by the rate and complexity of processing the data in your application, and not at all by differences between C and Fortran, because the grunt work of the I/O is done in the MS system DLLs.

mecej4

Posts: 1911

Back to Top

19 Dec 2016 12:17 #18568

Dan, there is something else that I don't understand about your 'problem statement'. You said in #3 that your reading speed with unformatted Fortran files was 300 KB/s, and that you needed to process 'terabytes of data'. If so, you will have to run your computer nonstop 24/7 for over five weeks for one run. Really? Are you doing this now?

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 Dec 2016 3:51 #18569

LOLOLOL Now i know why you did not want my present. Clearly like me on my North Pole you have no lack of booze in your place 😃. Well, a holiday season, anyway.

This KB/s was of course a typo. I mentioned MB/s before many times but not this one.

Please keep going. Still 1.8GB/s on my PC, not 5-6 let alone 10-12 but the steps are encouraging.

mecej4

Posts: 1911

Back to Top

19 Dec 2016 12:22 (Edited: 19 Dec 2016 1:33) #18570

Quoted from DanRRight Please keep going. Still 1.8GB/s on my PC, not 5-6 let alone 10-12 but the steps are encouraging.

Was that run with the current directory on a RAMdisk? If so, you can only hope to get throughput that is less than that, unless you can introduce parallelism into your application.

As we found in our earlier thread where we compared formatted internal reads and direct conversion of input strings to numbers, the best that we could do, without any error checking of the input, i.e, with zero disk latency, was about 300 MB/s. In this test, we have found that raw file I/O, i.e., with zero decoding latency, can be done at about 1 - 2 GB/s. If you combine these latency estimates (similarly to two resistors in series), you can estimate an effective speed of less than 230 MB/s for formatted reads from disk (less because a 4 byte real stored as a decimal number on disk takes about 12 bytes). If that is not good enough, I don't see how you can overcome these latencies without resorting to parallel processing.

Please go back and look at some of John Campbell's comments about what realistic I/O speeds you can aim for.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 Dec 2016 12:53 #18571

Something deeply wrong is here

Does the CrystalDiskMark test use parallelism to leave all the test results here in shameful misery?
How in the world it is possible to write 12GB of data to disk drive in one single second while it is not possible to load same 12 GB in second into RAM of the computer? Fun also is that RAMdrive is made out the same RAM and generally in practice I/O was never faster then RAM bandwidth.

Also as a note, ReadF@ and ReadFA@ may be fast to read big chunk or data but they are still very slow in reading line by line (10 numbers or ~160 characters per line)

PaulLaidler

Posts: 7971 Salford, UK

Back to Top

19 Dec 2016 2:26 #18573

mecej4

Thanks for the feedback. I have made a note of your original post.

mecej4

Posts: 1911

Back to Top

19 Dec 2016 3:04 #18580

Quoted from DanRRight Something deeply wrong is here

Does the CrystalDiskMark test use parallelism to leave all the test results here in shameful misery?

Test results don't feel shame. Question for a philosopher, perhaps?

I look out of my window and I see a bird flying around, chirping happily. I don't feel shame for not being able to fly, and the bird probably is not jealous because it cannot talk Fortran with DanRRight.

Also as a note, ReadF@ and ReadFA@ may be fast to read big chunk or data but they are still very slow in reading line by line (10 numbers or ~160 characters per line)

Yes, these are among the facts of life that one has to accept and cope with. I think that we have covered these points already, and repeatedly.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

20 Dec 2016 1:04 #18585

Then perhaps i did not articulate my points clearly. Leaving philosophic motives off the table (here for example could be absolutely different view that in reality it is not a happy song of your bird but more a 'swan song'. Every spring a bird has tripled family size in order to be the same size next spring. Which means of bird's potential life expectancy for a decade poor birdy actually lives a quarter of a year. Isn't this a total misery being a food for others or die from hunger?), here are questions a bit different way

I see other tests of read/write are almost an order of magnitude faster then anyone in this forum can show. Reasons for that? Can we get similar speeds if everything is 'MS DLLs'?
Why there is no way to load 12GB into RAM in one second directly by probably somehow bypassing slow formatting processing while we see that it is possible to unload these 12GB into the RAMdisk space which is supposed to be slower then just RAM itself?

OKok, in our case we are kind of slow, the free domain C compiler leaves us to bite the dust and listen birds laughing, but still we can load data into 1-dimensional array Arr(X) with the speed 1.8GB/s. Can we load the data into 3D array Arr(X,Y,Z) with the same speed ?

As a matter of discussion let me to illustrate last point suggesting one of potential way of doing that. I need to put these 10 numbers of data from each line of the file into array Arr (X, Y, X), or say Arr(10,1000000,100) to be exact which keeps 1 billion numbers. The data on the disk is formatted a bit differently then we played before in this or different thread. First 2 additional numbers in line will be array indices Y and Z and all the rest 10 numbers will go into X array elements. That is done to not calculate X,Y,Z indices and to eliminate processing for calculation of an index of element in the array Arr. Though this index calculation overhead could be actually negligible versus additional time reading indices, i did not check that yet. Adding two numbers per line decreases reading speed just 20% which means instead 12 GB/s we will get 10. 'Big' deal....

Again, the superfast reading program let's call it ReadSuperFast2@ will read 12 numbers, first two are indices Y1 and Z1 and place 10 numbers into first indices 1 to 10:

Arr(1:10, Y1, Z1),

then

Arr(1:10, Y2, Z2) etc

Simpler case of lower rank array would require only one index Y and the ReadSuperFast1@

More general case would require the program ReadSuperFast3@ which will use all 3 indices X, Y ad Z to fill sparse array data. Even in this case read speed would be 12/4 = 3 GB/s.

mecej4

Posts: 1911

Back to Top

20 Dec 2016 9:58 (Edited: 20 Dec 2016 11:58) #18586

The CrystalDisk benchmark program is, as far as I can see, just a GUI placed on top of the Microsoft Diskspeed command line utility. Instead of arbitrarily picking the highest speed reported by CrystalDisk, which corresponds to using multiple threads, and feeling miserable, read through the options of Diskspeed in https://github.com/Microsoft/diskspd/blob/master/DiskSpd_Documentation.pdf , select the options best matching your intended usage of I/O, and rejoice.

Your own reported speed of 1.8 GiB/s on your PC for block binary I/O is actually about the same as the speed reported by CrystalDisk for single-thread sequential I/O on large files. You can try this out yourself. Open a command window in Administrator mode, change to the directory containing your large input file, and run the command

<DiskSpeed directory>\amd64fre>diskspd.exe -fs -F1 -b64K <big_data_file>

Even these speeds are out of reach if you must do formatted READ. As we saw in https://forums.silverfrost.com/Forum/Topic/2970&postdays=0&postorder=asc&start=30, formatted READ using standard Fortran gives speeds of about about 30 MB/s. If we assume that the input data contains no errors and we do format conversions ourselves, we can raise the speed to about 300 MB/s.

That is approximately the best that you can do with a single thread, even if you could do disk I/O with infinite speed.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

20 Dec 2016 11:45 #18587

Such oil do exist but unfortunately no one sells it to Fortraneers in this ngroup. Parallel NetCDF and HDF5 are just few. There exist all libraries for that but again good to find fortrameers which will do initial testing with FTN95. I've heard about complaints too but slowness of large data is more then a nail in the foot

mecej4

Posts: 1911

Back to Top

20 Dec 2016 12:43 #18588

Even if we don't agree on what I/O speeds are possible, there is one good outcome from these discussions.

Herman Cain, a US Republican Primary Presidential candidate in 2012, became well known for his 9-9-9 tax plan. We have come up with something similar and quite useful in planning large programs.

On a PC circa 2016, we can use this rule of thumb:

    30 MB/s - 300 MB/s - 3 GB/s
    formatted  custom     unformatted
       read      formatted      read
                      read

are the upper limits of what is possible with a single thread processing a large input file.