Silverfrost Forums

Welcome to our forums

Direct access binary speed - again!

26 Jan 2015 6:39 #15458

By strange coincidence, I too have been surprised with the poor performance writing to binary direct access files. I've not used this approach before, but I was aware that sometimes my programming style does not led itself to speed - lots of writing and then reading the data by different program units - so I thought I look at this some wet winter evening.

On my machine the code below, takes 47 ms to allocate and populate the R4 1000x1000 array A, 17.2 s to write it to the direct access file, and 296 ms to recover same the data into array B.

Is this typical performance for writing, or am I doing something silly here?

Cheers Ken

      PROGRAM TEST
      IMPLICIT NONE      
      INTEGER RECLEN, NROW, NCOL, I, J, RECORD
      REAL, ALLOCATABLE:: A(:,:), B(:,:)
      REAL T1,T2,T3,T4
      
      CALL CLOCK@(T1)
      
      NROW = 1000
      NCOL = 1000
      ALLOCATE(A(NROW,NCOL))
      DO J = 1, NCOL, 1
        DO I = 1, NROW, 1
          A(I,J) = (J-1)*NCOL + I
        END DO
      END DO
      
      CALL CLOCK@(T2)
      
      WRITE(6,*)'TIME ALLOCATE AND POLULATE ARRAY IN MEMORY', T2-T1
      INQUIRE(IOLENGTH=RECLEN)A(1,1)
      OPEN(UNIT=10,FILE='data1.bin',FORM='UNFORMATTED',ACCESS='DIRECT',
     &     RECL=reclen)
      DO J = 1, NCOL, 1
        DO I = 1, NROW, 1
          RECORD = (J-1)*NCOL + I
          WRITE(10,REC=RECORD) A(I,J)
        END DO
      END DO
      
      CALL CLOCK@(T3)
      
      WRITE(6,*)'TIME TO WRITE ARRAY TO DIRECT ACCESS FILE',T3-T2
      ALLOCATE(B(NROW,NCOL))
      DO J = 1, NCOL, 1
        DO I = 1, NROW, 1
          RECORD = (J-1)*NCOL + I
          READ(10,REC=RECORD) B(I,J)
        END DO
      END DO
      
      CALL CLOCK@(T4)
      
      WRITE(6,*)'TIME TO READ ARRAY FROM DIRECT ACCESS FILE',T4-T3
      CLOSE(UNIT=10,STATUS='DELETE')
      END
26 Jan 2015 8:03 #15460

For me with your example, I get 1.6 seconds for the write section and 0.16 for the read section. So there is certainly a difference.

26 Jan 2015 8:58 #15462

Interesting David, with that write speed I would not have raised the query! I realise it will be machine/hardware dependant, but it's intriguing to see how much difference there can be. Out of curiosity I've bug out the old machines I have sitting at home, listed below in ascending order of increasing performance spec, along with their windows version and the time to perform the write operation:-

Windows XP                      7.2s
Windows 7 Home Premium 17.2s
Windows 7 Enterprise         8.5s

Suddenly my old XP machine looks good - but it's not good when there's a lot of number crunching to be carried out!

Ken

26 Jan 2015 9:16 #15465

Ken,

I have not seen 'INQUIRE(IOLENGTH=RECLEN)A(1,1)' used in this way before. Is this legal ?

I tried to improve the performance, by reducing the number of records from 1,000,000 to 1,000. This does have a significant improvement in speed, even though your original code wrote the records out sequentially. I would have expected better buffering.

When I changed this, 'INQUIRE(IOLENGTH=RECLEN)A(:,1)' did not work, so a different approach is needed for this. The intrinsic SIZEOF would be a great addition to FTN95!

I would also have expected an improvement with Windows 7xx over XP, but you are not finding this. Perhaps the improvement is more noticeable with larger files. Anyway, the problem is 1,000,0000 records of 4 bytes is much slower than 1,000 records of 4,000 bytes. Reading is faster, as it is being buffered. Note that clock@ is only accurate to 1/64 second, so the read from buffer can be quicker than this. Hence 0 seconds for read. Use system_clock for better accuracy.

My modified code to see the change is:

      PROGRAM TEST 
       IMPLICIT NONE      
       INTEGER RECLEN, NROW, NCOL, I, J, RECORD, K
       REAL, ALLOCATABLE:: A(:,:), B(:,:) 
       REAL T1,T2,T3,T4 
       LOGICAL :: big_blocks = .true.
       do k = 1,2
       big_blocks = k > 1
       CALL CLOCK@(T1) 
       
       NROW = 1000 
       NCOL = 1000 
       ALLOCATE(A(NROW,NCOL)) 
       DO J = 1, NCOL, 1 
         DO I = 1, NROW, 1 
           A(I,J) = (J-1)*NCOL + I 
         END DO 
       END DO 
       
       CALL CLOCK@(T2) 
       
       WRITE(6,*)'TIME ALLOCATE AND POLULATE ARRAY IN MEMORY', T2-T1 
!       INQUIRE(IOLENGTH=RECLEN)A(:,1) 
       reclen = 4
       if ( big_blocks) reclen = 4*nrow
       write(6,*)'RECL=',reclen
       OPEN (UNIT=10,FILE='data1.bin',FORM='UNFORMATTED',ACCESS='DIRECT',  &
      &     RECL=reclen) 
       DO J = 1, NCOL, 1 
         if ( big_blocks) then
          RECORD = J 
          WRITE(10,REC=RECORD) A(:,J) 
         else
          DO I = 1, NROW, 1 
            RECORD = (J-1)*NCOL + I 
            WRITE(10,REC=RECORD) A(I,J) 
          END DO
         end if 
       END DO 
       
       CALL CLOCK@(T3) 
       
       WRITE(6,*)'TIME TO WRITE ARRAY TO DIRECT ACCESS FILE',T3-T2 
       ALLOCATE(B(NROW,NCOL)) 
       DO J = 1, NCOL, 1 
         if ( big_blocks) then
          RECORD = j
          READ(10,REC=RECORD) B(:,J) 
         else
          DO I = 1, NROW, 1 
            RECORD = (J-1)*NCOL + I 
            READ(10,REC=RECORD) B(I,J) 
          END DO 
         end if
       END DO 
       
       CALL CLOCK@(T4) 
       
       WRITE(6,*)'TIME TO READ ARRAY FROM DIRECT ACCESS FILE',T4-T3 
       CLOSE(UNIT=10,STATUS='DELETE') 
       deallocate (a,b)
       end do
       END 
26 Jan 2015 9:52 #15466

Quoted from JohnCampbell

I have not seen 'INQUIRE(IOLENGTH=RECLEN)A(1,1)' used in this way before. Is this legal ?

I think you can use a list of variables which matches the list of variables written or read, so A(1,1), which is a real scalar value is OK.

26 Jan 2015 9:54 #15467

Quoted from Kenneth_Smith

I realise it will be machine/hardware dependant, but it's intriguing to see how much difference there can be.

Perhaps this depends on the hard disc you have fitted. Mine is pretty zippy.

Perhaps, also, writes are slower than reads?

26 Jan 2015 10:03 #15469

INQUIRE (IOLENGTH=RECLEN) A(:,1) and INQUIRE (IOLENGTH=RECLEN) A(1:NROW,1) both produces a compiler error or stack overflow.

The following is F95 standard conforming for SIZEOF

   reclen = size ( transfer ( a(1,1), (/'A'/) ) )
   if ( big_blocks) reclen = reclen*size (A(:,1))
   write(6,*)'RECL=',reclen

John

26 Jan 2015 10:40 #15470

Thanks John, that is super fast!

As I understand you code B(:,J) is writing a whole column of data as a single record - I must admit that construction does not jump out to me as the obvious way to do things - guess I still think too much in the F77 world!

I did try writing rows of data via an implied do loop - without much success and with hindsight that would still be the same number of writes and my serial approach. I did think that, one possible advantage of the serial approach might be that there is no need to recover the whole array into memory, simply access the required element i,j when required by it's record number, but clearly it may well be easier to recover the whole column that contains i,j.

I see I have much to learn, i would have written

big_blocks = k > 1

as

IF (K.GT.1) THEN
  BIG_BLOCKS = .TRUE.
ELSE
  BIG_BLOCKS = .FALSE.
END IF 

Thanks again!

Ken

30 Jun 2016 5:56 #17710

Ken,

I was just looking at IOLENGTH and noticed a problem with your example. The following code is in error :

      DO J = 1, NCOL, 1
         DO I = 1, NROW, 1 
           RECORD = (J-1)*NCOL + I 
           WRITE(10,REC=RECORD) A(I,J) 
         END DO 
       END DO 

It should be : RECORD = (J-1)*NROW + I

If NROW = NCOL all would be ok !

John

1 Jul 2016 2:25 #17717

What I found a long time ago (https://forums.silverfrost.com/Forum/Topic/2648) was the record length has an effect, but it's really the total number of I/O calls that determines the performance. Short record, lots of I/O, poor performance as compared to doing the identical kinds of I/O in 'C' (but without the flexibility [and overhead] that FTN95 I/O lists gives).

I have abandoned almost all of the direct access file I/O in my software because the performance is so poor. And, for still unknown reasons. Running my benchmarks (written long ago also) on a solid state drive that is blindingly fast still shows the lack of performance discovered a couple of years ago. Running the same benchmark on a RAM disk also showed the same poor performance (albeit less poor than on a hard drive, but the percentages still held true).

For me, that which used to be done in a direct access temporary file is now done in memory. Which limits the size of the data sets that can be processed. Not good, right?

Still modifying code to remove the file limitations of direct access and placing all the files in memory......

1 Jul 2016 5:28 #17718

Bill,

I read the link you posted. I must admit I don't use Win 8.1 except for 1 program on that PC. I run most analysis runs on Win 7 desktops as single user, ie no change to SHARE. Apart from your test example from Jan-15, my experience of Direct Access has always been of very good performance and I would always recommend that approach for substantial disk I/O. Did you come up with any fixes to the problem ?

Perhaps the combination of file buffers and some SHARE options is a problem, but reading the post from 18 months ago that may have been dismissed as a major cause. The other problem could have been the virus checker, as it can clash with I/O performance. Perhaps you should split the files, with a small token file with SHARE status, while the large database files do not require share write to work. (Share read should not be a problem, although I don't remember where the previous thread finished up)

My experience is direct access works very well and combines well with the file cache/buffers. All my programs that use it perform well for disk I/O. I would always recommend this approach.

John

1 Jul 2016 1:15 #17722

John, it works well, but..... The performance hit is too great to ignore. And, it was the same hit whether running the same benchmark under Windows 8 or Windows 2000.

I think, when I get some time, I'll repeat the benchmarking, this time linking in an equivalent 'C' implementation and getting the number for that to help in the comparisons. The source of the performance penalty, should we be able to find it, would be of interest to many, I think.

Just FYI, under Windows 3.1 up through XP, I used a different FORTRAN compiler to create the operational code. None of these effects were seen. The same code also harkens back to CP/M days using an early Microsoft compiler. The same FORTRAN code was ported to a VAX and a PDP-11. In all these cases, there was not a performance hit due to the direct-access unformatted I/O.

I may have a platform that can support an old compiler to compile and run the same benchmark. I'll look at adding that to the mix. That would at least eliminate the OS as a culprit.

Please login to reply.