Topic: Database or NetCDF / HDF5 interfaces? in General

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

17 Nov 2016 12:54 #18419

Well, unformatted read is probably the only good solution until someone will break the sound barrier in speed with formatted read of text files.

i will probably meantime transform all the files which have no problems in them into binary and then read unformatted way. This of course will take time for conversion (will be done in batch regime or overnight) but this will be compensated by 10x speedup when loading them.

mecej4

Posts: 1911

Back to Top

17 Nov 2016 1:30 #18420

Do note that FTN95 uses a different convention for record markers for Fortran unformatted files than most other current Fortran compilers, and your C programmers may be unfamiliar with Fortran unformatted files. You can search for messages in the Silverfrost forums about conversion from the common unformatted file format to the Silverfrost format.

Not a big issue, but something to be aware of.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

17 Nov 2016 6:49 #18424

Got so far 3.5x speed increase on unformatted read with real code versus 10-15x on simple test benches. All that versus my older formatted *-format specifier read. This is because besides loading there are a lot of processing with LOG, EXP and large arrays in my code go simultaneously...Binary files are 3x more compact by the way then my older text files

And tests show that when 64 will be finally ready the read speed will be additionally 1.8x faster

Can you guys please do independent check for unformatted read speed ? We have chance to break 1GB/second on RAMdrives

JohnCampbell

Posts: 2526 Sydney

Back to Top

17 Nov 2016 11:19 #18425

dan,

We have chance to break 1GB/second on RAMdrives

I am getting these high rates when the file is cached, typically with a binary I/O test of write then read. If you are opening a file on disk for reading, which has not been previously loaded into a disk cache in available memory, these speeds may not be possible. While binary files have the advantage of speed, text based files from external sources have the advantage of easy verification and adaptation of a free-format reader. My recent testing has shown these can be read at 60 to 100 mb/sec, which is not too bad. The alternative is you could write a converter that reads the text file and produces a binary format for later quicker analysis, although this involves duplication of the information. The advantage of this is the converter can also be a data reviewer, that reports likely errors and simple statistics of range and extremes, which I find essential. You haven't indicated the quantity of data to be analysed, say gb's or terra-bytes per day.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

18 Nov 2016 12:57 (Edited: 18 Nov 2016 1:00) #18427

John,

PCI-express SSDs have 3GB/s, we have to utilize this bandwidth

I am so sick and furious with slow formatted text read that i want to try anything. Unformatted read of binary data is of course very dangerous but hopefully we passed the learning curve when it was very hard to catch the error in huge data sets.

Right now we have to handle 10-100GB size files for one run. Coming soon 64-bit compiler will increase the loaded volumes by order of magnitude.

mecej4

Posts: 1911

Back to Top

18 Nov 2016 1:00 #18428

John,

Here is Fortran code for a fast but approximate log10 function. You may try replacing the two calls to log10() in your E format output routine by calls to qlg10(). It computes an 8-bit approximation to log10(x), which is slightly better than two decimal digits. In the first call, a small table is set up. Each subsequent evaluation of qlg10() involves a single floating-multiply-add (FMA) operation and some integer shift and AND operations.

real function qlg10(x)
!
! quick evaluation of log10(x) with 8 bit precision, for use in finding 
! integer part of log10(x) to use with E and G format output processing
!
! See http://www.icsi.berkeley.edu/pubs/techreports/TR-07-002.pdf
!
implicit none
integer, parameter :: n = 8
integer, parameter :: tblsiz = 256, mbits = 23   ! 2^n, mantissa bits
real, intent(in) :: x
real, dimension(tblsiz),save :: tbl
logical :: beg = .true.
integer :: i, ix, lg2, y, mask
data mask/Z'7FFFFF'/
!
if(beg)then
   beg = .false.
   do i=1,tblsiz
   	  tbl(i) = LOG10((i-1)/real(tblsiz) + 1.0)
   end do
endif
ix=transfer(x,ix)
lg2=iand(ishft(ix,-mbits),255) - 127  ! exponent
y=ishft(iand(ix,mask),n-mbits)    ! 8-bit mantissa
qlg10=0.30103*lg2+tbl(y+1)
return 
end function qlg10

mecej4

Posts: 1911

Back to Top

20 Nov 2016 2:27 (Edited: 21 Nov 2016 8:42) #18436

After some fine tuning of the E and F output formatting subroutines, I obtained the following timing results (W10, i7-2720QM).

               F-fmt  F-sub    E-fmt  E-sub

FTN95 32-bit   2.904  0.595    3.057  2.703
FTN95 64-bit   9.057  0.796    8.985  3.013

GFTN  6.2-32  50.940  0.359   85.160  2.578
GFTN  6.2-64  48.290  0.234   81.870  2.110

LGF   7.7-32  48.120  0.344   80.470  2.562
LGF   7.7-64  45.930  0.250   77.030  2.047

IFort 17-32    2.320  0.427    2.300  1.247
IFort 17-64    2.390  0.512    1.980  1.122

GFTN is Gfortran 6.2 from Equation.com. LGF is Lahey/GNU 7.7, which is a repackaging of Gfortran 5.4.1.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

20 Nov 2016 3:12 #18437

Thanks mecej4, such wide comparisons tell a lot.

Was this for WRITE? How about READ?

And for completeness would be good to add also * format which is very good when there are different type variables of different widths in the line

JohnCampbell

Posts: 2526 Sydney

Back to Top

21 Nov 2016 3:45 #18438

mecej4's summary of write performance gives a good indication of Fortran formatted write performance, especially, FTN95 /32 is comparatively good for formatted write FTN95 /64 is slow for Format Write gFortran is incredibly slow for Format write.

The function approach is faster and may provide a use if there are specific changes to output, say for 0. or -0.000, or if using gFortran !

regarding READ performance, I have reviewed the internal read options, using a variation of mecej4's post on 15 Nov and Dan's free format layout of 16 Nov.

The test options I considered are: 1 read fixed format using 10F10.3 2 read fixed format using mecej4's read_val routine 3 read fixed format using read_val routine for optional decimal 4 read free format layout, using Dan's layout example, with numbers separated by a space or comma 5 parse string only, using Dan's layout example, with numbers separated by a space or comma

I have tested this for FTN95 /32, FTN95 /64 and gFortran 64-bit.

ftn95/32 ftn95/64 gfortran  Test
  1.102   0.665   6.331    1 Fortran FORMAT read
  0.233   0.405   0.284    2 function read - fixed field length
  0.272   0.499   0.328    3 function read - optional fraction value,  fixed field length
  0.622   1.033   0.755    4 function read - optional fraction value, variable field length
  0.211   0.410   0.485    5 get_next_field only

These results do not show the variation that the write test did show, although the parse string code (finding each number string) is surprisingly slow compared to the reading times.

My guess is that the test #4 approach does give greater flexibility reading number strings, but is about the same speed as FORMAT reads. FTN95 and gFortran does offer comma support in the standard F format. gFortran FORMAT read is again slow in comparison.

The advantage of the function approach for large data sets is you can change the valid number formats to suit the equipment that writes the file, eg tab or colon delimited fields, as well as space or comma can be supported.

The test program is on this link

https://www.dropbox.com/s/ut82pkdexc3si5h/intlRead.f90?dl=0

Neither of the functions listed support E format reading, although this would need to be included if required.

Based on this, and previous posts, I estimate that a read speed for external text files is of the order of 100 mb/sec for HDD and 300 mb/sec for SSD, which would be limited by the read conversion. This is faster than some of Dan's performance reports, which may be due to other disk type problems.

Unfortunately when reading large files, they are typically not buffered.

It has always been my preference to use a free format function approach, where I also provide other error and character checking reports. This is easiest using stream I/O. Checking and other field definitions will reduce read rates. At the end of reading, providing statistics of character counts etc is useful to give confidence about the reading process. eg counts of lines read, control characters, delimiters and fields per line always help.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

21 Nov 2016 7:56 #18439

Thanks, John, that comparison was great. You both with mecej4 have made very nice insight into the Fortran I/O, made huge step forward in speeding it and pointed out the further areas of improvement !

mecej4

Posts: 1911

Back to Top

21 Nov 2016 8:25 #18440

John: There is one improvement that can be made to your code for writing E-formatted numbers; it may yield a slight but noticeable improvement in speed.

After one has found log10(x), rounded and scaled x, the integer v expressed in the decimal scale is exactly m digits long. Therefore, the two IF tests in the DO loop following 'working with' are uncecessary. The loop can be replaced by

     do k=3+m,p+1,-1
        str(k:k) = char (mod(v,10)+z)
        v=v/10
     end do
     str(p-1:p-1) = char (v+z)

if you remember that str(p:p) contains the decimal point character. .............. Have you thought out how to apply what we have learned to develop a work-around for your FEA code? If you can segregate a few of the most CPU-cycle-eating WRITE statements, you could replace them with calls to subroutines that generate formatted strings and print the strings with WRITE (nn, '(A)') str statements.

JohnCampbell

Posts: 2526 Sydney

Back to Top

21 Nov 2016 12:44 #18441

mecej4,

Thanks for the further advice. I have modified the loops to remove some of the unnecessary instructions you indicated and have produced some significant improvements. The following table tracks the elapsed time performance of the ES Function for changes that have been discussed, including: removing a LOG10 function then replacing the other with a quick LOG10 look-up function cleaning the DO loops to remove unnecessary instructions

ES Function run time (seconds)
id FTN95   FTN95/64 gFortran  Test
 1 2.862   2.353   0.939      Initial ES Function
 2 2.520   2.320   0.684      remove 1 x LOG10
 3 2.417   2.042   0.640      restructure ** power
 5                 0.424      replace log10 with qlg10
 6 2.278   2.353   0.423      use fortran Qlog10
 7 1.886   2.062   0.302      tidy ES function

FTN95 /32 and gFortran have produced anticipated performance improvements, while FTN95 /64 has shown some variability in gains.

The final function is: subroutine write_val_e4 (val, str, n) ! ! writes -3.04E+01 ! real4 :: val ! value to write; must fit integer4 :: n ! digits >= 0 and < len(str) character :: str*() ! integer4 :: l ! len ( str ) real4 :: rv ! abs ( val ) real4 :: power ! log10 ( val ) integer4 :: ip ! E+ip integer8 :: v ! integer for digits of val integer8 :: ten = 10 ! mod integer4 :: k ! position of digit ! integer4 :: p = 3 ! position of '.' integer4 :: sgn ! +/- integer4 :: d ! digit integer4 :: z = ichar ('0') integer4 :: i,m ! REAL QLG10 external QLG10 ! ! Remove sign if ( val > 0 ) then sgn = 1 rv = val else if ( val < 0 ) then sgn = -1 rv = -val else str = ' 0.' return end if ! ! Check for overflow m = max (n,1) ! digits to provide l = len (str) if ( m > l-7 ) goto 99 ! ! Determine power power = QLG10 (rv) ; ip = power ; if ( power < 0) ip = (power - .9999999d0) rv = rv * 10.0**(m-ip) + 0.5 ! ! generate digits ! k = 3+m ! last digit position v = rv ! digits str = ' ' do k = 3+m,3+1,-1 d = mod(v,ten) str(k:k) = char (d+z) v = v/10 end do ! str(3:3) = '.' if ( v > 9 ) then ! patch for round up power write (,*) 'Power increase identified :', str(1:3+m), ip str(2:2) = '1' ip = ip+1 else d = v str(2:2) = char (d+z) end if ! ! -ve values if ( sgn < 0 ) str(1:1) = '-' ! ! write power k = 3+m+2 ! if ( ip < 0 ) then v = abs(ip) str(k-1:k) = 'E-' else v = ip str(k-1:k) = 'E+' end if ! m = 2 if ( v > 99 ) m = 3 do i = k+m,k+1,-1 d = mod(v,ten) str(i:i) = char (d+z) v = v/10 end do return ! ! overflow field 99 str = repeat ('#', l) return

   end subroutine write_val_e4

It will be good to test the updated FTN95 /64 /OPT on this and my earlier post on Thu Nov 17, 2016, when it is available.

mecej4

Posts: 1911

Back to Top

21 Nov 2016 5:58 (Edited: 26 Sep 2019 5:15) #18442

A couple of observations.

1. In many borderline cases, your new routine is faithful to the 'round to nearest or even' rule; the previous version does not always get the least significant digit right.

2. When the input number causes the 'patch for round up power' to be applied, the result is not correct. Try, for example,

val = 1.000857E-1
call write_val_e4 (val,str,3)
print *,str

The printed value is 1.009E-01, which is missing the third '0' preceding the '9'.

JohnCampbell

Posts: 2526 Sydney

Back to Top

21 Nov 2016 10:00 #18443

mecej4,

How do you find these numbers !

The error is due to the error with QLG10 estimating the wrong integer power. ( I was hoping the lookup table would have been correct at multiples of 10 ) I corrected the problem by making a more general response to the integer power 'ip' being wrong. (could have an infinite loop with rounding ?)

John

   program IntlWrite
     call test ( 99.999 )
     call test ( 1.000857e-1 )
     call test ( 1.00857e-1 )
   end

   subroutine test ( x )
     implicit none 
     character(len=100) :: str 
     real   :: x 
     REAL     QLG10
     external QLG10
!
      write (*,*) 'log', qlg10 (x), log10 (x), x
      call write_val_e4 ( x, str(1:12), 3 )  
      write (*,*) 'es :',str(1:12), x
!
   end subroutine test
   
   subroutine write_val_e4 (val, str, n)
!
!  writes -3.04E+01
!
     real*4    :: val         !  value to write; must fit 
     integer*4 :: n           !  digits >= 0 and < len(str)
     character :: str*(*)
!
     integer*4 :: l           ! len ( str )
     real*4    :: rv,av       ! abs ( val )
     real*4    :: power       ! log10 ( val )
     integer*4 :: ip          ! E+ip
     integer*8 :: v           ! integer for digits of val
     integer*8 :: ten = 10    ! mod
     integer*4 :: k           ! position of digit
!     integer*4 :: p = 3       ! position of '.'
     integer*4 :: sgn         ! +/-
     integer*4 :: d           ! digit
     integer*4 :: z = ichar ('0')
     integer*4 :: i,m
!
     REAL     QLG10
     external QLG10
!
!  Remove sign
     if ( val > 0 ) then
       sgn = 1
       av  = val
     else if ( val < 0 ) then
       sgn = -1
       av  = -val
     else
       str = ' 0.'
       return
     end if
!
!  Check for overflow
     m   = max (n,1)                !  digits to provide
     l   = len (str)
     if ( m > l-7 ) goto 99
!zzz
!     write (*,*) 'val   =',val
!     write (*,*) 'av    =',av
!     write (*,*) 'log10 =', log10(av), QLG10 (av)
!
!  Determine power
     power = QLG10 (av)
      ip = power
      if ( power < 0) ip = (power - .9999999d0)
!zzz
!     write (*,*) 'power =',power
!     write (*,*) 'ip    =',ip
!
!  Apply rounding
 10  rv  = av * 10.0**(m-ip) + 0.5
!
!  generate digits
     v   = rv
     str = ' '
     do k = 3+m,3+1,-1
       d = mod(v,ten)
       str(k:k) = char (d+z)
       v = v/10
     end do
!
!  check power
     str(3:3) = '.'
     if ( v > 9 ) then          !  patch for round up power
       write (*,*) 'Power increase identified :', str(1:3+m), ip
       ip = ip+1
       goto 10
     else if ( v < 1 ) then     !  patch for round down power
       write (*,*) 'Power decrease identified :', str(1:3+m), ip
       ip = ip-1
       goto 10
     else
       d = v
       str(2:2) = char (d+z)
     end if
!
!  -ve values
     if ( sgn < 0 ) str(1:1) = '-'
!
!  write power
     k = 3+m+2
     if ( ip < 0 ) then
       v = abs(ip)
       str(k-1:k) = 'E-'
     else
       v = ip
       str(k-1:k) = 'E+'
     end if
!
     m = 2
     if ( v > 99 ) m = 3
     do i = k+m,k+1,-1
       d = mod(v,ten)
       str(i:i) = char (d+z)
       v = v/10
     end do
     return
!
!  overflow field
 99  str = repeat ('#', l)
     return

   end subroutine write_val_e4

mecej4

Posts: 1911

Back to Top

22 Nov 2016 4:59 (Edited: 22 Nov 2016 1:57) #18444

You can reduce the number of instances (where a correction is needed) by increasing n to, say, 12, and TBLSIZ to 2^12 = 4096. You could, in addition, revert to using the full-precision log10() function in the few instances in which you detect that the estimated exponent needs correction.

Here is an improved version of QLG10 in which, after the table lookup, instead of taking the closest lower value, linear interpolation is applied to the two bracketing values. With this change, I find that 556 exponent updates were needed for 10 million random numbers between -1E5 to 1E5. If the full precision log10() is used, the number of exponent updates becomes 548. Thus, your exponent checks and updates are needed to be present in the code (even when you use log10()), but the performance hit is minor because such updates may be rarely invoked.

real function qlg10l(x) 
! 
! quick evaluation of log10(x) with 8 bit precision, for use in finding 
! integer part of log10(x) to use with E and G format output processing 
! 
! See http://www.icsi.berkeley.edu/pubs/techreports/TR-07-002.pdf 
! In this version, after the table is looked up linear interpolation is 
! performed to improve the approximation 
! 
implicit none 
integer, parameter :: n = 8 
integer, parameter :: tblsiz = 257, mbits = 23   ! 2^n+1, mantissa bits 
real, intent(in) :: x 
real, dimension(tblsiz),save :: tblx,tbly
real, dimension(tblsiz-1),save :: cm
logical :: beg = .true. 
integer :: i, ix, lg2, y, z, mask, emask 
real rz 
data mask/Z'7FFFFF'/,emask/Z'3F800000'/ 
! 
if(beg)then 
   beg = .false. 
   do i=1,tblsiz 
        tblx(i) = (i-1)/real(tblsiz-1) + 1.0 
        tbly(i) = LOG10(tblx(i)) 
   end do 
   do i=1,tblsiz-1
	  cm(i)   = (tbly(i+1)-tbly(i))/(tblx(i+1)-tblx(i))
   end do
endif 
ix=transfer(x,ix) 
lg2=iand(ishft(ix,-mbits),255) - 127  ! exponent 
y=iand(ix,mask); z=ior(y,emask); rz=transfer(z,rz) 
y=ishft(y,n-mbits)    ! 8-bit mantissa 
qlg10l=0.30103*lg2+tbly(y+1)+cm(y+1)*(rz-tblx(y+1))
!write(*,10)y,rz,tblx(y+1),tbly(y+1),tblx(y+2),tbly(y+2),qlg10 
!10 format(1x,i4,5F10.4,F14.4) 
return 
end function qlg10l

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Nov 2016 12:33 #18446

OK, finally i was able to reboot my computer and install RAMdisk.

And now after being happy for last few days with increased by factor of 4.5x read speeds using unformatted I/O with binary data i'm again in the big black jealousy mode.

Here are my RAMdisk speeds.

Versus what i got (~300MB/s) these speeds are 20-30x larger...The C code of this benchmark somehow reads hell faster and writes damn faster.

If you know C and have time to investigate how these speeds are tested here is link on the C source code of this benchmark (called CrystalDiskMark which is mostly used for such tests)

http://crystalmark.info/software/CrystalDiskMark/index-e.html

How come Fortran is 30x slower at unformatted I/O then C ?? No need to find the field start and end, no LOG tables, no slowness of char arrays and still such hell small speeds. Something is still rotten in the state of Denmark in its Fortran district

JohnCampbell

Posts: 2526 Sydney

Back to Top

22 Nov 2016 9:16 #18450

Dan,

I presume that reading a sequential file at 8491 MB/s is 8.4 GigaBytes per second. These are impressive rates. The problem is you can't process the information at this rate.

You have removed Disk I/O performance as a bottleneck for your analysis with this approach.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

22 Nov 2016 10:03 #18452

John, Yes, these are multi-GB per second I/O speeds. Despite of these crazy speeds the DRAM is still next slowest bottleneck of the computer after harddrives. In my fox hole at North Pole I have older generation DDR3 memory, the newer DDR4 are probably twice faster (my previous DDR2 memory computer had 6GB/s speed with the same QSoft RAMdisk). And processor's caches are more then an order of magnitude faster with L1 cache two orders faster. Given that why we can not process the information at this rate ? This 10GB/s rate is like a turtle for any processor. The rate at which Fortran codes are doing I/O is like a dead man walk

What this CrystalDiskMark benchmark doing is the same like all we do above with all our tests - it is reading/writing and measuring time. But if this test was written in Fortran it'd show the speeds not more then 500-600MB/s because Fortran itself is a bottleneck, it by some unknown reason can not do faster. This is what fast drives like RAMdisk revealed. So the question is : why taking bytes from one place of DRAM (on RAMdisk) and placing them into another place of DRAM using C takes 30x less time then with Fortran? Is Fortran calculation of indexes of an array in the DRAM where data will be placed THAT more expensive then for C to find the address of the same data placed on the RAMdisk file?

We clearly don't use all capabilities of modern hardware. The compiler developers must notice and address this huge discrepancy. Besides RAMdisks, the 3.5GB/s speeds have PCI-Express (not SATA) mainstream reasonably priced SSDs (like the ones from Samsung), all ultraportable computers have at least 1 GB/s. Compare that to the shameful speeds we read/write...

By the way if anyone think that RAMdisks lose information on reboot etc, they are way overprotective. I still have all files on my older computer RAMdisk which are probably 3-4 years old. It makes automatic backups on shutdowns and on predefined intervals and loads that on login