Topic: Best way to add data to the growing file in General

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

9 Feb 2010 5:11 #5937

What do you folks think is the most optimal and fastest way to add the new piece of data to the growing file ? Additions will be done ones per day so let's call it for clarity 'multiday database file'. This file will be very large with time a GB or so. And there will be many 1000th of such files so i have to care about the right ways of doing that

I guess that standard way could be just reading this multiday database file until its end and add new ('today') portion at the end. But disadvantages are that it will take a lot of time to read as the file is growing and the fact that newer data (which will be more often used) will be placed at the end of multiday database file

So adding at the end or at the beginning is better, simpler, more reliable?

If adding to the beginning is preferable what is the best way to do that? Would it be with the use of APPEND file attribute (and the large database file then has to be added to the today smaller portion not vise versa, that will assure that the newer additions will at beginning) ? Is APPEND option working reliably (i have for many years strange not resolved problem sometimes with reading file up to the end, then calling one BACKSPACE and writing an addition to the file. I may get one line missing, do not know reason of that )?

jjgermis

Posts: 400 Nürnberg, Germany

Back to Top

10 Feb 2010 5:13 #5955

My first thought would have been[color=blue:7843d72505] APPEND [/color:7843d72505]as well. However, I do not have any experience with the task that you have in mind. Are you working with text files and do they have a special format? What is the reason for keeping the database growing? I assume that you would like to keep to your present format and do not want to change your database format.

JohnCampbell

Posts: 2526 Sydney

Back to Top

10 Feb 2010 8:17 #5957

An alternative is to use a direct access file, which requires fixed length records.

Each record could be designed to have a day's information, except the first record which indicates the number of active records, and any other information you may require, such as the date of the last record update.

You can then have a utility program which can analyse the file and check the length if required.

I'm not sure which is the best approach. On some O/S, the file allocation table can indicate where the end of the file is, but I suspect access=append may read all the file. You could try both methods with a file of at least 1gb and see how long they take to update the file.

John

IanLambley

Posts: 501 Sunderland

Back to Top

10 Feb 2010 12:09 #5961

Dan, If it is a sequential file, then opening it with the APPEND option should possition the pointer so that the next write puts data after the last record.

Using the read until the end of file and a backspace, is likely to position the pointer at the start of the last existing record, and this would then be overwritten by the next write statement, hence losing the last line of the existing. Instead, read till end, backspace and perform one more read and then write new data, may be the solution.

If you use direct access files and you don't know how many records are in the file, perform a dummy read on record 1 and look at the iostat value. If zero, then dummy read record 2, 4, 8, 16 etc until you get a fail, then use the successive approximation technique (error halving) used in an analogue to digital converter to find the number of existing records, add 1 and put in the new records. I would send you a routine to do this, but I'm not proud of it. By a dummy read, I mean just use

read(ichan,rec=itest,iostat=ios)

and don't put an data list on the read or if you find one is needed, then a 1byte variable is all that is needed. That should reduce the data transfer required and the same routine will work for all record sizes.

Ian

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

11 Feb 2010 9:51 #5968

Great suggestions, thank you all.

Barely remember and never worked with direct access file during last 20 years. Is it safe ? Say, suppose computer crashes during the write, what will happen? With the sequential file i lose probably only 'today' portion. Will i lose the whole database in direct access approach?

The database will contain detailed tick-by-tick stock market prices for all existing stocks

Also thanks for suggestion to fix backspace annoyance. I've spent a lot of time with it in the past and gave up. I do not remember what i've not tried....Will see what will happen now, would be nice to fix this damn thing.

JohnHorspool

Posts: 260 Gloucestershire UK

Back to Top

11 Feb 2010 11:45 #5969

Dan,

Why not have two files, todays and yesterdays? On the next day you initially have yesterdays file and the day before. Then each day, delete the oldest file, create a new file, copy the contents of yesterdays file into todays file and addon todays portion. Repeat everyday. Since you are only ever reading yesterdays file (to write into todays file), I don't think that it can be harmed by a read. Or am I talking nuts?

John

IanLambley

Posts: 501 Sunderland

Back to Top

12 Feb 2010 12:13 #5970

John, That is probably the standard method - the old sort and merge system. The existing file is merged with the sorted 'day file' by interleaving them together into a new file so that the resulting combination file is in the desired order, and you have only had to sort the shorter updated 'day file'. It is a commercial processing thing (e.g. COBOL), rather than scientific Fortran. Ian

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

12 Feb 2010 9:54 #5973

John,

In average we have 200 MB per year per stock, 10000 stocks = 2 TB per year

The way you are proposing to operate might be safe, but one thing exists besides that, i.e. the reading&writing 2TB of data could become the forever job. With the read+write+processing speed probably 20MB/sec it will take

2x10^12 / 20x10^6 = 10^5 seconds

more then 86400 seconds, or a whole day... I'd hope APPEND would just patch the new file to the database (or vice versa) without reading the whole damn thing. Is it how APPEND works?

JohnHorspool

Posts: 260 Gloucestershire UK

Back to Top

12 Feb 2010 10:36 #5974

Dan,

I think you will have to seriously consider John Campbell's suggestion of using a direct access file. You can certainly grow the file without having to read through all of it. The downside is the fixed record length, thus choosing an appropriate record length is a critical decision. Keeping track of the number of records shouldn't be a problem, as John suggested just store the count as part of the first record and update it when you add more records on the end of the file.

Even if you get a power cut when updating the file, just write a recovery program to read through the file record by record writing each recovered record to a new file, with the process stopping when IOSTAT hits an error reading the corrupted file.

JohnCampbell

Posts: 2526 Sydney

Back to Top

13 Feb 2010 11:13 #5977

Dan,

Even with a direct access file, when opening the file and writing past the last record, we do not know how windows locates the position of the last record on the disk. Presumably it will access the FAT to get the file location. We do not know how efficient windows O/S is to locate the end of file position, so that it can update the file. Alternatively you could just have new files each day (or week), then have other utility programs that run once a day (or week) to merge the new files. In windows, a direct access file does not have a significantly different structure to a sequential access file. There is no concept of block size or contiguous records on disk, as there was with older O/S say digital Vax. No guarantee they were any better either. Another option may be to find the size of the file (files@ will give this) then respond appropriately, either seeding a new file or appending to the old one. It would not be a big overhead to have a table of file names, and identify what period of time they relate to. Writing a database management system to keep track of all the files would not be very hard. Depending on the file size, you could choose to create a new file at any suitable time. (Last Thursday I wrote a paging system to handle a 6gb sparce array, by keeping a map of active pages. I was surprised how easy it was and it works as fast as the earlier version of the 1.2gb non-sparce array version) You realy need to write a program that tests how quickly you can update a 1gb sequential and a direct access file.

Good luck !!

John

JohnCampbell

Posts: 2526 Sydney

Back to Top

15 Feb 2010 1:14 #5985

I tried to write a program to test the two methods. Unfortunately these tests do not flush the disk buffers and so are not a real test, but they do give some indication of timing. The program is:

!     Last change:  JDC  15 Feb 2010   12:01 pm
!  program to open a file and append to the end
!
      integer*4, parameter :: num = 400000
      integer*4 i, dn
      REAL*8    elapse(0:num)
!
      open  (11, file=  'f11_seq.txt')
      close (11, status='delete')
!
      open  (11, file=  'f11_dir.txt')
      close (11, status='delete')
!
!   Sequential file
      dn = num/10
      CALL ELAPSE_SECOND (elapse(0))
      do i = 1,num
         if (mod(i,dn) == 0) write (*,*) 'Seq', i, elapse(i-1)-elapse(i-dn)
         CALL ELAPSE_SECOND (elapse(i))
         call Update_seq_file (i)
      end do
!
      open (unit=11,file='update_seq.log')
      do i = 1,num,10
         write (11,1002) i, elapse(i), elapse(i) - elapse(i-1)
      end do
      write (*,*) 'Sequential', elapse(num)-elapse(0)
!
!   Direct file
      CALL ELAPSE_SECOND (elapse(0))
      do i = 1,num
         if (mod(i,dn) == 0) write (*,*) 'Dir', i, elapse(i-1)-elapse(i-dn)
         CALL ELAPSE_SECOND (elapse(i))
         call Update_dir_file (i)
      end do
!
      open (unit=11,file='update_dir.log')
      do i = 1,num,10
         write (11,1002) i, elapse(i), elapse(i) - elapse(i-1)
      end do
      write (*,*) 'Direct', elapse(num)-elapse(0)
!
1002  format (i7,f14.5,f14.7)
      end

      subroutine update_seq_file (irec)
!
      integer*4 irec
      character date_time_buffer*25
      integer*4 ii(11,11), i,j,k
!
      open (11,file='f11_seq.txt', access='append')
!
      do i = 1,11
       do j = 1,11
         ii(i,j) = i+j+2
       end do
      end do
!
      call date_time_string (Date_Time_buffer)
      write (11,1000) 'new record',irec,' at '//date_time_buffer
      do k = 1,4
       do i = 1,11
         write (11,1001) k,i,(II(I,J),J=1,11)
       end do
      end do
      close (unit=11)
!
1000  FORMAT (/a,i8,a)
1001  format (2i4,12i6)
      end subroutine

      subroutine update_dir_file (irec)
!
      integer*4, parameter :: rec_len = 848
      integer*4 irec
      integer*4 record(rec_len), rec_num, iostat
!
      open (11, file='f11_dir.txt', access='direct', recl=rec_len*4, iostat=iostat)
!
!    get number of records
      READ (UNIT=11, REC=1, IOSTAT=IOSTAT) record
!
      if (iostat /= 0) then
         record = 1
      end if
!
      if (record(1) /= irec) write (*,*) 'unexpected irec value', record(1), irec
!
      rec_num = record(1) + 1
      record  = 1
      record(1) = rec_num
!
      write (UNIT=11, REC=1, IOSTAT=IOSTAT) record
!
      write (UNIT=11, REC=rec_num, IOSTAT=IOSTAT) record
      close (unit=11)
!
      end subroutine

JohnCampbell

Posts: 2526 Sydney

Back to Top

15 Feb 2010 1:17 #5986

The library routines are ( shame that preview and submit don't have the same size limit)

      subroutine date_time_string (Date_Time_buffer)
!
!     returns the date in the form 11-Jan-00 hh:mm:ss.sss
!
      character (len=*),        intent (out)    :: date_time_buffer
      CHARACTER LABEL(12)*3, CBUF*22
      STDCALL   GETLOCALTIME 'GetLocalTime' (REF)
      integer*2 ia(8), YEAR,MONTH
      DATA LABEL / 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',            &
     &             'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' /
!
      call GetLocalTime (ia)
!
!     ia_name(1) = 'year'
!     ia_name(2) = 'month'
!     ia_name(3) = 'day of week'
!     ia_name(4) = 'day'
!     ia_name(5) = 'hour'
!     ia_name(6) = 'minute'
!     ia_name(7) = 'second'
!     ia_name(8) = '.001_sec' (accurate only to 1/60 second)
!
      year  = mod (ia(1),100)
      month = ia(2)
!
      WRITE (CBUF,1009) ia(4), LABEL(MONTH), YEAR, ia(5), ia(6), ia(7),ia(8)
!
      Date_Time_Buffer = CBUF
      RETURN
!
 1009 FORMAT (I2,'-',A3,'-',I2.2, i3,':',i2.2,':',i2.2,'.',i3.3)
      END

!---------SUBROUTINE--ELAPSE_SECOND-----------------------------J.D.C. INC-----
!
      SUBROUTINE ELAPSE_SECOND (ELAPSE)
!
!     Returns the total elapsed time in seconds
!     based on QueryPerformanceCounter
!     This is the fastest and most accurate timing routine
!
      real*8,   intent (out) :: elapse
!
      STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
      STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
!
      real*8    :: freq  = 1
      logical*4 :: first = .true.
      integer*8 :: start = 0
      integer*8 :: num
      logical*4 :: ll
      integer*4 :: lute
!
!   Calibrate this time using QueryPerformanceFrequency
      if (first) then
         num   = 0
         ll    = QueryPerformanceFrequency (num)
         freq  = 1.0d0 / dble (num)
         call get_echo_unit (lute)
         WRITE (lute,*) 'Elapsed time counter :',num,' ticks per second'
         start = 0
         ll    = QueryPerformanceCounter (start)
         first = .false.
      end if
!
      num    = 0
      ll     = QueryPerformanceCounter (num)
      elapse = dble (num-start) * freq
      return
      end
 
      subroutine get_echo_unit (lute)
      integer*4 lute
      lute = 1
      end

JohnCampbell

Posts: 2526 Sydney

Back to Top

15 Feb 2010 1:24 #5987

and the run log I got from my PC was:

 Elapsed time counter :           3166720000 ticks per second
 Seq       40000          15.4560447065    
 Seq       80000          24.7551289164    
 Seq      120000          24.2236882661    
 Seq      160000          24.5701263193    
 Seq      200000          20.4406994556    
 Seq      240000          21.9553754124    
 Seq      280000          22.8299788671    
 Seq      320000          18.0617328416    
 Seq      360000          15.9047791112    
 Seq      400000          16.5379037240    
 Sequential          204.739364586    
 Dir       40000          19.0621267428    
 Dir       80000          19.2571968103    
 Dir      120000          19.2111123453    
 Dir      160000          19.1347387199    
 Dir      200000          19.6607339547    
 Dir      240000          20.0625909957    
 Dir      280000          19.5169702471    
 Dir      320000          19.8339843788    
 Dir      360000          19.6213950444    
 Dir      400000          20.0680457979    
 Direct          195.433487027

These times do not show the delay if disk buffers were cleared but they do show similar times for the 2 approaches. I have kept the record sizes similar.

I have not understood from your description :- how many times the files would be updated in a day, what error recovery may be required, or if each file is limited to a single process.

All these could make the program more complex, but still workable. I do think a database of file names woukld help. Using files8@ can help with this. I hope this helps

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

15 Feb 2010 5:08 #5988

John, special thanks, couldn't be better help then a piece of Fortran text

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 May 2010 8:14 (Edited: 20 May 2010 6:00) #6386

I found also one more useful thing in John's code which might be interesting to everyone. The function QueryPerformanceFrequency i re-wrapped separately into CPUclockGHz and placed in simple demo Clearwin program returns you real*8 value of the clock of your processor in GHz

use clrwin
real*8 CPUclockGHz; external CPUclockGHz 
   i=winio@('CPU Clock, GHz %rf%ac[esc]', CPUclockGHz(),'exit')
end

real*8 function CPUclockGHz() 
   integer*8 :: freq 
   logical*4 :: al      
   STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4 
   al = QueryPerformanceFrequency (freq) 
   CPUclockGHz = freq/1.e9 
end function CPUclockGHz

JohnHorspool

Posts: 260 Gloucestershire UK

Back to Top

19 May 2010 9:09 #6387

Dan,

It gives incorrect result on my machine (64bit AMD Phenom quadcore) running XP64.

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

19 May 2010 9:18 #6388

What it gives, current clock, max clock, other? You know clock frequency is varied by OS, right? Issue may appear when different cores are clocked differently.

JohnCampbell

Posts: 2526 Sydney

Back to Top

20 May 2010 12:42 #6391

John,

As you have found, it is not guaranteed for QueryPerformanceFrequency to give the processor frequency. QueryPerformanceCounter and CPU_Clock@ appear to be based on the QueryPerformanceFrequency and most accurately give elapsed time. On reading the documentation for these there is no guarantee what the processor manufacturer does with updating these counters.

Another puzzle to me is why so many of the other system timing counters are only updated @ 60 hz. I tried to send questions on the MSDN Forums but never got an answer as to how to increase the update frequency. CPU_Time, date_and_time, Dclock@, GetLocalTime, GetTickCount and GetProcessTimes all appear to be related to this timer that is only updated 60 times a second. I don't know:

where the 60 cycles is defined,
how it could be changed, and
what overhead would result.

I have written a routine to run CPU_Clock@ for .7 seconds (timed using dclock@) and calibrate the tick count, then store this in c:\processor_mhz.txt for other routines to quickly get. (I keep calling dclock@, noting when it changes value 41 times, indicating the start time of each cycle and call QPC each cycle)

If you want to differentiate between elapsed time and process (CPU) time, then GetProcessTimes gives some better info, but is only updated @ 60hz. It can be useful to differentiate between CPU or Processor time and disk I/O time, but achieving this changes with different processors, even different Intel processors !

If anyone can provide more information on this, I would love to know.

John

DanRRight

Posts: 2877 South Pole, Antarctica

Back to Top

12 Sep 2010 12:57 #6933