forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Best way to add data to the growing file
Goto page 1, 2  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
DanRRight



Joined: 10 Mar 2008
Posts: 2818
Location: South Pole, Antarctica

PostPosted: Tue Feb 09, 2010 6:11 am    Post subject: Best way to add data to the growing file Reply with quote

What do you folks think is the most optimal and fastest way to add the new piece of data to the growing file ? Additions will be done ones per day so let's call it for clarity "multiday database file". This file will be very large with time a GB or so. And there will be many 1000th of such files so i have to care about the right ways of doing that

I guess that standard way could be just reading this multiday database file until its end and add new ("today") portion at the end. But disadvantages are that it will take a lot of time to read as the file is growing and the fact that newer data (which will be more often used) will be placed at the end of multiday database file

So adding at the end or at the beginning is better, simpler, more reliable?

If adding to the beginning is preferable what is the best way to do that? Would it be with the use of APPEND file attribute (and the large database file then has to be added to the today smaller portion not vise versa, that will assure that the newer additions will at beginning) ? Is APPEND option working reliably (i have for many years strange not resolved problem sometimes with reading file up to the end, then calling one BACKSPACE and writing an addition to the file. I may get one line missing, do not know reason of that )?
Back to top
View user's profile Send private message
jjgermis



Joined: 21 Jun 2006
Posts: 404
Location: Nürnberg, Germany

PostPosted: Wed Feb 10, 2010 6:13 am    Post subject: Reply with quote

My first thought would have been APPEND as well. However, I do not have any experience with the task that you have in mind. Are you working with text files and do they have a special format? What is the reason for keeping the database growing? I assume that you would like to keep to your present format and do not want to change your database format.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Wed Feb 10, 2010 9:17 am    Post subject: Reply with quote

An alternative is to use a direct access file, which requires fixed length records.

Each record could be designed to have a day's information, except the first record which indicates the number of active records, and any other information you may require, such as the date of the last record update.

You can then have a utility program which can analyse the file and check the length if required.

I'm not sure which is the best approach. On some O/S, the file allocation table can indicate where the end of the file is, but I suspect access=append may read all the file. You could try both methods with a file of at least 1gb and see how long they take to update the file.

John
Back to top
View user's profile Send private message
IanLambley



Joined: 17 Dec 2006
Posts: 490
Location: Sunderland

PostPosted: Wed Feb 10, 2010 1:09 pm    Post subject: Reply with quote

Dan,
If it is a sequential file, then opening it with the APPEND option should possition the pointer so that the next write puts data after the last record.

Using the read until the end of file and a backspace, is likely to position the pointer at the start of the last existing record, and this would then be overwritten by the next write statement, hence losing the last line of the existing. Instead, read till end, backspace and perform one more read and then write new data, may be the solution.

If you use direct access files and you don't know how many records are in the file, perform a dummy read on record 1 and look at the iostat value. If zero, then dummy read record 2, 4, 8, 16 etc until you get a fail, then use the successive approximation technique (error halving) used in an analogue to digital converter to find the number of existing records, add 1 and put in the new records. I would send you a routine to do this, but I'm not proud of it. By a dummy read, I mean just use
Code:

read(ichan,rec=itest,iostat=ios)

and don't put an data list on the read or if you find one is needed, then a 1byte variable is all that is needed. That should reduce the data transfer required and the same routine will work for all record sizes.

Ian
Back to top
View user's profile Send private message Send e-mail
DanRRight



Joined: 10 Mar 2008
Posts: 2818
Location: South Pole, Antarctica

PostPosted: Thu Feb 11, 2010 10:51 pm    Post subject: Reply with quote

Great suggestions, thank you all.

Barely remember and never worked with direct access file during last 20 years. Is it safe ? Say, suppose computer crashes during the write, what will happen? With the sequential file i lose probably only "today" portion. Will i lose the whole database in direct access approach?

The database will contain detailed tick-by-tick stock market prices for all existing stocks

Also thanks for suggestion to fix backspace annoyance. I've spent a lot of time with it in the past and gave up. I do not remember what i've not tried....Will see what will happen now, would be nice to fix this damn thing.
Back to top
View user's profile Send private message
JohnHorspool



Joined: 26 Sep 2005
Posts: 270
Location: Gloucestershire UK

PostPosted: Fri Feb 12, 2010 12:45 am    Post subject: Reply with quote

Dan,

Why not have two files, todays and yesterdays? On the next day you initially have yesterdays file and the day before. Then each day, delete the oldest file, create a new file, copy the contents of yesterdays file into todays file and addon todays portion. Repeat everyday. Since you are only ever reading yesterdays file (to write into todays file), I don't think that it can be harmed by a read. Or am I talking nuts?

John
Back to top
View user's profile Send private message Visit poster's website
IanLambley



Joined: 17 Dec 2006
Posts: 490
Location: Sunderland

PostPosted: Fri Feb 12, 2010 1:13 pm    Post subject: Reply with quote

John,
That is probably the standard method - the old sort and merge system.
The existing file is merged with the sorted "day file" by interleaving them together into a new file so that the resulting combination file is in the desired order, and you have only had to sort the shorter updated "day file".
It is a commercial processing thing (e.g. COBOL), rather than scientific Fortran.
Ian
Back to top
View user's profile Send private message Send e-mail
DanRRight



Joined: 10 Mar 2008
Posts: 2818
Location: South Pole, Antarctica

PostPosted: Fri Feb 12, 2010 10:54 pm    Post subject: Reply with quote

John,

In average we have 200 MB per year per stock, 10000 stocks = 2 TB per year

The way you are proposing to operate might be safe, but one thing exists besides that, i.e. the reading&writing 2TB of data could become the forever job. With the read+write+processing speed probably 20MB/sec it will take

2x10^12 / 20x10^6 = 10^5 seconds

more then 86400 seconds, or a whole day... I'd hope APPEND would just patch the new file to the database (or vice versa) without reading the whole damn thing. Is it how APPEND works?
Back to top
View user's profile Send private message
JohnHorspool



Joined: 26 Sep 2005
Posts: 270
Location: Gloucestershire UK

PostPosted: Fri Feb 12, 2010 11:36 pm    Post subject: Reply with quote

Dan,

I think you will have to seriously consider John Campbell's suggestion of using a direct access file. You can certainly grow the file without having to read through all of it. The downside is the fixed record length, thus choosing an appropriate record length is a critical decision. Keeping track of the number of records shouldn't be a problem, as John suggested just store the count as part of the first record and update it when you add more records on the end of the file.

Even if you get a power cut when updating the file, just write a recovery program to read through the file record by record writing each recovered record to a new file, with the process stopping when IOSTAT hits an error reading the corrupted file.
Back to top
View user's profile Send private message Visit poster's website
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sat Feb 13, 2010 12:13 pm    Post subject: Reply with quote

Dan,

Even with a direct access file, when opening the file and writing past the last record, we do not know how windows locates the position of the last record on the disk. Presumably it will access the FAT to get the file location. We do not know how efficient windows O/S is to locate the end of file position, so that it can update the file.
Alternatively you could just have new files each day (or week), then have other utility programs that run once a day (or week) to merge the new files.
In windows, a direct access file does not have a significantly different structure to a sequential access file. There is no concept of block size or contiguous records on disk, as there was with older O/S say digital Vax. No guarantee they were any better either.
Another option may be to find the size of the file (files@ will give this) then respond appropriately, either seeding a new file or appending to the old one.
It would not be a big overhead to have a table of file names, and identify what period of time they relate to. Writing a database management system to keep track of all the files would not be very hard. Depending on the file size, you could choose to create a new file at any suitable time. (Last Thursday I wrote a paging system to handle a 6gb sparce array, by keeping a map of active pages. I was surprised how easy it was and it works as fast as the earlier version of the 1.2gb non-sparce array version)
You realy need to write a program that tests how quickly you can update a 1gb sequential and a direct access file.

Good luck !!

John
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Feb 15, 2010 2:14 am    Post subject: Reply with quote

I tried to write a program to test the two methods. Unfortunately these tests do not flush the disk buffers and so are not a real test, but they do give some indication of timing. The program is:
Code:
!     Last change:  JDC  15 Feb 2010   12:01 pm
!  program to open a file and append to the end
!
      integer*4, parameter :: num = 400000
      integer*4 i, dn
      REAL*8    elapse(0:num)
!
      open  (11, file=  'f11_seq.txt')
      close (11, status='delete')
!
      open  (11, file=  'f11_dir.txt')
      close (11, status='delete')
!
!   Sequential file
      dn = num/10
      CALL ELAPSE_SECOND (elapse(0))
      do i = 1,num
         if (mod(i,dn) == 0) write (*,*) 'Seq', i, elapse(i-1)-elapse(i-dn)
         CALL ELAPSE_SECOND (elapse(i))
         call Update_seq_file (i)
      end do
!
      open (unit=11,file='update_seq.log')
      do i = 1,num,10
         write (11,1002) i, elapse(i), elapse(i) - elapse(i-1)
      end do
      write (*,*) 'Sequential', elapse(num)-elapse(0)
!
!   Direct file
      CALL ELAPSE_SECOND (elapse(0))
      do i = 1,num
         if (mod(i,dn) == 0) write (*,*) 'Dir', i, elapse(i-1)-elapse(i-dn)
         CALL ELAPSE_SECOND (elapse(i))
         call Update_dir_file (i)
      end do
!
      open (unit=11,file='update_dir.log')
      do i = 1,num,10
         write (11,1002) i, elapse(i), elapse(i) - elapse(i-1)
      end do
      write (*,*) 'Direct', elapse(num)-elapse(0)
!
1002  format (i7,f14.5,f14.7)
      end

      subroutine update_seq_file (irec)
!
      integer*4 irec
      character date_time_buffer*25
      integer*4 ii(11,11), i,j,k
!
      open (11,file='f11_seq.txt', access='append')
!
      do i = 1,11
       do j = 1,11
         ii(i,j) = i+j+2
       end do
      end do
!
      call date_time_string (Date_Time_buffer)
      write (11,1000) 'new record',irec,' at '//date_time_buffer
      do k = 1,4
       do i = 1,11
         write (11,1001) k,i,(II(I,J),J=1,11)
       end do
      end do
      close (unit=11)
!
1000  FORMAT (/a,i8,a)
1001  format (2i4,12i6)
      end subroutine

      subroutine update_dir_file (irec)
!
      integer*4, parameter :: rec_len = 848
      integer*4 irec
      integer*4 record(rec_len), rec_num, iostat
!
      open (11, file='f11_dir.txt', access='direct', recl=rec_len*4, iostat=iostat)
!
!    get number of records
      READ (UNIT=11, REC=1, IOSTAT=IOSTAT) record
!
      if (iostat /= 0) then
         record = 1
      end if
!
      if (record(1) /= irec) write (*,*) 'unexpected irec value', record(1), irec
!
      rec_num = record(1) + 1
      record  = 1
      record(1) = rec_num
!
      write (UNIT=11, REC=1, IOSTAT=IOSTAT) record
!
      write (UNIT=11, REC=rec_num, IOSTAT=IOSTAT) record
      close (unit=11)
!
      end subroutine
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Feb 15, 2010 2:17 am    Post subject: Reply with quote

The library routines are ( shame that preview and submit don't have the same size limit)
Code:
      subroutine date_time_string (Date_Time_buffer)
!
!     returns the date in the form 11-Jan-00 hh:mm:ss.sss
!
      character (len=*),        intent (out)    :: date_time_buffer
      CHARACTER LABEL(12)*3, CBUF*22
      STDCALL   GETLOCALTIME 'GetLocalTime' (REF)
      integer*2 ia(8), YEAR,MONTH
      DATA LABEL / 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',            &
     &             'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec' /
!
      call GetLocalTime (ia)
!
!     ia_name(1) = 'year'
!     ia_name(2) = 'month'
!     ia_name(3) = 'day of week'
!     ia_name(4) = 'day'
!     ia_name(5) = 'hour'
!     ia_name(6) = 'minute'
!     ia_name(7) = 'second'
!     ia_name(8) = '.001_sec' (accurate only to 1/60 second)
!
      year  = mod (ia(1),100)
      month = ia(2)
!
      WRITE (CBUF,1009) ia(4), LABEL(MONTH), YEAR, ia(5), ia(6), ia(7),ia(8)
!
      Date_Time_Buffer = CBUF
      RETURN
!
 1009 FORMAT (I2,'-',A3,'-',I2.2, i3,':',i2.2,':',i2.2,'.',i3.3)
      END

!---------SUBROUTINE--ELAPSE_SECOND-----------------------------J.D.C. INC-----
!
      SUBROUTINE ELAPSE_SECOND (ELAPSE)
!
!     Returns the total elapsed time in seconds
!     based on QueryPerformanceCounter
!     This is the fastest and most accurate timing routine
!
      real*8,   intent (out) :: elapse
!
      STDCALL   QUERYPERFORMANCECOUNTER 'QueryPerformanceCounter' (REF):LOGICAL*4
      STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
!
      real*8    :: freq  = 1
      logical*4 :: first = .true.
      integer*8 :: start = 0
      integer*8 :: num
      logical*4 :: ll
      integer*4 :: lute
!
!   Calibrate this time using QueryPerformanceFrequency
      if (first) then
         num   = 0
         ll    = QueryPerformanceFrequency (num)
         freq  = 1.0d0 / dble (num)
         call get_echo_unit (lute)
         WRITE (lute,*) 'Elapsed time counter :',num,' ticks per second'
         start = 0
         ll    = QueryPerformanceCounter (start)
         first = .false.
      end if
!
      num    = 0
      ll     = QueryPerformanceCounter (num)
      elapse = dble (num-start) * freq
      return
      end
 
      subroutine get_echo_unit (lute)
      integer*4 lute
      lute = 1
      end
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Mon Feb 15, 2010 2:24 am    Post subject: Reply with quote

and the run log I got from my PC was:
Code:
 Elapsed time counter :           3166720000 ticks per second
 Seq       40000          15.4560447065   
 Seq       80000          24.7551289164   
 Seq      120000          24.2236882661   
 Seq      160000          24.5701263193   
 Seq      200000          20.4406994556   
 Seq      240000          21.9553754124   
 Seq      280000          22.8299788671   
 Seq      320000          18.0617328416   
 Seq      360000          15.9047791112   
 Seq      400000          16.5379037240   
 Sequential          204.739364586   
 Dir       40000          19.0621267428   
 Dir       80000          19.2571968103   
 Dir      120000          19.2111123453   
 Dir      160000          19.1347387199   
 Dir      200000          19.6607339547   
 Dir      240000          20.0625909957   
 Dir      280000          19.5169702471   
 Dir      320000          19.8339843788   
 Dir      360000          19.6213950444   
 Dir      400000          20.0680457979   
 Direct          195.433487027   


These times do not show the delay if disk buffers were cleared but they do show similar times for the 2 approaches.
I have kept the record sizes similar.

I have not understood from your description :-
how many times the files would be updated in a day,
what error recovery may be required, or
if each file is limited to a single process.

All these could make the program more complex, but still workable.
I do think a database of file names woukld help. Using files8@ can help with this.
I hope this helps

John
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2818
Location: South Pole, Antarctica

PostPosted: Mon Feb 15, 2010 6:08 am    Post subject: Reply with quote

John, special thanks, couldn't be better help then a piece of Fortran text
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 2818
Location: South Pole, Antarctica

PostPosted: Wed May 19, 2010 9:14 pm    Post subject: Reply with quote

I found also one more useful thing in John's code which might be interesting to everyone. The function QueryPerformanceFrequency i re-wrapped separately into CPUclockGHz and placed in simple demo Clearwin program returns you real*8 value of the clock of your processor in GHz
Code:
use clrwin
real*8 CPUclockGHz; external CPUclockGHz
   i=winio@('CPU Clock, GHz %rf%ac[esc]', CPUclockGHz(),'exit')
end

real*8 function CPUclockGHz()
   integer*8 :: freq
   logical*4 :: al     
   STDCALL   QUERYPERFORMANCEFREQUENCY 'QueryPerformanceFrequency' (REF):LOGICAL*4
   al = QueryPerformanceFrequency (freq)
   CPUclockGHz = freq/1.e9
end function CPUclockGHz


Last edited by DanRRight on Thu May 20, 2010 7:00 pm; edited 6 times in total
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group