forums.silverfrost.com

DanRRight

What do you folks think is the most optimal and fastest way to add the new piece of data to the growing file ? Additions will be done ones per day so let's call it for clarity "multiday database file". This file will be very large with time a GB or so. And there will be many 1000th of such files so i have to care about the right ways of doing that

I guess that standard way could be just reading this multiday database file until its end and add new ("today") portion at the end. But disadvantages are that it will take a lot of time to read as the file is growing and the fact that newer data (which will be more often used) will be placed at the end of multiday database file

So adding at the end or at the beginning is better, simpler, more reliable?

If adding to the beginning is preferable what is the best way to do that? Would it be with the use of APPEND file attribute (and the large database file then has to be added to the today smaller portion not vise versa, that will assure that the newer additions will at beginning) ? Is APPEND option working reliably (i have for many years strange not resolved problem sometimes with reading file up to the end, then calling one BACKSPACE and writing an addition to the file. I may get one line missing, do not know reason of that )?

jjgermis · Posted: Wed Feb 10, 2010 6:13 am Post subject:

My first thought would have been APPEND as well. However, I do not have any experience with the task that you have in mind. Are you working with text files and do they have a special format? What is the reason for keeping the database growing? I assume that you would like to keep to your present format and do not want to change your database format.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

An alternative is to use a direct access file, which requires fixed length records.

Each record could be designed to have a day's information, except the first record which indicates the number of active records, and any other information you may require, such as the date of the last record update.

You can then have a utility program which can analyse the file and check the length if required.

I'm not sure which is the best approach. On some O/S, the file allocation table can indicate where the end of the file is, but I suspect access=append may read all the file. You could try both methods with a file of at least 1gb and see how long they take to update the file.

John

IanLambley · Joined: 17 Dec 2006 Posts: 506 Location: Sunderland

Dan,
If it is a sequential file, then opening it with the APPEND option should possition the pointer so that the next write puts data after the last record.

Using the read until the end of file and a backspace, is likely to position the pointer at the start of the last existing record, and this would then be overwritten by the next write statement, hence losing the last line of the existing. Instead, read till end, backspace and perform one more read and then write new data, may be the solution.

If you use direct access files and you don't know how many records are in the file, perform a dummy read on record 1 and look at the iostat value. If zero, then dummy read record 2, 4, 8, 16 etc until you get a fail, then use the successive approximation technique (error halving) used in an analogue to digital converter to find the number of existing records, add 1 and put in the new records. I would send you a routine to do this, but I'm not proud of it. By a dummy read, I mean just use

DanRRight · Posted: Thu Feb 11, 2010 10:51 pm Post subject:

Great suggestions, thank you all.

Barely remember and never worked with direct access file during last 20 years. Is it safe ? Say, suppose computer crashes during the write, what will happen? With the sequential file i lose probably only "today" portion. Will i lose the whole database in direct access approach?

The database will contain detailed tick-by-tick stock market prices for all existing stocks

Also thanks for suggestion to fix backspace annoyance. I've spent a lot of time with it in the past and gave up. I do not remember what i've not tried....Will see what will happen now, would be nice to fix this damn thing.

JohnHorspool · Joined: 26 Sep 2005 Posts: 270 Location: Gloucestershire UK

Dan,

Why not have two files, todays and yesterdays? On the next day you initially have yesterdays file and the day before. Then each day, delete the oldest file, create a new file, copy the contents of yesterdays file into todays file and addon todays portion. Repeat everyday. Since you are only ever reading yesterdays file (to write into todays file), I don't think that it can be harmed by a read. Or am I talking nuts?

John

IanLambley · Joined: 17 Dec 2006 Posts: 506 Location: Sunderland

John,
That is probably the standard method - the old sort and merge system.
The existing file is merged with the sorted "day file" by interleaving them together into a new file so that the resulting combination file is in the desired order, and you have only had to sort the shorter updated "day file".
It is a commercial processing thing (e.g. COBOL), rather than scientific Fortran.
Ian

DanRRight · Posted: Fri Feb 12, 2010 10:54 pm Post subject:

John,

In average we have 200 MB per year per stock, 10000 stocks = 2 TB per year

The way you are proposing to operate might be safe, but one thing exists besides that, i.e. the reading&writing 2TB of data could become the forever job. With the read+write+processing speed probably 20MB/sec it will take

2x10^12 / 20x10^6 = 10^5 seconds

more then 86400 seconds, or a whole day... I'd hope APPEND would just patch the new file to the database (or vice versa) without reading the whole damn thing. Is it how APPEND works?

JohnHorspool · Joined: 26 Sep 2005 Posts: 270 Location: Gloucestershire UK

Dan,

I think you will have to seriously consider John Campbell's suggestion of using a direct access file. You can certainly grow the file without having to read through all of it. The downside is the fixed record length, thus choosing an appropriate record length is a critical decision. Keeping track of the number of records shouldn't be a problem, as John suggested just store the count as part of the first record and update it when you add more records on the end of the file.

Even if you get a power cut when updating the file, just write a recovery program to read through the file record by record writing each recovered record to a new file, with the process stopping when IOSTAT hits an error reading the corrupted file.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

Dan,

Even with a direct access file, when opening the file and writing past the last record, we do not know how windows locates the position of the last record on the disk. Presumably it will access the FAT to get the file location. We do not know how efficient windows O/S is to locate the end of file position, so that it can update the file.
Alternatively you could just have new files each day (or week), then have other utility programs that run once a day (or week) to merge the new files.
In windows, a direct access file does not have a significantly different structure to a sequential access file. There is no concept of block size or contiguous records on disk, as there was with older O/S say digital Vax. No guarantee they were any better either.
Another option may be to find the size of the file (files@ will give this) then respond appropriately, either seeding a new file or appending to the old one.
It would not be a big overhead to have a table of file names, and identify what period of time they relate to. Writing a database management system to keep track of all the files would not be very hard. Depending on the file size, you could choose to create a new file at any suitable time. (Last Thursday I wrote a paging system to handle a 6gb sparce array, by keeping a map of active pages. I was surprised how easy it was and it works as fast as the earlier version of the 1.2gb non-sparce array version)
You realy need to write a program that tests how quickly you can update a 1gb sequential and a direct access file.

Good luck !!

John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

I tried to write a program to test the two methods. Unfortunately these tests do not flush the disk buffers and so are not a real test, but they do give some indication of timing. The program is:

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

The library routines are ( shame that preview and submit don't have the same size limit)

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

and the run log I got from my PC was:

DanRRight · Posted: Mon Feb 15, 2010 6:08 am Post subject:

John, special thanks, couldn't be better help then a piece of Fortran text

DanRRight · Posted: Wed May 19, 2010 9:14 pm Post subject:

I found also one more useful thing in John's code which might be interesting to everyone. The function QueryPerformanceFrequency i re-wrapped separately into CPUclockGHz and placed in simple demo Clearwin program returns you real*8 value of the clock of your processor in GHz