forums.silverfrost.com

arctica · Joined: 10 Sep 2006 Posts: 148 Location: United Kingdom

Hello,

I have a quick query. I have a data file which is an ascii file and I can read and process the file i/o in text fine but was wondering if it is easy to write the otput as a binary (real*4) file of just the z-vales:

open(10, file=infile, status='old', action='read', position='rewind') ! read the ascii data file
open(11, file=outfile, status='new', action='write')
...

do i = 1, nn
read(10,*) (row(j),j=1,ne)
rlon = rlon1
do j = 1, ne
if (rlon .gt. 360.0) rlon = rlon1
!write(11,*) rlat, rlon, row(j) for XYZ GMT
write(11,*) row(j)
! GMT xyz2grd -ZTLa option
rlon = rlon + dlon
enddo
rlat = rlat - dlat
enddo

...

the current code will write the results from row(j) as a stream of ascii z-values which is fine, but if the input file gets a lot larger it becomes unwieldy. Can I generate a direct access fixed
recordlength unformatted binary output stream of real*4 values without too much work?

Thanks

Lester

Wilfried Linder · Posted: Wed Aug 26, 2015 3:23 pm Post subject:

Lester, perhaps you can use a construction like this:

IanLambley · Joined: 17 Dec 2006 Posts: 514 Location: Sunderland

Direct access file

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

I would recommend against any binary file format and suggest that you use a simple text file with a fixed format.

If I am preparing data for a program, I use Excel and structure the data in a simple way, then export the data to a .prn file , which is easy to read in Fortran.

Typically when reading data, the first information you need is how much data is being provided, so the first line of data can be a count of fields and number of records. That is easy to generate in Excel.
You can then use ALLOCATE to allocate arrays of sufficient size.

Not knowing how much data is to be supplied is a significant issue to manage, so look after this in your data preparation. I have spent years trying to manage this issue when computer memory was scarce and editors could not manage large datasets.
If your data is so large that this is still a problem, I'd still recommend a pre-processor that assembles the data in a fixed format text file.

Once you have read in the data, you have it all in memory so don't need random access data structures that direct access files provide.

In 2015, the saving in time between text read vs binary read is less than the time it takes to type in the file name.

The other big disadvantage of binary files is they are difficult to check, Again any text file with a simple structure can be opened by notepad or analysed in Excel, especially when you think the data is not as expected.
It is easy to check the data then devise error tests that can filter out the bad input. (I'm also a big fan of pivot tables which makes data checking a lot easier)

So my recommendation is:
Design a simple text file data layout. (you can do that in Excel)
Design Fortran arrays that allow you to store the data and use it easily.

John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

John,

If you have a data set that is perhaps 10 gigabytes, it probably justifies using something more sophisticated than "read(10,*) (row(j),j=1,ne)"

With data of this size, I would still prefer to store it as text, although it depends on how it is supplied.
I would spend some time to look at structuring the data in some way and using a pre-processor to re-format it with either multiple files or an index file that defines the size of the data-sets.
There is also the issue of data errors and data patterns, both or which can be better reviewed with a text image.

I have used this approach with survey files of about 1gb which are more portable in text. Handling dos/unix text is a lot easier than managing the endian mix of real and integer binary data.

You would also have to wait for FTN95_64 to store all of that in memory.

John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

John,

Direct access files have been very useful. My FE program I mentioned elsewhere makes extensive use of direct access files.
As you suggest, they are basically very structured files.
The FE program creates many different types of records, each with a known location, so that when the information is required, it can be retrieved "quickly".
My recent program changes have been to transfer more information into memory arrays. Where as previously I had a direct access disk based database that transfers quickly to memory, I now have a memory based database that transfers quickly to cache.

While binary files can still be very useful, data portability and visibility is a key feature of text files. They all have their advantages.

John

LitusSaxonicum · Posted: Tue Sep 08, 2015 9:05 pm Post subject:

This thread has strayed into the area of desirable attributes of datafiles, a particular interest of mine.

It seems to me that datafiles for Fortran programs can be either plain text or unformatted (binary) sequential files. There doesn’t seem to be much point in making the files direct access so that the records can be read in any order. If you are creating datafiles from scratch it seems sensible that they should be plain text, and I can see the merits of creating them in a text editor (like Notepad) or in a spreadsheet (like Excel) as in the latter you can generate some of the data according to rules.

Plain text has the advantage that you can create it very easily in the first place, and modify it later at will.

Now imagine that you have a Windows program, created say with FTN95 and Clearwin+, and where the datafile is not created as lists of numbers in an editor, but is derived from work in progress in the application itself, and this may be via some form of graphical input. We want to save the dataset, possibly to open and modify it at some future date, or at least to work with it. The problem now is not so clear cut.

If the basic file format for the program is a plain text file, then once again it is easy to open in an editor to modify it. However, user modification may create errors in the file that render it difficult to read back, or somehow inconsistent, i.e. invalid for the application. If it is saved as a binary file, then direct user modification is pretty much impossible, and when it comes to reading a file that the program created in the first place, it is (subject to the program actually working!) a foregone conclusion that the datafile is both valid and readable. Whatever validity checks are done, they can be much simpler if a binary format is chosen than if a plain text format is chosen.

Moreover, the binary format will preserve an appropriate number of decimal places of precision, whereas the precision is always somewhat compromised in the conversion to and from decimals.

Particularly in the case of binary files it is annoying to a user if a file of version 1 of the application cannot be opened in version 2 (well it is to me, my example being CorelDRAW where I created files in versions 1, 2, 3, 4 that are no longer readable without version 5 on the computer, and the current version is 17 – X7 – so you know that 5 isn’t going to work well under Windows 10, as version 5 was contemporary with Windows 98!). A plain text datafile continues to be readable with an editor, and presumably can be massaged into an appropriate format, whereas a binary file may not be, at least not without a lot of codebreaking skill.

Moreover, the convention in Windows is to identify a file by its extension, and to associate some extensions with certain applications via the Registry entries. This is all well and good for such universal extensions such as .FOR (etc), but may not be with your chosen default extension. So, for the unimaginative, one tries to open .DAT files and discovers that instead of the expected plain text input for a particular application, you have instead tried to open a virus scanner’s set of virus definitions – the people writing that being equally as unimaginative.

If in writing a datafile, the program name AND version number is embedded somewhere near the top, it is possible to verify at least that you have opened a file that is intended for your program, AND you can take the necessary steps to accommodate an earlier version layout or content of the data file. Discovering version info from the layout of a datafile without this help is difficult.

A particular peril of plain text is the assumption of the likely values that the variables might take. As an example I often operate on (x,y) coordinates in the range -200 to +200 m, accurate to 3 decimal places (mm). Some decades ago I got into the habit of writing them in a format F10.3. Circa 1990 I wrote a program in Fortran to run under DOS to store topographic coordinates in a 2000 x 1000m box. (continued)

LitusSaxonicum · Posted: Wed Sep 09, 2015 10:53 pm Post subject:

This DOS version, and a later FTN77/FTN95 version certainly served for nearly a quarter century on student field courses. This year, for the first time, the program produced an unreadable file as my colleague chose to use UK National Grid coordinates instead of local coordinates, and those had 6 digits left of the decimal point (i.e. >100km) which robbed the formatted data of the <sp> separator that the program relied upon when reading the coordinates back using a * format descriptor. Lessons learnt (if using plain text):

(a) Make formats even bigger than necessary to contain probable values
(b) Provide explicit separators, e.g. 2(F15.3,’,’)

E formats are also troublesome, as one decides the number of significant figures, and thus the absolute precision of the number (say one wants mm) depends on the magnitude of the number, so if for example I had used E15.7, coordinates of 1000m would be stored to millimetric precision, but coordinates of 100,000 m would be to a lesser precision. Thus, returning to the UK National Grid coordinates example, if using this format type, the program would run fine in much of England and Wales, but be useless in Scotland. This is not to get distances from Land's End to John O'Groats accurate to a millimetre, but to get distances between points in that box to have that accuracy. Issues like this impinge on the choice of data storage layout and format.

Eddie

JohnCampbell · Joined: 16 Feb 2006 Posts: 2629 Location: Sydney

Eddie,

I agree with all the points you make, as they are valid when considering a data file design. I would say that a well designed data structure should enable editing with minimal chance of making errors.
I have always wanted to put a version number on the data file, but most programs I write only use the current data format, or apply default values to new variables introduced into the analysis.

Binary files are limited to communication between programs or for restart of the same program.

John

LitusSaxonicum · Posted: Thu Sep 10, 2015 4:20 pm Post subject:

Thanks for the support. I wanted to write the above somewhere.

As far as a 'document' created in a windows program but intended to be reopened somewhere else or at another time, doesn't that count as a 'restart'? And yet its behaviour and attributes are those of a datafile created elsewhere.

Eddie