forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

writing a binary output file from ascii input

 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
arctica



Joined: 10 Sep 2006
Posts: 105
Location: United Kingdom

PostPosted: Wed Aug 26, 2015 2:36 pm    Post subject: writing a binary output file from ascii input Reply with quote

Hello,

I have a quick query. I have a data file which is an ascii file and I can read and process the file i/o in text fine but was wondering if it is easy to write the otput as a binary (real*4) file of just the z-vales:

open(10, file=infile, status='old', action='read', position='rewind') ! read the ascii data file
open(11, file=outfile, status='new', action='write')
...

do i = 1, nn
read(10,*) (row(j),j=1,ne)
rlon = rlon1
do j = 1, ne
if (rlon .gt. 360.0) rlon = rlon1
!write(11,*) rlat, rlon, row(j) for XYZ GMT
write(11,*) row(j)
! GMT xyz2grd -ZTLa option
rlon = rlon + dlon
enddo
rlat = rlat - dlat
enddo

...

the current code will write the results from row(j) as a stream of ascii z-values which is fine, but if the input file gets a lot larger it becomes unwieldy. Can I generate a direct access fixed
recordlength unformatted binary output stream of real*4 values without too much work?

Thanks

Lester
Back to top
View user's profile Send private message
Wilfried Linder



Joined: 14 Nov 2007
Posts: 314
Location: Düsseldorf, Germany

PostPosted: Wed Aug 26, 2015 3:23 pm    Post subject: Reply with quote

Lester, perhaps you can use a construction like this:

Code:
integer*2  handle,dummy
real*4,allocatable::z_values(:)

allocate(z_values(ne),stat=rtcode)
if (rtcode /= 0) goto ...

do i = 1,nn
  do j = 1,ne
    z_values(j) = row(j)
  end do
  call writef@(z_values,handle,ne,dummy)
  if (dummy /= 0) goto ...
end do

deallocate(z_values)

Wilfried
Back to top
View user's profile Send private message
IanLambley



Joined: 17 Dec 2006
Posts: 490
Location: Sunderland

PostPosted: Thu Aug 27, 2015 12:44 pm    Post subject: Reply with quote

Direct access file
Code:

! delete any old version as a direct access file will not terminate after the
! last write for this session.
!opened as recl=1 as all files can be opened for direct access with a 1 byte record.
      open(unit=11,file=outfile, status='unknown',access='direct',recl=1)
      close(unit=11,status='delete')
      open(unit=11,file=outfile, status='unknown',access='direct',recl=4)

      do j=1,n
        write(11,rec=j)row(j)





      enddo




Back to top
View user's profile Send private message Send e-mail
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sun Aug 30, 2015 5:13 am    Post subject: Reply with quote

I would recommend against any binary file format and suggest that you use a simple text file with a fixed format.

If I am preparing data for a program, I use Excel and structure the data in a simple way, then export the data to a .prn file , which is easy to read in Fortran.

Typically when reading data, the first information you need is how much data is being provided, so the first line of data can be a count of fields and number of records. That is easy to generate in Excel.
You can then use ALLOCATE to allocate arrays of sufficient size.

Not knowing how much data is to be supplied is a significant issue to manage, so look after this in your data preparation. I have spent years trying to manage this issue when computer memory was scarce and editors could not manage large datasets.
If your data is so large that this is still a problem, I'd still recommend a pre-processor that assembles the data in a fixed format text file.

Once you have read in the data, you have it all in memory so don't need random access data structures that direct access files provide.

In 2015, the saving in time between text read vs binary read is less than the time it takes to type in the file name.

The other big disadvantage of binary files is they are difficult to check, Again any text file with a simple structure can be opened by notepad or analysed in Excel, especially when you think the data is not as expected.
It is easy to check the data then devise error tests that can filter out the bad input. (I'm also a big fan of pivot tables which makes data checking a lot easier)

So my recommendation is:
Design a simple text file data layout. (you can do that in Excel)
Design Fortran arrays that allow you to store the data and use it easily.

John
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Fri Sep 04, 2015 6:43 pm    Post subject: Reply with quote

Quote:
In 2015, the saving in time between text read vs binary read is less than the time it takes to type in the file name


that's probably a bit of a generalisation John, especially if the datasets are very big and if for example the initial data input is first pre-processed in some way and then you want to save it for use in further runs ?

What for example if you have say an array of 1,000,000x1000 terms ?
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sat Sep 05, 2015 1:38 am    Post subject: Reply with quote

John,

If you have a data set that is perhaps 10 gigabytes, it probably justifies using something more sophisticated than "read(10,*) (row(j),j=1,ne)"

With data of this size, I would still prefer to store it as text, although it depends on how it is supplied.
I would spend some time to look at structuring the data in some way and using a pre-processor to re-format it with either multiple files or an index file that defines the size of the data-sets.
There is also the issue of data errors and data patterns, both or which can be better reviewed with a text image.

I have used this approach with survey files of about 1gb which are more portable in text. Handling dos/unix text is a lot easier than managing the endian mix of real and integer binary data.

You would also have to wait for FTN95_64 to store all of that in memory.

John
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Sat Sep 05, 2015 10:58 am    Post subject: Reply with quote

Thanks John,
I'm too of the 'if you can see it it's best' brigade.
The difference with what you've said above and what I'm usually trying to do is that you're talking around preparing you own data whereas I'm usually reading someone else's data and sifting out the usefuul stuff and putting it into a format better suited for my use.
Having had a quick look at 'direct access' files (never having used them before) and how they might be useful I'm a bit surprised as in fact 'direct access' is a bit of a misnomer in my eyes since you still have to know quite a bit about the structure of the file and how the data is written !
If for example you have a hundred thousand 'rows' of data, each corresponding to a different entity, which maybe has it's 'identifier' (a name or number) in the frst field for example, if you want to access directly a particular item of data then you still have to first read that first field until you get to the 'row' of data you're looking for.
How can you actually 'jump' to the precise data you're after ? - well you need an index of some kind I guess which requires reading and pre-processing ALL the data first I guess ? which maybe defeats the object, except if you plan to read some data over, and over again.
I guess the main 'time saver' with direct access files is that they are binary which is where the saving is ? So, maybe a sequential binary file is probably the most 'efficient' answer in computing terms in most cases.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Sat Sep 05, 2015 11:39 am    Post subject: Reply with quote

John,

Direct access files have been very useful. My FE program I mentioned elsewhere makes extensive use of direct access files.
As you suggest, they are basically very structured files.
The FE program creates many different types of records, each with a known location, so that when the information is required, it can be retrieved "quickly".
My recent program changes have been to transfer more information into memory arrays. Where as previously I had a direct access disk based database that transfers quickly to memory, I now have a memory based database that transfers quickly to cache.

While binary files can still be very useful, data portability and visibility is a key feature of text files. They all have their advantages.

John
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 1520
Location: Aerospace Valley

PostPosted: Mon Sep 07, 2015 9:33 am    Post subject: Reply with quote

Not quite sure I understand what you mean by a 'memory based database' John but sounds like you've found the course for your particular horse.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Tue Sep 08, 2015 9:05 pm    Post subject: Reply with quote

This thread has strayed into the area of desirable attributes of datafiles, a particular interest of mine.

It seems to me that datafiles for Fortran programs can be either plain text or unformatted (binary) sequential files. There doesn’t seem to be much point in making the files direct access so that the records can be read in any order. If you are creating datafiles from scratch it seems sensible that they should be plain text, and I can see the merits of creating them in a text editor (like Notepad) or in a spreadsheet (like Excel) as in the latter you can generate some of the data according to rules.

Plain text has the advantage that you can create it very easily in the first place, and modify it later at will.

Now imagine that you have a Windows program, created say with FTN95 and Clearwin+, and where the datafile is not created as lists of numbers in an editor, but is derived from work in progress in the application itself, and this may be via some form of graphical input. We want to save the dataset, possibly to open and modify it at some future date, or at least to work with it. The problem now is not so clear cut.

If the basic file format for the program is a plain text file, then once again it is easy to open in an editor to modify it. However, user modification may create errors in the file that render it difficult to read back, or somehow inconsistent, i.e. invalid for the application. If it is saved as a binary file, then direct user modification is pretty much impossible, and when it comes to reading a file that the program created in the first place, it is (subject to the program actually working!) a foregone conclusion that the datafile is both valid and readable. Whatever validity checks are done, they can be much simpler if a binary format is chosen than if a plain text format is chosen.

Moreover, the binary format will preserve an appropriate number of decimal places of precision, whereas the precision is always somewhat compromised in the conversion to and from decimals.

Particularly in the case of binary files it is annoying to a user if a file of version 1 of the application cannot be opened in version 2 (well it is to me, my example being CorelDRAW where I created files in versions 1, 2, 3, 4 that are no longer readable without version 5 on the computer, and the current version is 17 – X7 – so you know that 5 isn’t going to work well under Windows 10, as version 5 was contemporary with Windows 98!). A plain text datafile continues to be readable with an editor, and presumably can be massaged into an appropriate format, whereas a binary file may not be, at least not without a lot of codebreaking skill.

Moreover, the convention in Windows is to identify a file by its extension, and to associate some extensions with certain applications via the Registry entries. This is all well and good for such universal extensions such as .FOR (etc), but may not be with your chosen default extension. So, for the unimaginative, one tries to open .DAT files and discovers that instead of the expected plain text input for a particular application, you have instead tried to open a virus scanner’s set of virus definitions – the people writing that being equally as unimaginative.

If in writing a datafile, the program name AND version number is embedded somewhere near the top, it is possible to verify at least that you have opened a file that is intended for your program, AND you can take the necessary steps to accommodate an earlier version layout or content of the data file. Discovering version info from the layout of a datafile without this help is difficult.

A particular peril of plain text is the assumption of the likely values that the variables might take. As an example I often operate on (x,y) coordinates in the range -200 to +200 m, accurate to 3 decimal places (mm). Some decades ago I got into the habit of writing them in a format F10.3. Circa 1990 I wrote a program in Fortran to run under DOS to store topographic coordinates in a 2000 x 1000m box. (continued)
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Wed Sep 09, 2015 10:53 pm    Post subject: Reply with quote

This DOS version, and a later FTN77/FTN95 version certainly served for nearly a quarter century on student field courses. This year, for the first time, the program produced an unreadable file as my colleague chose to use UK National Grid coordinates instead of local coordinates, and those had 6 digits left of the decimal point (i.e. >100km) which robbed the formatted data of the <sp> separator that the program relied upon when reading the coordinates back using a * format descriptor. Lessons learnt (if using plain text):

(a) Make formats even bigger than necessary to contain probable values
(b) Provide explicit separators, e.g. 2(F15.3,’,’)

E formats are also troublesome, as one decides the number of significant figures, and thus the absolute precision of the number (say one wants mm) depends on the magnitude of the number, so if for example I had used E15.7, coordinates of 1000m would be stored to millimetric precision, but coordinates of 100,000 m would be to a lesser precision. Thus, returning to the UK National Grid coordinates example, if using this format type, the program would run fine in much of England and Wales, but be useless in Scotland. This is not to get distances from Land's End to John O'Groats accurate to a millimetre, but to get distances between points in that box to have that accuracy. Issues like this impinge on the choice of data storage layout and format.

Eddie
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 2554
Location: Sydney

PostPosted: Thu Sep 10, 2015 9:13 am    Post subject: Reply with quote

Eddie,

I agree with all the points you make, as they are valid when considering a data file design. I would say that a well designed data structure should enable editing with minimal chance of making errors.
I have always wanted to put a version number on the data file, but most programs I write only use the current data format, or apply default values to new variables introduced into the analysis.

Binary files are limited to communication between programs or for restart of the same program.

John
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 2388
Location: Yateley, Hants, UK

PostPosted: Thu Sep 10, 2015 4:20 pm    Post subject: Reply with quote

Thanks for the support. I wanted to write the above somewhere.

As far as a 'document' created in a windows program but intended to be reopened somewhere else or at another time, doesn't that count as a 'restart'? And yet its behaviour and attributes are those of a datafile created elsewhere.

Eddie
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group