forums.silverfrost.com

silicondale

I am leading a project team which includes C and C++ programmers as well as Silverfrost FTN95 (in Windows, not .Net). We shall have very large volumes of data to process, and are looking at holding everything in a DBMS: the leading candidate currently is PostgreSQL. It seems that F90SQL is discontinued, and ForDBC isn't supported for recent FTN95 versions.

Is there any current library that works with Silverfrost FTN95 8.x for ODBC database interfacing ?

The alternative to a DBMS if this is ruled out would be NetCDF or HDF5 self-describing data formats. So the same question - is there any library available that will handle these under FTN95 8.x ? Both data formats are open-source, and C/C++ libraries seem to be available. They say Fortran is supported and NetCDF even supplies some Fortran source code, but it seems a mammoth task to build and verify the library ourselves.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

This is a rather startling question to pose in this forum. Someone launching a "big data" team project would have spent some time defining the task and doing research on appropriate tools for the task. A specialty compiler (FTN95) for a niche language (Fortran) seems to be an unjustifiable choice.

How large is "large volumes of data"? What is its form, and will you continue to receive more data in the future that you will have to add to the current data? What is the information that you wish to extract, and how are the results to be presented or transferred to others? Is Fortran a suitable language for any of this work, and why?

An RDBMS is not necessarily the system of choice. If, indeed, an SQL database is shown to be appropriate, you have to subject your project to a "normalization" study. See, for example, https://en.wikipedia.org/wiki/Codd%27s_12_rules .

HDF/NetCDF files ? What for? These files are used in a small number of fields (climate data, for one). Having "this file contains totally disorganized data" as the self-description in a header record does not magically make the data usable.

That said, you can try out NetCDF with FTN95 (32-bit). At present, you can only make F77 style calls to NetCDF from FTN95, using the DLLs in the Windows NetCDF binary distribution.

You cannot create the necessary module files from the source distribution to make F9X calls to NetCDF since FTN95 does not yet support the ISO_C_BINDING feature of Fortran 2003.

Unidata does not provide a binary distribution of netCDFF. You can try out an incomplete but usable version at https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/676006#comment-1882319 .

silicondale · Posted: Sun Oct 30, 2016 3:17 pm Post subject:

Maybe startling, but nonetheless rational. Am doing the research now, before we make any coding decisions! So far we have no data. Volumes will be up to a few terabytes, including point-clouds with multiple attributes, sets of images, and time-stamped physical/chemical sensor data. Image data will probably be in 'raw' format but not planned to process it in Fortran (we have C/C++ software for this).

Reason for using Fortran is ability to take advantage of a considerable amount of legacy code for various statistical applications. Not exactly a "niche language" - also I have been using Fortran since 1970 and have a large library of legacy code.

Agreed, a DBMS not necessarily ideal, though if we are going to use a DBMS it should be relational. I have been working with RDBMS for most of that time and indeed have been involved in development of RDBMS systems (for specialist applications) since the 1970s. Am fully aware of the various normal-form requirements.

From your reply about NetCDF it seems that if we wish to use the legacy Fortran code, then NetCDF is probably not the way to go. The self-description text would be standardised across the project, and principal purpose of using one of these standards is to avoid errors resulting from different representations of numeric data across different operating systems and languages (ROS, Linux, Windows; Fortran, C, C++, Python). A tradeoff between coding the I/O for the interface format and new coding or translation for the applications.

Anyway - your reply is quite helpful. Thanks. Probably eliminates NetCDF and HDF5 from consideration.

mecej4 · Joined: 31 Oct 2006 Posts: 1886

silicondale · Posted: Sun Oct 30, 2016 6:01 pm Post subject:

Thanks. Good point! Hadn't considered Cygwin but indeed it would make a good testbed environment.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

mecej4 · Joined: 31 Oct 2006 Posts: 1886

Silicondale: whether you agree with the conclusions or not, you may find the narrative at http://cyrille.rossant.net/moving-away-hdf5/ instructive. You may be able to forecast whether or not you can avoid some of the pitfalls that are listed there.

silicondale · Posted: Mon Oct 31, 2016 11:29 am Post subject:

Many thanks, both! This is very helpful.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Thanks mecej4,

That link was very interesting.
It is amazing what other projects have developed.
Stepping away must have been a difficult decision.
It is hard to drop something that works some of the time.

For data analysis, my approach has been to start simple and try to build from that.
I often work with new/derived data sets that are reproducible from the original (as received data) via a rewritten data filter program, as this is easy to audit or change your applied corrections.

DanRRight · Posted: Tue Nov 08, 2016 3:30 am Post subject:

Have anyone done testing in Fortran that since HDF5 files are compressed is reading speed way faster ? The slowness of reading of few tens of GB size ASCII files annoys me like hell. But if there exist way to read ZIPed files directly and get 4-5 times speed boost then there will be good reason not even to consider HDF5

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

The best solution is to get a SSD. However, when combined with Fortran I/O, there appears to be performance differences between various SSD drives I had tested. I found it difficult to understand the quoted performance of different drives vs actual performance for the few SSD I have tested. My understanding is that it depends on the driver bandwidth for the SSD drive.

In June 2015 my solution was to buy a Samsung 850 Evo 250GB SSD. It compared well in reported performance tests and has good actual performance for Fortran I/O. There are probably better available today, but I would recommend this as a good reference.

I would suspect that if the file is stored in memory in the disk cache, compression will only be an overhead, as decompression takes place in software and not via the disk buffering.
Actually once the file is in the disk cache, what type of disk might not matter and it will depend on the bandwidth for transfers between disk cache memory and program memory, which can vary depending on the O/S.

John

ps: My experience is that SSD is the easiest solution, but if anyone knows better, I would be pleased to edit this post, as the above summary is based on my limited testing.

DanRRight · Posted: Tue Nov 08, 2016 8:12 am Post subject:

John, I tried that. I even checked drives which are up to 10x faster then any SSD like RAMdrives. Conclusion - no big difference in read speed. By the way all of them are not much different in read speed then fast harddrives like WD Black.

Clearly something else is giving bottleneck. Hope future parallelization of reading, or solutions like HDF5 will give a boost. Or something else.

How about other Fortran compilers? Polyhedron never compared them in I/O speed but this becomes crucial as file sizes grow and grow. Anyone has any unusual experience?

JohnCampbell · Joined: 16 Feb 2006 Posts: 2554 Location: Sydney

Dan,

I would have to look back at my I/O testing.

My conclusion was that the SSD driver had a bandwidth limit and this was the problem with other SSDs. I thought I was getting transfer rates at 100's of mega bytes per second (up to 400 mb/sec is possible from memory) but this was disguised by the size of the available memory buffers in comparison to the file size.
At present my PC has 32 gb memory and 250 gb SSD, so it is hard to tell what is happening.
I will resurrect the test and compare with HDD and get back to you when I have some real numbers.

John

mecej4 · Joined: 31 Oct 2006 Posts: 1886

DanRRight · Posted: Fri Nov 11, 2016 3:20 am Post subject:

Would be nice if someone clarified one amazing thing before we plunge into any alternative methods of I/O. For that you have to make a RAMdisk with couple GB size otherwise we will be dealing with the delayed write by the OS which masks actual write (when we will copy large file the OS will report that the file is copied fast or almost instantly but in reality the process goes after that with much slower speed. RAMdisk read/write speed exceeds 5-9 GB per second, see here

https://www.raymond.cc/blog/12-ram-disk-software-benchmarked-for-fastest-read-and-write-speed/ )

Q: Why copying 1-3 GB file takes less then a ~0.5 second while just loading the same data into RAM by Fortran code takes 100-1000x more time ?

Here some kind of really absurd things are happening. The OS reads and then writes the content of the file with the blazing speed while Fortran code is just to read the same bits and bytes doing this like a drunk turtle.

The Fortran code which is supposed to be fast is actually the most terrible bottleneck?!

/* (Playing with this effect make sure to copy before some other large files to other places to kick our initial file out of RAM cache)