forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Database or NetCDF / HDF5 interfaces?
Goto page 1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
silicondale



Joined: 15 Mar 2007
Posts: 154
Location: Matlock, Derbyshire

PostPosted: Sat Oct 29, 2016 11:50 am    Post subject: Database or NetCDF / HDF5 interfaces? Reply with quote

I am leading a project team which includes C and C++ programmers as well as Silverfrost FTN95 (in Windows, not .Net). We shall have very large volumes of data to process, and are looking at holding everything in a DBMS: the leading candidate currently is PostgreSQL. It seems that F90SQL is discontinued, and ForDBC isn't supported for recent FTN95 versions.

Is there any current library that works with Silverfrost FTN95 8.x for ODBC database interfacing ?

The alternative to a DBMS if this is ruled out would be NetCDF or HDF5 self-describing data formats. So the same question - is there any library available that will handle these under FTN95 8.x ? Both data formats are open-source, and C/C++ libraries seem to be available. They say Fortran is supported and NetCDF even supplies some Fortran source code, but it seems a mammoth task to build and verify the library ourselves.
Back to top
View user's profile Send private message Visit poster's website
mecej4



Joined: 31 Oct 2006
Posts: 717

PostPosted: Sun Oct 30, 2016 1:41 pm    Post subject: Reply with quote

This is a rather startling question to pose in this forum. Someone launching a "big data" team project would have spent some time defining the task and doing research on appropriate tools for the task. A specialty compiler (FTN95) for a niche language (Fortran) seems to be an unjustifiable choice.

How large is "large volumes of data"? What is its form, and will you continue to receive more data in the future that you will have to add to the current data? What is the information that you wish to extract, and how are the results to be presented or transferred to others? Is Fortran a suitable language for any of this work, and why?

An RDBMS is not necessarily the system of choice. If, indeed, an SQL database is shown to be appropriate, you have to subject your project to a "normalization" study. See, for example, https://en.wikipedia.org/wiki/Codd%27s_12_rules .

HDF/NetCDF files ? What for? These files are used in a small number of fields (climate data, for one). Having "this file contains totally disorganized data" as the self-description in a header record does not magically make the data usable.

That said, you can try out NetCDF with FTN95 (32-bit). At present, you can only make F77 style calls to NetCDF from FTN95, using the DLLs in the Windows NetCDF binary distribution.

You cannot create the necessary module files from the source distribution to make F9X calls to NetCDF since FTN95 does not yet support the ISO_C_BINDING feature of Fortran 2003.

Unidata does not provide a binary distribution of netCDFF. You can try out an incomplete but usable version at https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/676006#comment-1882319 .


Last edited by mecej4 on Sun Oct 30, 2016 7:23 pm; edited 1 time in total
Back to top
View user's profile Send private message
silicondale



Joined: 15 Mar 2007
Posts: 154
Location: Matlock, Derbyshire

PostPosted: Sun Oct 30, 2016 3:17 pm    Post subject: Reply with quote

Maybe startling, but nonetheless rational. Am doing the research now, before we make any coding decisions! So far we have no data. Volumes will be up to a few terabytes, including point-clouds with multiple attributes, sets of images, and time-stamped physical/chemical sensor data. Image data will probably be in 'raw' format but not planned to process it in Fortran (we have C/C++ software for this).

Reason for using Fortran is ability to take advantage of a considerable amount of legacy code for various statistical applications. Not exactly a "niche language" - also I have been using Fortran since 1970 and have a large library of legacy code.

Agreed, a DBMS not necessarily ideal, though if we are going to use a DBMS it should be relational. I have been working with RDBMS for most of that time and indeed have been involved in development of RDBMS systems (for specialist applications) since the 1970s. Am fully aware of the various normal-form requirements.

From your reply about NetCDF it seems that if we wish to use the legacy Fortran code, then NetCDF is probably not the way to go. The self-description text would be standardised across the project, and principal purpose of using one of these standards is to avoid errors resulting from different representations of numeric data across different operating systems and languages (ROS, Linux, Windows; Fortran, C, C++, Python). A tradeoff between coding the I/O for the interface format and new coding or translation for the applications.

Anyway - your reply is quite helpful. Thanks. Probably eliminates NetCDF and HDF5 from consideration.
Back to top
View user's profile Send private message Visit poster's website
mecej4



Joined: 31 Oct 2006
Posts: 717

PostPosted: Sun Oct 30, 2016 4:33 pm    Post subject: Reply with quote

Quote:
Anyway - your reply is quite helpful. Thanks. Probably eliminates NetCDF and HDF5 from consideration.

Let us not allow my opinions to cloud your choice. On Linux and on Windows with Cygwin, you can install full pre-built distributions of NetCDF, HDF5 and prerequisites, for use with GCC and Gfortran. With these, you can run appropriate experiments on your data. If NetCDF comes out as something worthy of continued interest, you can then write some glue code to access the same DLLs from FTN95 programs.


Last edited by mecej4 on Mon Oct 31, 2016 12:36 am; edited 1 time in total
Back to top
View user's profile Send private message
silicondale



Joined: 15 Mar 2007
Posts: 154
Location: Matlock, Derbyshire

PostPosted: Sun Oct 30, 2016 6:01 pm    Post subject: Reply with quote

Thanks. Good point! Hadn't considered Cygwin but indeed it would make a good testbed environment.
Back to top
View user's profile Send private message Visit poster's website
JohnCampbell



Joined: 16 Feb 2006
Posts: 1757
Location: Sydney

PostPosted: Mon Oct 31, 2016 1:55 am    Post subject: Reply with quote

Quote:
Having "this file contains totally disorganized data"


Although I have probably worked in different fields (logistics, dredging and finite elements), I have never had data that fits into this classification, and if at first it does appear this way, I have written code to restructure the raw data so that it is easier to analyse.
Excel and pivot tables can be a good place to start to experiment with samples of raw data and identify ways of restructuring the data definition, while not destroying some important unusual values.
My approach has been to apply some pre-processing of the data and retain all info in a text format. Admittedly, I have only analysed numerical records, such as survey data (up to 100gb) or text reports from other databases (1gb) but it has worked for me.
Text files with a few headings can be much easier to manage.
This is probably a Fortran focused approach, but there must be some structure that can be utilised.
I have not applied this approach to non-text data inputs, such as images, which would be much more challenging.
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 717

PostPosted: Mon Oct 31, 2016 4:29 am    Post subject: Reply with quote

Silicondale: whether you agree with the conclusions or not, you may find the narrative at http://cyrille.rossant.net/moving-away-hdf5/ instructive. You may be able to forecast whether or not you can avoid some of the pitfalls that are listed there.
Back to top
View user's profile Send private message
silicondale



Joined: 15 Mar 2007
Posts: 154
Location: Matlock, Derbyshire

PostPosted: Mon Oct 31, 2016 11:29 am    Post subject: Reply with quote

Many thanks, both! This is very helpful.
Back to top
View user's profile Send private message Visit poster's website
JohnCampbell



Joined: 16 Feb 2006
Posts: 1757
Location: Sydney

PostPosted: Mon Oct 31, 2016 12:42 pm    Post subject: Reply with quote

Thanks mecej4,

That link was very interesting.
It is amazing what other projects have developed.
Stepping away must have been a difficult decision.
It is hard to drop something that works some of the time.

For data analysis, my approach has been to start simple and try to build from that.
I often work with new/derived data sets that are reproducible from the original (as received data) via a rewritten data filter program, as this is easy to audit or change your applied corrections.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1522
Location: South Pole, Antarctica

PostPosted: Tue Nov 08, 2016 3:30 am    Post subject: Reply with quote

Have anyone done testing in Fortran that since HDF5 files are compressed is reading speed way faster ? The slowness of reading of few tens of GB size ASCII files annoys me like hell. But if there exist way to read ZIPed files directly and get 4-5 times speed boost then there will be good reason not even to consider HDF5
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1757
Location: Sydney

PostPosted: Tue Nov 08, 2016 7:43 am    Post subject: Reply with quote

Dan,

The best solution is to get a SSD. However, when combined with Fortran I/O, there appears to be performance differences between various SSD drives I had tested. I found it difficult to understand the quoted performance of different drives vs actual performance for the few SSD I have tested. My understanding is that it depends on the driver bandwidth for the SSD drive.

In June 2015 my solution was to buy a Samsung 850 Evo 250GB SSD. It compared well in reported performance tests and has good actual performance for Fortran I/O. There are probably better available today, but I would recommend this as a good reference.

I would suspect that if the file is stored in memory in the disk cache, compression will only be an overhead, as decompression takes place in software and not via the disk buffering.
Actually once the file is in the disk cache, what type of disk might not matter and it will depend on the bandwidth for transfers between disk cache memory and program memory, which can vary depending on the O/S.

John

ps: My experience is that SSD is the easiest solution, but if anyone knows better, I would be pleased to edit this post, as the above summary is based on my limited testing.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1522
Location: South Pole, Antarctica

PostPosted: Tue Nov 08, 2016 8:12 am    Post subject: Reply with quote

John, I tried that. I even checked drives which are up to 10x faster then any SSD like RAMdrives. Conclusion - no big difference in read speed. By the way all of them are not much different in read speed then fast harddrives like WD Black.

Clearly something else is giving bottleneck. Hope future parallelization of reading, or solutions like HDF5 will give a boost. Or something else.

How about other Fortran compilers? Polyhedron never compared them in I/O speed but this becomes crucial as file sizes grow and grow. Anyone has any unusual experience?


Last edited by DanRRight on Tue Nov 08, 2016 8:34 am; edited 1 time in total
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1757
Location: Sydney

PostPosted: Tue Nov 08, 2016 8:30 am    Post subject: Reply with quote

Dan,

I would have to look back at my I/O testing.

My conclusion was that the SSD driver had a bandwidth limit and this was the problem with other SSDs. I thought I was getting transfer rates at 100's of mega bytes per second (up to 400 mb/sec is possible from memory) but this was disguised by the size of the available memory buffers in comparison to the file size.
At present my PC has 32 gb memory and 250 gb SSD, so it is hard to tell what is happening.
I will resurrect the test and compare with HDD and get back to you when I have some real numbers.

John
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 717

PostPosted: Tue Nov 08, 2016 11:00 am    Post subject: Re: Reply with quote

DanRRight wrote:
Have anyone done testing in Fortran that since HDF5 files are compressed is reading speed way faster ?

By using compressed files you reduce the amount of I/O, which tends to speed up execution. However, on-the-fly compression (during writes) and decompression (during reads) consumes resources (memory, CPU time) and tends to slow down execution. The combined effect on CPU time can be either a decrease or an increase. If your tests revealed that faster I/O devices did not enable your program to run faster, perhaps the decompression is the bottleneck. You did not state details of the compression/decompression used.

Most compression algorithms involve a trade-off between tightness of compression and speed of compression. For example, Gzip has an option -#, where # ranges from 1 to 9; see https://linux.die.net/man/1/gzip .


Quote:
But if there exist way to read ZIPed files directly and get 4-5 times speed boost then there will be good reason not even to consider HDF5

There are C/C++/C#/Java APIs for reading/writing compressed files. See, for example, https://msdn.microsoft.com/en-us/library/windows/desktop/hh920921 . To use these APIs with Fortran, you may have to write some interfaces and glue code.

There is no reason to use HDF if your only goal is to compress the data.
Back to top
View user's profile Send private message
DanRRight



Joined: 10 Mar 2008
Posts: 1522
Location: South Pole, Antarctica

PostPosted: Fri Nov 11, 2016 3:20 am    Post subject: Reply with quote

Would be nice if someone clarified one amazing thing before we plunge into any alternative methods of I/O. For that you have to make a RAMdisk with couple GB size otherwise we will be dealing with the delayed write by the OS which masks actual write (when we will copy large file the OS will report that the file is copied fast or almost instantly but in reality the process goes after that with much slower speed. RAMdisk read/write speed exceeds 5-9 GB per second, see here

https://www.raymond.cc/blog/12-ram-disk-software-benchmarked-for-fastest-read-and-write-speed/ )

Q: Why copying 1-3 GB file takes less then a ~0.5 second while just loading the same data into RAM by Fortran code takes 100-1000x more time ?

Here some kind of really absurd things are happening. The OS reads and then writes the content of the file with the blazing speed while Fortran code is just to read the same bits and bytes doing this like a drunk turtle.

The Fortran code which is supposed to be fast is actually the most terrible bottleneck?!

/* (Playing with this effect make sure to copy before some other large files to other places to kick our initial file out of RAM cache)
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Goto page 1, 2, 3, 4, 5, 6  Next
Page 1 of 6

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group