forums.silverfrost.com Forum Index forums.silverfrost.com
Welcome to the Silverfrost forums
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Best Text Searching Storage Strategy

 
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General
View previous topic :: View next topic  
Author Message
John-Silver



Joined: 30 Jul 2013
Posts: 453

PostPosted: Mon Jun 06, 2016 2:47 pm    Post subject: Best Text Searching Storage Strategy Reply with quote

I have a heritage program I'm going to try to rejuvenate which basically searches through a (large) number of blocks of text for keyword/strings and then outputs only those blocks which contain the said kw/str.
The question is what is the best stratgy to use:-
a) read ALL the blocks of text into memory , initially slower, but subsequent searches are quicker - but withthe memory useage overhead ... and how best to 'store' the text data.
b) read the data each and every time a search is invoked, searching as-you-go (this is the current strategy of the program (which was written almost 20 years ago when there was a herculean 32Mb of memory (if memory serves me right) on the mainframe) - with a block at a time being processed, line-by-line, and output if containing the searched for string)
The target is obviously - speed.
A block of data is variable size, of the order 10-100 lines per block and variable lines length.
There could be thousands of blocks, and more than one file of blocks.
As an example, a text file processed at the time had 2500 pages of text, 40 lines per page. There could be several files to process.
Remember there could be several files like that.
_______________________________
Associated to this question, but with wider implications for memory manageent in programs in general too - are there any FTN95 functions and/or Win API's to check during runtime the actual (cumulative) memory useage, or is it a question of working out a personal memory management (i.e. continuously counting how much is being used at any given time).
Any experience in this area would be gratefully received.
Back to top
View user's profile Send private message
LitusSaxonicum



Joined: 23 Aug 2005
Posts: 1574
Location: Yateley, Hants, UK

PostPosted: Mon Jun 06, 2016 5:45 pm    Post subject: Reply with quote

John,

If it worked in the past, it should work now. Not only that, but it should be lots quicker as cpu speeds are better than they were. Not changing the source code gives you the minimum number of things to fix.

If you download the 77library.pdf (from FTN77 days) and read Chapter 6, it may give you some idea how to do your task the alternative way.

If it ran in 32Mb, then that means that you have 60 x as much RAM in a 2 Gb computer, and you are unlikely to run out of it.

Eddie
Back to top
View user's profile Send private message
mecej4



Joined: 31 Oct 2006
Posts: 607

PostPosted: Mon Jun 06, 2016 7:03 pm    Post subject: Reply with quote

Consider building an index for each text corpus. The structure of the index needs careful planning, but building the index needs to be done only once for each text. For example, if you have fixed-size blocks of text, the text could be processed as a direct-access file with fixed block size, and the index file could contain two columns: keyword; block number. If you want the index to cover more than one corpus, you could add one more column to the index file: corpus-name.

Once you build the index file(s), you can sort them in various ways to speed up subsequent look-ups.

Do you know the keywords in advance, or are they to be extracted from the texts themselves, using some detection rules?

There exist full-fledged software packages, many free, for doing this type of work with texts that involve multi-byte characters/Unicode alphabets. Do a Web search using the keyword "concordance". Many of these packages handle ASCII texts fine.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1706
Location: Sydney

PostPosted: Tue Jun 07, 2016 4:40 am    Post subject: Reply with quote

John,
I have my own 32 bit line editor (based on Pr1me's line editor!). It stores the file as a list of lines of variable length in a character array and is at present configured for 1gb files, 20 million lines. To search this file for a string is very quick, even case insensitive.
It can also do a sort on multiple fields within each line, again very fast.
You have described a data structure of blocks and lines, which would not be too hard to structure.
My recommendation would be to read the file into memory and do the search.
You did not indicate the size of each file or number of files, but I have found for files at around 1gb in size, a simple read and memory search is much faster than Notepad, (which could also do for a one-off test). Notepad appears to struggle with memory allocation when the file gets very large.
I have not extended the program to 64 bit, but would not expect huge delays.

John
Back to top
View user's profile Send private message
John-Silver



Joined: 30 Jul 2013
Posts: 453

PostPosted: Thu Jun 09, 2016 12:57 am    Post subject: Reply with quote

Eddie,
My thoughts were obviously first on sucking-and-seeing the existing structure of the program as you rightly suggest.
Thanks for the pointer to the F77 doc I'll read through that.
As for:-

Quote:
If it ran in 32Mb, then that means that you have 60 x as much RAM in a 2 Gb compute


that may be true but it's unlikežy that 2Gb will be free, hence my supplementary general question at the bottom of the post about tracking memory useage.
Maybe I'm trying to think too much in advance and trying to anticipate problems before they occur but I'm getting old and hesitate to plough in like I used to on a 'break it first' - 'fix-it later' philosophy LOL
And of course I'm stuck for time as is the usual situation.

mecej4
Thanks for the suggestion of index file building
.
I'll consider that as an option to look into if the existing program doesn't run as efficiently as I'd like.
Concordance is something new to me and I'll have to look into it's subtleties and try to find some examples.

No the keywords are not known in advance, a string is input by the user and the program searches for occurences of that compleete string in the various files.


John C,

Thanks for your input too.
I tried a quick search for info about 'Pr1me's line editor' , to get some background about any relevant techniques, but to date have not come up trumps.
I'll continue to look for more infomy thoughts were, after verifying the existing program0s speed as Eddie suggests to then try the modified strategy to compare performance.
Back to top
View user's profile Send private message
JohnCampbell



Joined: 16 Feb 2006
Posts: 1706
Location: Sydney

PostPosted: Thu Jun 09, 2016 2:40 am    Post subject: Reply with quote

John,

The editor is basically storing the file in a character array and then searching this array for a character string, often ignoring upper/lower case.
The data structure is very basic:
Code:
module edit_parameters
!
      INTEGER*4, PARAMETER :: milion =    1000000   ! 1 million
      INTEGER*4, PARAMETER :: MAXLIN =  20*milion   ! max lines in file              240mb
      INTEGER*4, PARAMETER :: MAXSTR = 950*milion   ! max characters in file         950mb
      INTEGER*4, PARAMETER :: LENCOM =        512   ! max characters in command
      INTEGER*4, PARAMETER :: LENLIN =        512   ! max characters in line
!
!    common variables
!
      character*1               CSTOR(MAXSTR)       ! text storage array
      integer*4                 START(MAXLIN)       ! pointer to first character of each line
      integer*4                 LENGTH(MAXLIN)      ! length of each line in characters
      integer*4                 LINE_ORDER(MAXLIN)  ! line storage index for sorting line storage order
!
      character (len=lenlin) :: LINE                ! active line of text
!
      character (len=lencom) :: XCOM                ! command line
!
...
!
end module edit_parameters

I then just search each line in order, for the search string. If I ignore case then all lower case characters are converted to upper case as the search continues.
It is a sequential search with no smarts.
The main point I am trying to make is even with a 1gb file the search time is hardly noticeable, perhaps 1 or 2 seconds.
The main delay is reading in the text file to memory, which depends on what type of disk is being used and if the file is already in a disk buffer.
Unless it is being done millions of times, a simple search would do.

John
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    forums.silverfrost.com Forum Index -> General All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group