 |
forums.silverfrost.com Welcome to the Silverfrost forums
|
View previous topic :: View next topic |
Author |
Message |
LitusSaxonicum
Joined: 23 Aug 2005 Posts: 2402 Location: Yateley, Hants, UK
|
Posted: Mon Jun 06, 2016 5:45 pm Post subject: |
|
|
John,
If it worked in the past, it should work now. Not only that, but it should be lots quicker as cpu speeds are better than they were. Not changing the source code gives you the minimum number of things to fix.
If you download the 77library.pdf (from FTN77 days) and read Chapter 6, it may give you some idea how to do your task the alternative way.
If it ran in 32Mb, then that means that you have 60 x as much RAM in a 2 Gb computer, and you are unlikely to run out of it.
Eddie |
|
Back to top |
|
 |
mecej4
Joined: 31 Oct 2006 Posts: 1899
|
Posted: Mon Jun 06, 2016 7:03 pm Post subject: |
|
|
Consider building an index for each text corpus. The structure of the index needs careful planning, but building the index needs to be done only once for each text. For example, if you have fixed-size blocks of text, the text could be processed as a direct-access file with fixed block size, and the index file could contain two columns: keyword; block number. If you want the index to cover more than one corpus, you could add one more column to the index file: corpus-name.
Once you build the index file(s), you can sort them in various ways to speed up subsequent look-ups.
Do you know the keywords in advance, or are they to be extracted from the texts themselves, using some detection rules?
There exist full-fledged software packages, many free, for doing this type of work with texts that involve multi-byte characters/Unicode alphabets. Do a Web search using the keyword "concordance". Many of these packages handle ASCII texts fine. |
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Tue Jun 07, 2016 4:40 am Post subject: |
|
|
John,
I have my own 32 bit line editor (based on Pr1me's line editor!). It stores the file as a list of lines of variable length in a character array and is at present configured for 1gb files, 20 million lines. To search this file for a string is very quick, even case insensitive.
It can also do a sort on multiple fields within each line, again very fast.
You have described a data structure of blocks and lines, which would not be too hard to structure.
My recommendation would be to read the file into memory and do the search.
You did not indicate the size of each file or number of files, but I have found for files at around 1gb in size, a simple read and memory search is much faster than Notepad, (which could also do for a one-off test). Notepad appears to struggle with memory allocation when the file gets very large.
I have not extended the program to 64 bit, but would not expect huge delays.
John |
|
Back to top |
|
 |
JohnCampbell
Joined: 16 Feb 2006 Posts: 2615 Location: Sydney
|
Posted: Thu Jun 09, 2016 2:40 am Post subject: |
|
|
John,
The editor is basically storing the file in a character array and then searching this array for a character string, often ignoring upper/lower case.
The data structure is very basic:
Code: | module edit_parameters
!
INTEGER*4, PARAMETER :: milion = 1000000 ! 1 million
INTEGER*4, PARAMETER :: MAXLIN = 20*milion ! max lines in file 240mb
INTEGER*4, PARAMETER :: MAXSTR = 950*milion ! max characters in file 950mb
INTEGER*4, PARAMETER :: LENCOM = 512 ! max characters in command
INTEGER*4, PARAMETER :: LENLIN = 512 ! max characters in line
!
! common variables
!
character*1 CSTOR(MAXSTR) ! text storage array
integer*4 START(MAXLIN) ! pointer to first character of each line
integer*4 LENGTH(MAXLIN) ! length of each line in characters
integer*4 LINE_ORDER(MAXLIN) ! line storage index for sorting line storage order
!
character (len=lenlin) :: LINE ! active line of text
!
character (len=lencom) :: XCOM ! command line
!
...
!
end module edit_parameters
|
I then just search each line in order, for the search string. If I ignore case then all lower case characters are converted to upper case as the search continues.
It is a sequential search with no smarts.
The main point I am trying to make is even with a 1gb file the search time is hardly noticeable, perhaps 1 or 2 seconds.
The main delay is reading in the text file to memory, which depends on what type of disk is being used and if the file is already in a disk buffer.
Unless it is being done millions of times, a simple search would do.
John |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|