forums.silverfrost.com

LitusSaxonicum · Posted: Mon Jun 06, 2016 5:45 pm Post subject:

John,

If it worked in the past, it should work now. Not only that, but it should be lots quicker as cpu speeds are better than they were. Not changing the source code gives you the minimum number of things to fix.

If you download the 77library.pdf (from FTN77 days) and read Chapter 6, it may give you some idea how to do your task the alternative way.

If it ran in 32Mb, then that means that you have 60 x as much RAM in a 2 Gb computer, and you are unlikely to run out of it.

Eddie

mecej4 · Joined: 31 Oct 2006 Posts: 1899

Consider building an index for each text corpus. The structure of the index needs careful planning, but building the index needs to be done only once for each text. For example, if you have fixed-size blocks of text, the text could be processed as a direct-access file with fixed block size, and the index file could contain two columns: keyword; block number. If you want the index to cover more than one corpus, you could add one more column to the index file: corpus-name.

Once you build the index file(s), you can sort them in various ways to speed up subsequent look-ups.

Do you know the keywords in advance, or are they to be extracted from the texts themselves, using some detection rules?

There exist full-fledged software packages, many free, for doing this type of work with texts that involve multi-byte characters/Unicode alphabets. Do a Web search using the keyword "concordance". Many of these packages handle ASCII texts fine.

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

John,
I have my own 32 bit line editor (based on Pr1me's line editor!). It stores the file as a list of lines of variable length in a character array and is at present configured for 1gb files, 20 million lines. To search this file for a string is very quick, even case insensitive.
It can also do a sort on multiple fields within each line, again very fast.
You have described a data structure of blocks and lines, which would not be too hard to structure.
My recommendation would be to read the file into memory and do the search.
You did not indicate the size of each file or number of files, but I have found for files at around 1gb in size, a simple read and memory search is much faster than Notepad, (which could also do for a one-off test). Notepad appears to struggle with memory allocation when the file gets very large.
I have not extended the program to 64 bit, but would not expect huge delays.

John

JohnCampbell · Joined: 16 Feb 2006 Posts: 2615 Location: Sydney

John,

The editor is basically storing the file in a character array and then searching this array for a character string, often ignoring upper/lower case.
The data structure is very basic: