Building cross reference -- how?

Stewart C. Russell scruss-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Mon Oct 14 20:54:33 UTC 2013


On 13-10-14 04:03 AM, William Park wrote:
> 
> I'm trying to avoid scanning the entire file.  If I have 1M files, each
> with 1K lines, then that's 1G lines.

In corpus linguistics terms, that would be a medium-sized research
corpus. The sort of thing my colleagues at Birmingham University were
serving real-time queries over telnet to multiple clients from a single
SparcStation 10 in the late 1990s.

If it were me doing this for single-user access, I'd smack all the text
into a SQLite FTS table. This allows for fast token-based searching, but
has some features that you might have to work around:

 · only full token matches are supported: not substrings, not regex
 · searches are usually case-insensitive
 · Unicode support might be limited or completely absent.

A couple of links on FTS:
- SQLite FTS3 and FTS4 Extensions https://www.sqlite.org/fts3.html
- How to use Full-Text Search in SQLite - O'Reilly
Answers http://answers.oreilly.com/topic/1955-how-to-use-full-text-search-in-sqlite/


cheers,
 Stewart
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list