FFW - Freetext search For Web is a package made to provide easy-to-use
freetext searching facilities over HTML documents (and as a special case
plain text documents). The output is intended as input
to scripts providing the user interface, typically CGI scripts.
FFW is basically intended to replace similar solutions based on the
Wais search engine, and solves some of the problems we experienced when
using the Wais engine.
FFW is developed by the MultiTorg project at TeleNor Research, Norway.
The FFW info pages are at http://www.nta.no/produkter/ffw/ffw.html
Sources are at ftp://ftp.nta.no/pub/ffw
Sources are compiled under SunOS 4.1.3 with gcc 2.6.2 ONLY, those using
other systems might encounter problems. This IS version 1.0 :)
I do however not expect big problems making it compile on other systems.
FFW features:
- Traditional inverted index, considerably smaller than a Wais index.
On test datasets we have seen FFW indexes at 1/3 the size of a Wais index.
This of course will depend on data set size and content.
- Full HTML parsing on input, reserved HTML words are not indexed.
Input parser can easily be replaced with parser for other formats.
- Low semantic content words like and, or, not, if, etc. can be filtered
out of the index to reduce index size. This is done by providing exclusion
lists.
- Flexible indexer, can take document list from input, stdin or parameter
files.
- Memory conservative merge program allows efficient incremental building of
huge indexes. Two FFW indexes can be quickly merged into one. Building
huge indexes can generally be a problem because indexer program size
outgrows machine physical memory, leading to excessive paging load.
ffwmerge solves this problem.
- Can search in several indexes at the same time.
- Self-contained index, does not need access to the data files to construct
the user presentation. URL's and document 'title' are stored directly in
the index, index server can be totally independent of the server holding
the documents. No access to the source files needed to present the search
result to the user.
- Written in compiled C++ for efficiency.
- Searching supports a formal expression grammar with AND, OR, NOT and ().
- Program messages are separated in one file for easy nationalisation.
Norwegian and English versions are provided.
- Support for using several indexes with one CGI script, no need to use one
script for each searchable area.
- 8-bit characters fully supported, HTML character escape codes are changed
into their 8-bit ISO8859-1 equivalents where possible. This makes words
with escape codes in them searchable.
I wish you all a Merry Christmas and a Happy New Year!
|
Baard Haafjeld | When you give a wolf a poodle cut, you
Norwegian TeleNor Research | don't get a show dog but a pissed wolf.
SMTP-mail: Baard.Haafjeld@tf.tele.no | -Robert Asprin