> Does anyone know of an engine that does this yet? We're talking with some
> SGML experts so that we can figure out the right technical strategy and
> we're talking with everyone on this list about HTML...
Check out Harvest:
http://rd.cs.colorado.edu/harvest/
It has a gatherer that uses Essence to do customized extraction. Lots of
theory on it, all of which makes me dizzy. I'm beta testing Harvest now,
and it is some pretty powerful stuff, with strong indexing, replication,
caching, etc.
> Clearly the potential is tremendous, but the search engines have to have a
> document model that mirrors the right aspects of structured text.
The Harvest system first uses heuristics to determine what the file is --
HTML, FAQ, etc. -- then, since it knows about the structure of those
files, can extract type-specific info. I'm using it for building in
bibliographic information using (as you mentioned below) META tags.
> FYI, most engines at best just have the notion of "zones" -- phrases,
> sentences and paragraphs -- and attributes, which would be fielded
> meta-information. We are designing a generic capability to take header
> information from an HTML document and put it into our attribute fields.
> I'm digging up the old "META" discussion to see what, if anything, we
> decided is a minimal standard.
It is pretty interesting how Harvest does this. It makes what is known
as a SOIF object (like an IAFA template) that summarizes the object.
Lots of things happen after that, but it allows for sophisticated queries.
Paul Everitt V 703.785.7384 Email Paul.Everitt@cminds.com
Connecting Minds, Inc. F 703.785.7385 WWW http://www.cminds.com/