Our engine builds a query-able index of the attributes, as
well as a full-word index. The question I was asking was how
we tag attributes in the HTML documents so that they can be
captured by a spider or other indexing tool.
However the indexable attributes are tagged in HTML documents, I
believe it is important that there should be some way to closely
associate index entries with the text they are about. In other words,
the index entries should be immediately next to (before or after) the
paragraph or list item.
This is especially important for larger documents, but even for small
documents. There are benefits to users and authors. Searching tools
can refer to the relevant part of the document, as if the tag were an
anchor. The author of the document is reminded of the index entries
and can correct them as needed or move them along with the text during
editing.
Are meta tags only allowed in the HEAD of a document? If so, I dont
believe they are sufficient for indexing. Probably some filter should
be used to extract the embedded index tags to store them separately.
I'm not sure that being able to exchange the indexes is the ultimate
solution, but it's a very interesting one that we'd like to enable.
While I am suspicious of any scheme involving manual replicatiion,
automatic replication (caching) will be required for scalability.
Caching indexes is the easy part, but caching of services such as
searching services is trickier.
Daniel LaLiberte
National Center for Supercomputing Applications
liberte@ncsa.uiuc.edu
(Fight interface copyrights and software patents.
Join the League for Programming Freedom: lpf@uunet.uu.net)