site-index.pl, Perl Script to index WWW sites, version 0.2...

Robert S. Thau (rst@ai.mit.edu)
Fri, 1 Apr 94 12:12:51 EST


I have a new version of site-index.pl, a Perl script which largely
automates the job of building local indexes for Martijn Koster's ALIWeb
(and, perhaps, other future services), at sites running NCSA httpd.
Documentation, and pointers to code, are available at:

http://www.ai.mit.edu/tools/site-index.html

The script works by looking for HTML documents keyed with metainformation,
using the <META ...> tag which has been proposed on this list as a possible
feature for HTML+. It's also, as of this version, capable of building
multiple indexes (e.g., one for information of local interest only, and one
for export). For instance, site-index.html itself has the <META>info:

<meta name="keywords" value="resource discovery, site management, tools">
<meta name="description"
value="This is the documentation for site-index.pl, a tool which allows
administrators of sites running NCSA httpd to largely automate the
construction of local indexes for services such as Martijn Koster's ALIWEB">

There are configuration flags which determine whether the script follows
symlinks, where it puts its output, etc. (There's also an undocumented
flag which will cause the script to build an index including every HTML
file you have with so much as a title, <meta>information or no. I put it
in for debugging only, and I can't imagine that it's the right thing for
very many people, but if you really think you want it, it's in there...).

A note of possible wider interest --- I've changed the NAMEs of the
<META>information to which the script responds. The current lot are:

description --- used to fill in the Description field of the IAFA
templates (i.e., index entries) which the script builds.

keywords --- used to fill in the Keywords field of the index entries

resource-type --- what sort of thing this HTML file is. The default
(if none specified) is 'document', which is almost always appropriate;
however, cover pages for search engines, input forms, and the like
may be more appropriately indexed as being a 'service' (which is the
other recognized value

distribution --- if the script is configured to build multiple indexes,
this meta-datum is used to determine which index is appropriate for
a particular file. The mapping between distributions and the names
of the index files is configurable, as is the distribution used for
documents which don't specify any.

(These are case-insensitive; also, the 0.1 meta-names, 'iafa-description',
'iafa-keywords', and 'iafa-type', are deprecated but still work).

A few final notes:

The script should be getting the description for the index entries out of
an HTML+ <ABSTRACT>, if one is present, but it doesn't do that yet. (I
haven't yet figured out what to do with formatting tags in the abstract.
That problem also arises with <TITLE>s, BTW, but there I'm just stripping
them out).

Also, it should be fairly easy to adapt the script to servers other than
NCSA --- the NCSA dependencies are entirely confined to one function which
reads the config files.

rst