The purpose of this request is to get feedback on maintenance requirements
in order to incorporate them into pre-release versions of MOMspider.
In addition, I have already uncovered some problems that will (eventually)
require additions to the HTTP and HTML specifications and would like to
hear what others think of these additions.
I am sending this message to both the www-talk mailing list and to
the newsgroup comp.infosystems.www in order to maximize responses both
from programmers and infobase authors. Since I will be actively reading
both areas, feel free to respond to those lists or by personal e-mail
addressed to <fielding@ics.uci.edu>. Please tell me if you want your
comments to remain private.
This began as a class project (for Dr. Mark Ackerman's graduate course
on Distributed Information Services here at UCI) but will continue
indefinitely in order to support our own maintenance needs. Once a
stable version of MOMspider is developed, it will be publicly distributed
as freeware (including source code) with the usual public-domain terms.
BTW, I do know about James Pitkow's html_analyzer program, so there's
no need to remind me of it. That program handles different, although
complementary, aspects of HTML maintenance (consistency and completeness)
which will be ignored by MOMspider. This should in no way be considered
as a replacement for html_analyzer.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Problem
-----------
The documents available at each server site can be considered a form
of information resource database (infobase). The infobase often
contains a wide variety of information in the form of interlinked
documents which are maintained by a number of different document
owners (usually, but not necessarily, the original document authors).
Since this information is rarely static, the structure and contents of
the infobase are likely to change over time. Documents may be moved or
deleted, referenced information may change, and hypertext links may be
broken.
As it grows, the infobase becomes more complex and difficult to
maintain. Such maintenance effort currently relies upon the error
logs of each server (often never relayed to the document owners), the
complaints of users (often not seen by the actual document
maintainers), and periodic manual traversals by each owner of all the
webs for which they are responsible. Since thorough manual traversal
of a web can be time-consuming and boring, the result is that
maintenance is rarely or inconsistently performed and the infobase
eventually becomes corrupted. What is needed is an automated means for
traversing a web of documents and checking for changes which may
require the attention of the human maintainers (owners) of that web.
Proposed Partial Solution
-------------------------
I propose to (at least partially) solve this maintenance problem by
building a robot client which can periodically traverse the webs of a
set of owners, check each web for any changes which may require its
owner's attention, and build a special index document for each owner
that lists out the attributes and connections of their web in a form
that can itself be traversed as a hypertext document. This robot will
be called the Multi-Owner Maintenance spider (MOMspider).
MOMspider will look for three types of document change which may be of
interest to the owner:
1) referenced objects which cannot be accessed (broken links);
2) referenced objects with recent modification dates; and,
3) owned objects with past expiration dates.
MOMspider will (in effect) recursively traverse each owner's web from
a specified "top" document down to each leaf node, where a leaf node
is defined to be any information object which is not of document-type
HTML, is owned by a different or unspecified owner, or is interactively
generated by some program (i.e. a script or query). [I may also include
a special mode wherein owner=all and the web is traversed for all links
at a particular site.]
As MOMspider analyzes each owner's web, it will build an index
document (complete with cross-references and links to the actual
documents) which contains the following entries:
[Note that this is an index in the traditional sense, not an <ISINDEX>]
o Information regarding how and when the index was generated
(i.e. program options and execution time);
o A hypertext link to the one prior version of the index document;
o The following for each document owned and accessible via the "top":
-- An anchor which links to the actual document;
-- Document header info (Title, Modification Date, Expires Date);
-- A list of all unique hypertext references made by the document.
o Each hypertext reference item will include:
-- The type of reference made (i.e. get, query, ftp, script, etc.);
-- An anchor which duplicates the reference;
-- Document header info if available (Title and Modification Date);
-- If the referenced object is owned by the current owner, then an
additional anchor is provided to cross-reference jump to its
own entry in the index document.
o A list of cross-reference anchors which point to interesting changes
as reflected in the index entries.
The MOMspider program will be designed in such a way as to maximize
flexibility for each web owner. This will be achieved through the use
of an external configuration file, read at the beginning of the
process, containing owner names and their associated set of program
options, top document URL, and e-mail address. This allows the owner
of a document to be specified by an alias name rather than their true
name or e-mail address, thus preventing broadcast of that information
to all readers of the document.
When a suspected change is found, each owner will have the option of
being automatically notified of the list of changes via an e-mail
message. The message will, as an added convenience, also include a
hypertext link to the owner's generated index document.
A key design constraint of MOMspider is that of efficiency -- both in
terms of execution time and network bandwidth usage. It would be
irresponsible to develop a maintenance robot which overly taxed the
limited resources of the Internet. Therefore, MOMspider will be
designed to keep track of where it has been (to avoid cycles and
needless repetition) and to use the HTTP request "HEAD" to locate
documents and examine their type and owner without transmitting the
document body.
Implementation Details
----------------------
The MOMspider program will be a WWW client with the expectation that
it will be run on a regular basis (i.e. perhaps as a crontab entry).
It will probably be written in Perl, although I may default to C if
I run into problems (as I have much more experience with C). The
initial implementation will be designed to be portable, but I will
only be able to test it on Suns running SunOS 4.1.2. Of course,
since the code will be in the public domain, others will be free to
port it to additional platforms.
Due to deficiencies in the HTTP and HTML specs, the initial versions
will require some small changes to maintained sites' httpd server and
a special information comment entered at the top of maintained HTML files.
Hopefully, these changes will eventually be obsoleted by changes to the
official HTTP and HTML specs.
Each maintained HTML file would include an HTML comment of the following
format as the first line of the file:
<!-- MOM Owner="AnyAuthorAlias" Expires="31 Dec 1993" -->
where the Expires parameter is optional. In BNF (with literals surrounded
by single quotes `'), this would be:
MOMtag ::= `<!-- MOM ' OwnerParam [` ' ExpireParam] ` -->'
OwnerParam ::= `Owner="' AliasName `"'
AliasName ::= CDATA (max 20 characters)
ExpireParam ::= `Expires="' Date `"'
Date ::= DD Mmm YYYY
Note that since this added line is a comment, it will have no effect on
existing servers or clients.
HTTP servers which want to serve documents maintainable by MOMspider would
need to parse the above MOMtag and send the information as headers in a
response to any GET or HEAD request for that document. These headers would
appear after the HTTP status response and before the MIME headers.
The OwnerParam would be translated and output as an "Owner:" header
[or should this be WWW-Owner: or MOM-Owner: ???]. The ExpireParam
(if present) would be translated into an RFC850/RFC822 date format
(with time of 00:00:00 GMT) and output as an "Expires:" header as per the
HTTP/1.0 specification. For example, a HEAD request on a document with
<!-- MOM Owner="RTF" Expires="01 Jan 1994" -->
as its first line would result in an OK response something like:
-----------------------------------------------------------
HTTP/1.0 200 OK
Owner: RTF
Expires: Sat, 1 Jan 1994 00:00:00 GMT
Date: Wed Nov 24 09:59:53 1993 GMT
Server: NCSA/1.0a5
MIME-version: 1.0
Content-type: text/html
Last-modified: Wed Nov 24 09:01:39 1993 GMT
Content-length: 32576
-----------------------------------------------------------
[Actually, I just noticed that the Date: and Last-modified: headers
output by NCSA httpd/1.0a5 are not compliant with RFC850. Will this
be fixed in the next version???]
Proposed Changes to HTML and HTTP
---------------------------------
Obviously, the above method of getting the owner and expiration date
output is a bit of a kluge. I would prefer to have official HTML
metainformation elements for OWNER and EXPIRES which would be optionally
specified within the HEAD element (similar to the TITLE element).
Similarly, the HTTP response would include that metainformation as
appropriate headers (note that this has already been suggested for
the Expires header but I haven't seen any mention of how the expire
date would be obtained from normal HTML files).
The relevant changes to the HTML DTD would be something like this:
-----------------------------------------------------------
<!ELEMENT HEAD - - ( TITLE? & ISINDEX? & NEXTID? & LINK*
& BASE? & OWNER? & EXPIRES? )>
<!ELEMENT OWNER - - CDATA -- Alias name for document owner -->
<!ELEMENT EXPIRES - - CDATA -- Expiration date in RFC850 format -->
-----------------------------------------------------------
One point which I think may spark discussion is whether we should
specify the Owner as a LINK relationship rather than as its own
element. I decided not to do so for reasons of efficiency and
understandability. If the owner was specified as a LINK, MOMspider
(and any similar clients) would have to parse through all the fields
of every LINK header in order to find an owner relationship.
Furthermore, the document author would have to build a contrived
reverse LINK relationship with fields normally used for document
references -- a concept which is counter to understandability and
everything I know about software engineering. I believe that the
notion of document ownership is encountered frequently enough to
justify a special HTML element for that purpose.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Well, that should be enough to generate some healthy debate.
Please, please, please tell me if you think I have missed some
requirement or misinterpreted some part of WWW.
In about a week I will place this document (with corrections and
public comments) on our local WWW server (under construction) and
publish the URL. Once I'm satisfied with the code, I'll also be
looking for additional test sites, so let me know if you want an
advance copy before the general release.
....Roy T. Fielding (fielding@ics.uci.edu)
Department of Information & Computer Science (714)856-4049
University of California, Irvine, CA 92717-3425