forwarded message from connolly@pixel.convex.com

Jean-Francois Groff (jfg@bernd.cern.ch)
Mon, 2 Dec 91 10:08:04 -2300


WWW folks may like to comment on this, posted to wais-talk and
cni-arch... Sorry if you've already read it there !

-- Jean-Francois

------- Start of forwarded message -------

From: connolly@pixel.convex.com
To: wais-talk@Think.COM
Cc: cni-arch@uccvma.BITNET
Subject: Re: Document identifiers
Date: Mon, 02 Dec 91 01:32:36 CST

>The Coalition for Networked Information
>Architectures & Standards Working Group
>
I don't like the direction this technology is headed.

What is the desired functionality of these identifiers?

If you want an identifier that uniquely identifies a file,
why not use a checksum, such as returned by the unix
sum command?

Let's see how a checksum solves these issues, and then see
what functionality I'd like to see in stead.

>1. The need for identifiers, as distinct from location
>information. This is best handled by a number (much like an
>ISSN or ISBN), but the system must accomodate multiple
>number-assigning agencies. Thus, the identifier is proposed
>as <numbering-authority>,<identifier> where numbering
>authorities are registered.
>
There's no location info in a checksum. Done deal.

>2. The pointers must be representable as an ASCII string to
>facilitate inclusion in a wide range of material, including
>documents and electronic mail.
>
Check.

>3. Location information must support multiple Locations for
>the document, including the "location of record" and one or
>more redistribution centers, local caches, etc. The means of
>specifying a location should be sufficiently general to span
>at least the set of networks covered under the Internet
>Domain Naming system (DNS).
>
Ah! Now we want to be able to get location info out of the
identifier. Checksums don't help. Well, in fact, they help
no more or less than <numbering authority>-<id> helps, unless
a numbering authority implies a location. I'm not clear on
this at all.

>4. Objects may be retrieved by a variety of access
>mechanisms from servers, including FTP, LISTSERV, Z39.50,
>and perhaps FTAM and SQL-based database access, as well as
>requests for paper copies. The location information should
>be sufficiently general to include information about these
>different types of access techniques, and extensible to
>include new access methods that may develop in future.
>
Hmmm... now it looks like the doc id should tell how to
get the document... but not exactly. What we're relly looking
for is some client software that interprets these numbers
and queries servers. Checksums look as good as anything again.

>5. Perhaps the location identifier should include some
>information about the format and size of the object; on the
>other hand, perhaps it should not. Discussion?
>
Checksums do not contain type/size info. If that's what we want,
the checksum idea is no good.

>6. It should be possible to further qualify a reference to a
>"sublocation" within an object (which would have meaning
>only to the server that houses it). This is needed, for
>example, for hypertext-type links. Such a sublocation might
>be the 25th paragraph of a text, for a hypertext-type
>pointer.
>
Now we raise the question: just what does a document identifier
identify? Until this item, it appeared that a document was
a file. Now it's not so clear. Perhaps a document should be anything
from a single character to a paragraph to a file to a chapter to
a book to an encyclopedia to a library. That would be a good trick.
Is that what we're after?

>7. Indirection should be supported. In other words, one
>should be able to format the location as the name of a
>server that can be passed the identifier and which would
>return location information. The protocol mechanism(s) for
>doing this need to be specified as well.
>
Ah. Now the objectives of the location info become more clear.
Sounds to me like the location is a TCP connection, or enough
information on how to establish one.

>8. While full rights and permissions data would seem to be
>outside the scope of such a pointer, it might be useful to
>include at least some basic information. This might be an
>indication that the object is not copyrighted and can be
>freely distributed, that it is copyrighted but can be freely
>distributed, that it can be redistributed for noncommercial
>use, or that restrictions apply to redistribution. Also, it
>might make sense to include a pointer of some sort (an
>e-mail address? a host address?) for further information
>about rights.
>
Ack! This stuff seems totally orthogonal to the rest of the
stuff, but in practice, this looks like a crucial issue.
I don't have any good ideas here.

>9. Perhaps there might be some type of checksum that can be
>calculated on the retrieved object to ensure that the
>pointer and the object have not gotten out of synch?
>
This is what sparked the checksum idea.

My response to all this:

I don't think we need [yet another] document identifier format.
If you want location info, use an internet address; if you want
data integrity, use a checksum; if you want format, we are lacking
a standard here; if you want copyright info, ditto;

What we need is some nifty client software to glue all the parts
together. I guess there is some room for standardization, but please:
LET'S LEVERAGE EXISTING SYSTEMS!

Where these systems are robust, I think we should support them. I'd
also like to see support for ad-hoc document identifiers. Here's
an example to clarify:

I'm browsing some email, netnews, or a README file from somewhere.
I see a reference to more info:

A full discussion of the BLURF protocol is available via
anonymous FTP from frob.mit.edu as blurf-proto.tex
in the directory /pub/protos.

I select some or all of that text, and I click one of the buttons
in my document retrieval tool:

make ftp id -- extract the relevant information and display
a well-formed identifier acceptable to some
existing FTP client (I've heard of something
called ange FTP. Another idea is to make
a shell script that would do the retrieval:
ftp frob.mit.edu
cd /pub/protos
get blurf-proto.tex
)

make wais id -- get enough info to make a WAIS doc ID
[scrap this unless it stabilizes]
make WWW id -- same thing for World Wide Web HTTP addresses.
make NNTP id -- same thing for USENET news message id's.
make LISTSERV id -- you get the idea
Rather than making up a new format, these id's
are instructions to EXISTING clients to retrieve
a document.

verify id -- connect to the necessary server(s) and verify
that the id references an existing document.
Append to the id a "verification date," which
is the last time a server acknowledged the
existence of the document.

get id info -- connect to the necessary server(s) and get about
1K of miscellaneous info: document size in bytes,
date of last modification, available formats,
short summary, etc.

retrieve raw -- connect and retrieve the document in whatever
format is convenient to the server, e.g.
a compressed tar archive of C and troff sources.

retrieve text -- connect and retrieve the document as
plain text [defined, e.g. as the body of an
RFC-822 mail message]

retrieve... -- the user or the supporting client software
specifies the supported information formats,
(compression schemes, archiving formats,
image file formats, typesetting languages)
the client and the server hash over their options,
[perhaps with user intervention]
and the server sends the most desireable version
of the document it has available.

If we add a few buttons, we begin to encompass the scope of many existing
systems:

expand -- change the doc id to reference the "document"
containing it. In the ftp example, rather than
"get blurf.tex," it would have "ls."
Click again and get "cd ..; ls."
Obviously, this operation depends on the access
mechanism. For WAIS documents, the expansion of
a document is the source that contains it.

select -- narrow the document to some of its parts. For a
text file, select some of the characters/paragraphs
for a WAIS source, select some of the documents.
For a WWW node, select a neighboring node. For
a directory, select some files.

I guess my point is, let's think about how folks are going to use this
document referencing technology, and let's see how well existing systems
meet these needs.

I guess some groups have come to the conclusion that the existing systems
don't cut it. I'm beginning to agree.

I guess we'd all agree that we should decide how we're going to use these
doc id's and let that drive the design of the format. i.e. Let's decide
on the methods of this object before we decide on its representation.

[an idea: for syntax, the WAIS folks chose LISP. What about using
something akin to RFC-822 syntax? I think it works well: define a bunch
of standard headers; require some, allow some, disregard others; allow
free-form text in the body. examples:

ISBN: 0-13-590126-X
or
MESSAGE-ID: usenet-thing
or
FTP-HOST: frob.mit.edu
USER: anonymous
or
WAIS-PORT: 8001@think.com

This would allow us to leverage all the email technology out there, plus
the emerging multi-part mail format.
(and it would allow me to use PERL on these beasties! :-)
]

Another thing I hope folks are keeping in mind: I don't think any one
client can meet the information-retrieval needs of everybody. We need
to support multiple platforms, for one thing. But I hope other folks are
considering using mulitple clients at the same time! I'd like to use
one slick X-windows front end to the whole ball of wax, in some ways like
emacs does for programming, and in some ways like the mac GUI does for
office-productivity applications. But I'm going to be using POST mail
servers, NNTP servers, WAIS servers, FTP servers, etc, and I don't
expect one client to do it all. The crucial trick is to make all this
intuitive and interactive, i.e. to support hypertext browsing, fulltext
retrieval, USENET news reading, and maybe email correspondence, all in
one environment. Let's get started!

Dan

------- End of forwarded message -------