A Formalism for Internet Information References
Daniel W. Connolly
$Id: formalism.txt,v 1.4 1994/04/15 20:30:56 connolly Exp $
http://www.hal.com/users/connolly/drafts/formalism.txt
ABSTRACT
========
This is a mathematical model for references between typical
information objects in the internet community; specifically, the model
covers the URL concept from the World Wide Web application and the
external body mechanism from MIME.
I suggest that this model be used to define the requirements for
specifications of new URL schemes and MIME access types, and that it
provides basis for other extensible reference mechanisms such as the
URN and URC mechanisms as variously proposed in the URI working group.
INTRODUCTION
============
We begin with definitions of some commonly used notions:
The set of octets, or bytes:
Octet = { 0, 1, 2, ..., 255 }
Strings, sequences, or streams of bytes:
Using the notation SEQ{S}, short for
{ seq | for some N, seq : {0, 1, 2, ... N} -> S }
OctetStream = SEQ{Octet}
Then we formalize some of the MIME notions:
ContentType = { text/plain, application/octet-stream, ... }
/* just a finite set. Properties or the elements,
aside from identity, are not defined */
Entity = ContentType x OctetStream
And finally, we begin with a function to model the resolution of
references between entities in terms of an as-yet undefined set of
References:
Resolve : Reference -> Entity
While in general, internet resources are do not all fit the definition
of Entity (telnet servers, ftp directories, etc.), the information
units transferred between information clients and servers generally
do. Thus we begin by discussing references to entites, and we define
references to other sorts of resources in terms of the basic Resolve
function.
For example, the basic model of computation for a WWW client goes:
1. Given some reference r0, compute e0 := Resolve(r0)
by sending r0 to the server and getting e0 back.
2. Present e0
3. Let the user choose a reference r1 from e0
(click on an anchor, select a number, drag-select
a range of text, enter some search words, etc.)
4. Compute e1 := Resolve(r1) as above
5. Iterate steps 2 thru 4
THE BASIC WEB "GLOBAL FILESYSTEM" REFERENCE
===========================================
The WWW reference mechnism, the URL, is a generalization of a
filesystem pathname.
Dirs <: SEQ{OctetStream} /* <: meaning subset */
File <: OctetStream
A normal filesystem pathname doesn't satisfy the properties of a
reference for several reasons: The pathname "x/y/z" may reference
different entities depending on the "current directory" of the client
process. So the relation:
Dirs x Filename x Entity
is not a function. This can be addressed by using full pathnames. But
the path "/etc/motd" references different entities on different hosts.
So the WWW application adds hostname information to the path to make
it globally unique.
Host <: OctetStream /* internet host names */
But even "local-file://host/dir/file" can reference different entities
at different times. But if we include a time argument
Time = { real numbers } /* time in seconds since start of
C.E. calendar, written here in
ISO time format: YYYYMMDDHHMMSSZ.
For the purposes of this discussion,
a one second clock resolution is
assumed to be sufficient */
then we have a well defined function:
LOCAL-FILE.GET : (Host x Dirs x File) x Time -> OctetStream
Many filesystems don't associate an explicit ContentType with every
file. But we assume that the following function is well-defined:
Resolve.LOCAL-FILE : Host x Dirs x File x Time -> Entity
For example, we can require that the content type be part of the
reference, or we can use heuristics guided by the filename. So with a
content type ct in hand, we have
Resolve.LOCAL-FILE(h, d, f, t) = (ct, LOCAL-FILE.GET((h, d, f), t))
Or with a content type heuristic
InferType : File -> ContentType
we can have
Resolve.LOCAL-FILE(h, d, f, t)
= (InferType(f), LOCAL-FILE.GET((h, d, f), t))
The local-file: access type shares semantics so far with the anon-ftp:
access type. In general, the ftp RETR operation can be modelled as:
User <: OctetStream
Account <: OctetStream u { {} } /* account is optional */
FTP.GET : (Host x User x Account x Dirs x File) x Time -> OctetStream
Note 1: we don't consider the password, since the retrieved entity does not
vary as a function of the password
Note 2: we don't deal with type ASCII versus IMAGE transfers here
either. I don't think has significant impact on the model, but @@ I
need to think about it some more...
The anon-ftp: access type is a simplification of FTP that obviates the
User and Account parameters:
ANON-FTP.GET : (Host x Dirs x File) x Time -> OctetStream
where
ANON-FTP.GET((h, d, f), t) = FTP.GET((h, "anonymous", {}, d, f), t)
And gopher is similar:
Port = { 0, 1, 2, ... ,65535 } /* IP port number */
Selector = SEQ{Octet - { 9, 13 }} /* no tabs or newlines allowed */
Search <: OctetStream u { {} } /* @@ what subset? */
GType < ContentType /* Gopher types: subset of MIME types */
GOPHER.GET : (Host x Port x Selector) x Time -> OctetStream
Resolve.GOPHER : Host x Port x Selector x Search x GType x Time
-> Entity
where
Resolve.GOPHER(h, p, sel, ser, gt, t) =
(gt, GOPHER.GET((h, p, sel, ser), t))
The HTTP 0.9 model is nearly identical to gopher:
Path = SEQ{ { 33 .. 126 } } /* no whitespace or 8-bit chars */
HTTP0.GET : (Host x Port x Path x Search) x Time -> OctetStream
Resolve.HTTP0 : Host x Port x Path x Search x Time -> Entity
where
Resolve.HTTP0(h, p, path, ser, t)
= (text/html, HTTP0.GET((h, p, path, ser), Time))
The HTTP 1.0 model is significantly more complex. We'll get to that
later. But in general, for a set Location* and a set Context*, if there
are functions g and t and a set s where:
g : Location* x Context* -> OctetStream
t : Location* x Context* -> ContentType
and
For all c in Context* and l in Location*,
Resolve((s, l, c)) = (t(l, c), g(l, c))
then we say that g is a GET function for Context* and Location*, t is
a TYPE function for them, and s is a SCHEME for them.
So far we have that LOCAL-FILE.GET is the GET function for Context* =
Time and Location* = Host x Dirs x File, ANON-FTP.GET is a GET function
for the same set -- hence the need for the distinguishing scheme.
We now have some subsets of the set of References. If s is a SCHEME
for a set Context* and a set Location*, with corresponding GET* and
TYPE* functions, then
For all c in Context*, l in Location*
(s, l, c) is an element of Reference.
since
Resolve((s, l, c)) = (TYPE*(l, c), GET*(l, c))
Here we define several current URL schemes in this manner, and I
suggest that in order to introduce a new URL scheme, we required a
definitions its Location and Context sets along with definitions of
GET and TYPE functions.
LOCAL-FILE:
/* defined somewhat informally, as this is defined in terms
of the local host */
SCHEME = "local-file" (a string is an OctetStream, which is a set)
Location = Host x Dirs x File
Context = Time x (ContentType u { {} })
GET((h, d, f)) given informally:
FILE *fp = fopen(path(d,f)); fread(fp...);
TYPE((h, d, f), (t, ct)) := if ct = {}, InferType(f)
else ct
FTP:
SCHEME = "ftp"
Location = Host x User x Account x Dirs x File
Context = Time x (ContentType u { {} })
GET((h, u, a, d, f)) as per FTP RFC:
CONNECT h
USER u; PASSWD ...
Account a /* unless a is {} */
CWD d1
CWD d2 /*@@ how many CWDs? */
...
RETR f /*@@ type i/a command? */
TYPE((h, u, a, d, f), (t, ct)) := if ct = {}, InferType(f)
else ct
ANON-FTP:
SCHEME = "anon-ftp" /* @@ sometimes written ftp */
Location = Host x Dirs x File
Context = Time x (ContentType u { {} })
GET((h, d, f)) as per FTP RFC:
CONNECT h
USER anonymous; PASSWD <client's email address>
CWD d1
CWD d2 /*@@ how many CWDs? */
...
RETR f /*@@ type i/a command? */
TYPE((h, d, f), (t, ct)) := if ct = {}, InferType(f)
else ct
GOPHER:
SCHEME = "gopher"
Location = Host x Port x GType x Selector x Search
Context = Time
GET((h, p, gt, sel, ser)) as per Gopher Protocol:
telnet h:p
if ser = {}, send sel
else send sel<tab>ser
TYPE((h, p, gt, sel, ser), t) = gt /* @@ other type mechansims? */
HTTP:
SCHEME = "http"
Location = Host x Port x Dirs x File x Search
Accept = { f | f : ContentType -> Real }
Context = Time x Accept
GET((h, p, d, f, s), (t, a)) and
TYPE((h, p, d, f, s), (t, a)) in terms of HTTP GET operation:
compute path from d, f, s
compute accept encoding from a
telnet to h:p
send: GET path HTTP/1.0
Accept: <accept encoding>
As-Of: t /* @@ new HTTP header. Version? */
@@ forms in HTTP in MAILTO; MIME MAIL-SERVER access type
@@ wais, prospero, finger?
RELIABILITY ISSUES WITH FILE (AKA MUTABLE OCTET-STREAM OBJECT) REFERENCES
=========================================================================
The following problem of references to mutable objects is sharted by
several schemes: Suppose an author discovers a file at 1pm and
constructs an HREF to that file; then, the file is modified at 2pm,
and finally a reader resolves the reference at 3pm: the entity the
reader sees is different from what the author saw. That is,
GET(loc, tread) != GET(loc, tauthor)
and hence
Resolve(loc, tread) != Resolve(loc, tauthor)
But in many cases, the author and the reader _can_ read the file at
different times and get the same result. How can we determine if this
is the case?
One way is to use the time of last modification of the file. We can
model this as a function lastmod satisfying:
lastmod : Location* x Time -> Time
where
For all l in Location*, t1, t2 in Time
If t1 >= lastmod(l, t2)
then GET*(t2)(l) = GET*(t1)(l)
Where GET* is a GET function for Location and Time. We'll call such a
function a LASTMOD function for the set Location* and the function
GET*.
Note 1: I don't believe the FTP RFC specifies how to determine the
time of last modification, but it can be determined reliably in many
cases. @@ I need to look into this...)
Note 2: I don't believe Gopher specifies a way to determine the time
of last modification. Perhaps Gopher+ does.
Then, in the above example, we have the following:
If tauthor >= LASTMOD(loc, tread)
Then GET(loc, tread) = GET(loc, tauthor)
and hence
Resolve((s, loc, tread)) = Resolve((s, loc, tauthor))
That is, if the reader can determine that the file hasn't changed
since the link was made, s/he can have confidence in the integrity of
the link. If the mod time is later, the client software can warn the
reader before transferring the possibly erroneous data.
So one way to enhance the typical HREF="local-file://host/dir/file" to
be a well-defined Reference is to determine the time context of the
reference. This can ben done in any of several ways:
1. Enhance local-file URLs to contain time info; e.g. the author
writes:
local-file://host/dir/file;time=19940414141000Z
2. Enhance the HTML A element to contain time info; e.g.:
<A HREF="local-file://host/dir/file" TIME="19940414141000Z">
3. If the reference is in a file whose modification time is known,
we can make the heuristic assumption that all the references
in the file were current as of the last modification of the
file.
Either of the first two is tedious for hand-edited documents. But they
are simple to implement in any cut-n-paste or "paste link" feature.
They also work nicely to fill in the missing content type.
<A HREF="ftp://host/dir/file;user=connolly;account=general"
date="19940414141000Z"
content-type="application/postscript">
The third option is more convenient, though less robust. Consider:
1pm: file Y written with info on apples
2pm: file X says "see file Y for more on apples"
3pm: file Y modified to have info on oranged rather than apples
4pm: file X modified, but still says "see Y for apples"
5pm: reader of X cannot detect (by machine) that
the link to Y is corrupt.
Also: references get copied and pasted, quoted, etc. The time context
of the reference should travel with the reference.
Another technique for detecting faults in resolving references is to
include the MD5 signature of the entity in the reference. This
mechanism is somewhat expensive: it only detects faults _after_
transmission of the octet stream, and it requires computation
proportional to the length of the entity.
A much less reliable technique is to include the octet count of the
entity in the reference. The FTP RFC includes a mechanism to detect
this type of fault before transmission of the entity. (i.e. you can
ask an FTP server how big a file is before you transfer it.)
STRING REPRESENTATIONS OF REFERENCES
====================================
The string representation of a reference is an important part of the
specification of various applications. But it is not clear that we
need one string representation for all applications. It is much more
critical that we understand the properties of the objects we are
trying to represent. Once we have a definition of the set of object
we're representing, we can choose various means to represent the set
in various contexts, with well-defined mappings between syntaxes, for
example, between W3's HREF="xxx" and MIME's external-body; access-type="xxx"
syntaxes.
A definition of a scheme syntax may include incomplete string
representations, as long as it discusses mechanisms and/or heuristics
to determine the remaining parts of a well-defined reference.
LOCAL-FILE:
string rep((h, d, f), (t, ct)) given in perl as
&file_uri("local-file", $h, *d, $f, $t, $ct)
where d is a list of the directories
and t is in ISO time format
sub file_rep{
local($scheme, $host, *dirs, $file, $time, $content_type) = @_;
local($sep, $ret);
if(@dirs){ $sep = "/"; }
grep($_ = &uri_escape($_), @dirs);
$ret = $scheme . "://"
. &uri_escape(host)
. "/" . join('/', @dirs)
. $sep . $file;
if($time){
$ret .= ";as-of=" . $time; # @# better name for as-of?
}
if($content_type){
$ret .= ";content-type=" . &uri_escape($content_type);
}
return $ret;
}
sub uri_escape{
local($_) = @_;
s-[/?=;:]-sprintf("%%%02X", ord($&))-ge; # list of illegal chars?
return $_;
}
FTP:
string rep((h, u, a, d, f), (t, ct)) given in perl as
&file_uri("ftp", "$u@$h", *d, $f, $t, $ct) # @@ account?
where d is a list of the directories
and t is in ISO time format
NOTE: the string representation is often written without the
time and content-type information. In those cases, the content
type should be inferred from the filename and the time should
be inferred from the context of the reference.
ANON-FTP:
string rep((h, d, f), (t, ct)) given in perl as
&file_uri("anon-ftp", $h, *d, $f, $t, $ct)
where d is a list of the directories
and t is in ISO time format
NOTE: the string representation is often written without the
time and content-type information. In those cases, the content
type should be inferred from the filename and the time should
be inferred from the context of the reference.
GOPHER:
string rep((h, p, ty, sel, ser), t) given in perl as
&file_uri("gopher", $p == 70 ? $h : "$h:$p", (),
>ype_char($ct) . $sel, $t)
. $ser ? "?$ser" : ""
where d is a list of the directories
NOTE: the string representation is often written without the
time information. And unfortunately, there are no mechanisms
for determining whether gopher references with different times
refer to the same entity. @@md5 mechanism
HTTP:
string rep((h, p, d, f, s), (t, a)) given in perl as
&file_uri("http", $p == 80 ? $h : "$h:$p", *d,
$f, $t, max(a))
. $s ? "?$s" : ""
where d is a list of the directories t in ISO time format
NOTE 1: the string representation is often written without the
time. In these cases, the time should be inferred from
context.
NOTE 2: the string representation can only represent a limited
form of the content type metric function, i.e. "type x is
best," and ofthen it is ommitted completely. In those cases,
the client should encode a content type metric function based
on its capabilities and the user preferences.
RELATIVE NAMES
==============
Essentially (@@ need more discussion), if we have
Resolve(r0) = e0
and a name n is found inside e0, we apply the function
Relative : Name x Reference -> Reference
and compute
Resolve(Relative(n, r0))
RELIABLE CACHING AND MIRRORING
==============================
Essentially (@@ need more discussion), if a client wants to
compute Resolve(r0), it can, for example, give the whole r0
reference to a proxy server s, ala:
Resolve(r0) = Resolve((proxy, (s, r0)))
and so the question becomes: is Resolve(r0) in the cache?
In order to do mirroring, we need techniques to distribute information
about circumstances where a client can substitute references, i.e.
where
Resoleve(r') = Resolve(r)
but r' is easier to get to.
This is where heirarchical namespaces come in: we'd like to distribute
information of the form:
For all *,
Resolve("ftp://host1/dir1/*") = Resolve("ftp://host2/dir2/*")
LOCATION-INDEPENDENT REFERENCES (URN's)
=======================================
Rather than:
Resolve : Scheme x Location x Context -> Entity
we might as well write:
Resolve : Scheme x Name x Context -> Entity
for example we can define:
RFC.GET : Name x ContentType -> OctetStream
RFC.FILENAME : Name x ContentType -> File
as
RFC.FILENAME(n, c) = "n.txt" if c is text/plain,
"n.ps" if c is application/postscript
For all time t /* since rfc's don't change */
RFC.GET(n, c) = ANON-FTP.GET(("ds.internic.net", "/pub/rfc",
RFC.FILENAME(n, c)), t)
FUTURE DISCUSSION
=================
@@ security, scalability, reliability (fault detection and tolerance,
caching, migration, compression, encryption, authentication)
Daniel W. Connolly "We believe in the interconnectedness of all things"
Software Engineer, Hal Software Systems, OLIAS project (512) 834-9962 x5010
<connolly@hal.com> http://www.hal.com/%7Econnolly/index.html