Re: Redundancy in links, Davenport Prososal [long]

Jim Davis (davis@DRI.cornell.edu)
Mon, 30 Jan 1995 17:06:44 +0100


Date: Sat, 28 Jan 1995 01:03:10 +0100
From: "Daniel W. Connolly" <connolly@hal.com>

...From the evidence that I have studied, the way to make links more
reliable is not to deploy some new centralized namespace (ala URNs
with publisher id's), but to put more redundant info in links.

Rather than looking at the web as documents addressed by an
identifier, I think we should look at it as a great big
content-addressable-memory. "Give me the document written by Fred in
1992 whose title is 'authentication in distributed systems'."

I think the same sort of thing that makes for a high-quality citation
in written materials will make for a reliable link in a distributed
hypermedia system. A robust _link_ should look like a BibTex entry
(MARC record, etc.)

strategy to increase the quality of service in information retrieval.

Theory of Operation
===================

The body of information offered by these vendors can be regarded as a
sort of distributed relational database, the rows being individual
documents (retrievable entities, to be precise), and the columns being
attributes of those documents, such as content, publisher, author,
title, date of publication, etc.

The pattern of access on this database is much like many databases:
some columns are searched, and then the relavent row is selected. This
motivates keeping a certain portion of this data, sometimes referred
to as "meta-data," or indexing information, highly available.

The harvest system is a natural match. Each vendor or publisher would
operate a gatherer, which culls the indexing information from the rows
of the database that it maintains. A harvest broker would collect the
indexing information into an aggregate index. This gatherer/broker
collection interaction is very efficient, and the load on a
publisher's server would be minimal. The broker can be replicated to
provide sufficiently high availability.

Typically, a harvest broker exports a forms-based HTTP searching
interface. But locating documents in the davenport database is a
non-interactive process in this system. Ultimately, smart browsers
can be deployed to conduct the search of the nearest broker and
select the appropriate document automatically. But the system should
interoperate with existing web clients.

Hence the typical HTTP/harvest proxy will have to be modified to not
only search the index, but also select the appropriate document and
retrieve it. To decrease latency, a harvest cache should be collocated
with each such proxy.

Ideally, links would be represented in the harvest query syntax, or a
simple s-expression syntax. (Wow! In surfing around for references, I
just found an example of how these links could be implemented. See the
PRDM project[2].) But since the only information passed from
contemporary browsers to proxy servers is a URL, the query syntax will
have to be embedded in the URL syntax.

I'll leave the details aside for now, but for example, the query:

(Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide")
AND (Edition: Second)

might be encoded as:

harvest:/davenport?publisher-isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second

Each client browser is configured with the host and port of the
nearest davenport broker/HTTP proxy. The reason for the "//davenport"
in the above URL is that such a proxy could serve other application
indices as well. Ultimately, browsers might implement the harvest:
semantics natively, and the browser could use the Harvest Server
Registry to resolve the "davenport" keyword to the address of a
suitable broker.

To resolve the above link, the browser client contacts the proxy and
sends the full URL. The proxy contacts a nearby davenport broker,
which processes the query and returns results. The broker then selects
any match from those results.

Through careful administration of the links and the index, all the
matches should identify replicas of the same entity, possibly on
different ftp/http/gopher servers. An alternative to manually
replicating the data on these various servers would be to let the
harvest cache collocated with the broker provide high availability of
the document content.

Security Considerations
=======================

The main considerations are authenticity and access control for the
distributed database.

Securely-obtained links (from a CD-ROM, for example) could include the
MD5 checksum of the target document. If the target document changes, a
digital signature providing a secure override to the MD5 could be
transmitted in the HTTP header. Assuming the publishers' public keys
are made available to the cache/proxies in a secure fashion, this
would allow the cache/proxy to detect a forgery. But the link from the
cache/proxy to the client is insecure until clients are enhanced to
implement more of this functionality natively. At that point, the
problem of key distribution becomes more complex.

This proposal does not address access control. As long as all
information distributed over the web is public, this solution is
complete. But over time, the publishers will expect to be able
to control access to their information.

If the publishers were willing to trust the cache/proxy servers to
implement access control, I expect an access control mechanism could
be added to this system. If the publishers are willing to allow the
indexing information to remain public, I believe that performance
would not suffer tremendously. The primary difficulty would be
distributing a copy of the access control database among the proxies
in a secure fashion.

Conclusions
===========

I believe this solution scales well in many ways. It allows the
publishers to be responsible for the quality of the index and the
links, while delegating the responsibility of high-availability to
broker and cache/proxy servers. The publishers could reach agreements
with network providers to distribute those brokers among the client
population (much like the GNN is available through various sites.)

It allows those cache/proxy servers to provide high-availability to
other applications as well as the davenport community. (The Linux
community and the Computer Science Technical reports community already
operate harvest brokers.)

The impact on clients is minimal -- a one-time configuration of the
address of the nearest proxy. I believe that the benefits to the
respective parties outweigh the cost of deployment, and that this
solution is very feasible.

[1] http://www.acl.lanl.gov/URI/archive/uri-95q1.messages/0080.html
Sun, 22 Jan 1995 12:41:10 PST

[2] PRDM
http://www-pcd.stanford.edu/ANNOT_DOC/annotations.html

[3] http://www.research.digital.com/SRC/larch/larch-home.html

[4] http://www.cs.utexas.edu/~qr/algernon.html