Re: Resource discovery, replication (WWW Announcements archives?)

Markus Stumpf (stumpf@informatik.tu-muenchen.de)
Wed, 4 May 1994 03:52:13 +0200


Daniel W. Connolly writes:
|>Is www-announce and/or comp.infosystems.announce archived? I keep a
|>lot of copies of announcements, thinking "someday I might want to look
|>at that..." and I'd get rid of my local copies if I knew I could
|>replace them if I wanted to...

There is an archive of comp.infosystems.announce at
ftp://ftp.informatik.tu-muenchen.de/pub/comp/usenet/comp.infosystems.announce/
Look at :INDEX for a list of "filename -> subject"

|>It seems to me that a database of these articles, with a WAIS search
|>index, would be an extremely valuable resource discovery
|>application. Possibly more useful than databases currently built by
|>knowbots such as WWWWorm, AliWeb, Veronica, etc.

True! We had one for comp.archives (or better still have a really out of date)
but the problem still is to keep it uptodate. This is not a problem with
c.i.a right now, but it has been with c.a. The size of the database was
growing really fast and there was the question which timeframe should
be honoured ... 6 month? 1 year? 2 years? Within 2 years you have about
10 - 20 announcements for some software packages with makes it hard to
find the most recent one, but it's a hard job to delete the elder ones.

I also have a - not publicly available - mailfolder of all www-announce
messages from within a year or so.

|>But beyond that, it allows a distributed solution to the resource
|>discovery problem: Any site could build an index of available internet
|>resources just by archiving news.resources, indexing the contents, and
|>expiring old articles.

Hmmm ... this is exactly the same idea thats behind ALIWEB and IAFA,
isn't it? So why have a new one?

|>This could also be used as a way of distributing information about
|>replicated data. A mirror archive site could post a summary of its
|>contents, with (a) references to its own contents(A), and (b)
|>references to the original materials that it mirrors(B), and (c) a
|>machine-readable indication that A is a copy of B. Then any client
|>looking for any part of B that also has access to c can try looking at
|>A if it's closer or more available.

I think we should really go away from "mirror" as known by now.
The solution used by caching HTTPds right now is IMHO the more
efficent and more transparent. And you always have the information
where you got it from and a kind of expiry mechanism.
I really hate all those sites "mirroring" e.g. Xmosaic documentation
or the like. It is NEVER accurate and uptodate und you NEVER know
whether this is a selfmade or a mirrored.
Using caches solves the problem of e.g. slow links and is really
still kind of a pointer to the original.

|>We also need clients to have access to a database of these "A is a
|>copy of B" factoids. I think we should extend HTML to express this,
|>ala:
|> See <A HREF="http://host1/dir1/dir2/file">more info</A>
|> <REPLICA ORIGINAL="http://host1/dir1/" COPY="ftp://host2/dir3/">
|>Then, any client that parses this document would know that it can
|>retrieve http://host1/dir1/dir2/file as ftp://host2/dir3/dir2/file
|>if it prefers. It could also scribble that REPLICA factoid into a
|>database for use in other queries. It can also scribble the factoid
|>away in memory for use in other queries.

I am currently working on a caching-only server, that would also
allow for some hierachy in caching. This would completely solve the
problem without having the client to know about net topology and
replications.

\Maex

-- 
______________________________________________________________________________
 Markus Stumpf                        Markus.Stumpf@Informatik.TU-Muenchen.DE 
                                http://www.informatik.tu-muenchen.de/~stumpf/