Re: No Nasty Robots! [Was: Full-text indexing for WWW conference avail. ]

Nick Arnett (narnett@verity.com)
Thu, 13 Oct 1994 09:44:25 -0800


At 6:47 AM 10/13/94 +0100, Daniel W. Connolly wrote:

>I expect this would require installing Verity's Topic at the various
>information providers' sites. Not practical, I expect.

If we released our spider, people could build local indexes and then ftp
them here to be served, so that they wouldn't have to have the whole ball
o' wax. But as I mentioned, it's pre-release software.

We are absolutely NOT releasing this version, precisely because it is
prerelease. We won't run it against anyone's site without their
permission, which is why I asked, of course. When we've given it to a
handful of pre-release customers so that they could index their own
servers, we've made sure that they understand what it is, what it does and
why they absolutely must not use it on others' servers without permission.
It is designed to index just one directory tree of one site, so the load it
can create is limited.

We have very intention of building an *ethical* tool for mining information
from the Internet, one that follows the guideless, including limitations on
how often it will hit a particular site.

If someone can tell me how to make our search engine the Internet standard,
then I'll be happy to give everyone on the net the tools to make their own
TOPIC indexes. But I think the reality is that it's a multi-standard
world, so people will build their own spiders, robots, etc., in order to
get indexes into their own formats. Perhaps the URI system, when it gets
here, will make the whole process more efficient.

One might notice that for many months, I've had a few pointers to key net
information resources on my server at MCC... among them is a pointer to the
Nexor page on robots. In other words, we're paying attention, folks.

I think we should all realize that the inevitability of publishers who will
make a living out of creating subject- and discipline-oriented network
indexes, for which they legally don't need the permission of the server
owners. People have already designed server tools to stop them; gizmos
that lock out a client if it hits a server too often. I expect that kind
of capability to get built into commercial servers eventually.

Meanwhile, I understand and appreciate the cautions that are expressed
here. We intend to build very powerful information navigation tools, so we
need to be thoroughly aware of proper design.

Nick