W4 took many hours (maybe 20) to run, but I don't remember exactly, because it
saves state so I could kill it and restart it whenever I wanted. Well, in
total, the W4 found more than 17,000 http documents (didn't follow any other
kinds of links) and more than 125 unique hosts. In the current version,
it *only* retrieved the URL of the document.
In the next version, I hope to have it do the following other things.
o Get the <title>Title</title> of the document
o Get the length of the document
o Do a 'keyword' analysis of the document
o Count the number of links in a document
o Improve on the boredom system
By a 'keyword' analysis, I mean looking at the document for words that
appear frequently, but aren't normally common words. Additionally, titles
and things appearing in headers would be good candidates for keyword searches.
I'll try and get the current code at least clean enough that I'm willing to
let everyone in the world to see it, but if you *really* want to see it now,
send me mail. Any other suggestions would be welcome.
Once this index is produced, it will be searchable via http, and I suppose
by WAIS though I really detest the way WAIS restricts searches. In any case,
there is a possibility that this will be done by the end of the summer.
Matthew Gray
mkgray@athena.mit.edu