Path: gmdzi!unido!mcsun!uunet!clyde.concordia.ca!news-server.csri.toronto.edu!mailrus!sharkey!math.lsa.umich.edu!math.lsa.umich.edu!emv
From: e...@math.lsa.umich.edu (Edward Vielmetti)
Newsgroups: news.software.b,news.software.notes,news.software.nntp
Subject: Re: netnews storage representation
Message-ID: <EMV.90Jun12215100@stag.math.lsa.umich.edu>
Date: 13 Jun 90 01:51:00 GMT
References: <VIXIE.90Jun12153343@volition.pa.dec.com>
Sender: use...@math.lsa.umich.edu
Organization: University of Michigan Math Dept., Ann Arbor MI.
Lines: 53
Posted: Wed Jun 13 02:51:00 1990
In-Reply-To: vixie@decwrl.dec.com's message of 12 Jun 90 22:33:43 GMT
Nice dodge; if everyone reads news with nntp, you don't need to
store things as one article per file.
I'm from a community where old discussions tend to stay around
"forever" as well (Confer on MTS) so this question hits home.
It's a real shame that we can't dredge out old net.groups articles
from the mists of time and use them in all sorts of creative ways.
I have a couple of useful collections of old netnews articles, and
it would be nice to reconstitute them into something which could
be fetched off an nntp server without too much hassle. In that
way I could (for example) provide a permanent on-line archival
home for comp.archives, and anyone could simply train their
nntp clients at (some mythical) machine which would be able to
serve the articles up.
I've done a little bit of work here with some software called "Pat"
from Open Text, the Waterloo folks who brought you the New OED
project. Its native representation for text is one file for the whole
text, with a few additional files to hold indices of various sorts.
Pat lets you do very quick searches on a text or a part of a text. If
you were to glue it to netnews, one way might be to take the raw
articles, massage them a little bit to put in markers that delineate
the various header fields and boundaries between articles, then turn
an NNTP request like
article <VIXIE.89Feb27023...@jove.pa.dec.com>
into PAT commands like
docs msgid including "<VIXIE.89Feb27023...@jove.pa.dec.com>"
pr.docs.article
assuming that you've properly delineated the "article" and "msgid"
stuff ahead of time. You might also need to do some cleaning to remove
any tags that were necessary to mark out the various header fields
and reconstruct the original text.
This could work real well for large static collections of text; my
comp.archives collection is about 4 megabytes, but a few tests on a
sun sparcstation suggested that it was taking on the order of 0.05 cpu
seconds per search. Indexing takes a looooong time though, it looks
like 26 minutes wall clock time last time I did a run. That's just on
a sparcstation though, I'm sure it would be faster on the DEC gear you
have.
I have heard that Young Minds was going to put together a collection
of old netnews articles on CD-ROM, which I have joked would be the
perfect kind of data to go over for my PhD thesis. I don't know if
these have hit the market yet. It would be very nice to have this
sort of a collection accessable by NNTP as well.
--Ed
Edward Vielmetti, U of Michigan math dept <e...@math.lsa.umich.edu>
comp.archives moderator
Path: gmdzi!unido!mcsun!uunet!bfmny0!tneff
From: tn...@bfmny0.BFM.COM (Tom Neff)
Newsgroups: news.software.b,news.software.notes,news.software.nntp
Subject: Re: netnews storage representation
Message-ID: <15585@bfmny0.BFM.COM>
Date: 13 Jun 90 03:33:36 GMT
References: <VIXIE.90Jun12153343@volition.pa.dec.com>
Reply-To: tn...@bfmny0.BFM.COM (Tom Neff)
Followup-To: news.software.b
Lines: 14
Posted: Wed Jun 13 04:33:36 1990
It's interesting to contemplate other storage methods, but really, most
netnews isn't worth keeping around. The fraction that does merit
archiving probably should be archived in some other form than as
"active" news articles.
Maybe what we need is a standard (e.g., RFC) for archived news. The
paradigm would be write-once, read-occasionally. Storage efficiency
could be high, and potential media varied (CD-ROM, tape, dasd etc).
With a solid archival standard, newsreaders (and other utilities) could
be taught to access these auxiliary archives at the user's discretion.
--
There's nothing wrong with Southern California that a || Tom Neff
rise in the ocean level wouldn't cure. -- Ross MacDonald || tn...@bfmny0.BFM.COM
Path: gmdzi!unido!mcsun!uunet!cs.utexas.edu!usc!elroy.jpl.nasa.gov!decwrl!bacchus.pa.dec.com!vixie
From: vi...@decwrl.dec.com (Paul A Vixie)
Newsgroups: news.software.b,news.software.notes,news.software.nntp
Subject: netnews storage representation
Message-ID: <VIXIE.90Jun12153343@volition.pa.dec.com>
Date: 12 Jun 90 22:33:43 GMT
Sender: n...@wrl.dec.com (News)
Organization: DEC Western Research Lab
Lines: 104
Posted: Tue Jun 12 23:33:43 1990
1. Out of Inodes
I'm running up against a new kind of wall in netnews, namely the 4.2bsd FFS
bug wherein you can't have fewer than 16 cylinders in a group and you can't
have more than 2048 inodes in a cylgroup. This means that on a DEC RZ57,
my bytes:inode is approximately 4K, and that even though I've got 1.0GB of
user-writable data blocks available, I have only ~230K inodes. I'm working
with the Ultrix people to see about getting the 4.3-tahoe FFFS (fixed fast
file system) which will allow more than 2048 inodes/cylgroup, but that's
going to take a while since it's not exactly a backward-compatible fix.
The reason this is important is because in the traditional netnews storage
scheme, each article is stored in a separate file, thus consuming an inode.
Article #231 of alt.sex.bestiality, for example, is stored in a file called
(usually) /usr/spool/news/alt/sex/bestiality/231. Cross-posted articles,
one of Netnews' strongest architectural features, are handled by creating
multiple hard links --an article can show up in more than one newsgroup
because it has namelinks in more than one subdirectory of /usr/spool/news.
2. Archival
One of the requests I see fairly frequently from my users is for archival.
The complain that Netnews isn't as good as VAXnotes, because in Netnews,
articles are expired after a while. This seems so natural to me and so
foreign to them. VAXnotes live forever, which is fairly easy because each
notesfile is permanently stored on its own host. Thus we can keep terabytes
of old VAXnotes around since each host only has to store a small part of it,
and most hosts don't store any of it at all.
Obviously there are a lot of groups we wouldn't want to keep at all, though
keeping the headers of such articles would be useful for measurement purposes.
One option denied to me from the outset is to simply never expire netnews.
I would run out of inodes first, but ultimately I would run out of blocks,
since netnews cannot span file systems (because of the way cross-posted
articles are handled). 1.0GB is currently enough for about 50..75 days of
netnews. This doesn't constitute an "archive" by anyone's standards. In
particular the VAXnotes people often refer back to multi-year-old notes.
Another option is to use the "archive" option present in most versions of
the netnews "expire" utility. This beats not expiring at all, since the
inodes of /usr/spool/news are freed up for future articles. With some
cleverness, one could even arrange to fill up spindle after disk spindle
with archived articles, meaning that you could keep online as many articles
as you wanted to buy disk drives for.
3. Fragments
Quite a bit of space is wasted in the tail ends of partially-full blocks.
Given a fragment size of 512 bytes, this wastage will probably average
256 bytes, or 1K per each four articles. 100K articles would waste 25MB
of space.
4. Notes
Notes has a different representation than netnews, which is one of the
reasons it can't cross-post effectively and therefore one of the reasons
Netnews people tend to dislike it. Notes maintains a "notesfile" (think:
"newsgroup" here) in one or more flat files that contain pointers into
eachother. There's no wastage of partial blocks since there are no partial
blocks -- new articles are appended (more or less) directly to the end of
the representation file(s).
More recent versions of notes, in particular HP Notes, have some support for
cross-posting. This is done by storing out-of-group pointers in each group
after the first one on the "Newsgroups:" line. When a notesreader sees
a pointer of this kind it knows to go fetch the article text from some other
notesfile, rather than just reading the text of the note from the current
notesfile.
5. The NNTPD barrier
Most of my users now use NNTP to read their news. Various holdouts are
being retrained and given new tools, such that shortly, 100% of my readers
will use NNTP to read news here.
This gives me a wonderful opportunity to try other representations for
netnews. As long as I teach nntpd how to read articles from any new
representation I come up with, I am free to choose any representation that
works. Obviously the netnews transport (C News, in my case) will have to
be retrained as well.
6. Possibilities
One of the possibilities I'm thinking about is the current HP Notes
mechanism. It has several advantages:
-> doesn't need a lot of inodes (one or a few per group,
rather than one per article)
-> doesn't waste space on partial blocks
-> doesn't care about file system boundaries
It isn't as elegant about cross-posted articles, but if these can be
made to work then the inelegance will not break my heart.
7. Comments
Any?
--
Paul Vixie
DEC Western Research Lab <vi...@wrl.dec.com>
Palo Alto, California ...!decwrl!vixie
Path: gmdzi!unido!mcsun!sunic!uupsi!rpi!uwm.edu!cs.utexas.edu!samsung!uakari.primate.wisc.edu!dali.cs.montana.edu!mathisen
From: mathi...@dali.cs.montana.edu (Jaye Mathisen)
Newsgroups: news.software.b,news.software.notes,news.software.nntp
Subject: Re: netnews storage representation
Message-ID: <2076@dali>
Date: 13 Jun 90 07:11:17 GMT
References: <VIXIE.90Jun12153343@volition.pa.dec.com>
Reply-To: mathi...@dali.UUCP (Jaye Mathisen)
Followup-To: news.software.b
Organization: Montana State University, Dept. of Computer Science, Bozeman MT 59717
Lines: 57
Posted: Wed Jun 13 08:11:17 1990
In article <VIXIE.90Jun12153...@volition.pa.dec.com> vi...@decwrl.dec.com (Paul A Vixie) writes:
>1. Out of Inodes
>
>I'm running up against a new kind of wall in netnews, namely the 4.2bsd FFS
>bug wherein you can't have fewer than 16 cylinders in a group and you can't
Somebody (I don't think it was recently) posted some patches/comments on
how to make news use symlinks rather than hard links. Doesn't solve the
inode problem, but if a person wanted, I suppose one could mount an
RZ57 for each directory in /usr/spool/news, in order to get a terabyte
or more of storage. Of course, there are all kinds of problems with
this from an OS standpoint, such as mount table sizes, etc. etc., but
it's possible.
>
>2. Archival
I got some stuff in the mail, about the Epoch 1 Infinite Storage Server,
with combinations of Winchester storage, and optical, that might work.
Personally, I try to archive all source/binary groups, and all local
groups. Then every 30 days or so, I dump the important headers of
all articles in the archive directories, make a listing file, and tar
the whole mess to a cartridge tape. Then I can riffle through the
listings, and find the tape. Crude, but do-able. With 8mm tape (when
it show up here! ) I might only have to archive 1 time a year.
Maybe somebody should spring for one of those 50 tape 8mm juke boxes.
>
>5. The NNTPD barrier
>
>
>This gives me a wonderful opportunity to try other representations for
>netnews. As long as I teach nntpd how to read articles from any new
>representation I come up with, I am free to choose any representation that
>works. Obviously the netnews transport (C News, in my case) will have to
>be retrained as well.
I once tried running news on 2 machines. The first one had all the NNTP
connections for the feeds that I exchanged. It then fed one
other machine that everybody used for reading. Then I changed the mini-inews
so that when it posted, it sent it to machine 1 for processing, who then
sent it to machine 2, so it might take a few minutes for the article to show up
in the second machines spool directory. Machine 1 only stored 2 days worth of
news, while machine 2 stored 21 days or somesuch. It worked out OK, except that
machine 1 was a microVAX II, and severely under-powered for a lot of
incoming and out-going feeds.
If you were to do something similar, you wouldn't have to worry much
about the transport layer for the second machine, something really simple
would suffice.
>
>6. Possibilities
>
>One of the possibilities I'm thinking about is the current HP Notes
>mechanism. It has several advantages:
How does it work?
Path: gmdzi!unido!mcsun!uunet!jarthur!usc!zaphod.mps.ohio-state.edu!tut.cis.ohio-state.edu!rutgers!mcdchg!chinet!les
From: l...@chinet.chi.il.us (Leslie Mikesell)
Newsgroups: news.software.b,news.software.notes,news.software.nntp
Subject: Re: netnews storage representation
Message-ID: <1990Jun15.152814.26515@chinet.chi.il.us>
Date: 15 Jun 90 15:28:14 GMT
References: <VIXIE.90Jun12153343@volition.pa.dec.com>
Organization: Chinet - Chicago Public Access UNIX
Lines: 34
Posted: Fri Jun 15 16:28:14 1990
In article <VIXIE.90Jun12153...@volition.pa.dec.com> vi...@decwrl.dec.com (Paul A Vixie) writes:
>1. Out of Inodes
>2. Archival
>3. Fragments
>6. Possibilities
>One of the possibilities I'm thinking about is the current HP Notes
>mechanism. It has several advantages:
>
> -> doesn't need a lot of inodes (one or a few per group,
> rather than one per article)
>
> -> doesn't waste space on partial blocks
>
> -> doesn't care about file system boundaries
>7. Comments
I'd like to see a compressed multi-part file format standardized
so that it could be used for batch transmissions as well as
archives. Zoo might be a good starting point, although the
readers should be set up to use an external directory. A database
(dbm or otherwise) of ID#, archive filename, compression method,
and starting offset would let you use an arbitrary mix of compression
techniques and file groupings (including normal uncompressed one-message
per file mode). If the reader is smart enough to cache the open files
and the groups are organized reasonably, it could easily be more efficient
to decompress than to open individual files per message. With a little
thought, it should be possible to work out a scheme to pass these files
around without changing them between machines, and perhaps use the
same scheme for mailbox storage (maybe with the index stored at the end
of the file).
Les Mikesell
l...@chinet.chi.il.us
|