Re: <draft-ietf-iiir-html-01.txt, .ps> to be deleted.

Erik Naggum (enag@ifi.uio.no)
Tue, 15 Feb 1994 15:27:43 --100


[Eliot Kimber] (1994-02-14 23:02:36 -0500)

| I'm afraid that on this point there can be no compromise. If a
| document is an SGML document then it *must* start with a DOCTYPE
| declaration and include the document element *in the same entity*. The
| definition of SGML document entity is quite clear on this. In fact, it
| is impossible to know whether or not a given stream of data is valid
| SGML *unless* there is a doctype declaration (and an SGML declaration,
| which may be implied by the processing system).

I beg to differ.

This is a complicated issue, and I'm at work now, so I can't elaborate
until sometime tonight (or this afternoon, EST), but the reason it has
become complicated is that there has been a general failure to understand
the distinction between what an SGML parser will see, and what was really
there. Charles Goldfarb and I quickly came to the conclusion that the
record boundary characters were figments of the SGML entity manager's
interface, and we have worked hard to specify a mechanism that allows
"storage objects" (a generalization of "file") to identify their record
boundary convention (a generalization of "line terminator") such that the
entity manager could do the right thing with them. We also took this
argument further, realizing that the "entity" is not a file, or a string of
characters "out there". It's a string of characters as seen by the parser.
We allowed substrings of storage objects, concatentation of (substrings of)
storage objects, and the reason we have "storage object" instead of "file"
is the realization that the user needs the ability to identify the "storage
manager" that can take a "storage object specification" and convert it to a
string of characters. We provide two default storage managers: "file", and
"literal". A user can thereby provide his own storage manager to read text
from an in-memory buffer, from a network resource, from a database, from
the execution of a program, etc.

Conceptually, there is no limit to the number of transformations that could
be applied to the storage objects before they were presented to the parser
as the string of characters of an entity.

As an extension of this idea was the realization that people work with one
particular document type much more than they work with others. It would be
a waste to parse the same DTD thousands of times a day, and we got the idea
that a pre-parsed DTD could be stored in some way transparent to the user,
which would be used by the SGML parser. This folds itself neatly into the
idea of a resumable parser that stores enough state information that it can
resume from any point in the parsing process. Right after the DTD parsing
is just one example. The idea was that a parser client could parse up to a
certain point, keep a "bookmark", and resume parsing from there if the text
following this point changed. Well, apply this to a DTD, and the whole
instance could change.

I believe I have outlined a standards-conforming process that can be used
to support the initial view that HTML+ need not include a DTD in every file
(a reference would suffice), and need not parse the DTD itself (a pre-
parsed version will do). Since the HTML+ application is restrictive, the
number of document type declarations that will conform to the application
is small, and can, for all practical uses, be limited to one, which is the
one that all HTML+ processor implement.

However, this is not really such a big deal. I have argued that validating
a DTD is a different task than using it, and some parsers implement this
distinction. Validating is _hard_. Parsing it to use it is relatively
easy, and takes almost the same amount of resources required to read and
process a binary format resulting from the pre-parsed DTD. (Barring tons
of comments in the DTD, or lots of small files with DTD fragments.)

If we also assume that HTML+ document authors validate their documents
before they ship them (a not unfriendly requirement when you consider the
alternative), parsing relative to a DTD is a relatively simple process. If
done with something other than an SGML parser, however, it can be
expensive, hard to get right, and terribly complicated. Therefore, an SGML
parser should be used for this purpose. Whatever it is that actually does
the job will be an "SGML parser", although probably not a _conforming_ SGML
parser. Using a publicly available tool that can communicate with its
client in the way I have outlined should offer some significant advantages.
I also believe using POEM would solve many problems in retrieving files
over the network, and would thus simplify the entire parsing process.

Well, duty calls. I will have to continue later.

Best regards,
</Erik>

--
Erik Naggum <erik@naggum.no> <SGML@ifi.uio.no>  |  Memento, terrigena.
ISO 8879 SGML, ISO 10744 HyTime, ISO 10646 UCS  |  Memento, vita brevis.