Re: <draft-ietf-iiir-html-01.txt, .ps> to be deleted.

Daniel W. Connolly (connolly@hal.com)
Tue, 15 Feb 1994 13:19:15 --100


In message <9402141145.AA08042@manuel.hpl.hp.com>, Dave_Raggett writes:
>Dan, did you get my pointer to the current HTML+ DTD?
>
> ftp://15.254.100.100/pub/htmlplus.dtd.txt
>

I just now took a look at it. Very nice.

My major concern is that it implies a tremendous increase in the
complexity of the HTML parser. With my original HTML specification, an
HTML parser only parsed the instance part of the SGML document. With
this HTML+ specification, WWW clients will have to parse the prologue
as well. For example:

[[[[[[

<!DOCTYPE HTMLPLUS [

<!-- here's a blurb I hate to type all the time:
<!ENTITY sgml "SGML (Standard Generalized Markup Language)">

]>

<htmlplus><body>Parsing &sgml; in a general fashion is quite complicated!
</body></htmlplus>

]]]]]]

You have, however, simplified matters by not putting any parameter
entities in content models. This means that WWW clients won't have to
deal with individual documents introducing new element types "on the
fly."

But you've introduced OMITTAG, <!ENTITY> parsing, and lots of other
stuff. If we plan to include a full blown SGML parser in every WWW
client, why not use all the syntactic sugar like short references and
cool stuff like that while we're at it?

One of the things I released (or was just about to release when I
changed jobs...) was an SGML compliant HTML parser in a few hundred
lines of vanilla ANSI C.

>
>One issue in formalising HTML+ was in providing an adequate structure
>while dealing with legacy documents. As you can see in my current DTD,
>documents have a richer structure than with the old HTML DTD.
>

Yes... and it seems to me (at first glance... I'll have to look more
closely...) that we've lost the ability to translate HTML to Microsoft
Word or FrameMaker without any loss of information.

Let's get formal why don't we: I do not mean that we should be able to
take any RTF file and convert it to HTMLPLUS, or MIF for that matter.
But I think it's crucial that there exist invertible mappings

h : HTML -> RTF
and
g : HTML -> MIF
and
h : HTML -> TeXinfo

so that I can take a given HTML document, convert it to RTF, and
convert it back and get exactly what I started with (the same ESIS,
that is... perhaps SGML comments and a few meaningless RE's would get
lost).

>For instance, document text is forced to appear within paragraph elements
>which act as containers. Documents are broken into divisions using the
>new DIVn elements which give substance to the notion that headers start
>sections which continue up to the next peer or larger header.

If we're going to burden WWW clients with all this rich structure and
OMITTAG parsing, why don't we go with something like DocBook, which
has a proven ability to capture the structure of existing technical
documents, in stead of trying to roll our own.

>The ability to omit starting tags suggested a neat trick for handling
>existing HTML documents by defining DIVn and P as having omissable
>starting tags. Thus an <H1> tag can only occur as the first element
>of a DIV1 element, so browsers can infer missing DIVn start tags.

I'd like to see a more formal argument that this is a general
solution... Perhaps in the form of a short perl program that does the
inference.

>Similarly, missing <P> tags can be inferred when the browser sees
>something belonging to %text. This neatly deals with the common case
>where some authors think of <P> as a paragraph separator or something
>you should put at the end of each paragraph (this view is promulgated
>by Mosaic documentation).

Is this form of inference consistent with the SGML standard? Or is
this a non-standard extension to support legacy HTML documents?

>My HTML+ browser works this way, using a top-down parser which permits
>most elements to have omissable start and end tags, using the context
>to identify missing tags. Each element is associated with a procedure.
>Its easy this way to recover the structure of badly authored documents
>e.g. with missing <DL> start tags. BTW this browser will be demoed at
>the forthcoming WWW Conference in May.
>General purpose SGML parsers have difficulties with omitted start tags
>reflecting the outcome of a debate in the standards committee. Small
>print in the SGML standard limits the power of parsers to infer missing
>start tags. This restriction was added to simplify writing general parsers
>to handle DTDs in which the content model specifies exceptions.

If you're suggesting we use a parser that's "smarter" than standard
SGML parsers, I don't see the point. Either we buy into SGML, or we
make up something application-specific. And if we're going to make up
something application specific, we might as well scrap SGML syntax
all together and build something simple out of lex and yacc, or build
on TeXinfo.

>As a result the HTML+ DTD specifies the paragraph element as requring a
>start tag. The DTD can therefore be used with existing SGML authoring tools.
>HTML+ browsers are expected to exploit the DTD to infer missing tags, and
>hence deal with the wide variety of markup errors in existing documents.
>
>In future, we expect authors will use specialized wysiwyg editors for HTML+
>or automated document format conversion tools and hence produce documents
>which naturally conform to the DTD.
>

Ahh... now I am beginnig to understand the strategy, and I think I
like it: We begin anew with HTMLPLUS, defining a DTD that we expect to
be suitable to our needs. Then simply acknowledge that existing
documents contain a significant number of markup errors, and develop
heuristic techniques for inferring the ESIS from these "broken"
documents.

Hmmm... as long as there are no un-broken documents that would be
misinterpreted by these heuristics, I think it's a great idea. (Again,
though, I'd like to see a formal argument that this is the case.)

>> I think the HTML-Plus does a good job of getting a lot of interesting
>> issues on the table, but it's approach of throwing all the stuff into
>> one DTD, and making the DTD extensible (thereby forcing clients to
>> know how to _parse_ SGML DTD's) is a little off track.
>
>Actually, once you state that HTML is an SGML format, then formally each
>document can extend the DTD.

Nope. I took great pains in the specification to prevent WWW clients
from having to deal with anything but _instances_ of the DTD I wrote:

<!-- Regarding clause 6.1, SGML Document:

[1] SGML document = SGML document entity,
(SGML subdocument entity |
SGML text entity | non-SGML data entity)*

The role of SGML document entity is filled by this DTD,
followed by the conventional HTML data stream.
-->

> HTML+ merely exploits this to show authors
>how to declare which extension they wish to use: forms, tables, figures etc.
>I owe a debt here to Lou Burnard and the TEI DTDs which showed me how and
>why to use this approach. It is also pivotal in addressing the problems
>in providing a wide enough range of semantic markup to cover all needs.
>In practice, this is a bottomless pit, and the best solution is for HTML+
>browsers to offer a small basis set of emphasis primitives and to allow
>authors to define their specific elements in terms of this basis set (see
>the RENDER element). At least one browser out there already supports this.

Another solution is to go with more of a MIME architecture, where HTML
is just one data format. TeXinfo is another handy one, and maybe
DocBook, etc. ... I'll have to explain my thoughts on this a little
more in another message.

>
>I have investigated HyTime compliance with Yuri Rubinsky and Elliot Kimber
>(Dr Macro), and know how to add this in. At the moment though, most people
>in the WWW community see little value in switching to a model which forces
>you to declare hypertext links at the start of the document.

There are ways to exploid HyTime without using <!ENTITYs for all
links. More on that later too...

> This no doubt
>will change if and when HyTime gets widely adopted. On the other hand, I
>feel it is essential for HTML+ to conform to SGML. Without this, publishers
>and businesses will tend to see WWW as a passing experiment that needs to
>be replaced by something on a more professional/commercial footing. This is
>why I am working so hard to extend HTML into something that meets publishers
>and users expectations for document delivery. NCSA have done their bit - now
>its my turn to roll up my sleeves and get down to serious programming :-)

There's a lot of good stuff in this latest DTD. I think we need a more
sophisticated, fault-tolerant linking element, and a few other things,
but you might be on the right track.

>
>> I've got a lot of catching up to do. I hope it's not too late to
>> keep folks from losing confidence in communicating with HTML.
>
>No problem! I am confident that html+ will go a long way to vitiating
>current objections and raising confidence in WWW as a model for the
>development of national information highways. I look forward to renewed
>vigour in the debate on where we should go next, and hope you can make it
>to the WWW Conference in Geneva.

I wish! Maybe...

Dan

p.s. I'd like to start some sort of html-successor-design discussion
form. Is comp.infosystems.www, comp.text.sgml, or www-talk a suitable
forum? Shall we create one?