I strongly suggest we bring the definition of HTML into conformance
with the SGML standard before we register it with the IANA.
>Published specification:
> "The HTTP Protocol as Implemented in W3", avaiable for
> anonymous ftp from ftp://info.cern.ch/pub/doc/www/http.txt.
> Describes the HTTP interactive access protocol and the tags used
> in HTML documents.
This is the HTTP document, not the HTML document:
This document defines the Hypertext Transfer protocol (HTTP) as
currently implemented by the WorldWideWeb initaitive software.
The HTML document is: http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html
an old version of which is contained in http.txt.
In any case, both documents mention some relationship between HTML and
SGML which is not formally defined:
The hypertext mark-up language is an SGML format. This defines the
basic syntax used. The particular language, the set of tags and the
rules about their use, and their significance is not part of the
SGML standard. There being no standard on this, we have adopted a
set which seems sensible. We call them HTML -- hypertext markup
language. HTML is not an alternative to SGML, it is a particular
format within the SGML rules (an SGML "DTD").
The standard is very clear on this kind of thing. [I just got myself a
copy, so I can quote it:]
4.103 (document) type declaration: A markup declaration that
contains the formal specification of a document type
definition.
4.104 document type delcaration subset: The element, entity,
and short reference sets occuring within the declaration
subset of a document type declaration.
4.105 document (type) definition: Rules, determined by an
application, that apply SGML to the markup of documents of a
particular type. A document type definition includes a formal
specification, expressed in a document type declaration, of
the element types, element relationships, and attributes, and
references that can be represented by markup. It thereby
defines the vocabulary of the markup for which SGML defines
the syntax.
So it seems that the HTML DTD is missing the "formal specification."
I have written a document type declaration subset that matches HTML as
currently defined and implemented, with a few exceptions (most
importantly, the PLAINTEXT tag). See
http://info.cern.ch/hypertext/WWW/MarkUp/HTML.dtd
Most existing HTML documents need only small modifications to bring
them into conformance (quote attribute values, add the <!DOCTYPE ...>
prologue). And the existing WWW browser parses conforming documents
just fine.
Currently HTML documents are transmitted without the normal SGML framing
tags, but if these are included parsers will ignore them.
I don't know what "the normal SGML framing tags" are. An SGML document
has three parts: the SGML declaration, the prologue, and the instance.
It is common in SGML applications to use an implied SGML declaration
and include the prologue by reference (kinda like an #include
directive in C.) but without these "framing tags," it's just not an
SGML document.
Besides, it's very little work to add the line:
<!DOCTYPE HTML SYSTEM>
at the beginning of HTML documents.
More non-conforming stuff in Markup.html:
Plaintext
This tag indicates that all following text is to be taken litterally, up to
the end of the file. Plain text is designed to be represented in the same
way as example XMP text, with fixed width character and significant line
breaks. Format:
<PLAINTEXT>
This tag allows the rest of a file to be read efficiently without parsing.
Its presence is an optimisation. There is no closing tag.
This should be moved outside the definition of HTML. It should just be
part of the HTTP protocol: if the server starts the response with
<PLAINTEXT>, what you're getting is plain text, not SGML.
Another problem:
Example sections
The text may contain any ISO Latin printable characters, including the
tag opener, so long as it does not contain the closing tag in full.
This doesn't fit in SGML. The ETAGO delimiter ("</") ends a CDATA
section.
A clarification:
Paragraph
This tag indicates a new paragraph. The exact representation of this
(indentation, leading, etc) is not defined here, and may be a function of
other tags, style sheets etc. The format is simply
<P>
(In SGML terms, paragraph elements are transmitted in minimised form).
The implementation suggests that the <P> tag marks an empty element, a
paragraph separator, rather than allowing minimization in the form of
an omitted end tag, </P>.
We could even go so far as to call WWW an SGML application:
4.279 SGML Application: Rules that apply SGML to a text
processing application. An SGML application includes a formal
specification of the markup constructs used in the
application, expressed in SGML. It can also include a
non-SGML definition of semantics, application conventions,
and/or processing.
Note 2 The formal specification of an SGML application
constitutes the common portions of the documents processed by
th application. These common protions are frequently made
available as public text.
In other words, ftp://info.cern.ch/pub/doc/the_www_book.txt would
serve as the "non-SGML definition." [by the way: I could only find
postscript and LaTeX versions of the book: no txt file.] The "common
portion" is html.dtd (we could obtain a public text identifier for
it...).
If we want to do this (define an SGML application) section 15.5
requires this statement to be plastered all over the place:
An SGML Application Conforming to International Standard
ISO 8879 -- Standard Generalized Markup Language
If we're gonna use SGML, why not do it right?
Dan