HTML and TEI?

Lou Burnard (lou@vax.ox.ac.uk)
Mon, 26 Jul 1993 11:24:17 +0200


Seriously impressed by our new Xmosaic server, I have at last started
to look at the differences between HTML and the emerging TEI dtd. I
downloaded the most recent HTML spec (the draft RFC titled "Hypertext
Markup Language", dated 13 July 1993) and read through it fairly
carefully.

This exercise has so far produced a mapping file which can be used with
the TF filter to turn a P2X file into a close approximation to an HTML
document, though I have yet to put this to the test. It also lead to my
producing the present note, which I am forwarding to both WWW and TEI
technical discussion lists in the hope of opening up a useful
dialogue. Anyone on the TEI list who doesn't know what WWW is, and
anyone on the WWW list who doesn't know what TEI is, please ask me
directly rather than waste bandwidth rehearsing the familiar here.

What follows is mostly organized as a set of reactions to the RFC. In
each case, I've tried to think how I would express HTML semantics using
TEI syntax (rather than the other way round, which looks like a
non-starter). Various minor queries and niggles about HTML and the RFC
itself emerged in the process which I have also included. For clarity,
HTML tag names are given in uppercase (eg TITLE) and TEI ones in
pointy brackets (eg <title>).

Apologies if I am rehearsing points already made several times over by
others!

Firstly, a minor irritant: the RFC is spattered with typos and spelling
errors. I can supply a list if needed, but a spell checker would do so
more quickly and painlessly.

More specific comments (keyed by section number) follow:

1.1.3 Several sections/features don't specify any status. Does this mean
they are mainstream or what? Does "treat the contents as though
the tags were not there" mean that processors suppress any tag they
don't know about but retain its content? It would be nice if they could
at least optionally preserve the tag for the benefit of other downstream
applications that can do something with them (tho I suppose that might
be counted as processing them).

I think any reference to "undefined elements" here should be extirpated
root and branch. If you are using an SGML dtd (which HTML appears to
be) then there cannot be any undefined elements in your document,
unless you are willing to countenance illegal documents. In which case,
why are you using SGML at all? One should distinguish between elements
not defined in the dtd (always illegal) and elements which are defined
by the DTD but which an application does not know, or care to know,
about.

2.1 The paragraph on character sets seems to imply that the top half of
the ISO-646 IRV can be used ("There is no obligation...to contain any
characters above decimal 127"). As I read it, however, characters above
127 are at best undefined, at worst precluded, by the statement
DESCSET 0 128 0 within the HTML SGML declaration.

3. The reference for ISO SGML is not present.

3.1.1 The second para under heading 'Attributes' reads "See other
tolerated forms (@@)". It would be nice to know what other forms are
tolerated. Presumably both e.g. <A HREF='FOO'> and <A HREF=FOO> are?

3.1.2 Under the heading 'Character data' there is a rather confusing
discussion of the difficulties inherent in representing markup
characters within HTML content in such a way that they are not
interpreted as markup. There are at least two other ways of achieving
this not suggested here:

(a) use entity references, e.g. &lt and &gt for < and > respectively
(these do appear in the reference list at section 5, but they have been
expanded!)

(b) enclose the passage within a CDATA marked section (This last is the
most effective method we have found for dealing with the vexed problem of
getting SGML examples into our text in the TEI)

3.1.3 Obviously, given my previous comment, I am alarmed to read that
"Marked sections are deprecated". Why and by whom? And does "deprecated"
mean that they are illegal, unsupported, or what? And what is the
reference "See the SGML standard for complete information" supposed to
be doing here? (I mean, it's definitely a true statement, but I dont see
what it does to help in this context)

3.1.5 "In HTML, multiple spaces should be rendered as proportionally
larger spaces". I think this means that an HTML processor should treat
sequences of white space characters differently from a single such
character, which is sort of reasonable but rather far removed from the
spirit of SGML. More to the point though, it seems to contradict the
sentence "Neither spaces nor tabs should be used to make SGML source
layout more attractive or easier to read". I agree that tabs should be
eschewed entirely, but I don't see what's wrong with making the HTML
source human readable, so long as it's clearly understood that a
processor won't necessarily take any notice!

4.1 The elements defined as content of the HEAD element would, to my
way of thinking, more naturally be defined as attributes, since they
are "properties of the whole document", not content; an exception to
this is the LINK elements. It would be helpful to have a clear
statement of the circumstances in which one would encode an
association between documents using a LINK rather than with an ANCHOR.
Is it in fact arbitrary, or must a LINK also be included for *all*
ANCHORs in a document that target some other document?

TEI equivalence: There is nothing directly equivalent to the HEAD and
its content in the TEI scheme. The 'n' attribute on the <text> or
<TEI.2> element would be the most natural place to put the information
contained by the TITLE element. It would probably be better to define a
special element within the <encodingDesc> of the <TEIHeader> to hold the
other contents of the HEAD.

4.7 Anchors

There is a lot to be said about the different linking philosophies of
HTML and TEI, and I will only scratch the surface here (particularly
because details of the TEI proposals are still being finalized). As I
understand it, the target of an HTML LINK or ANCHOR is either the whole
of a document (identified by URL) or to a point within it,
distinguished by being preceded by a sharp sign, or a combination of
the two. In TEI the natural way of doing the former would be to use an
<xref> and of doing the latter to use a <ref>. The former uses
TEI-defined syntax for extended pointers, while the latter relies on
the built in id/idref mechanism of SGML.

For example, the HTML link <A HREF="http://info.cern.ch/">CERN</A> might
be represented in a TEI document as <xref doc=CERNdoc>CERN</xref>. This
has the added requirement that there be an system entity definition
somewhere else in the prologue associating the entity CERNdoc with the
system identifier which is given explicitly within the HTML document.
[I think Elliot Kimber made a similar point in relation to making HTML
HyTime-conformant in May this year]

An HTML link such as <A HREF="#foo"> (which is presumably legal only if
there is somewhere else in the same document another anchor <A
NAME='#foo'>) translates simply into <ref target=foo> (again, assuming
that somewhere else there is an element bearing the attribute
specification 'id=foo').

Finally, an HTML link such as <A HREF="http://info.cern.ch/#foo"> maps
fairly neatly onto <xref doc=CERNdoc from='id (foo)'>, with the same
additional requirement that the entity CERNdoc be declared.

There are no direct equivalents for the REL REV or METHODS attributes,
though there is a 'type' attribute which could carry similar information.
The TITLE attribute could here, as elsewhere, be mapped to the TEI
global 'n' attribute. I don't understand how the proposed URN attribute
will work, so cannot comment on it.

When the A element is used to encode the target of a pointer (i.e. where
the NAME attribute is used rather than the HREF attribute), the natural
TEI solution might appear to be to use the <anchor> element,
representing
A <A NAME=serious> serious</A> crime ...
as
A <anchor id=serious> serious crime ...
However, as shown above, <anchor> is an empty element, whereas A has content.
I don't know how serious a problem this is, partly because I don't see
what the purpose of delimiting the scope of the A element is in this
case. If it is important, then some other element needs to be chosen,
probably the general purpose <seg> element -- in the TEI scheme, all
elements may take the 'id' attribute, and can therefore serve as the
target of a link.

The TEI scheme also provides a parallel pair of elements <ptr> and
<xptr> for links which do not have content. These are probably not
relevant, except when converting from TEI to HTML, which will always be
fraught with difficulty because of the number of possible ways of
encoding links in the former which are unsupported by the latter.

An alternative would be to regard the HTML coding for targets simply as
an alien notation, and use only <xref>, specifying the correct HTML
syntax with the 'foreign' keyword on the 'from' attribute. I think
this simply sweeps the difficulties under the carpet, but it may turn out
to be the only generally viable solution in the long run.

4.8 The TEI <address> currently has element content only and cannot
therefore be used as a direct equivalent for the ADDRESS element. This
may be worth reviewing; I don't see any reason why it (the TEI one)
should not have mixed content.

4.10 It's not clear from the discussion whether the BLOCKQUOTE element
is also intended for use with inline quotes, nor exactly what "rendered
specially" means here. The TEI has a <quote> element which is probably
the best match.

4.11 As defined, HTML will allow for entirely arbitrary sequence of
heading levels, so that a sequence like 'H1 H2 H4 H2 H5' would be
perfectly OK. The TEI (like most other SGML schemes) would find that
decidedly odd. The 'level' semantics in the TEI scheme are attached to a
DIV element rather than to a heading (so that untitled document
subdivisions are possible, which HTML does not allow).

This means that translating from TEI to HTML is relatively painless:
<div1><head> becomes <H1>, <div2><head> becomes <H2> and so on.
The reverse translation would be problematic if the HTML document
had an erratic sequence like the one noted above.
The TEI scheme also allows for an arbitrary depth of nesting levels, if
"vanilla" or un-numbered divs are used: however, as noted in the RFC,
deeply nested structures are improbable.

4.12 IMG. Why is the attribute which specifies the URL of an IMG called
'SRC' (rather than HREF?) It might be handy to have an extra attribute
to hold a brief description of the image for the benefit of processors
which cannot display the graphic: or would this be better treated as an
Anchor? The notion of "inline" might helpfully be defined here. Is the
implication simply that an IMG will not cause a line break on the
screen, whereas an A that points to a graphic image always will?

The TEI has an element <figure>, but I'm not sure whether this is what is
wanted here. More probably the generic <xref> or <xptr> would be an
appropriate equivalent, possibly with a REND attribute.

4.13 I don't understand the sentence beginning "The node may be
queried...". I also don't understand what "Status: standard" at the end
of this section means (it's not in the list at 1.1.3 unless it's
synonymous with 'mainstream')

4.14 See remarks under 4.1 and 4.7. Clarification as to whether LINKs are
permitted within BODY would be useful.

4.15.1 The content model for DL is a bit weird, and does not reflect the
constraints required by the discussion here. For example, the content
model would allow <DL> <DT>term1<DT>term2<DT>term3 </DL> which is
clearly wrong. It should be changed as the note in the DTD suggests. I
will refrain from pointing out that the underlying problem here is the
use of empty elements to indicate things that really have content (such
as terms, definitions, paragraphs etc.) rather than any inherent
"messiness" in mixed content models, as I expect you're sick of being
told as much.

What is it about glossary lists? The TEI also currently proposes a
somewhat eccentric model in which DT would be mapped to <label> and DD
to <item> (note however that this has already attracted some adverse
criticism, and may change).

One thing which probably won't change is that all lists are mapped to
the same element <list>, with the distinctions specified in HTML by
different tags being specified as values of a 'type' attribute. So, for

<UL><LI>list element <LI>another one</UL>
<OL><LI>list element <LI>another one</OL>

a natural TEI translation would be

<list><item>list element <item>another one</list>
<list type=ordered><item>list element <item>another one</list>

The values for the 'type' attribute are not currently defined. It would
be reasonable to use 'ol', 'ul', 'menu', 'dir' if that seemed
appropriate.

The COMPACT attribute on DL would most naturally be translated into
another of the TEI global attributes 'rend', which can be applied to
any element to supply additional information about its rendering. Why
is COMPACT not permitted in HTML for anything except glossary lists?

4.17.2 Additional white space has crept into the first example within
the <P> tags.

4.18 The PRE element has no obvious equivalent in the TEI scheme as
published, though we have (of course) found it necessary to invent
something like it in the dtd for documenting tagsets. This is the
element <eg>, which has a content model of CDATA, but generally
realizes this by a CDATA marked section. Probably because I have been
doing things this way for so long, I find a bit strange the way that
some, but not all, elements within a PRE element are interpreted. I
think I would be happier with an element which was entirely
preformatted, behaving rather like the LaTeX verbatim environment.

The sentence describing the use of the tab character needs to be a bit
tighter, I think. I suggest it should read " ... must be interpreted
either as the smallest positive.... multiple of 8, or as equivalent to
a single space." (and delete the last sentence)

As with COMPACT, the TEI equivalent for the WIDTH attribute would
probably be 'rend=width:80' or something like it.

4.20.1 It would be nice to know whether these physical styles are
additive, i.e. whether <I>this is <B>really</B> important</I>
results in bold italic for the word 'really'. If so, there must
be some combinations which are illegal or undefined: for example
<I><TT>implausible</TT></I>, or <B><B>very bold <B>indeed</B></B></B>

The TEI approach would be to mark all passages which are distinct
typographically (but not in any other respect) with the <hi> element,
optionally using its 'rend' attribute to specify the flavour of
hilighting involved: thus <hi rend=I>this is <hi rend=B>really</hi>
important</hi>. And, before you ask, I don't think the TEI has
expressed an opinion as to whether the semantics of the rend attribute
are additive either.

4.20.2 I need more information about what exactly most of these mean
before being able to specify TEI equivalents for them. In an ideal
world I'd recommend that they all be junked and replaced by whatever
subset of the existing TEI phrase level elements seem to be most
useful. For starters, the list should include foreign words, technical
terms, names of persons places and organizations, dates, times,
expressions of quantity... What is a citation? (examples would help)

That's probably enough for now.

Lou Burnard
European Editor, TEI