Re: Flames & WWW (was Frames & WWW)

Gavin Nicol (gtn@ebt.com)
Fri, 18 Nov 1994 01:48:00 -0500


>|>Phil proposed this:
>|>
>|> http:///bongo.cern.ch/fred.html#H1:2/H2:4/H3:3/P:4/10,15
>|>
>|>while I proposed long ago the use of the TEI invented naming schemes.
>|>
>|> http:///bongo.cern.ch/fred.html/section=2/subsection=4/subsubsection=3/P=4
>|> http:///bongo.cern.ch/fred.html/2/4/3/4
>
>Except that this is a URL that identifies an object not a position within an
>object. Not the same thing at all.

Unless I am extraordinarily dense, doesn't your scheme refer to a
single character or possibly element within a document? Isn't a
character an object? If you are talking about the difference between
one being a fragment ID, and the other a URL, well, you could just as
easily make the TEI path part of a fragment ID. If you mean that I
have not shown how one might refer to a single character, then it is
easily solved by something like:

../fred.html/section=2/subsection=4/subsubsection=3/P=4/PCDATA=nnn

(depending on the content model), though I'm not sure it's all that
useful, whereas I am sure that accessing, and managing data at the
element level *is* useful.

Now, the reason I think it is better as a URL is because it defines a
*path* to the element, in the same way that normal URL's are a *path*
to the document. The benefit is much like that found in using
heirarchical file systems over flat filesystems. I'm sure no-one
want's to go back to the MSDOS 1.0 or CPM file systems...

I should note that there is an increasing trend to move away from
physical paths (ie. 1 to 1 mapping between URL and file), and this is
just one example of it. One advantage of the TEI scheme, which I have
repeated time and time again, is that <h1>They are not SGML
specific</h1> whereas your model is.

>|>(Note that the second uses the child number of the element, whereas
>|>the first is using the occurence of the element name within the child
>|>list.)
>
>Except that this TEI scheme bears no relationship to HTML and you so not
>define one. There is no such thing as a HTML `section' or subsection unless

Perhaps you are just pretending to not understand? When I use
"section" and "subsection" I *obviously* am not referring to HTML. In
fact, in my last posting I explained that HTML was a degenerate case
because it is "flat" (ie. it is represented as a very shallow
heirarchy in which order is not well specified). I showed that the TEI
paths can be used for HTML (and showed that in the final anaylsis,
your proposal and TEI are the same), but I have repeatedly said that
the real benefit is with more structured datasets. In other words, the
TEI paths represent a *superset* of the functionality that you desire,
that can be used on *any* heirarchical data system. This whole
"doesn't apply to HTML" is a smokescreen because anything that applies
to SGML applies to HTML, because HTML *is* SGML (just a particularly
poor form of it. I should note that HTML 2.0, and HTML 3.0, thanks to
the hard work of Dave Ragget and Dan O'Connolly, are getting better
and better).

>This is not "Not Invented Here" but "Other Suggestion Non-Starter". I
>somehow don't think that TEI proposed the scheme in the context you
>intend to apply it. Again references please.

You are probably correct in saying that the TEI probably never
envisaged the TEI paths being used in URL's, but they certainly wanted
to use them to retrieve parts of large documents.

>|>>What relation do these `sections' have to HTML elements. Is H1 a section?
>|>>Is H2 a subsection? What is a H3???
>|>
>|>Well, now we come to the crux of the matter. HTML was very poorly
>|>designed because it ignored the inherent structure of documents, so in
>|>fact we don't have many containers... if we want to address
>|>something using the TEI stuff, it will be very "flat"
>|>(fred.html/P=14).
~~~~~~~~~~~~~~~~~ note the example of TEI paths and HTML!

>By poorly designed I suppose you mean it didn't happen to conform to
>the SGML communities views on document design.

No. By poorly designed I mean that he ignored (or probably simply
wasn't aware of) all the research done on markup systems. He ignored
both structural markup, and the concept of working in the author's
mindset, and instead defined another TROFF.

>Guess the ratio of HTML documents to other SGML documents. Guess the
>likely ratio in a few months time.

I tell you what. Go to a large aerospace company and tell them that
SGML is a waste of time. Tell them to mark up the 500,000 to
1,000,000 pages of technical documentation they have using HTML for use
in IETM's. There are *gigabytes* of mission critical SGML data around
the world, and probably a gigabyte of HTML. Compare the content,
compare the purpose of the documents, compare the authors. Then
perhaps you will understand why SGML *is* coming to the WWW.

>|>>Is this an SGML standard or a Web standard? Who has commented on it? Dave
>|>>Ragget? Tim B-L?
>|>
>|>This is a *humanities* standard. The people found that using SGML was
>
>Ah the humanities people, well known for their ability to create technical
>standards.

It is very easy to sneer at things you do not understand.
I'll let you in on a little secret. People do best what they
understand most (or usually anyway). These people have a huge amount
of experience with markup and document processing. Did Tim B-L? Do
you? You would be *amazed* at the technical skills needed to maintain
large document repositories. In addition, some(many?) TEI people also
happen to be SGML experts (by choice, or because they saw that SGML
was required, and spent time understanding it. Very admirable if you
ask me...)

>|>>If its an SGML standard don't imagine that it has any relationship to
>|>>HTML.
>|>
>|>Well, HTML *is* SGML (which of course you know), but it is a
>|>particularly poor form of it. As I noted, these are not SGML specific
>|>(see below).
>
>Given the SGML spec HTML is probably the best you can do from a very
>poorly designed system. If SGML was properly designed it would not
>have required over a year to get the basic HTML DTD correct.

This verges on the ridiculous. HTML is close to the *worst* one can do
with SGML. The reason it took close to a year to get a formal HTML DTD
is because most WWW people did not understand SGML (starting with the
original designer of HTML), and because they gave almost no thought to
SGML when they designed it. Again, do you blame your computer langauge
of choice when your programs crash? ("God C is a crock! Why doesn't it
have automatic array index checking? Why doesn't it detect NULL
pointers? Why doesn't it have a counted string type?")

>|>>It is possible to create containers by associating sections of
>!>>text with the preceeding headers and nesting Hn+1 elements within
>!>>Hn elements. This may be hard to express in SGML lossage but that
>!>>is SGML for you.
>|>
>|>This has got to be the funniest thing I have read all week! Probably
>|>all year! SGML's primary purpose is to define the structure of a document
>|>explicity by defining containers and content model.
>
>A circumlocuitous way of saying that SGML fails at its pricipal
>design purpose. I quite agree.

Ha ha ha ha. You'll have me in stiches yet! <H1>Wrong!</H1>
In SGML one defines containers to contain things. I guess if you don't
do that, then you can't expect to have any containers! I wouldn't say
that's SGML's fault though...

Now compare the SGML and HTML markups.

Section 1

Section 1 ..... ......

Section 2

Section 2

We can immediately see 2 things:
1) The SGML tends to be more verbose because it has more tags.
2) The boundaries between elements is easily found in the SGML, but
one *cannot* deduce the heirarchy from the HTML document.

The key difference is this: HTML was designed with presentation and a
linear document structure in mind, whereas (good) SGML emphasizes
hierarchical structure, or at least, tries to say *what* something is
rather than how it looks.

>|>One cannot define
>|>containers by associating Hn with the following text for 2 reasons:
>|>
>|>1) Many people use Hn for font effects
>|>2) One cannot find the boundaries
>
>Its easy enough to define the boundaries. If people engage in Mosaic
>tag-abuse they end up losers, so what? Their documents light up the bad HTML
>flag on more modern browsers and eventually the users get educated.

Some will applaud your sentiments, some will call it arrogance. At
least the SGML people propose a backwardly compatible
solution... people will abuse tags *because the orginal HTML
was so badly designed*. The tools for restricting tag use were there,
but they were not used. Don't curse the tools now.

>The tree structure may be deduced using a simple set of rules aince at the top
>level within the BODY container the only valid elements are <Hn>, <P>, <UL>
><DL>, <OL>, <PRE> and <IMG>. The <Hn> elements are the only ones which define
>structure within the tree and all the others may be regarded simply as
>different types of paragraph.

Which HTML spec are you referring to? Unless it has changed
considerably since I last looked, <Hn> could occur within an <A> which
could occur almost anywhere. In addition, there is no way to enforce
using H1 before using H2, because they all occur at the same elevel
within the heirarchy defined in the DTD. This means you cannot
reliably find the end of the (faked!) container without building in
HTML version specific information into you parser, and even then,
you'd find cases where you would fail. What if someone does
something like the following?

<h3>Big</>
<h2>Bigger</>
<h1>Biggest</>

I don't pretend that the TEI paths can do any more than access parts
of documents by element. I certainly don't claim that it can put
structure where there is none.

>I blame SGML for having the most incomprehensible structure definition
>grammar since sendmail and still not allowing the structure to be effectively
>represented.

The only thing which really confused me was groups. Apart from that,
it's just YACC on steroids. Now I would *love* to see some examples of
where it is impossible to define the structure of some document using
SGML. I will concede that there is one area where I found SGML to be
lacking: one cannot define the format of the character data. What this
means is that for things like phone numbers, you cannot say "this data
is 3 numbers followed by a dash followed by 4 numbers". This might be
very useful, because one could do data verification, thereby making it
much more suitable for database work.

>In other words not all documents are divided into chapters or sections.

No, and I never claimed they were. The example I showed using
"section" and "subsection" showed the use of tag names (any tag names
are fine).

>This is why an untagged indexing scheme has to operate on the tags used
>for the markup.

I'm sorry, I don't understand this sentence.

>|>academic sites that *require* SGML, and the structure it contains. In
>|>fact, you told me that *you* had plans to do an SGML aware browser...
>
>Quite true. This is one reason why I am more aware than most of the
>scale of SGML lossage.

I think most people go through stages when dealing with SGML. Most CS
types go through a stage I'll call "I could have done it better", in
which they are aghast at the baroque syntax, and the context-sensitive
parser. In most cases, I'd agree. I had the same thoughts, but then I
also looked at how people were using it, and it makes much more
sense. The TEI subset of SGML is parseable using YACC and flex, and
provides a very good minimum. Take a look at it. People using SGML
will not accept HTML, but they would probably accept the TEI subset as
a common base for document exchange.

>The main reason why it won't fit in the bottle being of course that the life
>support system its attached to won't fit into the neck. As we know the users
>vote with their feet, hello Microsoft Word 6.0.

Which can now also read/write SGML. Yes, Microsoft thinks SGML is
important too. Hello? Smell the tea?

>Phillip M. Hallam-Baker
>Not Speaking for anyone else.

Thank goodness. I have every respect for you in some technical areas,
but in large scale document processing and SGML, I think I might have
found an area that "needs more work", shall we say. Your proposal
cannot work, SGML is not evil, HTML is poor for large
documentation/database projects, TEI paths are great for accessing
parts of heirarchical documents, and I have spent far too much time on
this already. These are facts. Please give me good, solid, technical
reasons for

1) Why TEI paths are not superior to your scheme, and indeed a
superset of it.
2) Why SGML is so evil and braindamaged, and why HTML is king.
3) How you could possibly make your scheme fly (how to find
boundaries)

Or even better, show me *code* that works. As you said "Code! General
Consensus! Working Implementation".

----

Gavin Nicol
Not even speaking for myself. The real me is doing something far more
constructive.