SGML has no mechanism for doing this, so the word I keep hearing is
that we should strangle HTML with the same restrictions.
The fallacy I often hear uttered is that if we can stuff Unicode into
the MIME header as the charset, then we can avoid the problem of having
to define a CHARSET tag (since Unicode encompasses most national char-
acters). But this way of thinking is WRONG. Unicode doesn't provide
a mechanism for varying sort order and other things that vary accord-
ing to locale and language. To do this, THE UNICODE STANDARD ITSELF
SAYS THAT ADDITIONAL TAGS ARE NECESSARY for this sort of thing.
So although offering Unicode or UTF-8 as a default charset is a good
idea, it does not do away with the need for LANG and CHARSET tags.
Just to do away with one other fallacy: You can't have just LANG or
CHARSET tags. You need both. You can have two different charsets for
a single document (e.g., Shift-JIS and ISO 8859-1), and you can have
two different languages within the same charset (e.g. English and Ger-
man for ISO 8859-1; Urdu, Persian, and Arabic for Unicode - they all
use the same Unicode pages).
It may not make sense for all clients to allow all possible combina-
tions, but this is something they can negotiate with servers. It is
not a reason to cripple HTML.
If I'm misunderstanding the Unicode standards, HTML, or SGML, someone
please let me know. I'm doing my best to keep up :-).
Richard Goerwitz
goer@midway.uchicago.edu