multimodal style
David Seibert (seibert@hep.physics.mcgill.ca)
Mon, 4 Mar 1996 15:06:13 -0500 (EST)
This is a proposal for a multimodal styling language, as opposed to a
linear combination of visual/audio/whatever styling languages. I suggest
a few natural multimodal attributes, and discuss why and how to encourage
their use. I also suggest allowing attribute values to be in a fixed
range of numbers, which both simplifies the styling language and
minimizes the dependence on English.
I would be very happy to hear comments from any interested reader. This
proposal is publicly available at
http://www.physics.mcgill.ca/WWW/seibert/style/mms.html. I am also
temporarily storing the audio style sheet proposal of T.V. Raman at
http://www.physics.mcgill.ca/WWW/seibert/style/raman.html, so that this
manuscript is also publicly available.
Regards,
David Seibert
Multimodal document styling
Encouraging the production of stylish and accessible documents
- Introduction
- Design goals
- Unimodal attributes
- Multimodal attributes
- Standardization and language independence
- Independent specification of attributes
- Encouraging authors to produce multimodal documents
- Precise specification of multimodal attributes
- Summary
HTML
(HyperText Markup Language), the standard markup language of the World
Wide Web, is an
SGML
(Standard Generalized Markup Language) document type. Rules have been
specified to transform HTML tags to a set of "canonical" elements in
accordance with SDA (SGML Document Access) standards, so that HTML
documents can easily be presented using Braille, large print, audio,
or any other type of display. The reason that the ICADD (International
Committee on Accessible Document Design) recommends the definition of
mappings to a standard tag set by DTD (Document Type Definition) authors
is that this practice will "minimize the burden on writers and editors
of understanding the requirements of markup for Braille, large print and
voice synthesized delivery"
[Y. Rubinsky,
"Description of the ICADD Mechanism"].
The use of HTML
styling languages, such as
DSSSL
(Document Style Semantics and Specification Language) Online and
CSS (Cascading Style
Sheets), has been suggested as a simple way for Web publishers to
control the presentation of HTML documents. The hope is that this
mechanism will encourage publishers and software vendors to use HTML
rather than creating their own DTDs using SGML, or inventing new HTML
markup tags. Some advantages of continued HTML use are:
- Cross-platform development is facilitated by having a
well-established, simple standard.
- The dependence of presentation on proprietary software is
minimized because the meanings of markup tags are relatively
well-defined.
- Web publishers are less likely to create HTML documents
that cannot be presented well in alternative formats.
- The relatively small set of allowed markup tags allows HTML
display software to be simpler than SGML processing software.
However, documents created using customized styles will not be presented
well in all display modes, unless the style designer spends time creating
the proper specification for each mode. This extra work would again make
it less likely that web authors would create documents and styles that
can be presented well in all formats.
Current styling language proposals concentrate on giving authors control
of the visual presentation of text and images, for the most part ignoring
the possibility of alternative formats. HTML style sheets for audio
presentation have been proposed by
T.V.
Raman
and by the
TEO group at the Katholique University of Leuven, but in these
schemes the audio controls are totally independent of visual controls.
Little attention has been paid to the fact that, in most cases, authors
use markup to express an idea, so that audio and visual presentations
(along with presentations in any other modes) are related because they
represent the same idea, just as visual presentations of markup in
different languages are related by the semantics of the tags.
In this document, I propose to create a multimodal HTML styling
language, in which visual, audio, and other style descriptions are
integrated as much as is practical. The purpose of this unification is
to make it as easy as possible to produce better web documents for
people with disabilities, by reducing the work for the author or style
designer to enrich a document or style for all display modes. I
suggest design goals for such a styling language and discuss means to
implement those goals. I give concrete examples for five multimodal
presentation attributes that can be derived from visual and audio
attributes. Finally, I discuss how to combine multimodal and unimodal
attributes to create a styling language that not only allows but
encourages authors to produce multimodal documents and styles.
A well-designed multimodal styling language should
- contain a fairly complete set of attributes, so that authors can
specify a wide range of properties for visual, audio, and other
displays.
- contain multimodal attributes that allow authors to simultaneously
specify properties for multiple presentation modes.
- standardize attribute values as much as possible, to make it
easier for casual authors to use the styling language.
- reduce language dependence by minimizing the use of English.
- allow authors to specify visual and audio properties
independently if they wish to do so, for maximal control over
presentation.
- encourage authors to use multimodal attributes (that control
presentation in more than one mode), which will produce
documents that can be presented well in any mode, rather than
unimodal attributes (that control presentation in a single
mode).
- allow precise physical specification of presentation properties
when feasible.
The first design goal, providing a wide range of attributes, can be met
by simply combining current proposals for visual and audio style
attributes. In
this section, I give the names of some proposed attributes and their
natural language and numerical values (without actual or implied
units). The definitions are usually fairly obvious; when they are
not, readers should refer to the proposals for visual and audio
style sheets. I do not discuss physical values for these attributes,
as these cannot be translated as simply to values of multimodal
attributes as can the less precise (but more intuitive) natural
language or numerical values.
I list here all attributes described by
Raman,
with the exception of speech-other, which is suggested for
experimental purposes, and spatial-audio, which is suggested
for possible use in the future. The attributes proposed by the
TEO group
are a subset of those proposed by Raman, so they are also included below.
The emphasized attributes are those that can be naturally combined with
visual attributes. For simplicity, I use the
CSS
syntax, although the proposal could be written using the notation of
either CSS or
DSSSL
Online.
Attribute name: |
Natural language and numerical values |
volume: | soft | medium | loud | [0-10] |
[left | right]-volume: | <none> |
voice-family: | <string> (name) |
speech-rate: | slow | medium | fast | [1-10] |
average-pitch: | [1-10] |
pitch-range: | [0-200] |
stress: | [0-100] |
richness: | [0-100] |
pause-[before | after | around]: |
<none> |
pronunciation-mode: | <string> |
language: | <string> |
country: | <string> |
dialect: | <string> (name) |
[before | after | during]-sound: |
<uri> |
I will not list the full range of visual attributes that can be
controlled by proposed HTML style sheets. Instead, I give only
the attributes that are naturally linked with audio attributes.
I use the nomenclature of
CSS, although
these attributes can be equally well expressed using the terminology
of
DSSSL
Online.
Attribute name: |
Natural language and numerical values |
font-size: | xx-small | x-small | small | medium | large |
x-large | xx-large |
font-style: | normal | italic | oblique | small-caps |
[ italic | oblique ] small-caps |
font-weight: | extra-light | light | demi-light | medium |
demi-bold | bold | extra-bold |
padding: | auto |
background: | transparent | <uri> |
Here I give an example of the solution to the second design goal by
proposing a set of multimodal attributes designed for simultaneous
control of visual and audio displays.
In a number of cases, visual and audio attributes given above can be
expressed by a common meaning. In these cases, the visual and audio
attributes can be combined in a natural manner to produce multimodal
attributes. I propose the multimodal style attributes given in the
following table as they are defined below.
Multimodal attribute: |
Audio name, |
Visual name |
size: | volume, | font-size |
range: | pitch-range, | font-style |
weight: | stress, | font-weight |
separation: | pause-[before | after | around], |
padding |
background: | [before | after | during]-sound, |
background |
- size: 1 | 2 | 3 | 4 | 5 | 6 | 7
- The relationship here is obvious - larger text, louder speech, and
higher numbers will usually be associated. If they are not,
authors should use a suitable combination of unimodal style attributes,
but if they are, authors will minimize their work by using the multimodal
forms. Possible mnemonics for the values (from musical notation):
pianissimo | piano | mezzopiano | mezzo | mezzoforte | forte | fortissimo.
- range: 1 | 2 | 3 | 4 | 5 | 6 | 7
- Here again the relation is fairly obvious if you consider how printed
words are normally spoken (e.g., "It's not really important
..."). The mapping is a bit trickier, mainly because voices are so much
more expressive in this regard than print. Probably values 1-4 would map
to normal type, and 5-7 would map to italics or oblique type. Possible
mnemonics (could use work): dead | dull | boring | normal | happy |
excited | wild.
- weight: 1 | 2 | 3 | 4 | 5 | 6 | 7
- Stress and font-weight are again fairly naturally related, and
the mapping from numbers to current natural language values is obvious.
Possible mnemonics (more or less from boxing): feather | light |
midlight | middle | midheavy | heavy | superheavy.
- space: 1 | 2 | 3 | 4 | 5 | 6 | 7
{above/right/below/left specified following
CSS}
- Here the attribute values should tied to the visual presentation,
which is richer because printed spaces are two-dimensional while
audio spaces can only be one-dimensional. Space should be tied
to the visual attribute of padding or margin; I picked padding, but
I think that either could be chosen. Possible mnemonics: none (a bit
counter-intuitive at 1) | narrower | narrow | normal | wide | wider |
widest.
- background: <uri>
- Here you just save a little time, but again the meanings match so it
makes sense to allow authors to simultaneously specify audio and visual
backgrounds. The allowed values are the same, so the presentation
software must interpret the URI, but that is trivial - visual backgrounds
go with visual presentations, audio with audio, and so on. Maybe
style sheets should also provide visual before- and after-cues, to go
along with the audio cues that
Raman
suggests?
Once one is allowed, the other would follow naturally in the same way that
background can be used naturally for both audio and visual presentation
without the need for any extra notation.
Other values can (and generally should) be allowed for most of these
attributes. Physical values obviously should be allowed, to give
authors detailed control over document formats; however, the allowed
values were selected to be as useful and intuitive as possible, to
encourage casual authors to use them rather than physical values. They
should be granular enough to give good control, but not so granular as
to be confusing. The mnemonics could also be allowed values, although
I am not sure that I would recommend this in general.
I simultaneously address the third and fourth design goals,
standardization of allowed values and language independence, by
allowing numbers, e.g. [1-7], to be used to represent the
values of multimodal attributes for which this procedure seems to be
intuitively reasonable, in their natural order (smallest=1,
normal=4, largest=7). This is similar to the practices proposed by
Raman
and the TEO
group; only the significant difference is the proposal to use the
same range of numbers for all attributes. I allow 7 values because
that seems to provide a reasonable amount of granularity for most
attributes. Using the same range for
most attributes makes it simpler for authors to remember the allowed
values and their meanings (e.g., 4 is always the default), so I
suggest that the range remain the same across attributes if possible,
even if a different global range is preferred.
Using numbers gives relative language independence because numerical
notation is more widespread than any language. If numbers are allowed,
authors can learn the definitions of the numbers in whatever language
they prefer. There will still be some difficulty for authors
who are not familiar with Arabic numbers, but this could be dealt with
simply by allowing non-Arabic numbers with the same meanings as well,
because numbers are well defined so translation is trivial.
Because it simultaneously solves two design problems, the practice of
using a standard numerical range as allowed values would also be an
advantageous practice for general styling language design.
An additional minor benefit is also obtained: because each integer is
represented by a single character, the amount of typing needed to create
style descriptions is reduced.
The fifth design goal, allowing independent specifications for audio
and visual presentations, is also easily met. Under
CSS, the use of
multimodal attributes would not preclude the specification of
refinements to any single mode of presentation. Rather, authors should
first specify the document style as accurately as possible using
multimodal attributes, and then add further refinements through
modifications of unimodal attributes, so that the document is presented
well to all users.
My sixth design goal is to encourage authors to use the multimodal
attributes provided by the styling language. Establishing and
standardizing multimodal attributes is necessary to enable authors to
easily produce rich web documents for multimodal display, but it is not
sufficient to ensure that authors regularly produce rich documents and
styles suitable for multimodal presentation. Additional steps should be
taken to encourage authors to create customized multimodal style
descriptions in place of unimodal descriptions. For example, in a
well-designed styling language, authors should be pushed to use
multimodal attributes as the first step of designing styles, in
preference to unimodal attributes.
There will be some resistance to using multimodal attributes, as
many authors, regardless of the level of experience, are accustomed to
working primarily with unimodal (usually visual) attributes. To
counteract this resistance, styling language designers should not give
an overcomplete set of attributes, i.e., all unimodal and
multimodal attributes. Instead, for each group of unimodal attributes
that combines to form a multimodal attribute, the multimodal attribute
should replace the richest of the unimodal attributes (so that authors
are likely to need less refinement of unimodal attributes). For the
attributes proposed above, this scheme could be implemented as follows.
- size
- could probably replace either attribute
reasonably well. It should probably replace the visual attribute,
font-size, as authors are probably more likely to prescribe visual
style than audio style. In this case, the name would be a bit less
intuitive than before, but this could be an advantage as it would serve
to remind the author that the attribute is multimodal rather than
unimodal.
- range
- would replace the audio attribute, pitch-range, which
has more granularity and therefore carries more information than the
visual attribute, font-style. The association with an audio property
might discourage visual authors from using this attribute, however,
especially because the two attributes overlap in meaning but are not
equivalent, so I suggest a replacement of
font-style as discussed below.
- weight
- would replace the visual attribute, font-weight,
following the case for size.
- space
- would again replace the visual attribute, padding, as in
the cases above (although padding may be a better name for this
attribute).
- background
- would replace both attributes. The use of multiple
URIs of different types should be allowed, as presentation software can
tell easily which to use from the context (e.g., use a visual
background for visually displayed text, not an audio background).
In addition to the new multimodal attributes, two changes to the
CSS scheme would be
needed. I propose that small-caps be added to the set of allowed values
for the CSS attribute text-transform, and a new element emph-style be
added, as given below. The old attribute, font-style, would then be
expressed through combinations of range, text-transform, and emph-style.
- emph-style: italic | oblique
- controls whether high-range text is presented in italic or oblique
type.
One difficulty with the scheme proposed here is that many authors will
want to specify physical quantities, such as the font-size, very
precisely. To solve this problem requires simply a standard mapping
from the interval [1-7] to the range of reasonable physical values for
each attribute. Then, if the author specifies a physical value (with
units) for a multimodal attribute, the units will enable the presentation
software to determine to which mode the value applies, and the mappings
can be used to calculate the appropriate values for the other modes
connected with the attribute.
Alternatively, if such a mapping exists, the value could be more
precisely specified by allowing any number in the range [1-7], and not
just integers. This is probably preferable, as it decreases the
device-dependence of the style, and so should probably be allowed.
However, physical values (with units) should be allowed in any case, as
a large body of authors are accustomed to using those values.
I have not attempted to produce the required mappings here. That is
left for the present as a research problem, as it will be best solved
experimentally by having a wide range of subjects evaluate a large
number of displays with a variety of mappings.
Although I have used HTML style sheet proposals as examples here, the
design goals and the methods used to achieve them would apply to
multimodal styling languages for use with any DTD. The third and
fourth, standardization of allowed attribute values and minimal language
dependence, are also useful goals for unimodal styling language
designers (as are the first and seventh, but they are so intuitive that
they are almost universally followed).
This is not a complete proposal for a multimodal HTML styling
language. I have, however, proposed the creation of five multimodal
style attributes, and the elimination or modification of five related
visual or audio attributes in order to encourage authors to use the
multimodal attributes. As proposed, the creation of one new visual
attribute and the slight modification of another would also be
necessary to recover the full proposed functionality of
CSS.
The changes proposed to current HTML styling languages are
small but important. Without these changes, it will be significantly
harder for web authors to produce rich documents that can be presented
well in all modes, and therefore most documents will be designed for
unimodal (probably visual) display. These changes could be implemented
in the future, but this would result in a significant diminution of their
full power, as once visual authors become accustomed to visually oriented
styling languages there will be more resistance to the multimodal forms.
In addition, there may be some problems with backward compatibility of
attributes, as in the case of font-style, which may be easily eliminated
if designers plan now for eventual conversion to multimodal styling
languages. Thus, I suggest that the proposed changes be made early,
before visual styling technology has a chance to become widespread and
the current technology is locked in.
Last modified 1 March 1996 by
David Seibert.
seibert@hep.physics.mcgill.ca