Re: Initial Draft --Cascaded Speech Style Sheets

Evan Kirshenbaum (evan@poirot.hpl.hp.com)
Tue, 13 Feb 1996 18:16:26 -0800


> Here is a first-cut at a draft specification for speech stylesheets.

Good first cut. I do [of course :-)] have some suggestions.

First off, a caveat: while I have a fair bit of experience in language
design, I have almost none in auditory or speech systems.

My main observation is that you have a lot of attributes specified
very precisely and numerically. One of the strengths that I see in
CSS is that it allows the author to specify the values using
meaningful symbols, which allows the user to customize their browser
to map onto desired interpretations. As a simple example, I certainly
don't want to listen to a page whose designer has specified the volume
in decibels without knowing whether I am listening to it through
headphones or playing it to a lecture hall. I would rather have them
tell me that it is louder or softer relative to some baseline volume
which I get to set.

On the same thrust, you occasionally talk in terms of free-form
strings which the browser will interpret (as for the specification of
voice). This will only work if there is some relatively widely agreed
upon standard for naming the resource, as there is for fonts. You are
generally better off coming up with a set of values specified by the
standard and using a URL (or system-dependent string) as a fallback to
point to a description of the resource.

Finally, you have several places in which you allow "device-specific"
values. This is generally dangerous, especially as different devices
may assign different meanings to the various values. If you must
allow this (and I would recommend against it in favor of being a
little lenient in allowing people to play with adding attributes), I'd
make sure that the code identifies the device that the attribute and
value are to be interpreted with respect to.

On to the specifics:

- For volume, I certainly wouldn't specify a concrete number of
decibels. (And if you must allow this, at least force the author to
suffix "dB".) I'd go more with a set analogous to that used for
font-size: very soft, soft, normal, loud, very loud. For relative
values, I'd probably allow [much] louder/softer.

- For voice-family, you have the problem (I assume) that there aren't
any good standards for names. As with font-family, I'd define a few
that can be assumed. My recommendation would be

male/female-[adult/child/elder]-[<n>]

where the optional trailing index can be used to contrast similar
voices (male-child-1 with male-child-2). If there is a way to
describe a voice, it should probably be allowable to point at it by
URL or name. As with font-family, it should be possible to specify
a list of such values, with the browser picking the first that it
understands.

- For speech-rate, I'd append "wpm" to the number. I'd also allow
(and recommend the use of): very slow, slow, normal, fast, very
fast. For relative values, I'd add [much] slower/faster.

- For average-pitch, I'd append "Hz" to the number. I'd also allow
(and recommend the use of): soprano, alto, tenor, barritone, bass
(and perhaps a couple of others), as well as [much] higher/lower.

- For pitch-range, I'd add something like: monotone, normal, animated
(and if this is the place to add it: whisper, scream, shriek, etc.)
and possibly [much] more/less animated.

- For stress, the notion that some elements of a sentence get primary,
secondary, or tertiary stress is hard to map onto elements. For
relative stress of elements with respect to the surrounding context,
perhaps: destressed, unstressed, [weakly/highly] stressed, with
[much] more/less stressed as the relative. Perhaps the attribute
should be changed to "emphasis".

- For richness, I'd try to select a set of canonical symbolic values.

- For speech-other, either drop entirely or make the value be a list
of triples, with the name of the device (or schema) encoded as well.

- For pause-before-pause (which should probably change to simply
pause-before), etc., add "ms" after the number, and add: none, very
short, short, medium, long, very long, as well as [much]
shorter/longer.

- For pronunciation-mode, you need to define at least a first cut at a
canonical set (which not all browsers need be able to understand).

- language, country, and dialect are all combined in a single value
according to RFC 1766, which is used as the value of the LANG
attribute in the HTML internationalization draft and the value oft
the Content-Language header in MIME (and therfore HTTP). It
probably doesn't hurt to have a single attribute with an rfc1766
value, but the information should already be available to the
browser, and I'm not sure what the appropriate behavior should be if
the element's LANG attribute and its style sheet's language
attribute disagree.

- for the various non-speech cues, I would recommend highly against
talking about file names. URLs are probably best, but a good base
set of assumable effects is probably a good idea.

- for these cues (especially for during-sound), you probably want to
be able to specify a "cue-volume", as it will probably want to be
different from the speech volume.

----
Evan Kirshenbaum +------------------------------------
HP Laboratories |The plural of "anecdote"
1501 Page Mill Road, Building 1U |is not "data"
Palo Alto, CA 94304

kirshenbaum@hpl.hp.com
(415)857-7572

http://www.hpl.hp.com/personal/Evan_Kirshenbaum/