RE: WWW support for Cyrillic (and UNICODE)

Vladimir Sukonnik, Process Software Corp (sukonnik@elnath.process.com)
Thu, 3 Nov 94 08:03:45 PST


>Wed, 2 Nov 94 16:19:11 CST
>Date: Wed, 2 Nov 94 16:19:11 CST
>From: "Richard L. Goerwitz" <goer@midway.uchicago.edu>
>Subject: RE: WWW support for Cyrillic (and UNICODE)

>I've heard the Microsoft hoopla, but so far can't determine what it
>is all about. From what I can tell, Microsoft is using Unicode ac-
>cording to the old, internationalization/localization model. It is
>not using Unicode as it should be used, namely as a multilingual en-
>coding standard. Take a look at Apple's WorldScript for an example
>of that.

I appologize for taking up the bandwidth for something that may be
trivial to the list (please let me know if it is), but I need a little education.

Would you please explain

1. What the old internationalization/localization model is, and
2. What the multilingual encoding standard is, and
3. How does UNICODE fits into all of this.

>If you have the specs on hand, tell me what the 32-bit GUI does with
>characters in the Arabic and Hebrew code blocks, by the way. I am
>not fishing for a particular response here, but I suspect that the
>system is nonconformant in that it will recognize the codes, but yet
>fail to show the behavior outlined in appendix A of the standard.

Below, I am including a write-up on UNICODE that I found in the Microsoft
Development Libary (the October '94 issue). I would very much like to understand
what the issues are. Thanks for taking your time to explain it.

>Richard Goerwitz

Best regards,
Vladimir.

Since code pages are different for each script and operating environment, attempts to
standardize and consolidate all code pages onto one code page are now underway. One
standard is called Unicode, an effort driven by Apple, Borland, Digital,
Hewlett-Packard, IBM, Lotus, Metaphor, Microsoft, Next, Novell, Research Libraries
Group, Sun, WordPerfect, and Xerox.

Unicode is a pure 16-bit character encoding that encompasses all characters used in
general text interchange. The two volume standard is available from Addison Wesley as
The Unicode Standard; Worldwide Character Encoding Volume I and The Unicode Standard;
Worldwide Character Encoding Volume 2 (ISBN 0-201-56788-1 and ISBN 0-201-60845-6
respectively).

The ISO DIS (Draft Industrial Standard) 10646.2 was merged with Unicode version 1.0 to
form Unicode's current version, 1.1.

Differences between Unicode and Existing Code Pages

The main differences between Unicode and existing code pages are as follows:

All characters are 16 bits wide. Unlike DBCS there are no lead bytes. Random
access to character strings is also possible and programs generally dont need to
maintain state information when parsing strings.

A Unicode index refers unequivocally to a given character; for example, the
symbol happy face and the control code Ctrl-A are two different characters in Unicode.

Unicode is a character encoding. Ligatures are not characters, but glyphs. Text
in Unicode is always in character form. Only in final output stage does an application
or graphics engine combine f and i and substitute the glyph for the fi ligature.

There are non-spacing accents in Unicode that can be combined with base
characters to create composite letters.
Unicode provides mappings to and from all the important single-byte code pages in use on
computers today.

Unicode Goals

The following are the basic goals of Unicode:

Eliminate special-case Systems and Applications code for multiple character
sets, thus speeding up localization and reducing testing time.

Make a larger range of characters available than will fit in a single-byte code
page.

Ensure that character code is independent of compression and text formatting
considerations.

Make code more efficient when used as an internal processing code.

Ensure that Unicode is complete when used as data interchange or reference code.

Unicode provides a model that has much of the simplicity and efficiency of the
plain-text model, but with greater international capabilities. Microsoft is supporting
Unicode in the 32-bit API for Windows (Win32 API), and several other companies are
working on Unicode implementation.

Unicode is a Character Encoding

Unlike DBCS, Unicode is not a variable-length encoding. Text in Unicode cannot be passed
to functions that are expecting zero-terminated ASCII strings. The first 256 characters
in Unicode have the same layout as the international standard ISO 8859/1. However,
Unicode is zero-extended to 16-bits. The terminator in a Unicode string is 0x0000
because many Unicode characters contain one null byte.

ISO 8859/1 was used as the source for the Windows ANSI character set. Except for the 80H
through 9FH range, which is the C1 control-code range, there is a one-to-one
correspondence between Windows ANSI and Unicode. The few additional characters defined
in the Windows ANSI set in this range have corresponding characters elsewhere in
Unicode.

Volume 1 of the Unicode 1.0 standard was published in the fall of 1991. This volume
contains assigned codes for every major alphabet and symbol in the world except for the
Han characters of the Korean, Chinese, and Japanese writing systems. Volume 2,
containing a unified Han character set, became available in early 1992.


+---------------------------------------------------------------+
| Vladimir Sukonnik Voice: 1-508-879-6994 |
| Principal Software Engineer http://www.process.com |
| Process Software Corp Fax: 1-508-879-0042 |
| 959 Concord Street E-mail: sukonnik@process.com or |
| Framingham, MA 01760 USA sukonnik@bumetb.bu.edu |
+---------------------------------------------------------------+