Introduction to Unicode

Notion of character

Language terms:

Character: a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language.
Grapheme: the fundamental unit in written language.
Glyph: an element of writing in a natural language, visual variants of the abstract unit known as grapheme.

As can be seen, what a character is is loosely defined. There are various levels at which to approximate it within Unicode, which are further explained below.

Unicode terms:

Code unit: the unit in terms of which the string of text is encoded in, in one of the Unicode transformation formats.
Code point: a numerical value, encoded as potentially several code units, that is part of the Unicode code space, i.e. the set of all characters it maps and maintains.
Combining character sequence: a sequence of code points that is the unit for the Unicode composition and decomposition processes
Grapheme cluster: a cluster of code points that form a grapheme.

The general idea is that "character" in Unicode usually refers to code points, and higher levels of abstractions are usually referred to as "abstract characters", or directly with their actual name.

Character set

The Unicode character set is a mapping that associates code points, which are integers, to characters for any writing system or language.

As of version 5.1, there are 100,507 characters, requiring a storage capacity of 17 bits per code point. The unicode standard however also reserves some code ranges, known as planes, meaning it really requires a storage capacity of 21 bits.

Since microprocessors usually deal with integers whose capacity are multiples of 8 bits, a naive usage would be to use 32 bits per code point, which is quite a waste, especially since most daily-used characters lie in the Basic Multilingual Plane, which fits on 16 bits.

That is why variable-width encodings were designed, where each code point is represented by a variable number of code units, formely also known as code values.

Encodings

The UTF-X family of encodings encode a single code point into a variable number of code units, each of which does X bits.

UTF-32

This encoding is fixed-width, each code unit is simply a code point.

This encoding isn't really recommended for internal representations other that for use with algorithms that strictly require random access of code points.

UTF-16

Every code point is encoded by one or two code units. If the code point lies within the BMP, it is represented by exactly that code point. Otherwise, the code point is represented by two units which both lie in the surrogate category of Unicode code points.

This is the recommended encoding for dealing with Unicode internally for general purposes, since it has fairly low processing overhead compared to UTF-8 and doesn't waste as much memory as UTF-32.

UTF-8

This encoding was designed to be compatible with legacy, 8-bit based, text management.

Every code point within ASCII is represented as exactly that ASCII character, others are represented as a variable-sized sequence from two to four bytes, all of which are non-ASCII.

This encoding is popular for data storage and interchange, but can also be useful for compatibility with byte-oriented string manipulation.

Combining character sequences

A non-combining code point may be followed by an arbitrary number of combining code points to form a single combining character sequence, which is really a composite character.

Certain characters are only available as a combination of multiple code points, while some, the ones that are expected to be the most used, are also available as a single precomposed code point. The order of the combined code points may also vary, but all code points combinations leading to the same abstract character are still canonically equivalent.

While a combining character sequence can be arbitrarily big, the Unicode standard also introduces the concept of a stream-safe string, where a combining character sequence is at most 31 code points long, which is largely above what is sufficient for any linguistic use.

Grapheme clusters

Another yet higher-level abstraction of character is that of a grapheme cluster, i.e. a cluster of code points that constitutes a grapheme. All combining character sequences are graphemes, but there are other sequences of code points that are as well; for example \r\n is one.

For certain classes of applications, such as word processors, it can be important to operate at the grapheme level rather than at the code point or combining character sequence one, as this is what the document is composed in terms of.

Normalization

The Unicode standard defines four normalized forms in Annex #15 - Normalization Forms where combining character sequences are either fully compressed or decompressed, using either canonical or compatiblity decompositions.

The Normalized Form C is of a great interest, as it compresses every grapheme so that is uses as few code points as possible. It's also the one that operates best with legacy systems unaware of combining character sequences, font rendering systems and is also the normalized form assumed by the XML standard.

On the other hand, the Normalized Form D uses a lot more space, but is more efficient to compute and to use when concatenating combining characters to a string while maintaining the form.

Other operations

The Unicode standard also specifies various features such as a collation algorithm in Technical Standard #10 - Unicode Collation Algorithm for comparison and ordering of strings with a locale-specific criterion, as well as mechanisms to iterate over words, sentences and lines in Annex #29 - Text Segmentation.

Those features are not implemented by the current version of the library.

Character properties

Unicode also provides a database of character properties called the Unicode Character Database (UCD), which consists of a set of files describing the following properties:

Name.
General category (classification as letters, numbers, symbols, punctuation, etc.).
Other important general characteristics (white space, dash, ideographic, alphabetic, non character, deprecated, etc.).
Character shaping (bidi category, shaping, mirroring, width, etc.).
Case (upper, lower, title, folding; both simple and full).
Numeric values and types (for digits).
Script and block.
Normalization properties (decompositions, decomposition type, canonical combining class, composition exclusions, etc.).
Age (version of the standard in which the code point was first designated).
Boundaries (grapheme cluster, word, line and sentence).
Standardized variants.

The database is useful for Unicode implementation in general, as it is the base for most algorithms, but can also be of interest to the library user that wants to implement facilities not provided by the library core.