PrevUpHomeNext

Overview

Components
Organization
Linking

Part of this library is header-only, while part requires to link against a library.

The library provides the following header-only components:

  • the type char16 and char32, suitable for encoding UTF-16 and UTF-32 respectively,
  • a comprehensive converter and segmenter framework, which allows among others to convert a range as it is iterated or to convert a file stream using a codecvt facet,
  • converters between the various UTF encodings and the locale character sets,
  • compile-time unicode strings and compile-time UTF converters,
  • converters that compose or decompose Hangul characters.

The following features are available by linking against the library:

  • a Unicode character database, which, for each Unicode code point, provides many properties,
  • converters for decomposition, composition, and normalization,
  • functions to concatenate normalized ranges,
  • segmenters for graphemes, and in the close future words, sentences and line breaks.

This library defines the concepts of Converter and Segmenter, which are mechanisms to arbitrarily convert or segment ranges of data, expressed as pairs of iterators. The Converter and Segmenters framework allows to perform these either eaglery

[Caution] Caution

The organization of headers may change in the future in order to improve compile times.

Main headers

boost/cuchar.hpp

Primitive types for UTF code units.

boost/unicode/utf.hpp

Conversion between UTF encodings.

boost/unicode/static_utf.hpp

Compile-time conversion between UTF encodings.

boost/unicode/graphemes.hpp

Functions to iterate and identify graphemes.

boost/unicode/compose.hpp

Functions to compose, decompose and normalize unicode strings.

boost/unicode/cat.hpp

Functions to concatenate normalized strings while maintaining a normalized form.

boost/unicode/search.hpp

Utility to adapt Boost.StringAlgo finders to discard matches that lie on certain boundaries.

boost/unicode/ucd/properties.hpp

Access to the properties attached with a code point in the Unicode Character Database.

As has been stated in Introduction to Unicode, several Unicode algorithms require the usage of a large database of information which, as of the preview 4 of this library, is 500 KB on x86 when stripped. Note that at the current stage of development, the database does not contain everything one might need to deal with Unicode text, so it may grow in the future.

Features that can avoid dependency on that database do so; so it is not required for UTF conversions for example, that are purely header-only.

UCD generation

The Unicode Character Database can be generated using a parser present in the source distribution of this library to analyze the data provided by Unicode.org.

Note however that the parser itself needs to be updated to be made aware of new proprieties values; otherwise those properties will fallback to the default value for that property and the parser will issue a warning.

Binary compatibility

The UCD is fully backward compile, and unknown property values returned by the linked library will automatically be converted to the default value for that property. This is consistent with how new values are introduced in the standard.

Alternate databases

Future versions of this library may provide alternate implementations of this database as a thin layer over a database provided by another library or environment to prevent duplication of data. All this should be entirely binary compatible, and using one database or another should just be a drop-in replacement of a shared object.


PrevUpHomeNext