User's Guide

UTF converters and segmenters

This library provides two kinds of operations on bidirectional ranges: conversion (e.g. converting a range in UTF-8 to a range in UTF-32) and segmentation (i.e. demarcating sections of a range, like code points, grapheme clusters, words, etc.).

Conversion

The naming scheme of the utilities are as follows, here is an example what is provided to convert UTF-32 to UTF-8:

boost::unicode::u8_encoder is a model of the OneManyConverter concept.
boost::unicode::u8_encode is an eager encoding algorithm.
boost::unicode::adaptors::u8_encode returns a range adapter that does on-the-fly encoding.
boost::unicode::adaptors::u8_encode_output returns an output iterator adapter that will encode its elements before forwarding them to the wrapped output iterator.

	Note
	The library considers a conversion from UTF-32 an "encoding", while a conversion to UTF-32 is called a "decoding". This is because code points is what the library mainly deals with, and UTF-32 is a sequence of code points.

Segmentation

The naming scheme is as follows:

boost::unicode::u8_boundary is a BoundaryChecker that tells whether a position is the start of a code point in a range of UTF-8 code units.
boost::unicode::grapheme_boundary is a BoundaryChecker that tells whether a position is the start of a grapheme cluster in a range of code points.
boost::unicode::adaptors::u8_segment adapts its input range in UTF-8 into a range of ranges of code units, each range being a code point.
boost::unicode::adaptors::grapheme_segment adapts its input range in UTF-32 into a range of ranges of code points, each range being a grapheme cluster.
boost::unicode::adaptors::u8_grapheme_segment adapts its input range in UTF-8 into a range of code units, each range being a grapheme cluster.

UTF type deduction with SFINAE

Everytime there are two versions for a function or class, one for UTF-8 and the other for UTF-16, and deducing which type of UTF encoding to use is possible, additional ones are added that will automatically forward to it.

The naming scheme is as follows:

boost::unicode::utf_decode either behaves like boost::unicode::u8_decode, boost::unicode::u16_decode depending on the value_type of its input range.
boost::unicode::utf_boundary either behaves like boost::unicode::u8_boundary or boost::unicode::u16_boundary depending on the value_type of the input ranges passed to ltr and rtl.

	Tip
	Not only UTF-8 and UTF-16 are recognized by UTF type deduction, UTF-32 is as well.

Composition and Normalization

Normalized forms are defined in terms of certain decompositions applied recursively, followed by certain compositions also applied recursively, and finally canonical ordering of combining character sequences.

A decomposition being a conversion of a single code point into several and a composition being the opposite conversion, with exceptions.

Decomposition

The Unicode Character Database associates with code points certain decompositions, which can be obtained with boost::unicode::ucd::get_decomposition, but does not include Hangul syllable decompositions since those can be easily procedurally generated, allowing space to be saved.

The library provides boost::unicode::hangul_decomposer, a OneManyConverter to decompose Hangul syllables.

There are several types of decompositions, which are exposed by boost::unicode::ucd::get_decomposition_type, most importantly the canonical composition is obtained by applying both the Hangul decompositions and the canonical decompositions from the UCD, while the compatibility decomposition is obtained by applying the Hangul decompositions and all decompositions from the UCD.

boost::unicode::decomposer, model of Converter allows to perform any decomposition that matches a certain mask, recursively, including Hangul ones (which are treated as canonical decompositions), and canonically orders combining sequences as well.

Composition

Likewise, Hangul syllable compositions are not provided by the UCD and are implemented by boost::unicode::hangul_composer instead.

Some distinct code points may have the same decomposition, so certain decomposed forms are preferred. That is why an exclusion table is also provided by the UCD.

The library uses a pre-generated prefix tree (or, in the current implementation, a lexicographically sorted array) of all canonical compositions from their fully decomposed and canonically ordered form to identity composable sequences and apply the compositions.

boost::unicode::composer is a Converter that uses that tree as well as the Hangul compositions.

Normalization

Normalization can be performed by applying decomposition followed by composition, which is what the current version of boost::unicode::normalizer does.

The Unicode standard however provides as well quick-check properties to avoid that operation when possible, but the current version of the library does not support that scheme at the moment.

Concatenation

Concatenating strings in a given normalization form does not guarantee the result is in that same normalization form if the right operand starts with a combining code point.

Therefore the library provides functionality to identity the boundaries where re-normalization needs to occur as well as eager and lazy versions of the concatenation that maintain the input normalization.

Note concatenation with Normalization Form D is slightly more efficient as it only requires canonical sorting of the combining character sequence placed at the intersection, while Normalization Form C requires the sequence be renormalized.

See:

boost::unicode::cat_limits to partition into the different sub ranges.
boost::unicode::composed_concat, eager version with input in Normalization Form C.
boost::unicode::adaptors::composed_concat, lazy version with input in Normalization Form C.
boost::unicode::decomposed_concat, eager version with input in Normalization Form D.
boost::unicode::adaptors::decomposed_concat, lazy version with input in Normalization Form D.

String searching algorithms

The library provides mechanisms to perform searches at the code unit, code point, or grapheme level, and in the future will provide word and sentence level as well.

Different approaches to do that are possible:

Converter- or Segmenter-based, you may simply run classic search algorithms, such as the ones from Boost.StringAlgo, with ranges of the appropriate elements -- those elements being able to be ranges themselves (subranges returned by boost::segment_iterator are EqualityComparable).
BoundaryChecker-based, the classic algorithms are run, then false positives that don't lie on the right boundaries are discarded. This has the advantage of reducing conversion and iteration overhead in certain situations. The most practical way to achieve this is to adapt a Finder in Boost.StringAlgo with boost::algorithm::boundary_finder, and the boundary you are interested in testing, for example boost::unicode::utf_grapheme_boundary.

	Important
	You will have to normalize input before the search if you want canonically equivalent things to compare equal.