PrevUpHomeNext

User's Guide

UTF converters and segmenters
Composition and Normalization
String searching algorithms

This library provides two kinds of operations on bidirectional ranges: conversion (e.g. converting a range in UTF-8 to a range in UTF-32) and segmentation (i.e. demarcating sections of a range, like code points, grapheme clusters, words, etc.).

Conversion

The naming scheme of the utilities are as follows, here is an example what is provided to convert UTF-32 to UTF-8:

[Note] Note

The library considers a conversion from UTF-32 an "encoding", while a conversion to UTF-32 is called a "decoding". This is because code points is what the library mainly deals with, and UTF-32 is a sequence of code points.

Segmentation

The naming scheme is as follows:

UTF type deduction with SFINAE

Everytime there are two versions for a function or class, one for UTF-8 and the other for UTF-16, and deducing which type of UTF encoding to use is possible, additional ones are added that will automatically forward to it.

The naming scheme is as follows:

[Tip] Tip

Not only UTF-8 and UTF-16 are recognized by UTF type deduction, UTF-32 is as well.

Normalized forms are defined in terms of certain decompositions applied recursively, followed by certain compositions also applied recursively, and finally canonical ordering of combining character sequences.

A decomposition being a conversion of a single code point into several and a composition being the opposite conversion, with exceptions.

Decomposition

The Unicode Character Database associates with code points certain decompositions, which can be obtained with boost::unicode::ucd::get_decomposition, but does not include Hangul syllable decompositions since those can be easily procedurally generated, allowing space to be saved.

The library provides boost::unicode::hangul_decomposer, a OneManyConverter to decompose Hangul syllables.

There are several types of decompositions, which are exposed by boost::unicode::ucd::get_decomposition_type, most importantly the canonical composition is obtained by applying both the Hangul decompositions and the canonical decompositions from the UCD, while the compatibility decomposition is obtained by applying the Hangul decompositions and all decompositions from the UCD.

boost::unicode::decomposer, model of Converter allows to perform any decomposition that matches a certain mask, recursively, including Hangul ones (which are treated as canonical decompositions), and canonically orders combining sequences as well.

Composition

Likewise, Hangul syllable compositions are not provided by the UCD and are implemented by boost::unicode::hangul_composer instead.

Some distinct code points may have the same decomposition, so certain decomposed forms are preferred. That is why an exclusion table is also provided by the UCD.

The library uses a pre-generated prefix tree (or, in the current implementation, a lexicographically sorted array) of all canonical compositions from their fully decomposed and canonically ordered form to identity composable sequences and apply the compositions.

boost::unicode::composer is a Converter that uses that tree as well as the Hangul compositions.

Normalization

Normalization can be performed by applying decomposition followed by composition, which is what the current version of boost::unicode::normalizer does.

The Unicode standard however provides as well quick-check properties to avoid that operation when possible, but the current version of the library does not support that scheme at the moment.

Concatenation

Concatenating strings in a given normalization form does not guarantee the result is in that same normalization form if the right operand starts with a combining code point.

Therefore the library provides functionality to identity the boundaries where re-normalization needs to occur as well as eager and lazy versions of the concatenation that maintain the input normalization.

Note concatenation with Normalization Form D is slightly more efficient as it only requires canonical sorting of the combining character sequence placed at the intersection, while Normalization Form C requires the sequence be renormalized.

See:

The library provides mechanisms to perform searches at the code unit, code point, or grapheme level, and in the future will provide word and sentence level as well.

Different approaches to do that are possible:

  • Converter- or Segmenter-based, you may simply run classic search algorithms, such as the ones from Boost.StringAlgo, with ranges of the appropriate elements -- those elements being able to be ranges themselves (subranges returned by boost::segment_iterator are EqualityComparable).
  • BoundaryChecker-based, the classic algorithms are run, then false positives that don't lie on the right boundaries are discarded. This has the advantage of reducing conversion and iteration overhead in certain situations. The most practical way to achieve this is to adapt a Finder in Boost.StringAlgo with boost::algorithm::boundary_finder, and the boundary you are interested in testing, for example boost::unicode::utf_grapheme_boundary.
[Important] Important

You will have to normalize input before the search if you want canonically equivalent things to compare equal.


PrevUpHomeNext