This library provides two kinds of operations on bidirectional ranges: conversion (e.g. converting a range in UTF-8 to a range in UTF-32) and segmentation (i.e. demarcating sections of a range, like code points, grapheme clusters, words, etc.).
The naming scheme of the utilities are as follows, here is an example what is provided to convert UTF-32 to UTF-8:
boost::unicode::u8_encoder
is a model of the OneManyConverter
concept.
boost::unicode::u8_encode
is an eager encoding algorithm.
boost::unicode::adaptors::u8_encode
returns a range adapter that does on-the-fly encoding.
boost::unicode::adaptors::u8_encode_output
returns an output iterator adapter that will encode its elements before
forwarding them to the wrapped output iterator.
Note | |
---|---|
The library considers a conversion from UTF-32 an "encoding", while a conversion to UTF-32 is called a "decoding". This is because code points is what the library mainly deals with, and UTF-32 is a sequence of code points. |
The naming scheme is as follows:
boost::unicode::u8_boundary
is a BoundaryChecker
that tells whether a position
is the start of a code point in a range of UTF-8 code units.
boost::unicode::grapheme_boundary
is a BoundaryChecker
that tells whether a position
is the start of a grapheme cluster in a range of code points.
boost::unicode::adaptors::u8_segment
adapts its input range in UTF-8 into a range of ranges of code units,
each range being a code point.
boost::unicode::adaptors::grapheme_segment
adapts its input range in UTF-32 into a range of ranges of code points,
each range being a grapheme cluster.
boost::unicode::adaptors::u8_grapheme_segment
adapts its input range in UTF-8 into a range of code units, each range
being a grapheme cluster.
Everytime there are two versions for a function or class, one for UTF-8 and the other for UTF-16, and deducing which type of UTF encoding to use is possible, additional ones are added that will automatically forward to it.
The naming scheme is as follows:
boost::unicode::utf_decode
either behaves like boost::unicode::u8_decode
,
boost::unicode::u16_decode
depending on the value_type
of its input range.
boost::unicode::utf_boundary
either behaves like boost::unicode::u8_boundary
or boost::unicode::u16_boundary
depending on the value_type
of the input ranges passed
to ltr
and rtl
.
Tip | |
---|---|
Not only UTF-8 and UTF-16 are recognized by UTF type deduction, UTF-32 is as well. |
Normalized forms are defined in terms of certain decompositions applied recursively, followed by certain compositions also applied recursively, and finally canonical ordering of combining character sequences.
A decomposition being a conversion of a single code point into several and a composition being the opposite conversion, with exceptions.
The Unicode Character Database associates with code points certain decompositions,
which can be obtained with boost::unicode::ucd::get_decomposition
,
but does not include Hangul syllable decompositions since those can be easily
procedurally generated, allowing space to be saved.
The library provides boost::unicode::hangul_decomposer
,
a OneManyConverter to decompose
Hangul syllables.
There are several types of decompositions, which are exposed by boost::unicode::ucd::get_decomposition_type
,
most importantly the canonical composition is obtained by applying both the
Hangul decompositions and the canonical decompositions from the UCD, while
the compatibility decomposition is obtained by applying the Hangul decompositions
and all decompositions from the UCD.
boost::unicode::decomposer
,
model of Converter allows to perform
any decomposition that matches a certain mask, recursively, including Hangul
ones (which are treated as canonical decompositions), and canonically orders
combining sequences as well.
Likewise, Hangul syllable compositions are not provided by the UCD and are
implemented by boost::unicode::hangul_composer
instead.
Some distinct code points may have the same decomposition, so certain decomposed forms are preferred. That is why an exclusion table is also provided by the UCD.
The library uses a pre-generated prefix tree (or, in the current implementation, a lexicographically sorted array) of all canonical compositions from their fully decomposed and canonically ordered form to identity composable sequences and apply the compositions.
boost::unicode::composer
is a Converter that uses that
tree as well as the Hangul compositions.
Normalization can be performed by applying decomposition followed by composition,
which is what the current version of boost::unicode::normalizer
does.
The Unicode standard however provides as well quick-check properties to avoid that operation when possible, but the current version of the library does not support that scheme at the moment.
Concatenating strings in a given normalization form does not guarantee the result is in that same normalization form if the right operand starts with a combining code point.
Therefore the library provides functionality to identity the boundaries where re-normalization needs to occur as well as eager and lazy versions of the concatenation that maintain the input normalization.
Note concatenation with Normalization Form D is slightly more efficient as it only requires canonical sorting of the combining character sequence placed at the intersection, while Normalization Form C requires the sequence be renormalized.
See:
boost::unicode::cat_limits
to partition into the different sub ranges.
boost::unicode::composed_concat
,
eager version with input in Normalization Form C.
boost::unicode::adaptors::composed_concat
,
lazy version with input in Normalization Form C.
boost::unicode::decomposed_concat
,
eager version with input in Normalization Form D.
boost::unicode::adaptors::decomposed_concat
,
lazy version with input in Normalization Form D.
The library provides mechanisms to perform searches at the code unit, code point, or grapheme level, and in the future will provide word and sentence level as well.
Different approaches to do that are possible:
boost::segment_iterator
are EqualityComparable
).
Finder
in Boost.StringAlgo
with boost::algorithm::boundary_finder
,
and the boundary you are interested in testing, for example boost::unicode::utf_grapheme_boundary
.
Important | |
---|---|
You will have to normalize input before the search if you want canonically equivalent things to compare equal. |