Two concepts are of utmost importance within this library, the Segmenter concept, which is used for segmentation of text, and, more importantly, the Converter concept, which is used for conversion, including transcoding and normalization.
A model of the Segmenter concept is a class that takes an input range, specified as two iterators, and consumes it left-to-right or right-to-left, modifying the appropriate iterator as it advances.
Semantically, a right-to-left consuming done after a left-to-right consuming should restore the original position. Indeed, both primitives need to be provided in a symmetric way in order to implement bidirectional iteration.
Here is an example of a segmenter that consumes one element in a range of integers:
struct element_segmenter { typedef int input_type; template<typename In> void ltr(In& begin, In end) { return ++begin; } template<typename In> void rtl(In begin, In& end) { return --end; } };
A model of the Segmenter concept
may then be used to segment a range, either by calling manually, or by using
boost::adaptors::segment
,
which returns a boost::segmented_range
that adapts the range into a range of subranges.
With the above example, there would be as many subranges as elements, and each subrange would be one element.
A model of the BoundaryChecker concept is a function object that takes three iterators, the begin, the end, and a position, and that returns whether the position lies on a particular boundary.
Here is an example of a boundary checker that tells whether a position is at the end of an increasing sequence of numbers.
struct increasing_boundary { typedef int input_type; template<typename In> bool operator()(In begin, In end, In pos) { return *boost::prior(pos) > *pos; } };
A model of the BoundaryChecker
concept may then be used to test if a position is the right boundary to apply
a converter, such as needed by codecvt facets, or to define a Segmenter using boost::boundary_segmenter
.
With the above eample, a segmenter created from this boundary checker applied to the sequence [1, 4, 8, 2, 2, 1, 7, 4] would result in [ [1, 4, 8], [2, 2], [1, 7], [4] ].
A model of the Converter concept is a class that takes an input range, specified as two iterators, consumes it left-to-right or right-to-left, modifying the appropriate iterator as it advances, writes some elements to an output iterator, and returns it.
In terms of semantics, not only does the consuming need to be symmetric, but the output shall also be the same for a given consumed subrange, whatever the consuming direction. Furthermore, the output shall always be ordered left-to-right, even when applying the conversion right-to-left.
Here is an example of a converter that converts two adjacent numbers into the two numbers reversed, in a range of integers that must have an even number of elements; indeed, for the two operations to be symmetric here, there is not really another way.
struct reverse2_convert { typedef int input_type; typedef int output_type; typedef mpl::int_<2> max_output; template<typename In, typename Out> Out ltr(In& begin, In end, Out out) { int i = *begin++; if(begin == end) throw std::out_of_range("unexpected end"); *out++ = *begin++; *out++ = i; return out; } template<typename In, typename Out> Out rtl(In begin, In& end, Out out) { *out++ = *--end; if(end == begin) throw std::out_of_range("unexpected begin"); *out++ = *--end; return out; } };
A model of the Converter concept may then be used to perform a many-to-many conversion on a whole range, be it eagerly (by calling repeatly the converter) or lazily (be evaluating it step by step as an iterator adapter is advanced).
The boost::convert
function
provides the former, while the boost::adaptors::convert
function which returns a range in terms of boost::converted_range
provides the latter.
With the above example, the range [1, 2, 3, 4] would be converted to [2, 1, 4, 3].
Additionally, there is a refinement of the Converter concept named OneManyConverter, where one element is converted to many.
This allows avoiding managing iterator advancement; the converter can be defined as a single function that takes a value, an output iterator, and returns it.
Conversions can be applied in a variety of means, all generated from using the Converter concept that performs one step of the conversion:
Converter
until the whole input range has been treated.
boost::converted_range
.
boost::convert_output_iterator
.
Segmentations are expressed in terms of the Segmenter
concept, which is inherently very similar to the Converter
concept except it doesn't perform any kind of transformation, it just reads
part of the input. As a matter of fact, a Converter
can
be converted to Segmenter
using boost::converter_segmenter
.
Segmentation may be done either by using the appropriate Segmenter
directly, or by using the boost::segmented_range
template to adapt the range into a read-only range of subranges.
Additionally, the BoundaryChecker
concept may prove useful to tell whether a segment starts at a given position;
a Segmenter
may also be defined in terms of it using
boost::boundary_segmenter
.
While it is possible to apply a converter after another, be it with boost::convert
or by using boost::converted_range
, it is not
generally possible to define a converter that is the combination of two others.
Indeed, a Converter defines a step of a conversion, so it becomes difficult to define what the step of a combined conversion is if the two steps it tries to combine are mismatched or overlap.
There are therefore two limited ways to define a converter that is the combination of two others:
boost::multi_converter
applies a step of the first converter, then applies the second converter
step by step on its output until it is completely consumed. It only works
as expected if the second converter expects less input than the first
one outputs in a step. It doesn't work, for example, to apply a boost::unicode::normalizer
after a boost::unicode::utf_decoder
,
because each step of the normalizer will only be run on a codepoint,
but works to normalize then encode.
boost::converted_converter
applies a step of the second converter, passing it input that has been
adapted with boost::converted_range
.
Unfortunately, since it needs to advance the original input iterator,
this cannot work unless the the first converter only ever outputs 1 element.
As a result it works fine to decode then normalize, but not the other
way around.
For some converters, applying the converter on a range of data then on another, and concatenating the results is not the same as applying the converter once on the concatenated data. In particular, the Unicode decomposition and composition processes are not stable by concatenation.
Such converters will not work properly when used as the first parameter to
boost::multi_converter
,
and their existence is part of the rationale for converters not to emit special
"partial" states indicating they're lacking input.
A codecvt facet is a facility of the standard C++ locales subsystem, that can describe a left-to-right two-way conversion between two encodings of data.
Standard file streams are imbued with a locale, and make use of the codecvt
facet attached to said locale to perform conversion between the data they
receive and give to the stream user, the so-called "internal" format,
and the underlying "external" format of the file, as is manipulated
by the underlying, char
-based, filebuf. Unfortunately,
it appears it is only possible to use this mechanism with codecvt facets
that have char
as external and either char
or wchar_t
as internal, but C++0x may improve the situation.
To use boost::converter_codecvt
,
which allows to build a codecvt facet from converters, you will need two
Converters, one for each direction,
as well as two BoundaryCheckers.
Indeed, as codecvt facets are passed arbitrary input buffers, there needs
to be a way to tell what is the right boundaries to apply the steps on. An
alternative would be to try to apply a step and try again if there was an
error due to incomplete data. This is however not sufficient for converters
that are not stable by concatenation.
You may also build converters out of codecvt facets with boost::codecvt_in_converter
or boost::codecvt_out_converter
,
or directly convert locales to UTF-32 with boost::unicode::locale_decoder
or boost::unicode::locale_encoder
.
This test/example shows how to use a codecvt facet that transcodes from wide chars (UTF-16 or UTF-32) to UTF-8 on the way out, and that does the opposite on the way in. It also demonstrates a variant that normalizes data read from the file.
#define BOOST_TEST_MODULE Codecvt #include <boost/test/included/unit_test.hpp> #include <boost/unicode/codecvt.hpp> #include <fstream> #include <boost/range/algorithm.hpp> #include <boost/range/as_literal.hpp> // e\u0301 is \u00E9 // \U0002FA1D is \U0002A600 const wchar_t data_[] = L"hello e\u0301 \U0002FA1D world"; boost::iterator_range<const wchar_t*> data = boost::as_literal(data_); const wchar_t data_normalized_[] = L"hello \u00E9 \U0002A600 world"; boost::iterator_range<const wchar_t*> data_normalized = boost::as_literal(data_normalized_); BOOST_AUTO_TEST_CASE( codecvt ) { std::locale old_locale; std::locale utf8_locale(old_locale, new boost::unicode::utf_u8_codecvt()); // Set a new global locale //std::locale::global(utf8_locale); // Send the UTF-X data out, converting to UTF-8 { std::wofstream ofs("data.ucd"); ofs.imbue(utf8_locale); boost::copy(data, std::ostream_iterator<wchar_t, wchar_t>(ofs)); } // Read the UTF-8 data back in, converting to UTF-X and normalizing on the way in { std::wifstream ifs("data.ucd"); ifs.imbue(utf8_locale); wchar_t item = 0; size_t i = 0; while (ifs >> std::noskipws >> item) { BOOST_CHECK_EQUAL(data[i], item); i++; } BOOST_CHECK_EQUAL(i, (size_t)boost::size(data)); } } BOOST_AUTO_TEST_CASE( codecvt_normalized ) { std::locale old_locale; std::locale utf8_locale(old_locale, new boost::unicode::utf_u8_normalize_codecvt()); // Set a new global locale //std::locale::global(utf8_locale); // Send the UTF-X data out, converting to UTF-8 { std::wofstream ofs("data.ucd"); ofs.imbue(utf8_locale); boost::copy(data, std::ostream_iterator<wchar_t, wchar_t>(ofs)); } // Read the UTF-8 data back in, converting to UTF-X and normalizing on the way in { std::wifstream ifs("data.ucd"); ifs.imbue(utf8_locale); wchar_t item = 0; size_t i = 0; while (ifs >> std::noskipws >> item) { BOOST_CHECK_EQUAL(data_normalized[i], item); i++; } BOOST_CHECK_EQUAL(i, (size_t)boost::size(data_normalized)); } }