Converters and Segmenters

Concepts

Two concepts are of utmost importance within this library, the Segmenter concept, which is used for segmentation of text, and, more importantly, the Converter concept, which is used for conversion, including transcoding and normalization.

Segmenter

A model of the Segmenter concept is a class that takes an input range, specified as two iterators, and consumes it left-to-right or right-to-left, modifying the appropriate iterator as it advances.

Semantically, a right-to-left consuming done after a left-to-right consuming should restore the original position. Indeed, both primitives need to be provided in a symmetric way in order to implement bidirectional iteration.

Here is an example of a segmenter that consumes one element in a range of integers:

struct element_segmenter
{
    typedef int input_type;

    template<typename In>
    void ltr(In& begin, In end)
    {
        return ++begin;
    }
    
    template<typename In>
    void rtl(In begin, In& end)
    {
        return --end;
    }
};

A model of the Segmenter concept may then be used to segment a range, either by calling manually, or by using boost::adaptors::segment, which returns a boost::segmented_range that adapts the range into a range of subranges.

With the above example, there would be as many subranges as elements, and each subrange would be one element.

BoundaryChecker

A model of the BoundaryChecker concept is a function object that takes three iterators, the begin, the end, and a position, and that returns whether the position lies on a particular boundary.

Here is an example of a boundary checker that tells whether a position is at the end of an increasing sequence of numbers.

struct increasing_boundary
{
    typedef int input_type;
    
    template<typename In>
    bool operator()(In begin, In end, In pos)
    {
        return *boost::prior(pos) > *pos;
    }
};

A model of the BoundaryChecker concept may then be used to test if a position is the right boundary to apply a converter, such as needed by codecvt facets, or to define a Segmenter using boost::boundary_segmenter.

With the above eample, a segmenter created from this boundary checker applied to the sequence [1, 4, 8, 2, 2, 1, 7, 4] would result in [ [1, 4, 8], [2, 2], [1, 7], [4] ].

Converter

A model of the Converter concept is a class that takes an input range, specified as two iterators, consumes it left-to-right or right-to-left, modifying the appropriate iterator as it advances, writes some elements to an output iterator, and returns it.

In terms of semantics, not only does the consuming need to be symmetric, but the output shall also be the same for a given consumed subrange, whatever the consuming direction. Furthermore, the output shall always be ordered left-to-right, even when applying the conversion right-to-left.

Here is an example of a converter that converts two adjacent numbers into the two numbers reversed, in a range of integers that must have an even number of elements; indeed, for the two operations to be symmetric here, there is not really another way.

struct reverse2_convert
{
    typedef int input_type;
    typedef int output_type;
    typedef mpl::int_<2> max_output;

    template<typename In, typename Out>
    Out ltr(In& begin, In end, Out out)
    {
        int i = *begin++;
        if(begin == end)
            throw std::out_of_range("unexpected end");
            
        *out++ = *begin++;
        *out++ = i;
        return out;
    }
    
    template<typename In, typename Out>
    Out rtl(In begin, In& end, Out out)
    {
        *out++ = *--end;
        if(end == begin)
            throw std::out_of_range("unexpected begin");
            
        *out++ = *--end;
        return out;
    }
};

A model of the Converter concept may then be used to perform a many-to-many conversion on a whole range, be it eagerly (by calling repeatly the converter) or lazily (be evaluating it step by step as an iterator adapter is advanced).

The boost::convert function provides the former, while the boost::adaptors::convert function which returns a range in terms of boost::converted_range provides the latter.

With the above example, the range [1, 2, 3, 4] would be converted to [2, 1, 4, 3].

OneManyConverter

Additionally, there is a refinement of the Converter concept named OneManyConverter, where one element is converted to many.

This allows avoiding managing iterator advancement; the converter can be defined as a single function that takes a value, an output iterator, and returns it.

Converting and segmenting

Conversion

Conversions can be applied in a variety of means, all generated from using the Converter concept that performs one step of the conversion:

Eager evaluation, which simply loops the Converter until the whole input range has been treated.
Lazy evaluation, where a new range is returned that wraps the input range and converts step-by-step as the range is advanced. The resulting range is however read-only. It is implemented in terms of boost::converted_range.
Lazy output evaluation, where an output iterator is returned that wraps the output and converts every pushed element with a OneManyConverter. It is implemented in terms of boost::convert_output_iterator.

Segmentation

Segmentations are expressed in terms of the Segmenter concept, which is inherently very similar to the Converter concept except it doesn't perform any kind of transformation, it just reads part of the input. As a matter of fact, a Converter can be converted to Segmenter using boost::converter_segmenter.

Segmentation may be done either by using the appropriate Segmenter directly, or by using the boost::segmented_range template to adapt the range into a read-only range of subranges.

Additionally, the BoundaryChecker concept may prove useful to tell whether a segment starts at a given position; a Segmenter may also be defined in terms of it using boost::boundary_segmenter.

Combining converters

While it is possible to apply a converter after another, be it with boost::convert or by using boost::converted_range, it is not generally possible to define a converter that is the combination of two others.

Indeed, a Converter defines a step of a conversion, so it becomes difficult to define what the step of a combined conversion is if the two steps it tries to combine are mismatched or overlap.

There are therefore two limited ways to define a converter that is the combination of two others:

boost::multi_converter applies a step of the first converter, then applies the second converter step by step on its output until it is completely consumed. It only works as expected if the second converter expects less input than the first one outputs in a step. It doesn't work, for example, to apply a boost::unicode::normalizer after a boost::unicode::utf_decoder, because each step of the normalizer will only be run on a codepoint, but works to normalize then encode.
boost::converted_converter applies a step of the second converter, passing it input that has been adapted with boost::converted_range. Unfortunately, since it needs to advance the original input iterator, this cannot work unless the the first converter only ever outputs 1 element. As a result it works fine to decode then normalize, but not the other way around.

Stability by concatenation

For some converters, applying the converter on a range of data then on another, and concatenating the results is not the same as applying the converter once on the concatenated data. In particular, the Unicode decomposition and composition processes are not stable by concatenation.

Such converters will not work properly when used as the first parameter to boost::multi_converter, and their existence is part of the rationale for converters not to emit special "partial" states indicating they're lacking input.

Codecvt facets

A codecvt facet is a facility of the standard C++ locales subsystem, that can describe a left-to-right two-way conversion between two encodings of data.

Standard file streams are imbued with a locale, and make use of the codecvt facet attached to said locale to perform conversion between the data they receive and give to the stream user, the so-called "internal" format, and the underlying "external" format of the file, as is manipulated by the underlying, char-based, filebuf. Unfortunately, it appears it is only possible to use this mechanism with codecvt facets that have char as external and either char or wchar_t as internal, but C++0x may improve the situation.

To use boost::converter_codecvt, which allows to build a codecvt facet from converters, you will need two Converters, one for each direction, as well as two BoundaryCheckers. Indeed, as codecvt facets are passed arbitrary input buffers, there needs to be a way to tell what is the right boundaries to apply the steps on. An alternative would be to try to apply a step and try again if there was an error due to incomplete data. This is however not sufficient for converters that are not stable by concatenation.

You may also build converters out of codecvt facets with boost::codecvt_in_converter or boost::codecvt_out_converter, or directly convert locales to UTF-32 with boost::unicode::locale_decoder or boost::unicode::locale_encoder.

This test/example shows how to use a codecvt facet that transcodes from wide chars (UTF-16 or UTF-32) to UTF-8 on the way out, and that does the opposite on the way in. It also demonstrates a variant that normalizes data read from the file.

#define BOOST_TEST_MODULE Codecvt
#include <boost/test/included/unit_test.hpp>

#include <boost/unicode/codecvt.hpp>

#include <fstream>
#include <boost/range/algorithm.hpp>
#include <boost/range/as_literal.hpp>

// e\u0301 is \u00E9
// \U0002FA1D is \U0002A600
const wchar_t data_[] = L"hello e\u0301 \U0002FA1D world";
boost::iterator_range<const wchar_t*> data = boost::as_literal(data_);
    
const wchar_t data_normalized_[] = L"hello \u00E9 \U0002A600 world";
boost::iterator_range<const wchar_t*> data_normalized = boost::as_literal(data_normalized_);

BOOST_AUTO_TEST_CASE( codecvt )
{
    std::locale old_locale;
    std::locale utf8_locale(old_locale, new boost::unicode::utf_u8_codecvt());

    // Set a new global locale
    //std::locale::global(utf8_locale);

    // Send the UTF-X data out, converting to UTF-8
    {
        std::wofstream ofs("data.ucd");
        ofs.imbue(utf8_locale);
        boost::copy(data, std::ostream_iterator<wchar_t, wchar_t>(ofs));
    }

    // Read the UTF-8 data back in, converting to UTF-X and normalizing on the way in
    {
        std::wifstream ifs("data.ucd");
        ifs.imbue(utf8_locale);
        wchar_t item = 0;
        size_t i = 0;
        while (ifs >> std::noskipws >> item)
        {
            BOOST_CHECK_EQUAL(data[i], item);
            i++;
        }
        BOOST_CHECK_EQUAL(i, (size_t)boost::size(data));
    }
}

BOOST_AUTO_TEST_CASE( codecvt_normalized )
{
    std::locale old_locale;
    std::locale utf8_locale(old_locale, new boost::unicode::utf_u8_normalize_codecvt());

    // Set a new global locale
    //std::locale::global(utf8_locale);

    // Send the UTF-X data out, converting to UTF-8
    {
        std::wofstream ofs("data.ucd");
        ofs.imbue(utf8_locale);
        boost::copy(data, std::ostream_iterator<wchar_t, wchar_t>(ofs));
    }

    // Read the UTF-8 data back in, converting to UTF-X and normalizing on the way in
    {
        std::wifstream ifs("data.ucd");
        ifs.imbue(utf8_locale);
        wchar_t item = 0;
        size_t i = 0;
        while (ifs >> std::noskipws >> item)
        {
            BOOST_CHECK_EQUAL(data_normalized[i], item);
            i++;
        }
        BOOST_CHECK_EQUAL(i, (size_t)boost::size(data_normalized));
    }
}