Appendices

Appendix A: Unicode in source files

It is often quite useful to embed strings of text directly into source files, and C++, as of the 2003 standard, provides the following ways to do so: string literals, wide string literals, character and wide character literals, and finally type lists that form compile-time strings. One has to be aware, however, of the various portability issues associated with character encoding within source files.

The first limitation is that of what character encoding the compiler expects the source files to be in, the source character set. The second one is what character encoding narrow and wide string literals will have at runtime: the execution character set, which may be different for narrow and wide strings.

Indeed, while certain compilers remain encoding-agnostic as long as the source is ASCII-compatible, certain will convert the string literals from the source character set to their appropriate execution character sets. This is the case of MSVC which, when it detects a source file is in UTF-8 or UTF-16, will convert narrow string literals to ANSI and wide string literals to UTF-16. Furthermore, if it doesn't detect the character encoding of the source file, it will still convert wide string literals from ANSI to UTF-16 while leaving narrow ones untouched.

Also, regardless of whether the compiler detects the character encoding of the source file or not, Unicode escape sequences, \uxxxx and \Uxxxxxxxx, will be translated to the execution character set of the literal type they're embedded in. Unfortunately, that makes them unusable portably within narrow strings, as there is no way to set the narrow execution character set to UTF-8 with MSVC, and UTF-8 is the way Unices are going.

Finally, wide characters are not well defined. In practice, they're either UTF-16 or UTF-32 code units, but their use is quite discouraged as the size of wchar_t is very variable: 16 bits on MS Windows, usually 32 bits on Unices. Nevertheless, in the lack of UTF-16 and UTF-32 literals that are coming with C++0x, wide string literals are probably the closest thing there is to native Unicode in the compiler. The library tools that automatically deduce the UTF encoding based on the size of the value type will therefore work as expected as they will expect wchar_t to represent either UTF-16 or UTF-32 depending on its size.

Alternatively, compile-time strings may be used, which allow a great deal of flexibility as arbitrary character encoding conversion may then be performed at compile-time, but which remain more verbose to declare and increase compilation times.

We can then infer certain guidelines to write Unicode data within C++ source files in a portable way while taking a few reasonable assumptions.

Legacy guidelines

The following guidelines ensure everything will go welll, regardless of compiler or environment setup:

Source file encoding: use UTF-8 without a Byte Order Mark or use ASCII. This ensures most compilers will run in an encoding-agnostic mode and not perform translations, plus most compilers only support ASCII-compatible input.
Narrow character literals: use ASCII only; if you use escape sequences such as \x treat the data you're inputting as UTF-8 code units. Ban \u and \U.
Narrow string literals: freely input UTF-8; if you use escape sequences such as \x treat the data you're inputting as UTF-8 code units. Ban \u and \U.
Wide character literals: use ASCII only; if you use escape sequences such as \x treat the data you're inputting as UTF-32 code units, but don't input anything higher than 0xD800. Use heavily \u but ban \U.
Wide string literals: use ASCII only; if you use escape sequences such as \x treat the data you're inputting as UTF-32 code units, but don't input anything higher than 0xD800. Use heavily both \u and \U.

Modern guidelines

With C++0x introducing UTF-8, UTF-16, and UTF-32 literals, it becomes clear the way to go is to rely on the compiler to convert from the source character set to whatever encoding the strings should be in. Indeed, the compiler will convert from the source character set to UTF-8 for UTF-8 literals, UTF-16 for UTF-16 literals, UTF-32 for UTF-32 literals, the narrow execution character set for narrow literals, and the wide execution character set for wide literals.

Assuming you can reliably ensure that all compilers recognize the same source character set, you can make full usage of all literal types freely. However, for UTF-8 source files, MSVC requires a BOM, while GCC requires it to not be present. If you can accomodate this in your environment, then definitely go for this solution, which is simpler and more powerful.

Compile-time strings

Option one is to use boost::mpl::string as a UTF-8 compile-time string. Its support for multi-char character literals allows it to not be too verbose, and it can be coupled with boost::unicode::string_cp to insert Unicode code points instead of the Unicode escape sequences. Any non-ASCII character shall be put as its own character literal. Note multi-char character literals require int to be at least 32 bits however.

A second option is to use boost::mpl::u32string as a UTF-32 compile-time string, and use boost::unicode::static_u8_encode or boost::unicode::static_u16_encode to eventually encode it at compile-time to UTF-8 or UTF-16. boost::mpl::u16string may also be used to directly input UTF-16. However, none of these two sequence types provide support for easier declaration with multi-char character literals.

Then, the boost::mpl::c_str meta-function may be used to convert any compile-time string into a zero-terminated equivalent.

Example

See the source_input example for demonstrations.

Appendix B: Rationale

Iterators rather than streams

The library chooses to base itself upon iterator adapters rather than upon streams, even though the latter were designed for conversion facilities with buffering and can be configured with locales.

That choice was made because it is believed that the iterator and range abstractions are more flexible and easier to deal with, and that there are also quite more efficient.

Converter concept

Centralizing conversion into a single Converter model allows eager and lazy variants of evaluation to be possible for any conversion facility.

Lazy evaluation is believed to be of great interest since it avoids the need for memory allocations and buffers and constructing a logic conversion is constant-time instead of linear-time since there is no need to actually walk the range.

Eager evaluations can remain more efficient however, and that is why they are provided as well.

Appendix C: Future Work

Non-checked UTF conversion

The library only provides UTF conversion converts that do extensive checking that the input is correct and that the end is not unexpectedly met.

These could be avoided when it is known that the input is valid, and thus performance be increased. boost::convert_iterator could as well avoid storing the begin and end iterator in such cases.

Fast Normalization

The Unicode standard provides a quick-check scheme to tell whether a string is in a normalized form, which could be used to avoid expensive decomposition and recomposition.

Unicode String type

Future versions of the library could provide a string type that maintains the following invariants: valid UTF, stream-safe and in Normalization Form C.

Appendix D: Acknowledgements

I would like to thank Eric Niebler for mentoring this project as part of the Google Summer of Code program, who provided steady help and insightful ideas along the development of this project.

Graham Barnett and Rogier van Dalen deserve great thanks as well for their work on Unicode character properties, most of the parser of Unicode data was written by them.

John Maddock was also a great help by contributing preliminary on-the-fly UTF conversion which helped the library get started, while inspiration from Phil Endecott allowed UTF conversion code to be more efficient.

Finally, I thank Beman Dawes and other members of the mailing list for their interest and support.