It is often quite useful to embed strings of text directly into source files, and C++, as of the 2003 standard, provides the following ways to do so: string literals, wide string literals, character and wide character literals, and finally type lists that form compile-time strings. One has to be aware, however, of the various portability issues associated with character encoding within source files.
The first limitation is that of what character encoding the compiler expects the source files to be in, the source character set. The second one is what character encoding narrow and wide string literals will have at runtime: the execution character set, which may be different for narrow and wide strings.
Indeed, while certain compilers remain encoding-agnostic as long as the source is ASCII-compatible, certain will convert the string literals from the source character set to their appropriate execution character sets. This is the case of MSVC which, when it detects a source file is in UTF-8 or UTF-16, will convert narrow string literals to ANSI and wide string literals to UTF-16. Furthermore, if it doesn't detect the character encoding of the source file, it will still convert wide string literals from ANSI to UTF-16 while leaving narrow ones untouched.
Also, regardless of whether the compiler detects the character encoding of
the source file or not, Unicode escape sequences, \uxxxx
and \Uxxxxxxxx
, will be translated
to the execution character set of the literal type they're embedded in. Unfortunately,
that makes them unusable portably within narrow strings, as there is no way
to set the narrow execution character set to UTF-8 with MSVC, and UTF-8 is
the way Unices are going.
Finally, wide characters are not well defined. In practice, they're either
UTF-16 or UTF-32 code units, but their use is quite discouraged as the size
of wchar_t
is very variable: 16 bits on MS Windows, usually
32 bits on Unices. Nevertheless, in the lack of UTF-16 and UTF-32 literals
that are coming with C++0x, wide string literals are probably the closest
thing there is to native Unicode in the compiler. The library tools that
automatically deduce the UTF encoding based on the size of the value type
will therefore work as expected as they will expect wchar_t
to represent either UTF-16 or UTF-32 depending on its size.
Alternatively, compile-time strings may be used, which allow a great deal of flexibility as arbitrary character encoding conversion may then be performed at compile-time, but which remain more verbose to declare and increase compilation times.
We can then infer certain guidelines to write Unicode data within C++ source files in a portable way while taking a few reasonable assumptions.
The following guidelines ensure everything will go welll, regardless of compiler or environment setup:
\x
treat the data you're inputting as UTF-8
code units. Ban \u
and \U
.
\x
treat the data you're inputting as UTF-8
code units. Ban \u
and \U
.
\x
treat the data you're inputting as UTF-32
code units, but don't input anything higher than 0xD800
.
Use heavily \u
but ban \U
.
\x
treat the data you're inputting as UTF-32 code
units, but don't input anything higher than 0xD800
.
Use heavily both \u
and \U
.
With C++0x introducing UTF-8, UTF-16, and UTF-32 literals, it becomes clear the way to go is to rely on the compiler to convert from the source character set to whatever encoding the strings should be in. Indeed, the compiler will convert from the source character set to UTF-8 for UTF-8 literals, UTF-16 for UTF-16 literals, UTF-32 for UTF-32 literals, the narrow execution character set for narrow literals, and the wide execution character set for wide literals.
Assuming you can reliably ensure that all compilers recognize the same source character set, you can make full usage of all literal types freely. However, for UTF-8 source files, MSVC requires a BOM, while GCC requires it to not be present. If you can accomodate this in your environment, then definitely go for this solution, which is simpler and more powerful.
Option one is to use boost::mpl::string
as a UTF-8 compile-time
string. Its support for multi-char character literals allows it to not be
too verbose, and it can be coupled with boost::unicode::string_cp
to insert Unicode code points instead of the Unicode escape sequences. Any
non-ASCII character shall be put as its own character literal. Note multi-char
character literals require int
to be at least 32 bits
however.
A second option is to use boost::mpl::u32string
as a UTF-32
compile-time string, and use boost::unicode::static_u8_encode
or boost::unicode::static_u16_encode
to eventually encode it at compile-time to UTF-8 or UTF-16. boost::mpl::u16string
may also be used to directly input UTF-16. However, none of these two sequence
types provide support for easier declaration with multi-char character literals.
Then, the boost::mpl::c_str
meta-function may be used
to convert any compile-time string into a zero-terminated equivalent.
See the source_input example for demonstrations.
The library chooses to base itself upon iterator adapters rather than upon streams, even though the latter were designed for conversion facilities with buffering and can be configured with locales.
That choice was made because it is believed that the iterator and range abstractions are more flexible and easier to deal with, and that there are also quite more efficient.
Centralizing conversion into a single Converter model allows eager and lazy variants of evaluation to be possible for any conversion facility.
Lazy evaluation is believed to be of great interest since it avoids the need for memory allocations and buffers and constructing a logic conversion is constant-time instead of linear-time since there is no need to actually walk the range.
Eager evaluations can remain more efficient however, and that is why they are provided as well.
The library only provides UTF conversion converts that do extensive checking that the input is correct and that the end is not unexpectedly met.
These could be avoided when it is known that the input is valid, and thus
performance be increased. boost::convert_iterator
could as well avoid storing the begin
and end
iterator in such cases.
The Unicode standard provides a quick-check scheme to tell whether a string is in a normalized form, which could be used to avoid expensive decomposition and recomposition.
Future versions of the library could provide a string type that maintains the following invariants: valid UTF, stream-safe and in Normalization Form C.
I would like to thank Eric Niebler for mentoring this project as part of the Google Summer of Code program, who provided steady help and insightful ideas along the development of this project.
Graham Barnett and Rogier van Dalen deserve great thanks as well for their work on Unicode character properties, most of the parser of Unicode data was written by them.
John Maddock was also a great help by contributing preliminary on-the-fly UTF conversion which helped the library get started, while inspiration from Phil Endecott allowed UTF conversion code to be more efficient.
Finally, I thank Beman Dawes and other members of the mailing list for their interest and support.