C++ Small String, Big Trouble

1 Introduction

Many years ago, we had a library that caused garbled text when users input Chinese characters. As a result, we modified it internally to use UTF-8 encoding throughout. The interface still used the std::string type as a carrier for UTF-8 strings because std::u8string did not exist at that time. For many years, everything was fine, but recently, a large number of users started using C++20 and complaints began to arise because users encountered compilation issues with code like this:

bool SomeFunc(const std::string&amp; input); // Interface function
// C++20 compilation error, because u8"ss中文gg" is const char8_t[]
std::string t2 = u8"ss中文gg";
SomeFunc(t1);

2 Problems Begin to Appear

2.1 Conversion Between string and u8string

[Note that the conversions introduced here are purely between two types, and the premise for their semantic correctness is that both are genuine UTF-8 strings. Thus, this conversion is not a specific encoding conversion.]

The C++ Standard Committee may have considered this issue when introducing char8_t, thus explicitly defining that char8_t and char have the same size and alignment. This means that char8_t* and char* can be converted to each other using reinterpret_cast. Therefore, with the library interface unchanged, users can adapt their code slightly to make it work, for example:

std::u8string t1 = GetPath();
std::string u8t{ reinterpret_cast&lt;char *&gt;(t1.data()), t1.size() };
SomeFunc(u8t); // u8t is std::string, but is a genuine UTF-8 string

It is also possible to convert back using reinterpret_cast, and of course, both u8string and string provide templated constructors, so explicit use of reinterpret_cast is not necessary for conversion between the two types. For example, to convert from string to u8string, you can do this:

std::u8string u8s(t1.begin(), t1.end());

The assign member function also provides corresponding support, such as:

u8s.assign(t1.begin(), t1.end());

Additionally, the standard library’s copy() function can also be used for such copying or conversion, for example:

std::copy(t1.begin(), t1.end(), u8s.begin());

These flexible conversion methods indeed reduce some people’s aversion to u8string, but this introduces a hidden danger. Consider this piece of code:

std::string t1 = "ss中文gg";
std::u8string t2{ reinterpret_cast&lt;char8_t *&gt;(t1.data()), t1.size() };

On Windows platforms, t1 is actually a normal extended ASCII (also known as MBCS, which for Chinese Windows is generally GBK, a type of extended ASCII) string. After this operation, t2 is a counterfeit UTF-8 string because its surface type and the underlying data are inconsistent. This inconsistency adds another potential pitfall to the already unsafe C++ environment. Such inconsistencies are also the cause of many software errors, such as garbled text issues. Moreover, it is difficult to detect such problems during code review because when people see that t2 is std::u8string, they do not suspect whether the actual string content is UTF-8. What do you think?

2.2 Linux and Windows

Consider the following two lines of code:

std::string t1 = "ss中国gg";
std::cout &lt;&lt; t1 &lt;&lt; ", length = " &lt;&lt; t1.length() &lt;&lt; std::endl;

On Linux, the output is: “ss中国gg, length = 10” because the default encoding on Linux is UTF-8 (the kernel is UTF-8 encoded), and the UTF-8 encoding for the Chinese characters “中国” occupies 3 bytes each. The same two lines of code on Windows yield the output “ss中国gg, length = 8” because the default encoding on Windows is a type of extended ASCII known as MBCS (Multi-Byte Character Set), which varies based on the language. For Chinese, each character occupies two bytes, and to differentiate from standard ASCII characters, the high bit of the first byte of each character is always 1 (standard ASCII does not exceed 127).

C++ compilers generally support the /utf-8 compilation option. Explicitly using this option allows string to behave like it does on Linux, with UTF-8 encoding as the default. For example:

std::locale::global(std::locale("zh_CN.UTF8"));
std::string t1 = "ss中国gg";
std::cout &lt;&lt; t1 &lt;&lt; ", length = " &lt;&lt; t1.length() &lt;&lt; std::endl;

Note that we set the locale using “zh_CN.UTF8” (for more on locale settings, refer to this article: C++ Locale Settings). As mentioned earlier, the default local encoding on Windows is GBK, which is an extended ASCII encoding. Direct output would lead to garbled text, so this is just to ensure that the output is not garbled, without affecting the value of t1.

This difference is a nightmare for developers of cross-platform libraries. The std::u8string seems to alleviate this pain because using it allows for consistent results across both systems:

std::u8string t1 = u8"aa中国gg";

Now, t1 has the same content on both Windows and Linux, which is a 10-byte UTF-8 string. So the question arises: should you use /utf-8 or std::u8string?

2.3 std::cout and std::u8string

As previously mentioned, using std::u8string to represent UTF-8 strings allows for a unified and clear semantic interpretation across different systems, but do you know how std::cout reacts? Well, it doesn’t recognize it and doesn’t support it. Indeed, it does not support std::u8string type strings, even on Linux. The reason is simple: I am char, you are char8_t, after all, std::cout is essentially std::basic_ostream<char>.

While it is somewhat amusing, the proposal for std::u8string clearly states that it essentially does not provide backward compatibility. Given this, we have to roll up our sleeves and solve it ourselves. Fortunately, this problem is easy to overcome; creating an overload for std::u8string can handle it:

std::ostream&amp; operator&lt;&lt;(std::ostream&amp; os, const std::u8string&amp; str) {
    os &lt;&lt; reinterpret_cast&lt;const char*&gt;(str.data());
    return os;
}

2.4 The Significance of u8string

Up to this point, std::u8string has not really integrated with the standard library. It’s one thing that std::regex does not support std::u8string, but even std::format blatantly does not support it. You should not be excluded from backward compatibility. Despite the dissatisfaction, it is crucial to understand the current significance of introducing std::u8string and char8_t. Their introduction is not meant to provide specific functionality support for UTF-8 strings; rather, it aims to address the lack of type support for UTF-8 in C++, existing as type placeholders like std::u16string and std::u32string. Therefore, even if the return result of the length() function is semantically incorrect, C++ deems it acceptable.

For example, let’s illustrate the problems caused by the lack of UTF-8 type support. C++ provides the prefixes u8, u, U, and L to represent UTF-8, UTF-16, UTF-32, and operating system supported wide string literals, respectively. Any code designed for text encoding must be able to distinguish these strings. For wchar_t, char16_t, and char32_t string literals, they can be directly distinguished based on their types. However, distinguishing between ordinary ASCII strings and UTF-8 strings is somewhat awkward. Consider this set of interface designs:

void do_x(const char *);
void do_x_utf8(const char *);
void do_x(const wchar_t *);
void do_x(const char16_t *);
void do_x(const char32_t *);

Because both ordinary ASCII strings and UTF-8 strings have the same base type of char, they cannot be distinguished through overloading or template specialization, forcing the need for additional information through function names. If you only hard-code a simple distinction between ordinary ASCII strings and UTF-8 strings, writing a bunch of if-else statements is fine. However, when you try to use this set of interfaces for generic or abstract design, this inconsistency in naming becomes quite annoying as you cannot let the compiler handle consistency based on type.

Of course, the following two styles are not much better:

void do_x(const char *, bool is_utf8); //#1
template&lt;bool IsUTF8&gt;     //#2
void do_x(const char *);

From a software design perspective, both of these interfaces are poorly designed. They expect users to guide the compiler in making the correct type distinction, yet in most cases, such designs are the root of various inconsistencies and type errors. The correct approach should be to allow types to determine this, letting the compiler handle the distinction. However, due to the lack of UTF-8 type support, the compiler cannot do this. The filesystem introduced in C++17 has a filesystem::u8path specifically for representing file names or paths as UTF-8 strings because the filesystem::path constructor cannot distinguish whether the user has passed in an ordinary ASCII string or a UTF-8 string. The standard library prefers to add an awkward filesystem::u8path rather than consider modifying the filesystem::path constructor like this:

filesystem::path(const char *name, bool is_utf8); // Bad design

This encapsulates the general principles of software design we introduced earlier. C++20 introduced char8_t and std::u8string to fill this type gap, and then immediately marked filesystem::u8path as deprecated, likely to be removed in C++26.

As mentioned at the beginning of this article, the addition of char8_t has indeed caused a slew of compatibility issues. Besides the conversion issue between std::string and std::u8string, changing the return type of filesystem::path::generic_u8string() from std::string to std::u8string has also been a major culprit in causing numerous compilation errors. Interested readers can check out the documentation[2], specifically P1423R2, which presents some methods to mitigate some of these compatibility impacts.

3 Others

3.1 Detecting Local System Encoding and Character Set

The native locale in C++ does not contain information about the local system encoding. However, the Boost library provides a method to obtain the current system’s code page through the function boost::locale::util::get_system_locale(). If it is 936, it indicates a Chinese system. Before Windows 7, this was generally “GB18030” or “GB2312”; from Windows 8 onwards, it is all “GBK”.

Additionally, the libiconv library has a function called locale_charset() that can also obtain the local system’s character set and encoding. However, this function is not exported. If you compile the libiconv library locally, with a little modification, you can export this function, and it works quite well.

3.2 Detecting Compiler Support for char8_t

C++ provides the compile-time constant __cpp_lib_char8_t to determine whether the compiler has begun to support char8_t. When the compiler supports char8_t, careful handling of the type differences between std::string and std::u8string is necessary.

References

[1] P0482R6 (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html)[2] P1423R2 (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html)

Previous Content:

Accessing Google Gemini 1.5 Pro API with C++

C++ Non-Copyable Objects

C++ Asynchronous Processes: Future and Promise

C++ Asynchronous Processes: Packaged Task

C++ Asynchronous Processes: Async

C++ Ranges (Part Three)

C++ Ranges (Part Two)

C++ Ranges (Part One)

Supporting Range-Based For Loops for Custom Objects in C++

C++ NODISCARD Specifier

The Principle of C++ Make_Index_Sequence

Porting a C++ Version of the Tiktoken Tokenizer

WKWYL Optimization of Shared_Ptr and “Smart” Destruction

C++ Integer_Sequence and Make_Integer_Sequence

C++ Move() and Return Value Optimization

C++ Make_From_Tuple and Apply

C++ Make_Tuple and Forward_As_Tuple

C++11 Random Numbers: Pseudo-Random Number Library

C++11 Random Numbers: Rand Function Pitfalls and True Random Numbers

C++ Time Library: Traditional C Language (Revised)

Wide Contracts and Narrow Contracts

P1743R0 – Contracts, Undefined Behavior, and Defensive Programming

C++ Type Aliases and Template Aliases

C++ Aggregate Types and Assignment Initialization

C++ RAII Idiom Practice: ScopeGuard

C++ RAII Idiom

C++11 Forwarding References (Universal References)

The Evolution of Modern C++ New and Delete

C++ Three-Way Comparison Operator

C++17 Filesystem: File Time

C++17 Filesystem: File Operations

C++17 Filesystem

C++20 Comparison Operator “Hidden Rules”

C++ Cache-Aware Programming: Cache Lines

C++11 Unrestricted Unions

C++ Variant

C++ Optional

C++ Error_Code Part Three: Custom Error_Condition

C++ Error_Code Part Two: Custom Error_Code

C++ Error_Code Part One: Basic Concepts

C++ Noexcept Specifier

C++ Weak_Ptr

C++ Enable_Shared_From_This

C++ Shared_Ptr

C++ CRTP Odd Recursive Template Pattern

C++ Constexpr Specifier

C++ Unique_Ptr

C++ Format Function Supports Custom Types

C++ Format_To and Format_To_N Functions

C++ Format and VFormat Functions

C++ String and Number Conversion

C++ To_Chars and From_Chars

C++ Reference Qualifiers

C++ Consteval and Constinit Specifiers

C++ Const Constraints

C++ Custom Literals

C++ Name Lookup and Overload Resolution

C++ Bit Fields

C++ Source_Location

C++ Alignment: Alignof and Alignas

C++ Bitset

C++ Construct On First Use Idiom

C++ Time Library Part Seven: Custom Clocks

C++ Nifty Counter Idiom

C++ Time Library Part Six: Calendars and Time Zones

C++ Time Library Part Five: Time_Point

C++ Time Library Part Four: Clock

C++ Time Library Part Three: Duration

C++ Time Library Part Two: Ratio

C++ Locale Settings

C++ Range-Based For Loops

C++17 std::byte Type

C++17 If and Switch Syntax with Initialization

C++11 Trailing Return Type Syntax

C++ Tag Dispatching Idiom

C++ If-Constexpr

Supporting Structured Bindings for Custom Types in C++

C++ Get and Get_If

C++ Void_T and SFINAE

C++ Type Traits (Part One)

C++ Enable_If and SFINAE (Revised)

C++ Decltype() and Declval()

C++ Memory Allocation and Deallocation

C++ Structured Binding

C++ Static Initialization Order Problem

C++ Magic Statics and Local Static Variables

C++ Copy Elision and Guaranteed Copy Elision

C++11 Codecvt and Encoding Conversion

C++ Pair and Tuple

C++ Zombie Identifiers (Until C++23) C++ Storage Types and New Thread_Local

C++ Multithreaded Memory Model and Memory_Order

C++ Literals

C++ and C Version Relationship

Understanding “Template Member Functions Cannot Be Partially Specialized”

Revisiting “Clean Architecture”

C++20, Let’s Talk About Modules

Related posts

Leave a Comment Cancel reply