1 Introduction
bool SomeFunc(const std::string& input); // Interface function
// C++20 compilation error, because u8"ss中文gg" is const char8_t[]
std::string t2 = u8"ss中文gg";
SomeFunc(t1);
2 Problems Begin to Appear
2.1 Conversion Between string and u8string
[Note that the conversions introduced here are purely between two types, and the premise for their semantic correctness is that both are genuine UTF-8 strings. Thus, this conversion is not a specific encoding conversion.]
std::u8string t1 = GetPath();
std::string u8t{ reinterpret_cast<char *>(t1.data()), t1.size() };
SomeFunc(u8t); // u8t is std::string, but is a genuine UTF-8 string
It is also possible to convert back using reinterpret_cast, and of course, both u8string and string provide templated constructors, so explicit use of reinterpret_cast is not necessary for conversion between the two types. For example, to convert from string to u8string, you can do this:
std::u8string u8s(t1.begin(), t1.end());
The assign member function also provides corresponding support, such as:
u8s.assign(t1.begin(), t1.end());
Additionally, the standard library’s copy() function can also be used for such copying or conversion, for example:
std::copy(t1.begin(), t1.end(), u8s.begin());
std::string t1 = "ss中文gg";
std::u8string t2{ reinterpret_cast<char8_t *>(t1.data()), t1.size() };
On Windows platforms, t1 is actually a normal extended ASCII (also known as MBCS, which for Chinese Windows is generally GBK, a type of extended ASCII) string. After this operation, t2 is a counterfeit UTF-8 string because its surface type and the underlying data are inconsistent. This inconsistency adds another potential pitfall to the already unsafe C++ environment. Such inconsistencies are also the cause of many software errors, such as garbled text issues. Moreover, it is difficult to detect such problems during code review because when people see that t2 is std::u8string, they do not suspect whether the actual string content is UTF-8. What do you think?
2.2 Linux and Windows
std::string t1 = "ss中国gg";
std::cout << t1 << ", length = " << t1.length() << std::endl;
On Linux, the output is: “ss中国gg, length = 10” because the default encoding on Linux is UTF-8 (the kernel is UTF-8 encoded), and the UTF-8 encoding for the Chinese characters “中国” occupies 3 bytes each. The same two lines of code on Windows yield the output “ss中国gg, length = 8” because the default encoding on Windows is a type of extended ASCII known as MBCS (Multi-Byte Character Set), which varies based on the language. For Chinese, each character occupies two bytes, and to differentiate from standard ASCII characters, the high bit of the first byte of each character is always 1 (standard ASCII does not exceed 127).
std::locale::global(std::locale("zh_CN.UTF8"));
std::string t1 = "ss中国gg";
std::cout << t1 << ", length = " << t1.length() << std::endl;
Note that we set the locale using “zh_CN.UTF8” (for more on locale settings, refer to this article: C++ Locale Settings). As mentioned earlier, the default local encoding on Windows is GBK, which is an extended ASCII encoding. Direct output would lead to garbled text, so this is just to ensure that the output is not garbled, without affecting the value of t1.
std::u8string t1 = u8"aa中国gg";
Now, t1 has the same content on both Windows and Linux, which is a 10-byte UTF-8 string. So the question arises: should you use /utf-8 or std::u8string?
2.3 std::cout and std::u8string
std::ostream& operator<<(std::ostream& os, const std::u8string& str) {
os << reinterpret_cast<const char*>(str.data());
return os;
}
2.4 The Significance of u8string
void do_x(const char *);
void do_x_utf8(const char *);
void do_x(const wchar_t *);
void do_x(const char16_t *);
void do_x(const char32_t *);
Because both ordinary ASCII strings and UTF-8 strings have the same base type of char, they cannot be distinguished through overloading or template specialization, forcing the need for additional information through function names. If you only hard-code a simple distinction between ordinary ASCII strings and UTF-8 strings, writing a bunch of if-else statements is fine. However, when you try to use this set of interfaces for generic or abstract design, this inconsistency in naming becomes quite annoying as you cannot let the compiler handle consistency based on type.
void do_x(const char *, bool is_utf8); //#1
template<bool IsUTF8> //#2
void do_x(const char *);
From a software design perspective, both of these interfaces are poorly designed. They expect users to guide the compiler in making the correct type distinction, yet in most cases, such designs are the root of various inconsistencies and type errors. The correct approach should be to allow types to determine this, letting the compiler handle the distinction. However, due to the lack of UTF-8 type support, the compiler cannot do this. The filesystem introduced in C++17 has a filesystem::u8path specifically for representing file names or paths as UTF-8 strings because the filesystem::path constructor cannot distinguish whether the user has passed in an ordinary ASCII string or a UTF-8 string. The standard library prefers to add an awkward filesystem::u8path rather than consider modifying the filesystem::path constructor like this:
filesystem::path(const char *name, bool is_utf8); // Bad design
This encapsulates the general principles of software design we introduced earlier. C++20 introduced char8_t and std::u8string to fill this type gap, and then immediately marked filesystem::u8path as deprecated, likely to be removed in C++26.
3 Others
3.1 Detecting Local System Encoding and Character Set
3.2 Detecting Compiler Support for char8_t
References
[1] P0482R6 (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r6.html)[2] P1423R2 (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html)
Previous Content:
Accessing Google Gemini 1.5 Pro API with C++
C++ Non-Copyable Objects
C++ Asynchronous Processes: Future and Promise
C++ Asynchronous Processes: Packaged Task
C++ Asynchronous Processes: Async
C++ Ranges (Part Three)
C++ Ranges (Part Two)
C++ Ranges (Part One)
Supporting Range-Based For Loops for Custom Objects in C++
C++ NODISCARD Specifier
The Principle of C++ Make_Index_Sequence
Porting a C++ Version of the Tiktoken Tokenizer
WKWYL Optimization of Shared_Ptr and “Smart” Destruction
C++ Integer_Sequence and Make_Integer_Sequence
C++ Move() and Return Value Optimization
C++ Make_From_Tuple and Apply
C++ Make_Tuple and Forward_As_Tuple
C++11 Random Numbers: Pseudo-Random Number Library
C++11 Random Numbers: Rand Function Pitfalls and True Random Numbers
C++ Time Library: Traditional C Language (Revised)
Wide Contracts and Narrow Contracts
P1743R0 – Contracts, Undefined Behavior, and Defensive Programming
C++ Type Aliases and Template Aliases
C++ Aggregate Types and Assignment Initialization
C++ RAII Idiom Practice: ScopeGuard
C++ RAII Idiom
C++11 Forwarding References (Universal References)
The Evolution of Modern C++ New and Delete
C++ Three-Way Comparison Operator
C++17 Filesystem: File Time
C++17 Filesystem: File Operations
C++17 Filesystem
C++20 Comparison Operator “Hidden Rules”
C++ Cache-Aware Programming: Cache Lines
C++11 Unrestricted Unions
C++ Variant
C++ Optional
C++ Error_Code Part Three: Custom Error_Condition
C++ Error_Code Part Two: Custom Error_Code
C++ Error_Code Part One: Basic Concepts
C++ Noexcept Specifier
C++ Weak_Ptr
C++ Enable_Shared_From_This
C++ Shared_Ptr
C++ CRTP Odd Recursive Template Pattern
C++ Constexpr Specifier
C++ Unique_Ptr
C++ Format Function Supports Custom Types
C++ Format_To and Format_To_N Functions
C++ Format and VFormat Functions
C++ String and Number Conversion
C++ To_Chars and From_Chars
C++ Reference Qualifiers
C++ Consteval and Constinit Specifiers
C++ Const Constraints
C++ Custom Literals
C++ Name Lookup and Overload Resolution
C++ Bit Fields
C++ Source_Location
C++ Alignment: Alignof and Alignas
C++ Bitset
C++ Construct On First Use Idiom
C++ Time Library Part Seven: Custom Clocks
C++ Nifty Counter Idiom
C++ Time Library Part Six: Calendars and Time Zones
C++ Time Library Part Five: Time_Point
C++ Time Library Part Four: Clock
C++ Time Library Part Three: Duration
C++ Time Library Part Two: Ratio
C++ Locale Settings
C++ Range-Based For Loops
C++17 std::byte Type
C++17 If and Switch Syntax with Initialization
C++11 Trailing Return Type Syntax
C++ Tag Dispatching Idiom
C++ If-Constexpr
Supporting Structured Bindings for Custom Types in C++
C++ Get and Get_If
C++ Void_T and SFINAE
C++ Type Traits (Part One)
C++ Enable_If and SFINAE (Revised)
C++ Decltype() and Declval()
C++ Memory Allocation and Deallocation
C++ Structured Binding
C++ Static Initialization Order Problem
C++ Magic Statics and Local Static Variables
C++ Copy Elision and Guaranteed Copy Elision
C++11 Codecvt and Encoding Conversion
C++ Pair and Tuple
C++ Zombie Identifiers (Until C++23) C++ Storage Types and New Thread_Local
C++ Multithreaded Memory Model and Memory_Order
C++ Literals
C++ and C Version Relationship
Understanding “Template Member Functions Cannot Be Partially Specialized”
Revisiting “Clean Architecture”
C++20, Let’s Talk About Modules