★
This chapter will discuss the following topics:Character, code point, and byte representationUnique features of binary sequences such as bytes, bytearray, and memoryviewCodecs for all Unicode and legacy character setsAvoiding and handling encoding errorsBest practices for handling text filesTraps of default encoding and issues with standard I/ONormalizing Unicode text for safe comparisonUtility functions for normalization, case folding, and aggressively removing diacriticsCorrectly sorting Unicode text using the locale module and PyUCA libraryCharacter metadata in the Unicode databaseA dual-mode API that can handle both strings and byte sequences
Byte Overview
The new binary sequence types differ from Python 2’s str type in many ways. First, it is important to know that Python has two built-in basic binary sequence types: the immutable bytes type introduced in Python 3 and the mutable bytearray type added in Python 2.6. (Python 2.6 also introduced the bytes type, but that was merely an alias for str, which is different from Python 3’s bytes type.)
Each element of bytes or bytearray objects is an integer between 0 and 255 (inclusive), unlike Python 2’s str objects which are single characters. However, slices of binary sequences are always of the same type of binary sequence, including slices of length 1, as shown in Example 4-2.
Example 4-2 bytes and bytearray objects containing 5 bytes
>>> cafe = bytes('café', encoding='utf_8') ➊
>>> cafe
b'caf\xc3\xa9'
>>> cafe[0] ➋
99
>>> cafe[:1] ➌
b'c'
>>> cafe_arr = bytearray(cafe)
>>> cafe_arr ➍
bytearray(b'caf\xc3\xa9')
>>> cafe_arr[-1:] ➎
bytearray(b'\xa9')
❶ A bytes object can be constructed from a str object using the given encoding.❷ Each element is an integer within range(256).❸ Slices of a bytes object are still bytes objects, even if it is a slice of just one byte.❹ A bytearray object does not have literal syntax but is displayed as bytearray() with a bytes sequence literal as a parameter.❺ Slices of a bytearray object are still bytearray objects.
Accessing my_bytes[0] gives an integer, while my_bytes[:1] returns a bytes object of length 1—this should not be surprising. The equality s[0] == s[:1] only holds for the str sequence type. However, this behavior of the str type is quite rare. For other sequence types, s[i] returns a single element, while s[i:i+1] returns a sequence of the same type containing the element s[i].
Although binary sequences are actually sequences of integers, their literal representation indicates that they contain ASCII text. Therefore, the values of individual bytes may be displayed in one of three different ways.
Bytes within the printable ASCII range (from space to ~) are displayed using the ASCII characters themselves.
Tab, newline, carriage return, and the corresponding bytes for \ are displayed using escape sequences \t, \n, \r, and \\.
Other byte values are displayed using hexadecimal escape sequences (for example, \x00 is a null byte).
Thus, in Example 4-2, we see b’caf\xc3\xa9′: the first three bytes b’caf’ are within the printable ASCII range, while the last two bytes are not.
In addition to formatting methods (format and format_map) and several methods for handling Unicode data (including casefold, isdecimal, isidentifier, isnumeric, isprintable, and encode), other methods of the str type support bytes and bytearray types. This means we can use familiar string methods to manipulate binary sequences, such as endswith, replace, strip, translate, upper, etc., with only a few other methods’ parameters being bytes objects instead of str objects. Additionally, if a regular expression is compiled from a binary sequence rather than a string, the regex functions in the re module can also handle binary sequences. Python 3.0~3.4 cannot use the % operator to handle binary sequences, but according to “PEP 461—Adding % formatting to bytes and bytearray” (https://www.python.org/dev/peps/pep-0461/), Python 3.5 should support it.
Binary sequences have a class method that str does not have, called fromhex, which parses pairs of hexadecimal digits (spaces between the digit pairs are optional) to construct a binary sequence:
>>> bytes.fromhex('31 4B CE A9')
b'1K\xce\xa9'
Instances of bytes or bytearray can also be constructed by calling their respective constructors with the following parameters.
A str object and an encoding keyword argument.An iterable providing values between 0 and 255.
An integer, creating a binary sequence of the corresponding length filled with null bytes. [Python 3.5 will mark this constructor as “deprecated”, and Python 3.6 will remove it. See “PEP 467—Minor API improvements for binary sequences” (https://www.python.org/dev/peps/pep-0467/).]
An object that implements the buffer protocol (such as bytes, bytearray, memoryview, array.array); in this case, the byte sequence from the source object is copied into the newly created binary sequence.
Using buffer class objects to create binary sequences is a low-level operation that may involve type conversion. Example 4-3 demonstrates this.
Example 4-3 Initializing a bytes object with raw data from an array
>>> import array
>>> numbers = array.array('h', [-2, -1, 0, 1, 2]) ➊
>>> octets = bytes(numbers) ➋
>>> octets
b'\xfe\xff\xff\xff\x00\x00\x01\x00\x02\x00' ➌
➊ Specifies the type code h, creating an array of short integers (16 bits).➋ octets holds a copy of the byte sequence that makes up numbers.➌ These are the 10 bytes representing those 5 short integers.
When creating bytes or bytearray objects from buffer class objects, the byte sequence from the source object is always copied. In contrast, memoryview objects allow sharing memory between binary data structures. If you want to extract structured information from binary sequences, the struct module is an important tool. The next section will use this module to handle bytes and memoryview objects.
Struct and Memory ViewsThe struct module provides functions to convert packed byte sequences into tuples composed of different type fields, as well as functions for performing the reverse conversion, turning tuples back into packed byte sequences. The struct module can handle bytes, bytearray, and memoryview objects.
As described in section 2.9.2, the memoryview class is not used to create or store byte sequences but to share memory, allowing you to access slices of other binary sequences, packed arrays, and data in buffers without copying byte sequences, as the Python Imaging Library (PIL) does when handling images.
Example 4-4 demonstrates how to use memoryview and struct to extract the width and height of a GIF image.
Example 4-4 Using memoryview and struct to view the header of a GIF image
>>> import struct
>>> fmt = '<3s3sHH' # ➊
>>> with open('filter.gif', 'rb') as fp:
... img = memoryview(fp.read()) # ➋
...
>>> header = img[:10] # ➌
>>> bytes(header) # ➍
b'GIF89a+\x02\xe6\x00'
>>> struct.unpack(fmt, header) # ➎
(b'GIF', b'89a', 555, 230)
>>> del header # ➏
>>> del img
❶ Struct format: < is little-endian, 3s3s are two 3-byte sequences, HH are two 16-bit integers.❷ Creates a memoryview object from the contents of the file in memory…❸ …then creates another memoryview object using its slice; this does not copy the byte sequence.❹ Converts to a byte sequence, just for display; this copies 10 bytes.❺ Unpacks the memoryview object into a tuple containing type, version, width, and height.❻ Deletes the reference, freeing the memory occupied by the memoryview instance.
Note that slices of memoryview objects are new memoryview objects and do not copy the byte sequence. [One of the technical reviewers of this book, Leonardo Rochael, pointed out that if you use the mmap module to open the image as a memory-mapped file, a small number of bytes will be copied. This book will not discuss mmap; if you frequently read and modify binary files, you can read “mmap —Memory-mapped file support” (https://docs.python.org/3/library/mmap.html) for further learning.]