Mastering Strings and Encoding in Python

Previously, we learned about Python data types and variables, and we also mentioned strings. Next, let’s explore more knowledge about strings and encoding.

Strings

Strings are a commonly used data type in Python, and they can be created using single or double quotes. To create a string, simply assign a value to a variable. For example:

var1 = ‘Hello World!’

var2 = “Zhang San

Once a string is created, you can perform various operations on it, such as accessing, escaping, updating, and formatting the string.

(1) Accessing values in a string in Python can be done using square brackets to slice the string. For example:

var1 = ‘Hello World!’

var2 = “this is test”

print(“var1[0]: “, var1[0])

print(“var2[1:5]: “, var2[1:5])

After running, the console will output:

var1[0]: H

var2[1:5]: his

(2) Updating strings in Python

In Python, you can also slice part of a string and concatenate it with other fields. For example:

var1 = ‘Hello World!’

print(“Updated string: “, var1[:6] + ‘Hallo!’)

After running, the console will output:

Updated string: Hello Hallo!

(3) Escape characters in Python

If you need to use special characters in a string, you can use a backslash (\) to escape characters in Python, as shown in Table 1-1.

For example, when inputting a string with quotes, you need to use to escape it. For example:

var1 = ‘I\’m a test!’

print(var1)

After running, the console will output:

I’m a test!

(4) String formatting in Python

Python supports formatted string output. Although this may involve very complex expressions, the most basic usage is to insert a value into a string with formatting symbols. For example:

print(“My name is %s and I am %d years old!” % (‘Little Ming‘, 10))

After running, the console will output:

My name is Little Ming and I am 10 years old!

In Python, there is also a common table for string formatting, as shown in Table 1-2.

Encoding

Strings are also a data type, but their uniqueness also involves encoding issues. Since computers can only process numbers, text must first be converted into numbers. The earliest computers used 8 bits (bit) as 1 byte (Byte), so the maximum integer that can be represented by 1 byte is 255 (binary 11111111= decimal 255), and to represent larger integers, more bytes must be used. For example, 2 bytes can represent a maximum integer of 65535, and 4 bytes can represent a maximum integer of 4294967295.

Initially, only 127 characters were encoded into computers, which included uppercase and lowercase letters, numbers, and some symbols. This encoding is called ASCII encoding. For example, the encoding for the uppercase letter A is 65, and the encoding for the lowercase letter z is 122.

However, to handle Chinese characters, clearly 1 byte is insufficient; at least 2 bytes are needed, and they cannot conflict with ASCII encoding. Therefore, China established the GB2312 encoding to include Chinese characters.

There are hundreds of languages worldwide. Japan encodes Japanese into Shift_JIS , and Korea encodes Korean into EUC-KR. Each country has its own standards, which inevitably leads to conflicts, resulting in garbled text when displaying multilingual mixed text.

Thus, Unicode was born. Unicode unifies all languages into a single encoding system, eliminating garbled text issues. The Unicode standard is also continuously evolving, but the most common usage is to represent a character with 2 bytes (if very rare characters are used, it may require 4 bytes). Modern operating systems and most programming languages directly support Unicode.

The difference between ASCII encoding and Unicode encoding is:ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

For example:

1) The letter A is encoded in ASCII, with a decimal value of 65 and a binary value of 01000001.

2) The character 0 is encoded in ASCII, with a decimal value of 48 and a binary value of 00110000. Note that the character ‘0’ and the integer 0 are different.

3) Chinese characters have exceeded the range of ASCII encoding. For example, the character “你” is encoded in Unicode, with a decimal value of 20013 and a binary value of 01001110 00101101.

It can be inferred that if the ASCII encoding of A is encoded in Unicode, it only needs to be padded with 0 in front, thus, the Unicode encoding for A is 00000000 01000001.

A new problem arises: if we uniformly use Unicode encoding, the garbled text problem disappears. However, if the text is mostly in English, using Unicode encoding requires double the storage space compared to ASCII encoding, making it economically unfeasible for storage and transmission.

Therefore, to save space, a variable-length encoding called UTF-8 was developed to convert Unicode encoding. UTF-8 encodes a Unicode character into 1 to 6 bytes depending on the numerical size. Common English letters are encoded into 1 byte, while Chinese characters are usually encoded into 3 bytes, and only very rare characters are encoded into 4 to 6 bytes. If the text to be transmitted contains a large number of English characters, using UTF-8 encoding can save a lot of space, as shown in Table 1-3.

From Table 1-3, it can also be seen that UTF-8 encoding has an additional benefit: ASCII encoding can actually be seen as part of UTF-8 encoding, so software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Understanding the relationship between ASCII, Unicode , and UTF-8 allows us to summarize how character encoding works in modern computer systems: in computer memory, we uniformly use Unicode encoding, and when saving to disk or transmitting, it is converted to UTF-8 encoding. When editing in a text editor, the UTF-8 characters read from the file are converted to Unicode characters saved in memory, and after editing, they are converted back to UTF-8 characters for saving to the file.

Recommended Books for Today

This book is divided into 4 parts, targeting beginner Python web crawlers, systematically explaining how to use Python for web crawler program development from scratch.

The first part is a quick start: It mainly introduces the setup of the Python environment and basic syntax knowledge, introductory knowledge of web crawlers, basic usage methods, analysis and capture of Ajax data, crawling dynamic rendering page data, setting up and using website proxies, recognizing and cracking verification codes, as well as data capture from apps and methods of data storage.

The second part is skill advancement: It mainly introduces the basic usage methods of two commonly used crawling frameworks, PySpider and Scrapy, deployment methods for crawlers, as well as the usage of common libraries for data analysis and data cleaning.

The third part is project practice: It details two comprehensive practical projects, explaining the start and practical application of Python data crawling. This part summarizes the content of the entire book, reinforcing the reader’s practical skills.

The fourth part is skill expansion: It introduces practical techniques of commonly used AI technologies from the perspectives of data crawling, data cleaning, and data analysis. By using these techniques, readers can improve the speed of writing web crawler programs and the efficiency of data analysis.

Related Book Recommendations

Review of Previous Content

What you should know about Python comments

Statistical functions that you can learn quickly to help you escape the Excel trap

Your first Python program

Leave a Comment