A Detailed Explanation of HTTP

Web Crawlers are a common application scenario in Python, primarily used to gather information from the internet. Since crawlers rely on the network, it is essential for us to understand what a network is.

A network consists of several nodes connected by links, and the vast network formed by connecting multiple networks is called the Internet. Today, we will discuss HTTP (HyperText Transfer Protocol), which is one of the most widely used network protocols on the internet, developed and published by the World Wide Web Consortium.

This article mainly explains the entire process of an HTTP request (excluding DNS resolution): the origin of HTTP, TCP/IP protocol, establishing TCP connections, client requests, server responses, and disconnecting TCP connections. The article also includes some related knowledge about HTTP. It is quite lengthy, so it is recommended to bookmark or share it for later reading! (A total of five thousand words, reading it will take about 20 minutes.)

1. Introduction

1. Origin

The ability to surf the web today is thanks to the vision of computer scientist Tim Berners-Lee. On August 6, 1991, Tim Berners-Lee officially launched the world’s first website (http://info.cern.ch) on a NeXT computer at the European Organization for Nuclear Research (CERN), establishing the basic concepts and technical framework of the internet, thus beginning the era of web information. Berners-Lee’s proposal included the basic concepts of the web and gradually established all the necessary tools:

Proposed HTTP (Hypertext Transfer Protocol), which allows users to access resources by clicking hyperlinks;
Proposed using HTML (Hypertext Markup Language) as the standard for creating web pages;
Created the Uniform Resource Locator URL as the website address system, using the format http://www;
Created the first web browser, called the World Wide Web browser, which was also a web editor;
Created the first web server (http://info.cern.ch) and the first web page describing the project itself.

2. Features

The HTTP protocol has five main features:

Supports client/server model.
Simple and fast:

When the client requests a service from the server, it only needs to send the request method and path.
Flexible:

HTTP allows the transmission of any type of data object.

The type being transmitted is marked by the Content-Type (Content-Type is an identifier used in HTTP packets to represent the content type).
Connectionless:

Connectionless means that each connection can only handle one request.

After the server processes the client’s request and receives the client’s response, it disconnects.

This method can save transmission time.
Stateless:

Stateless means that the protocol has no memory capacity for transaction processing; the server does not know the state of the client.

That is, after we send an HTTP request to the server, the server sends us data based on the request, but after sending, it does not record any information (Cookies and Sessions will be discussed later).

2. TCP/IP Protocol

We often hear the phrase: HTTP is a protocol that transmits data based on the TCP/IP protocol suite.

How do we understand that statement? Let’s take a look at the TCP/IP four-layer model to clarify. From the above diagram, we can clearly see that the transport layer protocol used by HTTP is TCP, while the network layer uses the IP protocol (of course, many other protocols are also used), so we say that HTTP is a protocol that transmits data based on the TCP/IP protocol suite.

Similarly, we can see that ping uses the ICMP protocol, which is why sometimes we can access the internet with a VPS but cannot ping Google, as they use different protocols.

Now, how does the TCP/IP protocol suite generally work? Let’s take a look at the diagram below: We can see that at the data sending end, data is encapsulated layer by layer, and at the receiving end, data is unpacked layer by layer, finally reaching the application layer.

3. Establishing a TCP Connection

After understanding the general working principle of the TCP/IP protocol suite, let’s see how HTTP establishes a connection.

1. TCP Header Information

As we mentioned earlier, HTTP is a protocol that transmits data based on the TCP/IP protocol suite, so establishing an HTTP connection is equivalent to establishing a TCP connection. Let’s take a look at the structure of the TCP packet information. TCP Packet = TCP Header + TCP Data, and the TCP header includes six control bits (highlighted in red in the image above), which represent the state of the TCP connection:

URG:

Urgent data — this is an urgent message.
ACK:

Confirmation of receipt.
PSH:

Indicates that the receiving application should read the data from the TCP receive buffer immediately.
RST:

Indicates a request to re-establish the connection.
SYN:

Indicates a request to establish a connection.
FIN:

Indicates that the sender wants to close the connection.

2. Connection Establishment Process

After understanding the TCP header information, we can officially look at the three-way handshake for establishing a TCP connection. Explanation of the three-way handshake:

The client sends a packet with the SYN bit set to 1, with a randomly generated sequence number seq=1234567 to the server. The server knows from SYN=1 that the client wants to establish a connection (Client:

I want to connect to you).
After receiving the request, the server confirms the connection information by sending a packet with ack number=(client’s seq+1), SYN=1, ACK=1, and a randomly generated seq=7654321 (Server:

Okay, you can connect).
The client checks whether the ack number is correct, i.e., the first sent seq number+1, and whether the ACK bit is 1. If correct, the client will send ack number=(server’s seq+1), ACK=1. The server then confirms that the seq value and ACK=1 are correct, and the connection is successfully established.

(Client:

Okay, I’m here).

Interviewer: Why does establishing an HTTP connection require three-way handshake, not two or four times? Answer: Three is the minimum safe number; two is not safe, and four wastes resources.

4. Client Request

Once the client is connected to the server, it can start requesting resources and send HTTP requests.

1. HTTP Request Message Structure

As we mentioned earlier, TCP Packet = TCP Header + TCP Data. We have discussed the TCP header, now let’s talk about the TCP data, which is our HTTP request message.

2. HTTP Request Example

Let’s take a look at an actual HTTP request example:

① is the request method; HTTP/1.1 defines eight request methods:

GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS, TRACE. The two most common are GET and POST, and if it’s a RESTful API, GET, POST, DELETE, and PUT are generally used.
② is the corresponding URL address for the request, which, along with the Host attribute in the message header, forms the complete request URL.
③ is the protocol name and version number.
④ is the HTTP message header, which contains several attributes formatted as “attribute name: attribute value”, allowing the server to obtain information about the client.
⑤ is the message body, which encodes the component values from a page form into a formatted string using key-value pairs like param1=value1&param2=value2. It carries multiple request parameters’ data.

Not only can the message body pass request parameters, but the request URL can also pass parameters in a similar way, such as “/chapter15/user.html?param1=value1&param2=value2”.

There are many request header parameters; here we will not explain them all, just mention two basic anti-crawling parameters:

User-Agent:

The name and version of the operating system and browser used by the client; some websites restrict requests based on the browser.
Referer:

The address of the previous webpage, indicating where this request came from; some websites restrict requests based on the source.

5. Server Response

After the server receives the client’s request and processes it, it needs to respond and return to the client. The structure of the HTTP response message is consistent with that of the request message.

1. HTTP Response Message Structure

2. HTTP Response Example

3. Response Status Codes

In the response message, we should pay special attention to the server’s response status codes, which are often asked in interviews. Below, we only list the categories; please search online for detailed status codes.

6. Disconnecting

After the server has completed the response, the session ends. Will the connection be disconnected at this point?

1. Long and Short Connections

Whether to disconnect depends on the HTTP version:

In HTTP/1.0, after the client and server complete a request/response, the previously established TCP connection is disconnected, requiring a new TCP connection to be established for the next request, which is called a short connection.
Only six months after the release of HTTP/1.0 (January 1997), HTTP/1.1 was released, bringing a new feature:

After the client and server complete a request/response, it allows for the TCP connection to remain open, meaning that the next request can use this TCP connection directly without needing to re-establish a new connection, which is called a long connection.

Note: A long connection refers to a single TCP connection allowing multiple HTTP sessions. HTTP is always a single request/response; the session ends, and HTTP itself does not have the concept of a long connection.

As early as 1999, HTTP/1.1 was widely promoted, so browsers now include a parameter in the request header: Connection: keep-alive, indicating that the browser requests to establish a long connection with the server, which can also set whether it is willing to establish a long connection.

2. Advantages and Disadvantages of Long Connections

For servers, establishing long connections has both advantages and disadvantages:

Advantages:

When a website has a large number of static resources (images, CSS, JS, etc.), it can enable long connections, allowing multiple images to be sent over a single TCP connection.
Disadvantages:

If the client does not make a request while the server keeps the long connection open, resources are wasted, which is a significant resource waste.

Therefore, whether to enable long connections and the duration of long connections need to be reasonably set according to the website itself.

Note: Do not underestimate this TCP connection. In a complete HTTP request from a client (DNS resolution, establishing TCP connection, requesting, waiting, parsing the webpage, disconnecting TCP connection), the time taken to establish the TCP connection is still considerable.

3. Disconnecting Process

Establishing a TCP connection requires a three-way handshake, while disconnecting a TCP connection requires a four-way handshake! As we mentioned earlier regarding the TCP/IP protocol, the flag bit: FIN indicates that the sender wants to close the connection. Why does disconnecting require four-way handshakes? This is an exercise for you; feel free to leave your understanding in the comments to see if it is correct.

7. Aside

1. Must-Know Interview Question: Three-Way Handshake and Four-Way Handshake in HTTP

Interviewer: Why does establishing a connection require a three-way handshake while closing a connection requires a four-way handshake? (This is an exercise for you; feel free to share your insights in the comments!)

2. HTTP/2.0

HTTP/1.1 has served us for 20 years, and HTTP/2.0 was actually released in 2015 but has not yet been widely adopted. You can also search online for information about the new features of HTTP/2.0.

3. HTTP & RPC

Due to the drawbacks of HTTP, such as slow responses and large request headers, RPC is commonly used in the microservices era to call services. Interested students can learn about RPC concepts online.

4. HTTP & HTTPS

HTTP has two significant drawbacks: it is plaintext and cannot guarantee integrity, which is why it is gradually being replaced by HTTPS. We will discuss HTTPS in detail next time.

Author: Zhuge

Source: Naked Sleeping Pig