Set the naked pig as a star to read quality content at the first time
A web crawler, also known as a web spider, is a program that automatically browses the web. Before discussing web crawlers, it is necessary to understand what a network is. A network consists of several nodes connected by links, and the vast network formed by connecting multiple networks is called the Internet. Today, we will discuss HTTP (HyperText Transfer Protocol), which is the most widely used network protocol on the Internet, established and published by the World Wide Web Consortium (W3C).
This article mainly explains the entire process of an HTTP request (excluding DNS resolution): the origin of HTTP, TCP/IP protocol, establishing a TCP connection, client requests, server responses, and disconnecting the TCP connection. The article also includes related knowledge about HTTP. It is quite lengthy, so it is recommended to bookmark or share it for later reading! (Total of five thousand words, approximately 20 minutes to read)
1. Introduction
1. Origin
Today, our ability to navigate the web is thanks to the vision of computer scientist Tim Berners-Lee. On August 6, 1991, Tim Berners-Lee officially launched the world’s first website (http://info.cern.ch) on a NeXT computer at CERN, establishing the basic concepts and technical framework of the Internet, thus ushering in the era of information on the web.
Berners-Lee’s proposal included the basic concepts of the web and gradually established all the necessary tools:
-
Proposed HTTP (Hypertext Transfer Protocol), allowing users to access resources by clicking hyperlinks;
-
Proposed using HTML (Hypertext Markup Language) as the standard for creating web pages;
-
Created the URL (Uniform Resource Locator) as the website address system, which is still in use today in the format http://www;
-
Created the first web browser, known as the World Wide Web browser, which also served as a web editor;
-
Created the first web server (http://info.cern.ch) and the first web page describing the project itself.
2. Features
The HTTP protocol has five main features:
-
Supports client/server model.
-
Simple and fast:
When a client requests a service from a server, it only needs to send the request method and path.
-
Flexible:
HTTP allows the transmission of any type of data object.
The type being transmitted is marked by the Content-Type (which indicates the content type in the HTTP packet).
-
Connectionless:
Connectionless means that each connection only handles one request.
Once the server has processed the client’s request and received the client’s response, it disconnects.
This method saves transmission time.
-
Stateless:
Stateless means that the protocol has no memory of transaction processing; the server does not know the state of the client.
After we send an HTTP request to the server, the server sends us data based on the request, but after sending, it does not record any information (Cookies and Sessions will be discussed later).
2. TCP/IP Protocol
We often hear the phrase: HTTP is a protocol that transmits data based on the TCP/IP protocol suite.
How do we understand this statement? Let’s take a look at the TCP/IP four-layer model to clarify.
From the above diagram, we can clearly see that the transport layer protocol used by HTTP is the TCP protocol, while the network layer uses the IP protocol (of course, many other protocols are also used), so we say that HTTP is a protocol that transmits data based on the TCP/IP protocol suite.
We can also see that ping uses the ICMP protocol, which is why sometimes we can access the internet with a VPS but cannot ping Google, as they use different protocols.
So how does the TCP/IP protocol suite work? Let’s take a look at the following diagram:
We can see that at the data sending end, data is encapsulated layer by layer, and at the receiving end, data is unpacked layer by layer, ultimately reaching the application layer.
3. Establishing a TCP Connection
Now that we understand the general working principle of the TCP/IP protocol suite, let’s see how HTTP establishes a connection.
1. TCP Header Information
As mentioned earlier, HTTP is a protocol that transmits data based on the TCP/IP protocol suite, so establishing an HTTP connection is equivalent to establishing a TCP connection. Let’s take a look at the structure of TCP packet information.
TCP packet = TCP header information + TCP data body, and the TCP header information contains six control bits (highlighted in red in the image), which represent the state of the TCP connection:
-
URG:
Urgent data — this indicates an urgent message.
-
ACK:
Indicates acknowledgment of receipt.
-
PSH:
Indicates that the receiving application should immediately read the data from the TCP receive buffer.
-
RST:
Indicates a request to re-establish the connection.
-
SYN:
Indicates a request to establish a connection.
-
FIN:
Indicates a notification to the other party that this end is closing the connection.
2. Connection Establishment Process
After understanding the TCP header information, we can formally look at the three-way handshake for establishing a TCP connection.
Explanation of the three-way handshake:
-
The client sends a packet with SYN=1 and a randomly generated sequence number seq=1234567 to the server. The server knows from SYN=1 that the client wants to establish a connection (Client:
I want to connect to you
).
-
After receiving the request, the server confirms the connection information by sending a packet with ack number=(client’s seq+1), SYN=1, ACK=1, and a randomly generated seq=7654321 (Server:
Okay, you can connect
).
-
After the client receives this, it checks whether the ack number is correct, i.e., the first sent seq number +1, and whether the ACK bit is 1. If correct, the client will send ack number=(server’s seq+1), ACK=1. The server, upon receiving this, confirms that the seq value and ACK=1, thus establishing the connection successfully.
(Client:
Okay, I’m here
).
Interviewer: Why does establishing an HTTP connection require three-way handshake instead of two or four? Answer: Three is the minimum safe number; two is not safe, and four wastes resources.
4. Client Request
Once the client and server are connected, the client can start requesting resources from the server and can begin sending HTTP requests.
1. HTTP Request Message Structure
As we mentioned earlier, TCP packet = TCP header information + TCP data body. We have already discussed the TCP header information, now let’s talk about the TCP data body, which is our HTTP request message.
2. HTTP Request Example
Let’s take a look at an actual HTTP request example:
-
① This is the request method. The HTTP/1.1 defines eight request methods:
GET, POST, PUT, DELETE, PATCH, HEAD, OPTIONS, TRACE. The two most common are GET and POST. If it is a RESTful interface, GET, POST, DELETE, and PUT are generally used.
-
② This is the corresponding URL address for the request, which, along with the Host attribute in the message header, forms the complete request URL.
-
③ This is the protocol name and version number.
-
④ This is the HTTP message header, which contains several attributes in the format “attribute name: attribute value”. The server uses this to obtain information about the client.
-
⑤ This is the message body, which encodes the component values from a page form into a formatted string in the form of param1=value1¶m2=value2, carrying multiple request parameters’ data.
Not only can the message body pass request parameters, but the request URL can also pass parameters in a similar way, such as “/chapter15/user.html?param1=value1¶m2=value2”.
There are many request header parameters, and I will not explain them all. I will only mention two basic anti-crawling parameters:
-
User-Agent:
The name and version of the operating system and browser used by the client. Some websites may restrict requests based on the browser.
-
Referer:
The address of the previous webpage, indicating where this request comes from. Some websites may restrict requests based on the source.
5. Server Response
After the server receives the client’s request and processes it, it needs to respond and return to the client. The structure of the HTTP response message is consistent with that of the request structure.
1. HTTP Response Message Structure
2. HTTP Response Example
3. Response Status Codes
In the response message, we should pay special attention to the server’s response status codes, which are often asked in interviews. Below, I will only list the categories; for detailed status codes, please search online for more information.
6. Disconnecting the Connection
After the server has completed its response, a session ends. The question is, will the connection be disconnected at this time?
1. Long and Short Connections
Whether to disconnect depends on the HTTP version:
-
In HTTP/1.0, after the client and server complete a request/response, the previously established TCP connection is disconnected. The next time a request is made, a new TCP connection must be established, which is also known as a short connection.
-
Only six months after the release of HTTP/1.0 (in January 1997), HTTP/1.1 was released, introducing a new feature:
After the client and server complete a request/response, it allows the TCP connection to remain open, meaning that the next request can directly use this TCP connection without needing to re-establish a new connection, which is known as a long connection.
Note: A long connection means that a single TCP connection allows multiple HTTP sessions. HTTP is always a single request/response; once the session ends, HTTP itself does not have the concept of a long connection.
As early as 1999, HTTP/1.1 was promoted and popularized, so now browsers will carry a parameter in the request header: Connection: keep-alive, indicating that the browser requests to establish a long connection with the server, and the server can also set whether it is willing to establish a long connection.
2. Advantages and Disadvantages of Long Connections
For servers, establishing long connections has both advantages and disadvantages:
-
Advantages:
When a website has a large number of static resources (images, CSS, JS, etc.), long connections can be enabled, allowing several images to be sent through a single TCP connection.
-
Disadvantages:
When a client makes a request but does not request anything else, while the server keeps the long connection open, resources are being occupied, which is a serious waste of resources.
Therefore, whether to enable long connections and the duration of long connections need to be reasonably set based on the website itself.
PS: Do not underestimate this single TCP connection; in a complete HTTP request from the client (DNS resolution, establishing TCP connection, requesting, waiting, parsing the webpage, disconnecting the TCP connection), the time taken to establish the TCP connection is still significant.
3. Disconnecting the Connection Process
While establishing a TCP connection requires a three-way handshake, disconnecting a TCP connection requires a four-way handshake!As we mentioned earlier regarding the TCP/IP protocol, the flag: FIN indicates a notification to the other party that this end is closing the connection. Why does disconnecting require four-way handshakes? This is an exercise for you; please leave your understanding in the comments to see if it is correct.
7. Aside
1. Interview Essential Question: Three-Way Handshake and Four-Way Handshake of HTTP
Interviewer: Why does establishing a connection require a three-way handshake while closing a connection requires a four-way handshake? This is an exercise for you; please leave your insights in the comments!
2. HTTP/2.0
HTTP/1.1 has served us for 20 years, while HTTP/2.0 was actually released in 2015 but has not yet been widely adopted. You can search online for more information about the new features of HTTP/2.0.
3. HTTP & RPC
Due to the slow response and large request header size of HTTP, in the era of microservices, RPC is often used to call services. Interested students can learn about RPC concepts online.
4. HTTP & HTTPS
HTTP has two significant drawbacks: it is plaintext and cannot guarantee integrity, which is why it is gradually being replaced by HTTPS. I will explain HTTPS in the next article.
[End]
If you like it, click “See”