Core Principles of Web Crawlers: How an HTTP Request is Completed

Core Principles of Web Crawlers: How an HTTP Request is Completed

Author: Da Mu Jiang

https://my.oschina.net/luozhou/blog/3003053

Overview

In the previous article, “Do You Know What Happens Behind the Scenes When You Ping?” we analyzed the process of a <span>Ping</span> using actual packet capture (a common interview question). We learned that <span>ping</span> relies on the <span>ICMP</span> protocol and also involves <span>ARP</span> requests in a local area network. Today, we will similarly use packet capture analysis tools to examine how the familiar <span>HTTP</span> request works.

Environment Preparation

I originally wanted to find a website for packet capture analysis, but the <span>HTTP</span> requests on a production environment website are too numerous and interfere with the analysis, so I simplified it to a <span>demo</span> that returns a string for <span>HTTP</span> requests.

Environment:

1. Demo service responding to HTTP requests
2. Client IP: 192.168.2.135
3. Server: 45.76.105.92
4. Packet capture tool: Wireshark

After deploying the <span>demo</span> to the server and starting it successfully, access the following:

Core Principles of Web Crawlers: How an HTTP Request is Completed

Open the packet capture tool <span>Wireshark</span> to capture packets, and the results are as follows:

Core Principles of Web Crawlers: How an HTTP Request is Completed

From the image above, we can see that we successfully captured an <span>HTTP</span> request and response, but we also see many <span>TCP</span> requests. Next, let’s analyze what these <span>TCP</span> requests are for.

Packet Capture Analysis

A) Three-Way Handshake

Initially, the local machine sent two requests to the server. We’ll explain why there are two requests later; for now, let’s focus on the <span>HTTP</span> corresponding port request, as follows:

192.168.2.135:60738---->45.76.105.92:8081

From the screenshot above, we know this is the first handshake of the <span>TCP</span> protocol. Those familiar with the <span>TCP</span> protocol know that establishing a connection involves a three-way handshake, and disconnecting involves a four-way handshake.

Let’s look at the first request:

60738 -> 8081 [SYN] Seq=0 Win=64240 Len=0 Mss=1460 Ws=256 SACK_PERM=1

Let’s analyze this packet request information:

  • <span>60783->8081</span> Port number: Source port—> Destination port

  • <span>[SYN]</span>: Synchronization handshake signal

  • <span>Seq</span>: Message number

  • <span>Win</span>: TCP window size

  • <span>Len</span>: Message length

  • <span>Mss</span>: Maximum segment size

  • <span>Ws</span>: Window scaling adjustment factor

  • <span>SACK_PERM</span>: SACK option, where 1 indicates SACK is enabled.

For the concepts above, here’s a brief explanation before we introduce the <span>TCPHeader</span> data structure diagram for a better understanding of the <span>TCP</span> header data structure:

Core Principles of Web Crawlers: How an HTTP Request is Completed

1. Win: TCP Window Size refers to the maximum number of bytes that <span>TCP</span> can accept for transmission, which can be dynamically adjusted. This is the <span>TCP</span> sliding window, which controls the rate of data sent by dynamically adjusting the window size. The image above occupies <span>2</span> bytes or <span>16</span> bits, so the maximum number supported is <span>2^16=65536</span>. Therefore, under default conditions, the maximum window size supported by the <span>TCP</span> header is <span>65536</span> bytes, or <span>64KB</span>.

2. Len: Message Length refers to the data packet segment, as the entire <span>TCP</span> packet = <span>Header</span> + <span>packSize</span>, so this message length indicates the total length of the data packet to be transmitted, which in this analysis is the size of the <span>HTTP</span> message.

3. Mss: Maximum Segment Size: This specifies the maximum length of the packet that can be transmitted. To achieve optimal transmission efficiency, the <span>TCP</span> protocol typically negotiates the <span>MSS</span> value during connection establishment, which is often replaced by the <span>MTU</span> value (after subtracting the header sizes of <span>IP</span> packets <span>20Bytes</span> and <span>TCP</span> segments <span>20Bytes</span>). Therefore, the typical <span>MSS</span> value is <span>1460</span>, which aligns with the value in our packet capture image.

4. Ws: Window Scaling Adjustment Factor: As mentioned earlier, the default maximum window size for <span>TCP</span> is <span>64KB</span>, which is insufficient in today’s high-speed internet age. To support more buffered data, RFC 1323 specifies <span>TCP</span> extension options, one of which is the window scaling adjustment factor. How does this work? First, this parameter is negotiated during the <span>[SYN]</span> synchronization phase. Analyzing the above packet data, we see that the result of the first request negotiation is <span>WS=256</span>, and it takes effect in the <span>ACK</span> phase, adjusting the window size. The effective packet capture is as follows:

 60738 ->8081   [ACK] Seq=1 ACK=1 Win=66560 Len=0

We find that the window size has become <span>66560</span>, larger than the default window. Upon examining the packet details:

Core Principles of Web Crawlers: How an HTTP Request is Completed

We discover that the actual window declared in the request is <span>260</span>, and the <span>WS</span> extension factor is <span>256</span>. The final calculated window size is <span>260*256=66560</span>.

5. SACK_PERM: SACK Option: We know that <span>TCP</span> transmission has a packet acknowledgment mechanism. By default, the receiving end sends an <span>ACK</span> confirmation after receiving a packet. However, it only supports sequential acknowledgment, meaning if packets <span>A</span>, <span>B</span>, and <span>C</span> are sent, and <span>A</span> and <span>C</span> are received but <span>B</span> is not, the acknowledgment for <span>C</span> cannot be sent until <span>B</span> is received. <span>TCP</span> has a timeout retransmission mechanism, so if a packet is not acknowledged for a long time, it is assumed to be lost and retransmitted, which can lead to unnecessary retransmissions and waste transmission space. To solve this problem, <span>SACK</span> proposes a selective acknowledgment mechanism. When <span>SACK</span> is enabled, the receiving end acknowledges all received packets, so the sending end only needs to retransmit genuinely lost packets.

After briefly introducing the above basic concepts, let’s summarize the <span>HTTP</span> request process based on the packet capture, where the local port for the <span>HTTP</span> request is <span>60378</span>. The summarized process is as follows:

------------------------Request Connection--------------------------
1) 60738 -> 8081 [SYN] Seq=0 Win=64240 Len=0 Mss=1460 Ws=256 SACK_PERM=1
2) 8081 -> 60738 [SYN,ACK] Seq=0 ACK=1 Win=29200 Len=0 MSS=1420 SACK_PERM=1 WS=128
3) 60738 -> 8081  [ACK] Seq=1 ACK=1 Win=66560 Len=0
4) Get /test HTTP/1.1
5) 8081 -> 60738  [ACK] Seq=1 ACK=396 Win=30336 Len=0
6) HTTP/1.1 200 (text/html)
7) 60738 -> 8081  [ACK] Seq=396 ACK=120 Win=66560 Len=0

------------------Disconnect Connection-----------------------------
8) 60738 -> 8081 [FIN ACK] Seq=396 Ack=120 Win=66560 Len=0
9) 8081 -> 60738  [FIN ACK] Seq=120 Ack=397 Win=30336 Len=0
10) 60738 -> 8081 [ACK] Seq=397 Ack=121 Win=66560 Len=0

From the above process, we can see that <span>Sequence 1</span> to <span>Sequence 3</span> clearly indicates the three-way handshake. Then <span>Sequence 4</span> represents an <span>HTTP</span> request, followed by <span>Sequence 5</span> which is an acknowledgment of the <span>HTTP</span> request, <span>Sequence 6</span> is the response to the <span>HTTP</span> request, and <span>Sequence 7</span> is the acknowledgment of the response request.

B) Four-Way Handshake

Sequences <span>8</span>, <span>9</span>, and <span>10</span> were captured after I closed the browser. Since this is a closure of the browser, we know that the <span>TCP</span> connection has been broken. Some of you may have noticed the issue: the disconnection involves <span>4</span> handshakes, but you only captured three records. I assure you that I did not make a mistake; this is the actual packet capture. Why are there only three? Let’s analyze:

Under normal circumstances, a connection is broken with <span>4</span> handshakes. The <span>4</span> handshake process is illustrated in the image below:

Core Principles of Web Crawlers: How an HTTP Request is Completed

Analyzing this image, the handshake process is as follows:

1. The client initiates a disconnect request, entering FIN-WAIT state
2. The server confirms the disconnect request
3. The server immediately sends a disconnect request, entering CLOSE-WAIT state
4. The client confirms the server's disconnect request, entering TIME-WAIT state

We find that both <span>Process 2</span> and <span>Process 3</span> are initiated by the server. Is it possible to combine these two requests and send them to the client in one go? The answer is: Yes. In section <span>4.2</span> of RFC 2581, it is mentioned that <span>ACK</span> can be delayed, as long as it is ensured that the acknowledgment packet reaches within <span>500ms</span>. Under this standard, <span>TCP</span> acknowledgment may be merged with delayed acknowledgment. Based on this, we infer the following packet:

 9) 8081 -> 60738  [FIN ACK] Seq=120 Ack=397 Win=30336 Len=0

This packet combines the acknowledgment for the client and the server’s <span>FIN</span> disconnect signal. We click on the details of this packet, and the red box indicates that this <span>Packet 9</span> is the <span>ACK</span> confirmation for <span>Frame500</span>. From the initial screenshot, we can see that this packet is <span>Packet 8</span>

 8) 60738 -> 8081 [FIN ACK] Seq=396 Ack=120 Win=66560 Len=0

Core Principles of Web Crawlers: How an HTTP Request is Completed

Moreover, <span>Packet 9</span> itself sends a <span>FIN</span> signal packet, so we can consider that <span>Packet 9</span> merged the contents of <span>ACK</span> and <span>FIN</span>. Therefore, the typical <span>4</span> handshakes can be reduced to <span>3</span> handshakes after merging.

This concludes a complete <span>HTTP</span> request, and the entire process is illustrated as follows:

Core Principles of Web Crawlers: How an HTTP Request is Completed

C) Keep-Alive

  • Some of you may ask, since this is a complete <span>HTTP</span> request, does every request involve three handshakes?

The answer is: Currently, the protocol does not require it.

In <span>HTTP0.9</span> and <span>HTTP1.0</span>, each request-response required three handshakes. However, starting from <span>HTTP1.0</span>, persistent connections were attempted with the <span>Keep-Alive</span> parameter, but it was not officially supported. In the <span>HTTP1.1</span> protocol, the <span>Keep-Alive</span> parameter is officially supported by default, allowing for persistent connections. The <span>Keep-Alive</span> serves two main purposes:

1. Check for dead nodes
2. Prevent connections from being closed due to inactivity
  • Check for Dead Nodes

This is primarily to quickly detect connection failures and reconnect. For instance, if nodes <span>A</span> and <span>B</span> have established a connection, and node <span>B</span> goes down for some reason, while node <span>A</span> is unaware, two scenarios can occur:

1. If node <span>B</span> has not yet recovered, node <span>A</span> will keep retrying until it realizes that node <span>B</span> is dead.

2. If node <span>B</span> recovers before node <span>A</span> sends data, when node <span>A</span> sends data, node <span>B</span> will not accept it and will send a <span>RST</span> signal (when data is received on a closed <span>socket</span>, a <span>RST</span> packet is sent, requesting the other end to close the abnormal connection without needing to reply with an <span>ACK</span>), allowing node <span>A</span> to know that node <span>B</span> needs to reconnect.

In both scenarios, only when data is sent does node <span>A</span> realize that the other end has encountered an issue. With <span>Keep-Alive</span>, heartbeat signals are sent periodically, allowing for quick detection of the server node’s status.

  • Prevent Connections from Being Closed Due to Inactivity

We know that establishing and maintaining network connections consume resources, and the number of connections a server can establish is limited. Therefore, firewalls or operating systems may release inactive connections to save resources. The <span>Keep-Alive</span> sends a heartbeat packet periodically to inform the firewall or operating system that this connection is active and should not be terminated.

Later, I captured packets with <span>Keep-Alive</span> included, as shown in the screenshot below:

Core Principles of Web Crawlers: How an HTTP Request is Completed

In the image, the last two packets are the <span>Keep-Alive</span> packets, which the server confirms with an <span>ACK</span>. We see that the <span>keep-alive</span> packet actually sends a packet with one byte, which is the implementation of <span>keep-alive</span>.

Returning to the initial question, why does an <span>HTTP</span> request involve two port handshakes? This is unrelated to the protocol itself. The first packet capture screenshot was accessed using Google Chrome, while the last packet capture image was accessed using Firefox. A careful comparison reveals that Firefox only has one port for the three-way handshake. Therefore, this situation arises from the browser’s implementation. Why does Google Chrome implement it this way? My guess is: to ensure the availability of HTTP access as much as possible. If one port is unavailable, it can immediately switch to another port to complete the <span>HTTP</span> request and response. (This is a personal guess; if there is an authoritative answer, please feel free to discuss in the comments.)

Conclusion
  • <span>HTTP</span> requests rely on <span>TCP</span> connections, and during the first connection, a <span>TCP</span> three-way handshake occurs.

  • <span>HTTP</span> utilizes <span>Keep-Alive</span> for persistent connections, sending a heartbeat packet periodically to inform the server that it is still active.

  • <span>HTTP</span> connection disconnections will lead to <span>TCP</span> four-way handshakes, but if the server meets certain conditions, it can merge the <span>ACK</span> and <span>FIN</span> signals, resulting in three handshakes.

Long press to subscribe for more exciting content ▼

Core Principles of Web Crawlers: How an HTTP Request is Completed

If you found this helpful, please give it a thumbs up. Thank you sincerely!Core Principles of Web Crawlers: How an HTTP Request is Completed

Leave a Comment