The Most Detailed Summary of TCP/IP Knowledge

From: Juejin, Author: Ruheng
Link: https://juejin.im/post/6844903490595061767

1. TCP/IP Model

The TCP/IP protocol model (Transmission Control Protocol/Internet Protocol) includes a series of network protocols that form the foundation of the Internet and is the core protocol of the Internet.

The reference model based on TCP/IP divides the protocols into four layers: the Link Layer, Network Layer, Transport Layer, and Application Layer. The diagram below illustrates the correspondence between the TCP/IP model and the OSI model layers.

The TCP/IP protocol suite is layered from top to bottom, with each layer wrapping the one beneath it. The topmost layer is the Application Layer, which includes familiar protocols such as HTTP, FTP, etc. The second layer is the Transport Layer, where the well-known TCP and UDP protocols reside. The third layer is the Network Layer, which includes the IP protocol responsible for adding IP addresses and other data to determine the transmission target. The fourth layer is the Data Link Layer, which adds an Ethernet protocol header to the data to be transmitted and performs CRC encoding to prepare for the final data transmission.

The above image clearly shows the role of each layer in the TCP/IP protocol, and the process of communication in the TCP/IP protocol corresponds to the process of data stacking and unstacking. During the stacking process, the data sender continuously wraps headers and trailers at each layer, adding some transmission information to ensure it reaches its destination. During the unstacking process, the data receiver continuously removes headers and trailers at each layer to obtain the final transmitted data.

The above image uses the HTTP protocol as an example to explain in detail.

2. Data Link Layer

The Physical Layer is responsible for the conversion between the 0s and 1s bit stream and the physical device’s voltage levels and light signals. The Data Link Layer is responsible for dividing the 0s and 1s sequence into data frames for transmission from one node to a neighboring node, which are uniquely identified by MAC addresses (MAC, physical address, each host has one MAC address).

Encapsulation into Frames: Add headers and trailers to the Network Layer datagram to encapsulate it into a frame, with the frame header including the source and destination MAC addresses.
Transparent Transmission: Zero bit padding, escape characters.
Reliable Transmission: Rarely used on low-error-rate links, but wireless links like WLAN ensure reliable transmission.
Error Detection (CRC): The receiver detects errors; if an error is found, the frame is discarded.

3. Network Layer

1. IP Protocol

The IP protocol is the core of the TCP/IP protocol, and all TCP, UDP, ICMP, and IGMP data is transmitted in IP data format. It is important to note that IP is not a reliable protocol, meaning that the IP protocol does not provide a mechanism for handling data that has not been delivered, which is considered the responsibility of upper-layer protocols like TCP or UDP.

1.1 IP Address

In the Data Link Layer, we generally identify different nodes using MAC addresses, while at the IP layer, we also need a similar address identifier, which is the IP address.

A 32-bit IP address is divided into network bits and address bits, which reduces the number of records in the routing table of routers. With a network address, we can limit terminals with the same network address to the same range, allowing the routing table to only maintain a direction for that network address to find the corresponding terminals.

Class A IP Address: 0.0.0.0~127.0.0.0 Class B IP Address: 128.0.0.1~191.255.0.0 Class C IP Address: 192.168.0.0~239.255.255.0

1.2 IP Protocol Header

Here we only introduce: the eight-bit TTL field. This field specifies how many routers the data packet can pass through before being discarded. Each time an IP data packet passes through a router, the TTL value decreases by 1. When the TTL value becomes zero, the packet is automatically discarded.

The maximum value of this field is 255, meaning a protocol packet can only pass through routers 255 times before being discarded. Depending on the system, this number may vary, typically being 32 or 64.

2. ARP and RARP Protocols

ARP is a protocol for obtaining MAC addresses based on IP addresses.

The ARP (Address Resolution) protocol is a resolution protocol. Initially, a host does not know which device an IP corresponds to. When a host wants to send an IP packet, it first checks its ARP cache (which is an IP-MAC address mapping cache).

If the queried IP-MAC pair does not exist, the host sends an ARP broadcast packet to the network containing the IP address to be queried. All hosts that directly receive this broadcast will check their IP addresses, and if any host finds it meets the criteria, it prepares an ARP packet containing its MAC address to send back to the broadcasting host.

When the broadcasting host receives the ARP packet, it updates its ARP cache (the place where the IP-MAC mapping table is stored). The broadcasting host will then use the updated ARP cache to prepare the data link layer packet for transmission.

The RARP protocol works in the opposite way; no further elaboration is needed.

3. ICMP Protocol

The IP protocol is not a reliable protocol; it does not guarantee that data is delivered. Therefore, naturally, the task of ensuring data delivery should be handled by other modules. One important module is the ICMP (Internet Control Message Protocol). ICMP is not a high-level protocol but a protocol at the IP layer.

When errors occur in transmitting IP data packets, such as unreachable hosts or unreachable routes, the ICMP protocol will package the error information and send it back to the host, giving the host a chance to handle the error. This is why it is said that protocols built on top of the IP layer can potentially achieve reliability.

4. Ping

Ping can be regarded as the most famous application of ICMP and is part of the TCP/IP protocol. The “ping” command can check whether the network is reachable and is very helpful in analyzing and diagnosing network failures.

For example, when we cannot access a certain website, we usually ping that website. Ping will echo some useful information. General information is as follows:

The word ping originates from sonar positioning, and this program serves that purpose. It uses ICMP packets to detect whether another host is reachable. The principle is to send a request with ICMP type code 0, and the host receiving the request responds with ICMP type code 8.

5. Traceroute

Traceroute is an important tool for detecting the routing situation between a host and the destination host, and it is also one of the most convenient tools.

The principle of Traceroute is very interesting. After receiving the destination host’s IP, it first sends a UDP packet with TTL=1 to the destination host. The first router that receives this packet automatically decreases the TTL by 1, and when the TTL becomes 0, the router discards the packet and generates an ICMP destination unreachable message for the host. The host receives this message and then sends a UDP packet with TTL=2 to the destination host, prompting the second router to send an ICMP message back to the host. This process continues until reaching the destination host. Thus, Traceroute obtains all the router IPs along the path.

6. TCP/UDP

TCP and UDP are both transport layer protocols, but they have different characteristics and applications. The following table compares and analyzes them.

Message-Oriented

The message-oriented transmission method allows the application layer to specify the length of the message to UDP, which sends it as is, meaning it sends one message at a time. Therefore, the application must choose an appropriate message size. If the message is too long, the IP layer needs to fragment it, reducing efficiency. If it is too short, the IP packet may be too small.

Byte Stream-Oriented

In the byte stream-oriented case, although the interaction between the application program and TCP occurs one data block at a time (of varying sizes), TCP treats the application program as a continuous stream of unstructured bytes. TCP has a buffer; when the data block sent by the application program is too long, TCP can segment it into shorter pieces for transmission.

Congestion control and flow control are key aspects of TCP, which will be explained later.

Some applications of TCP and UDP protocols

When Should TCP Be Used?

When there are quality requirements for network communication, for example: when all data must be accurately transmitted to the other party, this is often used in applications that require reliability, such as HTTP, HTTPS, FTP for file transfer protocols, and POP, SMTP for email transfer protocols.

When Should UDP Be Used?

When the quality requirements for network communication are low, and speed is a priority, UDP can be used.

7. DNS

DNS (Domain Name System) is a distributed database on the Internet that maps domain names to IP addresses, making it easier for users to access the Internet without having to remember IP address strings that machines can read directly. The process of obtaining the IP address corresponding to a hostname is called domain name resolution (or hostname resolution). The DNS protocol operates over the UDP protocol using port number 53.

8. Establishment and Termination of TCP Connections

1. Three-Way Handshake

TCP is connection-oriented, meaning that before either side can send data, a connection must first be established between the two parties. In the TCP/IP protocol, the TCP protocol provides reliable connection services, and the connection is initialized through a three-way handshake. The purpose of the three-way handshake is to synchronize the sequence numbers and acknowledgment numbers between the two parties and exchange TCP window size information.

First Handshake: Establishing the connection. The client sends a connection request segment, setting SYN to 1 and Sequence Number to x; then, the client enters the SYN_SEND state, waiting for the server’s confirmation;

Second Handshake: The server receives the SYN segment. The server receives the client’s SYN segment and must confirm it by setting the Acknowledgment Number to x+1 (Sequence Number+1); at the same time, it must send its own SYN request information, setting SYN to 1 and Sequence Number to y; the server puts all the above information into a segment (the SYN+ACK segment) and sends it to the client, at which point the server enters the SYN_RECV state;

Third Handshake: The client receives the server’s SYN+ACK segment. It then sets the Acknowledgment Number to y+1 and sends an ACK segment to the server. After this segment is sent, both the client and server enter the ESTABLISHED state, completing the TCP three-way handshake.

Why Three-Way Handshake?

To prevent invalid connection request segments from suddenly being sent to the server, resulting in errors.

A specific example: The generation of an “invalid connection request segment” occurs in a situation where the first connection request segment sent by the client does not get lost but is delayed at a network node for a long time, arriving at the server after the connection has been released. This is an already invalid segment. However, if the server receives this invalid connection request segment, it mistakenly believes it is a new connection request sent by the client.

Thus, it sends a confirmation segment to the client, agreeing to establish the connection. If we did not use the “three-way handshake,” as soon as the server sends the confirmation, a new connection would be established. Since the client has not sent a connection request, it will ignore the server’s confirmation and will not send data to the server. However, the server believes a new transport connection has been established and will continue to wait for data from the client. This would waste many of the server’s resources. The “three-way handshake” method prevents the above phenomenon from occurring. For instance, in the aforementioned situation, the client will not send a confirmation to the server. Since the server does not receive the confirmation, it knows that the client did not request to establish a connection.

2. Four-Way Termination

After the client and server establish a TCP connection through the three-way handshake, when the data transmission is complete, the TCP connection must be terminated. This is known as the mysterious “four-way termination”.

First Termination: Host 1 (which can be either the client or the server) sets the Sequence Number and sends a FIN segment to Host 2; at this point, Host 1 enters the FIN_WAIT_1 state; this indicates that Host 1 has no data to send to Host 2;

Second Termination: Host 2 receives the FIN segment sent by Host 1 and replies with an ACK segment, setting the Acknowledgment Number to Sequence Number plus 1; Host 1 enters the FIN_WAIT_2 state; Host 2 informs Host 1 that it “agrees” to the shutdown request;

Third Termination: Host 2 sends a FIN segment to Host 1, requesting to close the connection, while Host 2 enters the LAST_ACK state;

Fourth Termination: Host 1 receives the FIN segment sent by Host 2 and sends an ACK segment to Host 2, then Host 1 enters the TIME_WAIT state; after Host 2 receives Host 1’s ACK segment, it closes the connection; at this point, Host 1 waits for 2MSL (Maximum Segment Lifetime) before closing the connection, confirming that the server has closed normally.

Why Four-Way Termination?

The TCP protocol is a connection-oriented, reliable, byte-stream-based transport layer communication protocol. TCP operates in full-duplex mode, meaning that when Host 1 sends a FIN segment, it only indicates that Host 1 has no more data to send; it informs Host 2 that it has finished sending all its data. However, at this time, Host 1 can still receive data from Host 2; when Host 2 returns an ACK segment, it indicates that it has received the information that Host 1 has no data to send, but Host 2 can still send data to Host 1; when Host 2 also sends a FIN segment, it indicates that Host 2 also has no data to send, thereby informing Host 1 that it too has no data to send, and then both parties can happily terminate the TCP connection.

Why Wait for 2MSL?

MSL: Maximum Segment Lifetime, which is the longest time any segment can remain in the network before being discarded. There are two reasons:

To ensure that the TCP protocol’s full-duplex connection can be reliably closed.
To ensure that any duplicate segments from this connection disappear from the network.

First point: If Host 1 directly goes to CLOSED status, there is a risk that due to IP protocol unreliability or other network reasons, Host 2 may not receive Host 1’s final ACK. Then, after timing out, Host 2 may continue to send FIN. At this point, since Host 1 has already gone to CLOSED, it cannot find a corresponding connection for the retransmitted FIN. Therefore, Host 1 does not directly go to CLOSED but maintains a TIME_WAIT state, ensuring that it can confirm that the other party receives the ACK when it receives another FIN, thus properly closing the connection.

Second point: If Host 1 goes directly to CLOSED and then initiates a new connection to Host 2, we cannot guarantee that the new connection and the closed connection will have different port numbers. This means that it is possible for the new connection and the old connection to have the same port number. Generally, this will not cause issues, but special cases can arise: if the new connection and the already closed old connection have the same port number and some delayed data from the previous connection arrives at Host 2 after establishing the new connection, the TCP protocol will mistakenly treat that delayed data as belonging to the new connection, causing confusion with the actual data packets of the new connection. Therefore, TCP connections must remain in TIME_WAIT status for 2MSL to ensure that all data from this connection disappears from the network.

9. TCP Flow Control

If the sender sends data too quickly, the receiver may not be able to keep up, leading to data loss. Flow control ensures that the sender’s rate of transmission is not too fast, allowing the receiver to catch up.

Using the sliding window mechanism makes it easy to implement flow control on a TCP connection.

Assuming A sends data to B. During the connection setup, B informs A: “My receive window is rwnd = 400” (where rwnd represents the receiver window). Therefore, the sender’s sending window cannot exceed the value given by the receiver. Note that the TCP window is measured in bytes, not segments. Assuming each segment is 100 bytes long and the initial sequence number is set to 1. Uppercase ACK indicates the acknowledgment bit in the header, while lowercase ack indicates the value of the acknowledgment field.

From the diagram, we can see that B performed three flow control actions. The first reduced the window to rwnd = 300, the second to rwnd = 100, and finally to rwnd = 0, meaning the sender is no longer allowed to send data. This state of pausing the sender will last until Host B issues a new window value.

TCP establishes a persistence timer for each connection. Whenever one side of a TCP connection receives a zero-window notification from the other side, it starts the persistence timer. If the persistence timer expires, a zero-window probe segment (carrying 1 byte of data) is sent. The receiving party of this segment resets its persistence timer.

10. TCP Congestion Control

The sender maintains a state variable called the congestion window (cwnd). The size of the congestion window depends on the level of network congestion and changes dynamically. The sender sets its sending window equal to the congestion window.

The principle behind controlling the congestion window is: as long as there is no congestion in the network, the congestion window increases to allow more packets to be sent. However, if network congestion occurs, the congestion window decreases to reduce the number of packets injected into the network.

The slow start algorithm:

When a host begins to send data, if it injects a large amount of data into the network immediately, it may cause network congestion, as the load of the network is not yet clear. Therefore, a better approach is to probe the network gradually, increasing the sending window size from small to large, meaning gradually increasing the congestion window value.

Typically, when starting to send segments, the congestion window (cwnd) is set to a value of one maximum segment size (MSS). For every new segment acknowledgment received, the congestion window is increased by at most one MSS. This gradual increase of the sender’s congestion window (cwnd) allows for a more reasonable rate of packet injection into the network.

After each transmission round, the congestion window (cwnd) doubles. The duration of a transmission round is essentially the round-trip time RTT. However, “transmission round” emphasizes that the segments allowed by the congestion window (cwnd) are sent continuously until the last byte sent is acknowledged.

Additionally, the “slow” in slow start does not refer to a slow growth rate of cwnd but rather that during the initial sending of segments, cwnd is set to 1, allowing the sender to send only one segment at first (to probe the network congestion) before gradually increasing cwnd.

To prevent the congestion window (cwnd) from growing too large and causing network congestion, a slow start threshold (ssthresh) state variable is set. The usage of the slow start threshold (ssthresh) is as follows:

When cwnd < ssthresh, use the slow start algorithm.
When cwnd > ssthresh, stop using the slow start algorithm and switch to the congestion avoidance algorithm.
When cwnd = ssthresh, either the slow start algorithm or the congestion avoidance algorithm can be used.

Congestion Avoidance

Gradually increase the congestion window (cwnd), meaning that for every round-trip time (RTT) that passes, the sender’s congestion window (cwnd) increases by 1 rather than doubling. This allows the congestion window (cwnd) to grow slowly in a linear manner, much slower than the growth rate of the congestion window in the slow start algorithm.

Whether in the slow start phase or the congestion avoidance phase, as long as the sender determines that network congestion has occurred (which is indicated by not receiving an acknowledgment), the slow start threshold (ssthresh) is set to half of the sender’s window value at the time of congestion (but not less than 2). Then, the congestion window (cwnd) is reset to 1, and the slow start algorithm is executed.

The purpose of this is to quickly reduce the number of packets sent into the network, giving congested routers enough time to process the packets queued in their buffers.

The following diagram illustrates the above process of congestion control with specific values. Now the size of the sending window is equal to that of the congestion window.

2. Fast Retransmit and Fast Recovery

Fast Retransmit

The fast retransmit algorithm requires the receiver to immediately send duplicate acknowledgments for each out-of-order segment received (to allow the sender to know as soon as possible that a segment has not reached the receiver) rather than waiting until it sends its own data to include the acknowledgment.

Assuming the receiver has received segments M1 and M2 and has sent acknowledgments for both. Now, suppose the receiver did not receive M3 but subsequently received M4.

Clearly, the receiver cannot acknowledge M4 because it is an out-of-order segment. According to the reliable transmission principle, the receiver can choose to do nothing or send a timely acknowledgment for M2.

However, under the fast retransmit algorithm, the receiver should promptly send a duplicate acknowledgment for M2. This allows the sender to know as soon as possible that segment M3 has not arrived at the receiver. The sender then sends segments M5 and M6. The receiver receives these two segments and must again send a duplicate acknowledgment for M2. Thus, the sender has received four acknowledgments for M2 from the receiver, three of which are duplicate acknowledgments.

The fast retransmit algorithm also stipulates that as soon as the sender receives three duplicate acknowledgments, it should immediately retransmit the unacknowledged segment M3 without waiting for the retransmission timer for M3 to expire.

By retransmitting unacknowledged segments early, the overall throughput of the network can be improved by approximately 20%.

Fast Recovery

The fast recovery algorithm is used in conjunction with fast retransmit and has the following two key points:

When the sender receives three consecutive duplicate acknowledgments, it executes the “multiplicative decrease” algorithm, halving the slow start threshold (ssthresh).
Unlike slow start, the congestion window (cwnd) is not set to 1; instead, it is set to the value of the slow start threshold (ssthresh) after halving, and then the congestion avoidance algorithm (“additive increase”) is executed, allowing the congestion window to increase slowly in a linear manner.