Mastering TCP/IP: A Comprehensive Guide

From: Juejin, Author: Ruheng Link: https://juejin.im/post/6844903490595061767

1. TCP/IP Model

The TCP/IP protocol model (Transmission Control Protocol/Internet Protocol) consists of a series of network protocols that form the foundation of the Internet and are the core protocols of the Internet.

The reference model based on TCP/IP divides protocols into four layers: Link Layer, Network Layer, Transport Layer, and Application Layer. The diagram below shows the correspondence between the TCP/IP model and the OSI model layers.

The TCP/IP protocol suite is layered, with each layer encapsulating the one above it. The top layer is the Application Layer, which includes familiar protocols like HTTP and FTP. The second layer is the Transport Layer, where the well-known TCP and UDP protocols reside. The third layer is the Network Layer, which includes the IP protocol responsible for assigning IP addresses and other data to determine the transmission target. The fourth layer is the Data Link Layer, which adds an Ethernet protocol header to the data to be transmitted, along with CRC encoding, preparing for final data transmission.

The above diagram clearly illustrates the role of each layer in the TCP/IP protocol, and the communication process of the TCP/IP protocol corresponds to the process of data stacking and unstacking. During the stacking process, the sending party continuously encapsulates headers and trailers at each layer, adding transmission information to ensure delivery to the destination. During the unstacking process, the receiving party continuously removes headers and trailers at each layer to obtain the final transmitted data.

The above example illustrates the HTTP protocol specifically.

2. Data Link Layer

The Physical Layer is responsible for the conversion between the 0 and 1 bit streams and the physical device’s voltage levels and light flashes. The Data Link Layer is responsible for dividing the 0 and 1 sequences into data frames for transmission from one node to an adjacent node, which are uniquely identified by MAC addresses (MAC, physical address, each host has a MAC address).

Encapsulation into frames: Add headers and trailers to the network layer datagram, encapsulating it into frames, with the frame header including source and destination MAC addresses.
Transparent transmission: zero-bit padding, escape characters.
Reliable transmission: rarely used on links with very low error rates, but wireless links (WLAN) will ensure reliable transmission.
Error detection (CRC): the receiver checks for errors, and if an error is found, the frame is discarded.

3. Network Layer

1. IP Protocol

The IP protocol is the core of the TCP/IP protocol, and all TCP, UDP, ICMP, and IGMP data are transmitted in IP data format. It is important to note that IP is not a reliable protocol, meaning that it does not provide a mechanism for handling undelivered data, which is the responsibility of upper-layer protocols: TCP or UDP.

1.1 IP Address

In the Data Link Layer, we typically identify different nodes using MAC addresses, while in the IP layer we also need a similar address identifier, which is the IP address.

A 32-bit IP address is divided into network bits and address bits, which reduces the number of routing table entries in routers. With a network address, terminals with the same network address can be limited to a certain range, so the routing table only needs to maintain one entry for that network address to find the corresponding terminals.

Class A IP Address: 0.0.0.0~127.0.0.0 Class B IP Address: 128.0.0.1~191.255.0.0 Class C IP Address: 192.168.0.0~239.255.255.0

1.2 IP Protocol Header

This only introduces the eight-bit TTL field. This field specifies how many routers the data packet can pass through before being discarded. Each time an IP data packet passes through a router, the TTL value decreases by 1. When the TTL becomes zero, it is automatically discarded.

The maximum value of this field is 255, meaning a protocol packet can only pass through routers 255 times before being discarded. Depending on the system, this number may vary, usually being 32 or 64.

2. ARP and RARP Protocols

ARP is a protocol for obtaining MAC addresses based on IP addresses.

The ARP (Address Resolution Protocol) is a resolution protocol. Initially, the host does not know which interface of which host corresponds to the IP. When the host wants to send an IP packet, it first checks its ARP cache (which is an IP-MAC address mapping cache).

If the queried IP-MAC pair does not exist, the host sends an ARP broadcast packet over the network, which contains the IP address to be queried. All hosts that receive this broadcast will check their IP addresses, and if one host finds it matches, it will prepare an ARP packet containing its MAC address to send back to the host that sent the ARP broadcast.

After the broadcasting host receives the ARP packet, it updates its ARP cache (which stores the IP-MAC mapping). The broadcasting host will then use the new ARP cache data to prepare for sending data packets at the Data Link Layer.

The RARP protocol works in the opposite manner and will not be elaborated upon.

3. ICMP Protocol

The IP protocol is not a reliable protocol; it does not guarantee data delivery. Therefore, naturally, the task of ensuring data delivery should be handled by other modules. One important module is the ICMP (Internet Control Message Protocol). ICMP is not a high-level protocol but operates at the IP layer.

When errors occur in transmitting IP data packets, such as unreachable hosts or unreachable routes, the ICMP protocol will encapsulate the error information and send it back to the host. This gives the host an opportunity to handle the error, which is why it is said that protocols built on top of the IP layer can achieve reliability.

4. Ping

The ping command can be said to be the most famous application of ICMP, and it is part of the TCP/IP protocol. Using the “ping” command can check whether a network is reachable and can help us analyze and determine network faults.

For example, when we cannot access a certain website, we usually ping the site. The ping command will return some useful information. General information is as follows:

The term ping originates from sonar positioning, and the function of this program is indeed similar; it uses ICMP packets to detect whether another host is reachable. The principle is to send a request using an ICMP packet with type code 0, and the responding host will reply with an ICMP packet of type code 8.

5. Traceroute

Traceroute is an important tool for detecting the routing situation between a host and the destination host, and it is also the most convenient tool.

The principle of Traceroute is very interesting. It sends a UDP packet with TTL=1 to the destination host. The first router that receives this packet will automatically decrement the TTL by 1, and when the TTL becomes 0, the router will discard the packet and generate an ICMP message indicating that the host is unreachable. After receiving this message, the host sends a UDP packet with TTL=2 to the destination host, stimulating the second router to send an ICMP message back to the host. This continues until it reaches the destination host. Thus, Traceroute obtains the IP addresses of all routers.

6. TCP/UDP

TCP and UDP are both transport layer protocols, but they have different characteristics and applications. The following is a comparative analysis in tabular form.

Message-oriented

Message-oriented transmission means that the application layer specifies how long the message is, and UDP sends it as is, one message at a time. Therefore, the application must choose an appropriate message size. If the message is too long, the IP layer needs to fragment it, reducing efficiency. If too short, it may be too small for the IP.

Byte stream-oriented

In byte stream-oriented communication, although the application interacts with TCP one data chunk at a time (of varying sizes), TCP treats the application as a continuous stream of unstructured bytes. TCP has a buffer, and when the data chunk sent by the application is too long, TCP can segment it into smaller pieces for transmission.

Congestion control and flow control are key points of TCP, which will be explained later.

Some applications of TCP and UDP protocols

When to Use TCP?

When there is a requirement for communication quality, such as ensuring that all data is accurately transmitted to the other party, this is often used in applications that require reliability, such as protocols for transferring files like HTTP, HTTPS, FTP, and email transmission protocols such as POP and SMTP.

When to Use UDP?

When the quality of network communication is not highly demanded, and speed is prioritized, UDP can be used.

7. DNS

DNS (Domain Name System) is a distributed database on the Internet that maps domain names to IP addresses, making it easier for users to access the Internet without having to remember IP numbers that can be directly read by machines. The process of obtaining the IP address corresponding to a hostname is called domain name resolution (or hostname resolution). The DNS protocol operates over UDP and uses port 53.

8. Establishing and Terminating TCP Connections

1. Three-way Handshake

TCP is connection-oriented; before either party sends data, a connection must be established between the two. In the TCP/IP protocol, TCP provides a reliable connection service, which is initialized through a three-way handshake. The purpose of the three-way handshake is to synchronize the sequence numbers and acknowledgment numbers of both parties and exchange TCP window size information.

First Handshake: Establishing a connection. The client sends a connection request packet, setting SYN to 1 and Sequence Number to x; then, the client enters the SYN_SEND state, waiting for the server’s acknowledgment;

Second Handshake: The server receives the SYN packet. The server acknowledges the client’s SYN packet by setting Acknowledgment Number to x+1 (Sequence Number + 1); at the same time, it sends its own SYN request information, setting SYN to 1 and Sequence Number to y; the server places all this information into a single packet (the SYN+ACK packet) and sends it to the client, entering the SYN_RECV state;

Third Handshake: The client receives the server’s SYN+ACK packet. It then sets Acknowledgment Number to y+1 and sends an ACK packet to the server. After this packet is sent, both the client and server enter the ESTABLISHED state, completing the TCP three-way handshake.

Why Three-way Handshake?

To prevent a previously invalid connection request packet from suddenly being sent to the server, resulting in errors.

Specific Example: The generation of an “invalid connection request packet” occurs in a situation where the first connection request packet sent by the client has not been lost but has been delayed at a network node for a long time, arriving at the server after the connection has been released. This is an already invalid packet. However, when the server receives this invalid connection request packet, it mistakenly believes it is a new connection request from the client.

Thus, it sends an acknowledgment packet to the client, agreeing to establish the connection. If the “three-way handshake” is not used, as soon as the server sends the acknowledgment, a new connection would be established. Since the client has not sent a connection establishment request, it will ignore the server’s acknowledgment and will not send data to the server. However, the server would think that a new transport connection has been established and would be waiting for the client to send data. This would waste many of the server’s resources. The “three-way handshake” method can prevent the occurrence of the above phenomenon. For example, in the situation described above, the client will not send an acknowledgment back to the server. The server, not receiving the acknowledgment, will know that the client has not requested to establish a connection.

2. Four-way Handshake

After the client and server have established a TCP connection through the three-way handshake, when data transmission is complete, the TCP connection must be terminated. This is known as the mysterious “four-way handshake”.

First Handshake: Host 1 (which can be the client or server) sets the Sequence Number and sends a FIN packet to Host 2; at this point, Host 1 enters the FIN_WAIT_1 state; this indicates that Host 1 has no more data to send to Host 2;

Second Handshake: Host 2 receives the FIN packet sent by Host 1 and replies with an ACK packet, setting Acknowledgment Number to Sequence Number + 1; Host 1 enters the FIN_WAIT_2 state; Host 2 informs Host 1 that it “agrees” to the close request;

Third Handshake: Host 2 sends a FIN packet to Host 1, requesting to close the connection, while Host 2 enters the LAST_ACK state;

Fourth Handshake: Host 1 receives the FIN packet sent by Host 2 and sends an ACK packet back to Host 2; then Host 1 enters the TIME_WAIT state; after receiving Host 1’s ACK packet, Host 2 closes the connection; at this point, Host 1 waits for 2MSL (Maximum Segment Lifetime) to ensure no further replies are received, confirming that the server has closed normally, and then Host 1 can also close the connection.

Why Four-way Handshake?

The TCP protocol is a connection-oriented, reliable, byte-stream-based transport layer communication protocol. TCP operates in full-duplex mode, which means that when Host 1 sends a FIN packet, it only indicates that Host 1 has no more data to send; Host 1 is informing Host 2 that it has finished sending all its data; however, at this time, Host 1 can still receive data from Host 2. When Host 2 returns the ACK packet, it indicates that it knows Host 1 has no data to send, but Host 2 can still send data to Host 1. When Host 2 also sends a FIN packet, this indicates that Host 2 also has no data to send, and it tells Host 1 that it has no data to send; after this, both will happily terminate the TCP connection.

Why Wait for 2MSL?

MSL: Maximum Segment Lifetime, which is the longest time any segment can remain in the network before being discarded. There are two reasons:

To ensure that the TCP protocol’s full-duplex connection can be reliably closed.
To ensure that any duplicate data segments from this connection disappear from the network.

First Point: If Host 1 directly enters CLOSED, due to the unreliability of the IP protocol or other network reasons, Host 2 may not receive Host 1’s last ACK reply. Then, after a timeout, Host 2 will continue to send FIN; at this point, since Host 1 has already CLOSED, it cannot find a corresponding connection for the retransmitted FIN. Therefore, Host 1 does not directly enter CLOSED, but maintains the TIME_WAIT state to ensure that it receives the acknowledgment for the FIN again, thus correctly closing the connection.

Second Point: If Host 1 directly enters CLOSED and then initiates a new connection to Host 2, we cannot guarantee that this new connection will have a different port number than the closed connection. This means that it is possible for the new connection and the old connection to have the same port number. Generally, this will not cause problems, but there are special cases: if the new connection and the already closed old connection have the same port number, and some delayed data from the previous connection still lingers in the network, this delayed data may arrive at Host 2 after the new connection is established, and due to the same port number, the TCP protocol will consider that delayed data as belonging to the new connection, causing confusion with the actual new connection’s data packets. Therefore, the TCP connection must remain in the TIME_WAIT state for 2MSL to ensure that all data from this connection disappears from the network.

9. TCP Flow Control

If the sender sends data too quickly, the receiver may not be able to keep up, leading to data loss. Flow control ensures that the sender’s sending rate is not too fast, allowing the receiver to catch up.

Using the sliding window mechanism can easily implement flow control on a TCP connection.

Assume A is sending data to B. During the connection establishment, B informs A: “My receive window is rwnd = 400” (where rwnd represents the receiver window). Therefore, the sender’s sending window cannot exceed the value given by the receiver. Note that the TCP window is measured in bytes, not segments. Assuming each segment is 100 bytes long, and the initial sequence number for data segments is set to 1. Uppercase ACK indicates the acknowledgment bit in the header, while lowercase ack indicates the acknowledgment field value.

From the diagram, it can be seen that B performs flow control three times. The first time it reduces the window to rwnd = 300, the second time to rwnd = 100, and finally to rwnd = 0, meaning the sender is no longer allowed to send data. This state of pausing the sender will last until Host B sends a new window value.

TCP sets a persistence timer for each connection. Whenever one side of the TCP connection receives a zero window notification from the other side, it starts the persistence timer. If the timer expires, it sends a zero window probe packet (carrying 1 byte of data), and the receiving party resets the persistence timer.

10. TCP Congestion Control

The sender maintains a congestion window (cwnd) state variable. The size of the congestion window depends on the network’s congestion level and changes dynamically. The sender sets its sending window equal to the congestion window.

The principle for controlling the congestion window is: as long as there is no congestion in the network, the congestion window increases to allow more packets to be sent. However, if congestion occurs, the congestion window decreases to reduce the number of packets injected into the network.

Slow start algorithm:

When a host starts sending data, if it injects a large amount of data into the network immediately, it may cause network congestion since the load conditions are not yet clear. Therefore, a better method is to probe gradually by increasing the sending window from small to large, i.e., increasing the congestion window value from small to large.

Usually, when starting to send segments, the congestion window (cwnd) is set to the value of one maximum segment size (MSS). With each acknowledgment received for a new segment, the congestion window increases by at most one MSS. This method gradually increases the sender’s congestion window (cwnd), allowing packets to be injected into the network at a more reasonable rate.

With each transmission round, the congestion window (cwnd) doubles. The time taken for one transmission round is actually the round-trip time RTT. However, the term “transmission round” emphasizes that all segments allowed by the congestion window (cwnd) are sent continuously, and acknowledgments for the last byte sent are received.

Additionally, the “slow” in slow start does not refer to the slow growth rate of cwnd but rather to the initial setting of cwnd=1, causing the sender to send only one segment at the beginning (to probe the network’s congestion condition) before gradually increasing cwnd.

To prevent the congestion window (cwnd) from growing too large and causing network congestion, a slow start threshold (ssthresh) state variable is set. The usage of the slow start threshold (ssthresh) is as follows:

When cwnd < ssthresh, use the slow start algorithm.
When cwnd > ssthresh, stop using the slow start algorithm and switch to the congestion avoidance algorithm.
When cwnd = ssthresh, either the slow start algorithm or the congestion avoidance algorithm can be used.

Congestion Avoidance

Let the congestion window (cwnd) increase slowly, i.e., increase the sender’s congestion window (cwnd) by 1 for each round-trip time (RTT) instead of doubling it. This allows the congestion window (cwnd) to grow slowly in a linear manner, much slower than the growth rate of the congestion window in the slow start algorithm.

Regardless of whether in the slow start phase or the congestion avoidance phase, as long as the sender detects network congestion (indicated by the absence of acknowledgments), it should set the slow start threshold (ssthresh) to half of the sender’s window value at the moment of congestion (but not less than 2). Then, it should reset the congestion window (cwnd) to 1 and execute the slow start algorithm.

The purpose of this is to quickly reduce the number of packets sent into the network, giving the congested router enough time to process the packets queued up.

The following diagram illustrates the above congestion control process with specific values. Now the sending window size is equal to the congestion window.

2. Fast Retransmit and Fast Recovery

Fast Retransmit

The fast retransmit algorithm requires the receiver to immediately send a duplicate acknowledgment for every out-of-order segment received (so that the sender knows early that a segment has not reached the other party) rather than waiting to send an acknowledgment when it sends its own data.

Assuming the receiver has received M1 and M2 and has sent acknowledgments for both. Now suppose the receiver did not receive M3 but then received M4.

Clearly, the receiver cannot acknowledge M4 because M4 is an out-of-order segment. According to the reliable transmission principle, the receiver can choose to do nothing or send an acknowledgment for M2 at an appropriate time.

However, according to the fast retransmit algorithm, the receiver should promptly send a duplicate acknowledgment for M2, allowing the sender to know early that segment M3 has not reached the receiver. The sender then sends M5 and M6. After receiving these two packets, the receiver sends another duplicate acknowledgment for M2. Thus, the sender receives four acknowledgments for M2 from the receiver, with the last three being duplicate acknowledgments.

The fast retransmit algorithm also specifies that the sender should immediately retransmit the unacknowledged segment M3 as soon as it receives three duplicate acknowledgments, without having to wait for the retransmission timer for M3 to expire.

By retransmitting unacknowledged segments early, the overall network throughput can increase by about 20%.

Fast Recovery

Fast recovery is used in conjunction with fast retransmit and has two main points:

When the sender receives three consecutive duplicate acknowledgments, it executes a “multiplicative decrease” algorithm, halving the slow start threshold (ssthresh).
Unlike slow start, the congestion window (cwnd) is not set to 1; instead, it is set to the value of the slow start threshold (ssthresh) after halving, and then the congestion avoidance algorithm (“additive increase”) is executed to allow the congestion window to increase linearly.