Ten Essential Questions About TCP/IP

Click the “Learn Algorithms in Five Minutes” above and choose the “Star” public account

Heavy content delivered first Ten Essential Questions About TCP/IP

Source: Juejin, Author: Ruheng

Link: https://juejin.im/post/6844903490595061767

1. TCP/IP Model

The TCP/IP protocol model (Transmission Control Protocol/Internet Protocol) includes a series of network protocols that form the foundation of the Internet and is the core protocol of the Internet.

The reference model based on TCP/IP divides the protocol into four layers: Link Layer, Network Layer, Transport Layer, and Application Layer. The following diagram shows the correspondence between the TCP/IP model and the OSI model layers.

The TCP/IP protocol suite is layered from top to bottom. The topmost layer is the Application Layer, which includes familiar protocols like HTTP, FTP, etc. The second layer is the Transport Layer, where the famous TCP and UDP protocols reside. The third layer is the Network Layer, where the IP protocol is located, responsible for adding IP addresses and other data to determine the transmission target. The fourth layer is the Data Link Layer, which adds an Ethernet protocol header to the data to be transmitted and performs CRC encoding to prepare for the final data transmission.

The above diagram clearly illustrates the role of each layer in the TCP/IP protocol, while the communication process of the TCP/IP protocol corresponds to the process of data stacking and unstacking. During the stacking process, the sender continuously wraps headers and trailers at each layer, adding transmission information to ensure it reaches the destination. During the unstacking process, the receiver continuously removes headers and trailers at each layer to obtain the final transmitted data.

The above diagram uses the HTTP protocol as an example for a detailed explanation.

2. Data Link Layer

The Physical Layer is responsible for the interchange between the 0,1 bit stream and the voltage levels or light flashes of physical devices. The Data Link Layer is responsible for dividing the 0,1 sequence into data frames for transmission from one node to an adjacent node, which are uniquely identified by MAC addresses (MAC, physical address, each host has a MAC address).

Encapsulating into frames: Add headers and trailers to the network layer datagram, encapsulating it into frames. The frame header includes the source MAC address and destination MAC address.
Transparent transmission: zero bit padding, escape characters.
Reliable transmission: Rarely used on low-error-rate links, but wireless links (WLAN) guarantee reliable transmission.
Error detection (CRC): The receiver detects errors, and if an error is found, the frame is discarded.

3. Network Layer

1. IP Protocol

The IP protocol is the core of the TCP/IP protocol, and all TCP, UDP, ICMP, IGMP data is transmitted in IP data format. It is important to note that IP is not a reliable protocol, meaning that the IP protocol does not provide a mechanism for handling data that has not been delivered, which is considered the responsibility of upper-layer protocols: TCP or UDP.

1.1 IP Address

In the Data Link Layer, we generally identify different nodes by MAC addresses, while in the IP layer, we also need a similar address identifier, which is the IP address.

A 32-bit IP address is divided into network and host parts, which reduces the number of records in the routing table of routers. With a network address, terminals with the same network address can be restricted to the same scope, allowing the routing table to maintain only a single direction for that network address to reach the corresponding terminals.

Class A IP Address: 0.0.0.0~127.0.0.0 Class B IP Address: 128.0.0.1~191.255.0.0 Class C IP Address: 192.168.0.0~239.255.255.0

1.2 IP Protocol Header

Here we only introduce: the 8-bit TTL field. This field specifies how many routers the data packet can pass through before being discarded. Each time an IP packet passes through a router, the TTL value decreases by 1. When the TTL value becomes zero, it is automatically discarded.

The maximum value of this field is 255, meaning a protocol packet can pass through routers 255 times before being discarded. Depending on the system, this number can vary, generally being 32 or 64.

2. ARP and RARP Protocols

ARP is a protocol for obtaining MAC addresses based on IP addresses.

ARP (Address Resolution Protocol) is a resolution protocol. Initially, a host does not know which host interface corresponds to the IP. When a host wants to send an IP packet, it first checks its ARP cache (a cache of IP-MAC address pairs).

If the queried IP-MAC pair does not exist, the host sends an ARP broadcast packet to the network, containing the IP address to be queried. All hosts that directly receive this broadcast will check their own IP address. If a host finds it matches, it prepares an ARP packet containing its MAC address to send back to the host that sent the ARP broadcast.

Once the broadcasting host receives the ARP packet, it will update its ARP cache (where the IP-MAC pairs are stored). The broadcasting host will then use the updated ARP cache data to prepare the data link layer packet for sending.

The RARP protocol works in the opposite manner and will not be elaborated on.

3. ICMP Protocol

The IP protocol is not a reliable protocol; it does not guarantee data delivery. Therefore, the task of ensuring data delivery should naturally fall to other modules. One important module is the ICMP (Internet Control Message Protocol). ICMP is not a high-level protocol but is a protocol at the IP layer.

When errors occur in transmitting IP packets, such as host unreachable, route unreachable, etc., the ICMP protocol will encapsulate the error information and send it back to the host, giving the host a chance to handle the error. This is why protocols built on top of the IP layer can achieve reliability.

4. Ping

Ping can be said to be the most famous application of ICMP and is part of the TCP/IP protocol. The “ping” command can check network connectivity and is very helpful for analyzing and diagnosing network faults.

For example, when we cannot access a website, we usually ping the site. The ping will echo some useful information. General information is as follows:

The word ping originates from sonar positioning, and this program indeed functions similarly, using ICMP packets to detect whether another host is reachable. The principle is to send a request using ICMP with a type code of 0, and the host receiving the request responds with an ICMP of type code 8.

5. Traceroute

Traceroute is an important tool for detecting the routing situation between a host and the destination host, and it is the most convenient tool available.

The principle of Traceroute is quite interesting. Upon receiving the destination host’s IP, it first sends a UDP packet with TTL=1 to the destination host. The first router that receives this packet will automatically reduce the TTL by 1, and when the TTL becomes 0, the router discards the packet and generates an unreachable ICMP message back to the host. The host then sends a UDP packet with TTL=2 to the destination host, prompting the second router to send an ICMP message back to the host. This process continues until it reaches the destination host, allowing Traceroute to obtain all router IPs.

6. TCP/UDP

TCP and UDP are both transport layer protocols, but they have different characteristics and applications. Below is a comparative analysis in tabular form.

Message-oriented

The message-oriented transmission method allows the application layer to specify the length of the message to UDP, and UDP will send it as is, meaning one message per send. Therefore, the application must choose a suitable message size. If the message is too long, the IP layer needs to fragment it, reducing efficiency. If it is too short, it will be too small for IP.

Byte stream-oriented

In a byte stream-oriented transmission, although the application interacts with TCP one data block at a time (of varying sizes), TCP views the application as a continuous stream of unstructured bytes. TCP has a buffer, and when the data block sent by the application is too long, TCP can break it into shorter pieces for transmission.

Congestion control and flow control are key aspects of TCP, which will be explained later.

Some applications of TCP and UDP protocols

When should TCP be used?

When there are quality requirements for network communication, such as ensuring that all data is accurately transmitted to the other party. This is often used in applications that require reliability, such as protocols for transferring files like HTTP, HTTPS, FTP, and email transmission protocols like POP, SMTP.

When should UDP be used?

When there are no high quality requirements for network communication, and speed is prioritized, UDP can be used.

7. DNS

DNS (Domain Name System) is a distributed database on the Internet that maps domain names to IP addresses, allowing users to access the Internet more conveniently without having to remember machine-readable IP number strings. The process of obtaining the IP address corresponding to a hostname is called domain name resolution (or hostname resolution). The DNS protocol operates over the UDP protocol using port number 53.

8. Establishment and Termination of TCP Connections

1. Three-way Handshake

TCP is connection-oriented. Before either party can send data, a connection must be established between both parties. In the TCP/IP protocol, TCP provides reliable connection services, and the connection is initialized through a three-way handshake. The purpose of the three-way handshake is to synchronize the sequence numbers and acknowledgment numbers of both parties and exchange TCP window size information.

First handshake: Establishing the connection. The client sends a connection request segment with SYN set to 1 and Sequence Number set to x; then, the client enters the SYN_SEND state, waiting for the server’s confirmation;

Second handshake: The server receives the SYN segment. The server needs to acknowledge the client’s SYN segment, setting the Acknowledgment Number to x+1 (Sequence Number+1); simultaneously, it sends its own SYN request with SYN set to 1 and Sequence Number set to y; the server puts all this information into a segment (the SYN+ACK segment) and sends it to the client, entering the SYN_RECV state;

Third handshake: The client receives the server’s SYN+ACK segment. It sets the Acknowledgment Number to y+1 and sends an ACK segment back to the server. After this segment is sent, both the client and server enter the ESTABLISHED state, completing the TCP three-way handshake.

Why three-way handshake?

To prevent a stale connection request segment from suddenly being delivered to the server, causing errors.

Specific example: A “stale connection request segment” can occur in a situation where the first connection request segment sent by the client is not lost but is delayed at a network node for a long time, arriving at the server after the connection has already been released. This is an already stale segment. However, when the server receives this stale connection request segment, it mistakenly believes it is a new connection request from the client.

Thus, it sends a confirmation segment to the client, agreeing to establish a connection. If we did not use the “three-way handshake,” as soon as the server sends a confirmation, a new connection would be established. Since the client has not sent a request to establish a connection, it will ignore the server’s confirmation and will not send data to the server. However, the server will mistakenly think a new transport connection has been established and will wait for the client to send data. This would waste many resources on the server. The three-way handshake method prevents this phenomenon. For example, in the above situation, the client would not send a confirmation to the server. The server, not receiving the confirmation, would know that the client has not requested to establish a connection.

2. Four-way Handshake

After the client and server have established a TCP connection through the three-way handshake, when the data transmission is complete, the TCP connection must be terminated. This is known as the mysterious “four-way handshake”.

First separation: Host 1 (which can be either the client or server) sets the Sequence Number and sends a FIN segment to Host 2; at this point, Host 1 enters the FIN_WAIT_1 state; this indicates that Host 1 has no more data to send to Host 2;

Second separation: Host 2 receives the FIN segment from Host 1 and sends an ACK segment back to Host 1, with the Acknowledgment Number set to the Sequence Number plus 1; Host 1 enters the FIN_WAIT_2 state; Host 2 informs Host 1 that it “agrees” to the closure request;

Third separation: Host 2 sends a FIN segment to Host 1, requesting to close the connection, while Host 2 enters the LAST_ACK state;

Fourth separation: Host 1 receives the FIN segment from Host 2 and sends an ACK segment back to Host 2; then Host 1 enters the TIME_WAIT state; after Host 2 receives Host 1’s ACK segment, it closes the connection; at this point, Host 1 waits for 2MSL, and if it still has not received a reply, it can close the connection as well.

Why four-way handshake?

The TCP protocol is a connection-oriented, reliable, byte stream-based transport layer communication protocol. TCP operates in full-duplex mode, meaning that when Host 1 sends a FIN segment, it only indicates that Host 1 has no more data to send, informing Host 2 that it has finished sending all its data; however, Host 1 can still receive data from Host 2 at this time. When Host 2 returns the ACK segment, it indicates that it knows Host 1 has no data to send, but Host 2 can still send data to Host 1. When Host 2 also sends the FIN segment, it indicates that Host 2 also has no data to send, and will inform Host 1 that it has no data to send. After that, both sides will happily terminate the TCP connection.

Why wait for 2MSL?

MSL: Maximum Segment Lifetime, which is the longest time any segment can exist in the network before being discarded. There are two reasons for this:

To ensure that the TCP protocol’s full-duplex connection can be reliably closed.
To ensure that any duplicate segments from this connection disappear from the network.

The first point: If Host 1 directly goes to CLOSED, and due to the unreliability of the IP protocol or other network reasons, Host 2 does not receive Host 1’s final ACK reply. Then Host 2 will continue to send FIN after a timeout. Since Host 1 has already CLOSED, it will not find the connection corresponding to the resent FIN. Therefore, Host 1 does not directly enter CLOSED, but maintains TIME_WAIT, so that when it receives FIN again, it can ensure that the other side receives the ACK, thereby correctly closing the connection.

The second point: If Host 1 directly goes to CLOSED and then initiates a new connection to Host 2, we cannot guarantee that the new connection will have a different port number than the recently closed connection. In other words, it is possible for the new connection and the old closed connection to have the same port number. Generally, this should not cause any problems, but there are special cases: If the new connection and the already closed old connection have the same port number, and some delayed data from the previous connection still lingers in the network, this delayed data will arrive at Host 2 after the new connection is established. Since the new connection and old connection have the same port number, the TCP protocol will mistakenly consider that delayed data belongs to the new connection, thus causing confusion with the actual data packets of the new connection. Therefore, the TCP connection must wait in the TIME_WAIT state for 2MSL to ensure that all data from the current connection disappears from the network.

9. TCP Flow Control

If the sender sends data too quickly, the receiver may not be able to keep up, leading to data loss. Flow control ensures that the sender’s sending rate is not too fast, allowing the receiver to keep up with the reception.

Using the sliding window mechanism makes it easy to implement flow control on a TCP connection.

Assume A is sending data to B. At the time of connection establishment, B informs A: “My receive window is rwnd = 400” (where rwnd represents the receiver window). Therefore, the sender’s sending window cannot exceed the value of the receiver window provided by the receiver. Note that the TCP window is measured in bytes, not segments. Assume each segment is 100 bytes long, and the initial value of the data segment sequence number is set to 1. Uppercase ACK indicates the acknowledgment bit in the header, while lowercase ack indicates the acknowledgment field value.

From the diagram, we can see that B performed flow control three times. The first time the window was reduced to rwnd = 300, the second time to rwnd = 100, and finally to rwnd = 0, which prohibits the sender from sending any more data. This state of pausing the sender will last until Host B sends a new window value.

TCP has a persistence timer for each connection. Whenever one side of the TCP connection receives a zero window notification from the other side, it starts the persistence timer. If the timer expires, it sends a zero-window probe segment (carrying 1 byte of data), and the party receiving this segment resets the persistence timer.

10. TCP Congestion Control

The sender maintains a congestion window (cwnd) state variable. The size of the congestion window depends on the network’s congestion level and changes dynamically. The sender sets its sending window equal to the congestion window.

The principle of controlling the congestion window is that as long as there is no congestion in the network, the congestion window increases to send more packets. However, whenever congestion occurs, the congestion window decreases to reduce the number of packets injected into the network.

The slow start algorithm:

When a host starts sending data, if it injects a large amount of data bytes into the network immediately, it may cause network congestion, as the load situation of the network is not clear. Therefore, a better method is to probe first, gradually increasing the sending window from small to large, meaning increasing the congestion window value gradually.

Typically, when starting to send packets, the congestion window cwnd is initially set to the value of one maximum segment size (MSS). After receiving an acknowledgment for a new segment, the congestion window is increased by at most one MSS. This gradual increase in the sender’s congestion window cwnd allows packets to be injected into the network at a more reasonable rate.

With each transmission round, the congestion window cwnd doubles. The time for one transmission round is actually the round-trip time RTT. However, “transmission round” emphasizes that all segments allowed to be sent by the congestion window cwnd are sent continuously, and the last byte sent receives acknowledgment.

Moreover, the “slow” in slow start does not refer to the slow growth rate of cwnd, but rather that during the initial sending of packets, cwnd is set to 1, allowing the sender to send only one segment initially (to probe the network’s congestion situation), and then gradually increases cwnd.

To prevent the congestion window cwnd from growing too large and causing network congestion, a slow start threshold (ssthresh) state variable is also set. The usage of the slow start threshold ssthresh is as follows:

When cwnd < ssthresh, use the above slow start algorithm.
When cwnd > ssthresh, stop using the slow start algorithm and switch to the congestion avoidance algorithm.
When cwnd = ssthresh, either the slow start algorithm or the congestion avoidance algorithm can be used.

Congestion avoidance

Slowly increase the congestion window cwnd, meaning that every round-trip time RTT, the sender’s congestion window cwnd is increased by 1, rather than doubled. This causes the congestion window cwnd to grow slowly in a linear manner, much slower than the growth rate of the congestion window in the slow start algorithm.

Whether in the slow start phase or the congestion avoidance phase, as long as the sender detects network congestion (which is indicated by the lack of acknowledgment), the slow start threshold ssthresh should be set to half the sender’s window value at the time congestion occurred (but not less than 2). The congestion window cwnd is then reset to 1, and the slow start algorithm is executed.

The purpose of this is to quickly reduce the number of packets sent to the network so that the congested router has enough time to process the packets queued up.

The following diagram illustrates the process of the above congestion control with specific values. Currently, the size of the sending window is equal to the congestion window.

2. Fast Retransmit and Fast Recovery

Fast Retransmit

The fast retransmit algorithm requires that the receiver send a duplicate acknowledgment immediately upon receiving an out-of-order segment (to inform the sender early that a segment has not reached the other party) rather than waiting to send acknowledgment with its own data.

Assuming the receiver received M1 and M2 and sent acknowledgments for both. Now suppose the receiver did not receive M3 but then received M4.

Clearly, the receiver cannot acknowledge M4 because M4 is an out-of-order segment. According to the principles of reliable transmission, the receiver can either do nothing or send a timely acknowledgment for M2.

However, according to the fast retransmit algorithm, the receiver should promptly send a duplicate acknowledgment for M2, allowing the sender to know early that segment M3 has not reached the receiver. The sender then sends M5 and M6. The receiver receives these two segments and also sends another duplicate acknowledgment for M2. Thus, the sender receives four acknowledgments for M2, three of which are duplicates.

The fast retransmit algorithm also stipulates that as soon as the sender receives three duplicate acknowledgments, it should immediately retransmit the unacknowledged segment M3 without waiting for the retransmission timer for M3 to expire.

Since the sender retransmits unacknowledged segments early, adopting fast retransmit can increase overall network throughput by about 20%.

Fast Recovery

Fast recovery algorithm is used in conjunction with fast retransmit and has the following two key points:

When the sender continuously receives three duplicate acknowledgments, execute the “multiplicative decrease” algorithm, halving the slow start threshold ssthresh.
Unlike slow start, the sender does not set cwnd to 1 but instead sets cwnd to the value of ssthresh after halving and then begins executing the congestion avoidance algorithm (“additive increase”), allowing the congestion window to increase slowly in a linear manner.