Understanding TCP/IP Fragmentation: How Does It Work?

What Are TCP Segmentation and IP Fragmentation

We know that a network is like a pipe, and pipes can have different diameters.

When a data packet wants to travel from one end of the pipe to the other, it must pass through this pipe. (Obvious, right?)

However, the size of data packets can be varied, and to pass through the pipe, the data packet cannot exceed the diameter of the pipe.

The question arises: what to do when the data packet is too large?

The answer is relatively simple: the data packet will be split into smaller chunks. This way, the data can be reduced in size and transmitted smoothly.

Looking back at the network layering protocol, data first passes through the transport layer and then to the network layer.

This behavior can occur at both the transport layer and network layer.

In the transport layer (TCP protocol), it is called segmentation.

In the network layer (IP layer), it is called fragmentation. (Note that unless specified otherwise, the IP mentioned below refers to IPv4.)

So whether it is fragmentation or segmentation, it surely needs to be divided according to a certain length.

In TCP, this length is MSS.

In the IP layer, this length is MTU.

What is the relationship between MSS and MTU? This was briefly mentioned in a previous article, and here it is elaborated.

What Is MSS

MSS: Maximum Segment Size. It is the maximum segment size that TCP submits to the IP layer, excluding the TCP Header and TCP Option, only including the TCP Payload. MSS is used by TCP to limit the maximum number of bytes that can be sent from the application layer. Assuming MTU = 1500 byte, then MSS = 1500 – 20 (IP Header) – 20 (TCP Header) = 1460 byte. If the application layer wants to send 2000 byte, then two segments are needed to complete the transmission: the first TCP segment = 1460, the second TCP segment = 540.

How to Check MSS?

We all know about the three-way handshake in TCP, and MSS is communicated to the other party during this process to inform the counterpart of the maximum TCP message data size that can be received locally (excluding TCP and IP header).

For example, in the image above, B sends its MSS to A, suggesting A use MSS=1420 for segmentation when sending data to B. Similarly, when B sends data to A, it will also carry MSS=1372. After comparison, the smaller value (1372) will be used as the communication MSS value, a process known as MSS negotiation.

Additionally, under normal circumstances, MSS + 20 (TCP header) + 20 (IP header) = MTU. The MTUs corresponding to the captured packets above are 1372+40 and 1420+40. The MTU on the same path is not necessarily symmetrical, meaning the MTU from A to B and from B to A can be different, and so can the corresponding MSS.

Does the MSS Negotiated During the Three-Way Handshake Change?

Of course not. Every time the function for sending messages in TCP is executed, MSS will be recalculated, and segmentation will be performed again.

What Happens If the Other Party Does Not Send MSS?

Let’s take a look at the TCP header.

In fact, MSS is introduced as an optional item, but generally, MSS is always transmitted. However, if a machine’s implementation is particularly tricky and does not transmit MSS, what should the other party do?

If the MSS of the counterpart’s TCP is not received, the local TCP defaults to using MSS=536Byte.

Why is it 536?

536 (data) + 20 (tcp header) + 20 (ip header) = 576Byte

Earlier, it was mentioned that IP will fragment, and since it fragments, it will also reassemble, and this 576 is exactly the minimum reassembly buffer size for IP.

What Is MTU

MTU: Maximum Transmit Unit, the maximum transmission unit. This is provided by the data link layer to inform the upper layer IP layer of its transmission capacity. The IP layer will fragment packets based on this. Generally, MTU=<strong>1500 Byte</strong>. Assuming the IP layer has <= 1500 bytes to send, it can be sent in a single IP packet; if the IP layer has > 1500 bytes of data to send, fragmentation is necessary, and the fragmented IP Header ID will be the same. To reassemble the fragmented IP packets at the receiving end, various information must also be added to the fragmented IP packets, such as the offset of this fragment in the original IP packet.

How to Check MTU

In the mac console, input the ifconfig command to see the MTU value.

$ ipconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
    ...
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
    ...
p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304
    ...

Here, you can see several MTUs, which can be simply understood as the processing capacity of each network card being different, hence the corresponding MTUs are also different. Of course, this value can be modified, but that is not the topic of today’s discussion.

On the link from the application layer of one machine to its network card, it can generally be guaranteed that MSS < MTU.

Why Is MTU Generally 1500?

This is actually determined by transmission efficiency. Although the network we usually use feels quite stable, this is because TCP is doing various retransmissions behind the scenes to ensure reliable transmission. In reality, packets can often be lost in transit, and the larger the packet, the higher the probability of loss.

So, is a smaller packet always better? Not necessarily.

However, if we choose a relatively small length, say MTU as 300Byte, then TCP payload = 300 - IP Header - TCP Header = 300 - 20 - 20 = 260 byte. The effective transmission efficiency = 260 / 300 = 86%.

Conversely, if the Ethernet length is 1500, then the effective transmission efficiency = 1460 / 1500 = 96%, which is clearly much higher than 86%.

Thus, while smaller packets are less likely to be lost, larger packets have higher transmission efficiency, so a balance was struck at 1500.

Why Does the IP Layer Fragment Even Though TCP Segments?

Since the IP layer already performs fragmentation, even if TCP does not segment, the data packet will still be fragmented at the IP layer, and the data can still be transmitted normally.

Since the network layer will fragment, why does TCP still need to segment? Is it redundant?

Assuming there is a large piece of data, and if it is not segmented at the TCP layer, if packet loss occurs during transmission, TCP will retransmit the entire large piece of data (although the IP layer will split the data into N smaller packets of MTU length, the unit of TCP retransmission is still that large piece of data).

If TCP segments this data into N packets of size less than or equal to MSS, and when it reaches the IP layer, it adds the IP header and TCP header, still less than MTU, then the IP layer will not perform further segmentation. If packet loss occurs on the transmission path, TCP will only retransmit that small portion of the MSS segment. This is more efficient than when TCP does not segment.

Similarly, in addition to TCP, there is also the UDP protocol at the transport layer, but UDP does not segment by itself. Therefore, when the data volume is large, it can only be passed to the IP layer for fragmentation, and then sent to the lower layer.

In other words, under normal circumstances, on the link from the transport layer of one machine to the network layer, if the transport layer segments the data, then the IP layer will not need to fragment. If the transport layer does not segment, then the IP layer may perform fragmentation.

In short, the purpose of data segmentation in TCP is to avoid fragmentation at the IP layer, while ensuring that during retransmission, only the small pieces of data after segmentation are retransmitted.

If TCP Segments, Does the IP Layer Definitely Not Fragment?

As mentioned above, at the sending end, if TCP segments, the IP layer will not fragment.

However, there may be other network layer devices along the entire transmission path, and the MTU of these devices may be smaller than that of the sending end. In this case, even though the data packet has already been segmented at the sending end, it may still be fragmented at the IP layer.

If there are devices on the link with an even smaller MTU, fragmentation will occur again, and all fragments will be reassembled at the receiving end.

Therefore, even if TCP has segmented, it is still possible for the IP layer at other nodes on the link to fragment again, and even if the data has already been fragmented by the first IP layer, it can still be fragmented again by the IP layer of other machines.

How Can the IP Layer Avoid Fragmentation?

As mentioned before, the IP layer may fragment during transmission due to different MTUs between nodes. Each fragmentation adds various information to facilitate reassembly at the receiving end. So, can the IP layer avoid fragmentation?

If there is a way to know the smallest MTU along the entire link and send data at the minimum MTU length, then fragmentation will not occur no matter which node the data reaches.

The smallest MTU along the entire link is called PMTU (Path MTU).

There is a method to obtain this PMTU, called Path MTU Discovery.

$cat /proc/sys/net/ipv4/ip_no_pmtu_disc
0

By default, it is set to 0, meaning the PMTU discovery function is enabled. Generally, most machines are in this state.

The principle is relatively simple. First, let’s look back at the IP datagram header.

There is a highlighted flag DF (Don’t Fragment). When it is set to 1, it means this IP datagram should not be fragmented.

When a router on the link receives this datagram and finds that the length exceeds its MTU, it will check the DF of this IP datagram.

If it is 0 (allow fragmentation), it will fragment and pass the fragmented data to the next router.
If it is 1, it will discard the data and return an ICMP packet to the sender, informing it that “Oops!” the data is unreachable and needs fragmentation, along with the current machine’s MTU.

Understanding the above principle, let’s see how PMTU discovery is implemented.

The application normally sends messages via TCP, and after the transport layer TCP segments, the network layer adds the IP header with DF set to 1, and the message is sent to the lower layers.
At this time, a router on the link has a smaller MTU for various reasons.
The IP message reaches this router, which finds that the message length exceeds its MTU, and since the message has the DF flag set to prevent fragmentation, it discards the message. It then returns an ICMP error to the sender, along with its MTU.

The sender receives this ICMP message, updates its MTU, and records it in a PMTU table.
Due to TCP’s reliability, it will attempt to retransmit this message, calculating MSS based on this new MTU for segmentation, allowing the new IP packet to be successfully forwarded by the router.
If there are routers with even smaller MTUs on the path, the above process will repeat.

Conclusion

Data segmentation in TCP prevents fragmentation at the IP layer, and during retransmission, only the small segments of data after segmentation are retransmitted.
TCP segmentation uses MSS, while IP fragmentation uses MTU.
MSS is calculated based on MTU and can change during the three-way handshake and message sending.
IP fragmentation is a necessary action, and fragmentation at the IP layer should be avoided as much as possible, especially fragmentation by intermediate devices on the link. Therefore, in IPv6, fragmentation by intermediate nodes is prohibited; fragmentation can only occur at the beginning and end of the link.
After establishing a connection, if the MTU value of nodes on the path changes, the sender can update the MTU value through PMTU discovery. In this case, PMTU discovery sacrifices N transmission opportunities to obtain PMTU, and TCP can ensure reliability through retransmission, while in UDP, messages may be directly lost.