Building a TCP/IP Stack: Part 1 – Ethernet & ARP

This article is a translation of “Let’s code a TCP/IP stack, 1: Ethernet & ARP”. The translation process is also a learning experience for myself.

Writing your own TCP/IP stack may seem like a daunting task. In fact, TCP has accumulated many specifications over its more than thirty years of life. However, the core specifications appear to be quite compact—the important parts are TCP header parsing, state machines, congestion control, and retransmission timeout calculations. The most common Layer 2 and Layer 3 protocols, Ethernet and IP, pale in comparison to the complexity of TCP. In this blog series, we will implement a minimal user-space TCP/IP stack for Linux. The purpose of these posts and the software is purely educational—to learn about networking and system programming at a deeper level.

1. TUN/TAP Devices

To intercept low-level network communications from the Linux kernel, we will use Linux TAP devices. In short, network user-space applications often manipulate L3/L2 communications using TUN/TAP devices respectively. A popular example is tunneling, where one packet is wrapped in the payload of another packet.

The advantage of TUN/TAP devices is that they are easy to set up in user-space programs, and they have been used in many applications, such as OpenVPN. Since we want to build a network stack starting from Layer 2, we need TAP devices. We instantiate it like this:

/* * Taken from Kernel Documentation/networking/tuntap.txt */int tun_alloc(char *dev){struct ifreq ifr;int fd, err;if( (fd = open("/dev/net/tap", O_RDWR)) < 0 ) {        print_error("Cannot open TUN/TAP dev");exit(1);    }CLEAR(ifr);/* Flags: IFF_TUN   - TUN device (no Ethernet headers)     *        IFF_TAP   - TAP device     *     *        IFF_NO_PI - Do not provide packet information     */    ifr.ifr_flags = IFF_TAP | IFF_NO_PI;if( *dev ) {strncpy(ifr.ifr_name, dev, IFNAMSIZ);    }if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){        print_error("ERR: Could not ioctl tun: %s\n", strerror(errno));        close(fd);return err;    }strcpy(dev, ifr.ifr_name);return fd;}

After this, you can use the returned file descriptor fd to read and write data to the Ethernet buffer of the virtual device. Here, the IFF_NO_PI flag is crucial; otherwise, we will add unnecessary packet information at the front of the Ethernet frame. In fact, you can check the kernel source code of the TUN device driver and verify it for yourself.

2. Ethernet Frame Format

The various Ethernet network technologies are the backbone connecting computers in Local Area Networks (LANs). Like all physical technologies, the Ethernet standard has evolved significantly since the first version released in 1980 by Digital Equipment Corporation, Intel, and Xerox. By today’s standards, the first version of Ethernet was slow—about 10Mb/s, using half-duplex communication, which means you can either send or receive data, but not at the same time. This is why a Media Access Control (MAC) protocol must be added to organize the data flow. To this day, if an Ethernet interface operates in half-duplex mode, Carrier Sense Multiple Access with Collision Detection (CSMA/CD) remains the MAC method. The invention of the 100BASE-T Ethernet standard using twisted pair cables allows for full-duplex communication and higher throughput speeds. Additionally, the simultaneous popularity of Ethernet switches has rendered CSMA/CD largely obsolete. Different Ethernet standards are maintained by the IEEE 802.3 working group. Next, we will take a look at the Ethernet frame header. It can be declared as a C structure as follows:

#include <linux/if_ether.h>struct eth_hdr{unsigned char dmac[6];unsigned char smac[6];uint16_t ethertype;unsigned char payload[];} __attribute__((packed));

dmac and smac are self-explanatory fields. They contain the MAC addresses of the communicating parties (destination and source respectively). The ethertype field is a 2-byte field whose value determines the length or type of the payload. Specifically, if the value of this field is greater than or equal to 1536, it contains the type of the payload (such as IPv4, ARP). If the value is less than that, it contains the length of the payload.

After the type field, the Ethernet frame may have several different tags. These tags can be used to describe the VLAN (Virtual LAN) or QoS (Quality of Service) types of the frame. Ethernet frame tags are excluded from our implementation, so the corresponding fields will not appear in our protocol declaration.

The payload field contains a pointer to the Ethernet frame payload. In our case, this will contain an ARP or IPv4 packet. If the payload length is less than the minimum required 48 bytes (without tags), padding bytes are appended to the end of the payload to meet the requirement.

We also include the if_ether.h Linux header file to provide a mapping between ethertypes and their hexadecimal values.

Finally, the Ethernet frame format also includes a Frame Check Sequence field used to verify the integrity of the frame along with a Cyclic Redundancy Check (CRC). We will ignore the processing of this field in our implementation.

3. Ethernet Frame Parsing

The packed attribute in the struct declaration is an implementation detail—it is used to indicate to the GNU C compiler not to optimize the struct memory layout for alignment with padding bytes. The use of this attribute stems entirely from the way we parse the protocol buffer, which simply uses the appropriate protocol structure to cast the data buffer.

struct eth_hdr *hdr = (struct eth_hdr *) buf;

A portable (though somewhat cumbersome) method is to manually serialize the protocol data. This way, the compiler is free to add padding bytes to better align with the data alignment requirements of different processors. The overall scheme for parsing and processing incoming Ethernet frames is simple:

if (tun_read(buf, BUFLEN) < 0) {    print_error("ERR: Read from tun_fd: %s\n", strerror(errno));}struct eth_hdr *hdr = init_eth_hdr(buf);handle_frame(&netdev, hdr);

The handle_frame function only looks at the ethertype field of the Ethernet header and decides its next action based on that value.

4. Address Resolution Protocol

ARP (Address Resolution Protocol) is used to dynamically map a 48-bit Ethernet address (MAC address) to a protocol address (such as IPv4 address). The key point here is that ARP can use various L3 protocols: not just IPv4, but also others, such as CHAOS, which declares a 16-bit protocol address. Typically, you know the IP addresses of certain services in the LAN, but to establish actual communication, you also need to know the hardware address (MAC). Therefore, ARP is used to broadcast and query the network, asking the owner of an IP address to report its hardware address. The format of an ARP message is relatively simple:

struct arp_hdr{uint16_t hwtype;uint16_t protype;unsigned char hwsize;unsigned char prosize;uint16_t opcode;unsigned char data[];} __attribute__((packed));

The ARP header (arp_hdr) contains a 2-byte hwtype that determines the link layer type used. In our example, it is Ethernet, and the actual value is 0x0001. The protype field indicates the protocol type. In our example, this is IPv4, which communicates with the value 0x0800. The hwsize and prosize fields are both 1 byte in size, containing the sizes of the hardware and protocol fields respectively. In our example, these are 6 bytes for the MAC address and 4 bytes for the IP address. The 2-byte opcode field declares the type of ARP message. It can be an ARP request (1), ARP reply (2), RARP request (3), or RARP reply (4). The data field contains the actual payload of the ARP message, which in our example will contain IPv4-specific information.

struct arp_ipv4{unsigned char smac[6];uint32_t sip;unsigned char dmac[6];uint32_t dip;} __attribute__((packed));

These fields are easy to understand. smac and dmac are the 6-byte MAC addresses of the sender and receiver respectively. “sip” and “dip” contain the sender’s and receiver’s IP addresses respectively.

5. Address Resolution Algorithm

The original specification describes this simple address resolution algorithm:

?Do I have the hardware type in ar$hrd?Yes: (almost definitely)[optionally check the hardware length ar$hln]?Do I speak the protocol in ar$pro?Yes:[optionally check the protocol length ar$pln]Merge_flag := falseIf the pair <protocol type, sender protocol address> is already in my translation table, update the sender hardware address field of the entry with the new information in the packet and set Merge_flag to true.?Am I the target protocol address?Yes:If Merge_flag is false, add the triplet <protocol type, sender protocol address, sender hardware address> to the translation table.?Is the opcode ares_op$REQUEST?  (NOW look at the opcode!!)Yes:Swap hardware and protocol fields, putting the local hardware and protocol addresses in the sender fields.Set the ar$op field to ares_op$REPLYSend the packet to the (new) target hardware address on the same hardware on which the request was received.

In other words, the translation table is used to store the results of ARP so that hosts can check if they already have this entry in their cache. This avoids sending redundant ARP requests over the network.

This algorithm is implemented in arp.c.

Finally, a final test of an ARP implementation is to see if it correctly responds to ARP requests:

[saminiir@localhost lvl-ip]$ arping -I tap0 10.0.0.4ARPING 10.0.0.4 from 192.168.1.32 tap0Unicast reply from 10.0.0.4 [00:0C:29:6D:50:25]  3.170msUnicast reply from 10.0.0.4 [00:0C:29:6D:50:25]  13.309ms[saminiir@localhost lvl-ip]$ arpAddress                  HWtype  HWaddress           Flags Mask            Iface10.0.0.4                 ether   00:0c:29:6d:50:25   C                     tap0

The kernel’s network stack recognizes ARP replies from our custom network stack and thus fills its ARP cache with entries from our virtual network device, success!

6. Conclusion

The minimal implementation of Ethernet frame processing and ARP is relatively simple and can be accomplished in just a few lines of code. However, the payoff is quite high, as you can populate the ARP cache of a Linux host with your own virtual Ethernet device! The source code for the project can be found on GitHub. In the next article, we will continue with ICMP echo & ping (reply) and IPv4 packet parsing.

7. References

  1. https://tools.ietf.org/html/rfc7414↩

  2. http://ethernethistory.typepad.com/papers/EthernetSpec.pdf↩

  3. https://en.wikipedia.org/wiki/IEEE_802.3↩

  4. https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html#Common-Type-Attributes↩

  5. https://github.com/chobits/tapip↩

Leave a Comment