Understanding Docker Core Technologies and Implementation Principles

When we mention virtualization technology, the first thing that comes to mind is Docker. After four years of rapid development, Docker has become widely used in the production environments of many companies and is no longer just a toy for the development phase.As a product widely applied in production environments, Docker has a very mature community and a large number of users, and the content in the code repository has also become very extensive.

Similarly, due to the development of projects, the splitting of functions, and various strange renaming PRs, it has become more difficult for us to understand the overall architecture of Docker.

Although Docker currently has many components and its implementation is very complex, this article does not intend to delve into the specific implementation details of Docker; we want to discuss the core technologies that support the emergence of Docker as a virtualization technology.

First, the emergence of Docker must be because the current backend indeed needs a virtualization technology to solve the problem of consistency between the development environment and the production environment during the development and operation stages. Through Docker, we can also include the environment in which the program runs into version control, eliminating the possibility of different results caused by environmental differences. However, although the above demand has promoted the emergence of virtualization technology, without suitable underlying technological support, we still cannot obtain a perfect product. The remaining content of this article will introduce several core technologies used by Docker. If we understand their usage and principles, we will clearly understand the implementation principles of Docker.

Namespaces

Namespaces are methods provided by Linux to isolate resources such as process trees, network interfaces, mount points, and inter-process communication. In daily use of Linux or macOS, we do not need to run multiple completely isolated servers, but if we start multiple services on a server, these services will actually affect each other. Each service can see the processes of other services and can access any files on the host machine, which we often do not want to see. We prefer that different services running on the same machine can achieve complete isolation, just like running on multiple different machines.

In this case, once one of the services on the server is compromised, the intruder can access all services and files on the current machine, which is also something we do not want to see. Docker achieves isolation between different containers through Linux’s Namespaces.

The Linux namespace mechanism provides the following seven different namespaces, including CLONE_NEWCGROUP, CLONE_NEWIPC, CLONE_NEWNET, CLONE_NEWNS, CLONE_NEWPID, CLONE_NEWUSER, and CLONE_NEWUTS. Through these seven options, we can set which resources the new process should be isolated from the host machine when creating a new process.

Processes

Processes are a very important concept in Linux and modern operating systems. It represents a program that is currently executing and is a task unit in modern time-sharing systems. On every *nix operating system, we can use the ps command to print out the processes currently executing in the operating system. For example, on Ubuntu, using this command will yield the following result:

$ ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Apr08 ?        00:00:09 /sbin/init
root         2     0  0 Apr08 ?        00:00:00 [kthreadd]
root         3     2  0 Apr08 ?        00:00:05 [ksoftirqd/0]
root         5     2  0 Apr08 ?        00:00:00 [kworker/0:0H]
root         7     2  0 Apr08 ?        00:07:10 [rcu_sched]
root        39     2  0 Apr08 ?        00:00:00 [migration/0]
root        40     2  0 Apr08 ?        00:01:54 [watchdog/0]
...

There are many processes currently executing on the machine. Among the processes mentioned above, two are very special: one is the pid of 1, which is the /sbin/init process, and the other is the pid of 2, which is the kthreadd process. These two processes are created by the god process idle in Linux, where the former is responsible for executing part of the kernel’s initialization work and system configuration, and will also create some registration processes similar to getty, while the latter is responsible for managing and scheduling other kernel processes.

If we run a new Docker container under the current Linux operating system and enter its internal bash through exec and print all the processes inside, we will get the following result:

root@iZ255w13cy6Z:~# docker run -it -d ubuntu
b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79
root@iZ255w13cy6Z:~# docker exec -it b809a2eb3630 /bin/bash
root@b809a2eb3630:/# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 15:42 pts/0    00:00:00 /bin/bash
root         9     0  0 15:42 pts/1    00:00:00 /bin/bash
root        17     9  0 15:43 pts/1    00:00:00 ps -ef

Executing the ps command inside the new container prints a very clean process list, with only three processes including the current ps -ef. All the dozens of processes on the host machine have disappeared.

The current Docker container successfully isolates the processes inside the container from the processes on the host machine. If we print all the current processes on the host machine, we will get the following three results related to Docker:

UID        PID  PPID  C STIME TTY          TIME CMD
root     29407     1  0 Nov16 ?        00:08:38 /usr/bin/dockerd --raw-logs
root      1554 29407  0 Nov19 ?        00:03:28 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metrics-interval=0 --start-timeout 2m --state-dir /var/run/docker/libcontainerd/containerd --shim docker-containerd-shim --runtime docker-runc
root      5006  1554  0 08:38 ?        00:00:00 docker-containerd-shim b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 /var/run/docker/libcontainerd/b809a2eb3630e64c581561b08ac46154878ff1c61c6519848b4a29d412215e79 docker-runc

On the current host machine, there may exist a process tree composed of the different processes mentioned above:

This is achieved by passing CLONE_NEWPID when creating a new process using clone(2), which is to achieve process isolation using Linux’s namespace. Any process inside the Docker container is completely unaware of the processes on the host machine.

containerRouter.postContainersStart
└── daemon.ContainerStart
    └── daemon.createSpec
        └── setNamespaces
            └── setNamespace

Docker’s containers achieve process isolation from the host machine using the above technology. Every time we run docker run or docker start, a Spec is created to set up process isolation:

func (daemon *Daemon) createSpec(c *container.Container) (*specs.Spec, error) {
    s := oci.DefaultSpec()

    // ...
    if err := setNamespaces(daemon, &amp;s, c); err != nil {
        return nil, fmt.Errorf("linux spec namespaces: %v", err)
    }

    return &amp;s, nil
}

In the setNamespaces method, not only the process-related namespaces will be set, but also the namespaces related to users, networks, IPC, and UTS:

func setNamespaces(daemon *Daemon, s *specs.Spec, c *container.Container) error {
    // user
    // network
    // ipc
    // uts

    // pid
    if c.HostConfig.PidMode.IsContainer() {
        ns := specs.LinuxNamespace{Type: "pid"}
        pc, err := daemon.getPidContainer(c)
        if err != nil {
            return err
        }
        ns.Path = fmt.Sprintf("/proc/%d/ns/pid", pc.State.GetPID())
        setNamespace(s, ns)
    } else if c.HostConfig.PidMode.IsHost() {
        oci.RemoveNamespace(s, specs.LinuxNamespaceType("pid"))
    } else {
        ns := specs.LinuxNamespace{Type: "pid"}
        setNamespace(s, ns)
    }

    return nil
}

All namespace-related settings in the Spec will finally be passed as parameters in the Create function when creating a new container:

daemon.containerd.Create(context.Background(), container.ID, spec, createOptions)

All the settings related to namespaces are completed in the above two functions. Docker successfully achieves process and network isolation from the host machine through namespaces.

Network

If Docker’s containers achieve network isolation from the host machine’s processes through Linux’s namespaces but cannot connect to the internet through the host machine’s network, it will lead to many limitations. Therefore, although Docker can create an isolated network environment through namespaces, services within Docker still need to connect with the outside world to function.

Every container started using docker run actually has its own network namespace. Docker provides us with four different network modes: Host, Container, None, and Bridge modes.

In this section, we will introduce Docker’s default network setting mode: bridge mode. In this mode, in addition to allocating an isolated network namespace, Docker will also assign IP addresses to all containers. When the Docker server starts on the host, a new virtual bridge called docker0 is created, and subsequently, all services started on the host will connect to this bridge.

By default, each container will create a pair of virtual network cards upon creation. These two virtual network cards form a data channel, one of which will be placed in the created container and will join the bridge named docker0. We can use the following command to check the current interfaces of the bridge:

$ brctl show
bridge name bridge id       STP enabled interfaces
docker0     8000.0242a6654980   no      veth3e84d4f
                                        veth9953b75

docker0 will assign a new IP address to each container and set the IP address of docker0 as the default gateway. The bridge docker0 is connected to the network card on the host machine through configurations in iptables, and all qualified requests will be forwarded to docker0 through iptables and distributed to the corresponding machines by the bridge.

$ iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  anywhere             anywhere             ADDRTYPE match dst-type LOCAL

Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  anywhere             anywhere

We started a new Redis container using the command docker run -d -p 6379:6379 redis. After this, when we check the current iptables NAT configuration, we will see a new rule appear in the DOCKER chain:

DNAT       tcp  --  anywhere             anywhere             tcp dpt:6379 to:192.168.0.4:6379

The above rule will forward TCP packets sent from any source to port 6379 on the current machine to the address where 192.168.0.4:6379 is located.

This address is actually the IP address assigned by Docker for the Redis service. If we directly ping this IP address on the current machine, we will find it is accessible:

$ ping 192.168.0.4
PING 192.168.0.4 (192.168.0.4) 56(84) bytes of data.
64 bytes from 192.168.0.4: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.0.4: icmp_seq=2 ttl=64 time=0.043 ms
^C
--- 192.168.0.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.043/0.056/0.069/0.013 ms

From the above series of phenomena, we can infer how Docker exposes the internal ports of containers and forwards data packets. When a Docker container needs to expose its service to the host machine, it assigns an IP address to the container and appends a new rule to iptables.

When we access the address 127.0.0.1:6379 from the command line of the host machine using redis-cli, the data packets redirected through iptables will direct the IP address to 192.168.0.4. The redirected data packets can then pass through the FILTER configuration in iptables, ultimately masquerading the IP address as 127.0.0.1. At this point, although it seems that we are requesting 127.0.0.1:6379 from the outside, we are actually requesting the port exposed by the Docker container.

$ redis-cli -h 127.0.0.1 -p 6379 ping
PONG

Docker achieves network isolation through Linux namespaces and forwards data packets through iptables, allowing Docker containers to gracefully provide services to the host machine or other containers.

libnetwork

The entire network functionality is implemented through Docker’s split-out libnetwork, which provides a way to connect different containers while also offering a consistent programming interface and network layer abstraction for applications called the Container Network Model.

The goal of libnetwork is to deliver a robust Container Network Model that provides a consistent programming interface and the required network abstractions for applications.

The most important concepts in libnetwork, the Container Network Model, consist of several main components, namely Sandbox, Endpoint, and Network:

In the Container Network Model, each container contains a Sandbox, which stores the network stack configuration of the current container, including the container’s interfaces, routing tables, and DNS settings. Linux uses network namespaces to implement this Sandbox. Each Sandbox may have one or more Endpoints, which are virtual network cards (veth) on Linux. The Sandbox joins the corresponding network through the Endpoint, which may be the Linux bridge or VLAN mentioned above.

To obtain more information related to libnetwork or the Container Network Model, you can read Design · libnetwork for more information, or you can read the source code to understand the different implementations of the Container Network Model across operating systems.

Mount Points

Although we have solved the problems of process and network isolation through Linux namespaces, in the Docker process, we can no longer access other processes on the host machine and have limited network access, but processes in Docker containers can still access or modify other directories on the host machine, which is something we do not want to see.

To create isolated mount point namespaces in a new process, we need to pass CLONE_NEWNS in the clone function. This way, the child process will get a copy of the parent process’s mount points. If this parameter is not passed, the child process’s read and write to the file system will synchronize back to the parent process and the entire host’s file system.

If a container needs to start, it must provide a root file system (rootfs). The container needs to use this file system to create a new process, and all binary executions must occur within this root file system.

To successfully start a container, we need to mount several specific directories in rootfs. In addition to these directories, we also need to establish some symbolic links to ensure that system IO does not encounter problems.

To ensure that the current container process cannot access other directories on the host machine, we also need to use the pivot_root or chroot functions provided by libcontainer to change the root node of the directories that the process can access.

// pivor_root
put_old = mkdir(...);
pivot_root(rootfs, put_old);
chdir("/");
unmount(put_old, MS_DETACH);
rmdir(put_old);

// chroot
mount(rootfs, "/", NULL, MS_MOVE, NULL);
chroot(".");
chdir("/");

At this point, we have mounted the directories required by the container while also prohibiting the current container process from accessing other directories on the host machine, ensuring isolation of different file systems.

This part of the content was found in the SPEC.md file of the author’s libcontainer, which contains the description of the file system used by Docker. As for whether Docker really uses chroot to ensure that the current process cannot access directories on the host machine, the author actually does not have a definite answer. One reason is that the Docker project’s code is too large, and the author does not know where to start. The author tried to find relevant results through Google, but found both unanswered questions and conflicting answers with the description in the SPEC. If any readers have a clear answer, please leave a message in the comments below the blog, thank you very much.

chroot

Here, we must briefly introduce chroot (change root). In Linux systems, the default directories are all prefixed with /, which is the root directory. The use of chroot can change the current system’s root directory structure. By changing the current system’s root directory, we can limit user rights. In the new root directory, one cannot access the structure and files of the old system root, thus establishing a completely isolated directory structure from the original system.

Content related to chroot comes from the article Understanding chroot. Readers can read this article for more detailed information.

CGroups

We have isolated the file system, network, and processes between newly created processes using Linux namespaces, but namespaces cannot provide us with physical resource isolation, such as CPU or memory. If multiple ‘containers’ run on the same machine, unaware of each other and the host machine, they will collectively occupy the physical resources of the host machine.

If one of these containers is executing a CPU-intensive task, it will affect the performance and execution efficiency of tasks in other containers, leading to mutual influence and resource contention. How to limit the resource usage of multiple containers becomes the main problem after solving process virtual resource isolation, and Control Groups (CGroups for short) can isolate physical resources on the host machine, such as CPU, memory, disk I/O, and network bandwidth.

Each CGroup is a group of processes limited by the same standards and parameters, and different CGroups have a hierarchical relationship, meaning they can inherit some standards and parameters used to limit resource usage from parent classes.

Linux’s CGroup can allocate resources for a group of processes, which are the CPU, memory, and network bandwidth mentioned above. By allocating resources, CGroup can provide the following functions:

In CGroup, all tasks are a process of a system, and CGroup is a group of processes divided according to certain standards. In the CGroup mechanism, all resource control is implemented at the CGroup level. Every process can join or leave a CGroup at any time.

— Introduction to CGroup, application examples, and principle description

Linux uses the file system to implement CGroup. We can directly use the following command to see which subsystems are present in the current CGroup:

$ lssubsys -m
cpuset /sys/fs/cgroup/cpuset
cpu /sys/fs/cgroup/cpu
cpuacct /sys/fs/cgroup/cpuacct
memory /sys/fs/cgroup/memory
devices /sys/fs/cgroup/devices
freezer /sys/fs/cgroup/freezer
blkio /sys/fs/cgroup/blkio
perf_event /sys/fs/cgroup/perf_event
hugetlb /sys/fs/cgroup/hugetlb

Most Linux distributions have very similar subsystems, and the reason the above cpuset, cpu, etc. are called subsystems is that they can allocate resources for the corresponding control group and limit resource usage.

If we want to create a new cgroup, we only need to create a new folder under the desired subsystem to allocate or limit resources, and then many contents will automatically appear under this folder. If you have installed Docker on Linux, you will find a folder named docker under each subsystem’s directory:

$ ls cpu
cgroup.clone_children
...
cpu.stat
docker
notify_on_release
release_agent
tasks

$ ls cpu/docker/
9c3057f1291b53fd54a3d12023d2644efe6a7db6ddf330436ae73ac92d401cf1
cgroup.clone_children
...
cpu.stat
notify_on_release
release_agent
tasks

9c3057xxx is actually a Docker container that we are running. When this container is started, Docker creates a CGroup with the same identifier as the container. On the current host, the CGroup will have the following hierarchical relationship:

Each CGroup has a tasks file that stores the PIDs of all processes belonging to the current control group. As for the CPU subsystem, the content in the cpu.cfs_quota_us file can limit CPU usage. If the content of the current file is 50000, then the CPU usage rate of all processes in the current control group cannot exceed 50%.

If the system administrator wants to control the resource usage of a specific Docker container, they can find the corresponding child control group under the docker parent control group and change the content of the corresponding files. Of course, we can also directly use parameters at runtime to allow the Docker process to change the content of the corresponding files.

$ docker run -it -d --cpu-quota=50000 busybox
53861305258ecdd7f5d2a3240af694aec9adb91cd4c7e210b757f71153cdd274
$ cd 53861305258ecdd7f5d2a3240af694aec9adb91cd4c7e210b757f71153cdd274/
$ ls
cgroup.clone_children  cgroup.event_control  cgroup.procs  cpu.cfs_period_us  cpu.cfs_quota_us  cpu.shares  cpu.stat  notify_on_release  tasks
$ cat cpu.cfs_quota_us
50000

When we stop a running container using Docker, the corresponding folder for Docker’s child control group will also be removed by the Docker process. Docker’s use of CGroup is essentially just creating folders and changing file contents, but the use of CGroup does indeed solve the problem of limiting resource usage of child containers. System administrators can reasonably allocate resources for multiple containers without the issue of multiple containers competing for resources.

UnionFS

Linux namespaces and control groups solve different resource isolation problems. The former solves the isolation of processes, networks, and file systems, while the latter achieves isolation of CPU, memory, and other resources. However, there is still a very important problem that needs to be solved in Docker – that is the image.

What exactly is an image, and how is it composed and organized? This has been a source of confusion for the author since using Docker. We can easily download Docker images from remote sources and run them locally using docker run.

Docker images are essentially compressed packages. We can use the following command to export files from a Docker image:

$ docker export $(docker create busybox) | tar -C rootfs -xvf -
$ ls
bin  dev  etc  home proc root sys  tmp  usr  var

You can see that the directory structure in this busybox image is not much different from the content in the root directory of a Linux operating system. It can be said that Docker images are just files.

Storage Drivers

Docker uses a series of different storage drivers to manage the file systems within images and run containers. These storage drivers are somewhat different from Docker volumes; the storage engine manages storage that can be shared among multiple containers.

To understand the storage drivers used by Docker, we first need to understand how Docker builds and stores images, and we also need to understand how Docker images are used by each container. Each image in Docker consists of a series of read-only layers, and every command in the Dockerfile creates a new layer on top of existing read-only layers:

FROM ubuntu:15.04
COPY . /app
RUN make /app
CMD python /app/app.py

Each layer in the container makes very small modifications to the current container. The above Dockerfile will build an image with four layers:

When an image is created by the docker run command, a writable layer is added on top of the image’s layers, which is the container layer. All modifications to the runtime container are actually modifications to this container read-write layer.

The difference between a container and an image is that all images are read-only, while each container is essentially an image plus a writable layer. This means that the same image can correspond to multiple containers.

AUFS

UnionFS is actually a file system service designed for Linux operating systems that allows multiple file systems to be ‘united’ at the same mount point. AUFS, or Advanced UnionFS, is actually an upgraded version of UnionFS that provides better performance and efficiency.

As a union file system, AUFS can combine layers from different folders into the same folder. These folders in AUFS are called branches, and the entire ‘union’ process is called union mount:

Each image layer or container layer is a subfolder under the /var/lib/docker/ directory. In Docker, all image layers and container layers’ content are stored in the /var/lib/docker/aufs/diff/ directory:

$ ls /var/lib/docker/aufs/diff/00adcccc1a55a36a610a6ebb3e07cc35577f2f5a3b671be3dbc0e74db9ca691c       93604f232a831b22aeb372d5b11af8c8779feb96590a6dc36a80140e38e764d8
00adcccc1a55a36a610a6ebb3e07cc35577f2f5a3b671be3dbc0e74db9ca691c-init  93604f232a831b22aeb372d5b11af8c8779feb96590a6dc36a80140e38e764d8-init
019a8283e2ff6fca8d0a07884c78b41662979f848190f0658813bb6a9a464a90       93b06191602b7934fafc984fbacae02911b579769d0debd89cf2a032e7f35cfa
...

And the /var/lib/docker/aufs/layers/ directory stores the metadata of image layers. Each file saves the metadata of an image layer, and finally, the /var/lib/docker/aufs/mnt/ directory contains the mount points of image or container layers, which will finally be assembled by Docker through union.

The above image very well demonstrates the assembly process, where each image layer is built on top of another image layer, and all image layers are read-only. Only the topmost container layer of each container can be directly read and written by users. All containers are built on some underlying services (Kernel), including namespaces, control groups, rootfs, etc. This method of assembling containers provides great flexibility, and the read-only image layers can also reduce disk usage through sharing.

Other Storage Drivers

AUFS is just one of the storage drivers used by Docker. In addition to AUFS, Docker also supports different storage drivers, including aufs, devicemapper, overlay2, zfs, and vfs. In the latest version of Docker, overlay2 has replaced aufs as the recommended storage driver, but machines without the overlay2 driver will still use aufs as Docker’s default driver.

Different storage drivers have completely different implementations when storing images and container files. Interested readers can find the corresponding content in Docker’s official documentation Select a storage driver.

To check which storage driver is currently being used by the Docker system, simply use the following command to get the corresponding information:

$ docker info | grep Storage
Storage Driver: aufs

On the author’s Ubuntu system, since there is no overlay2 storage driver, aufs is used as Docker’s default storage driver.

Conclusion

Docker has now become a very mainstream technology and is used in the production environments of many mature companies. However, the core technologies of Docker have actually been around for many years. Linux namespaces, control groups, and UnionFS are the three major technologies that support the current implementation of Docker and are also the most important reasons for Docker’s emergence.

During the process of learning the principles behind Docker, the author consulted a lot of materials and learned many related knowledge about the Linux operating system. However, due to the enormous size of Docker’s current codebase, it has become very difficult to fully understand the implementation details of Docker from the source code perspective. However, if any readers are truly interested in the implementation details, they can start understanding the principles of Docker from the source code of Docker CE.

Reference

Chapter 4. Docker Fundamentals · Using Docker by Adrian Mount
TECHNIQUES BEHIND DOCKER
Docker overview
Unifying filesystems with union mounts
DOCKER 基础技术：AUFS
RESOURCE MANAGEMENT GUIDE
Kernel Korner – Unionfs: Bringing Filesystems Together
Union file systems: Implementations, part I
IMPROVING DOCKER WITH UNIKERNELS: INTRODUCING HYPERKIT, VPNKIT AND DATAKIT
Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces
理解 chroot
Linux Init Process / PC Boot Procedure
Docker 网络详解及 pipework 源码解读与实践
Understand container communication
Docker Bridge Network Driver Architecture
Linux Firewall Tutorial: IPTables Tables, Chains, Rules Fundamentals
Traversing of tables and chains
Docker 网络部分执行流分析（Libnetwork 源码解读）
Libnetwork Design
剖析 Docker 文件系统：Aufs与Devicemapper
Linux – understanding the mount namespace & clone CLONE_NEWNS flag
Docker 背后的内核知识 —— Namespace 资源隔离
Infrastructure for container projects
Spec · libcontainer
DOCKER 基础技术：LINUX NAMESPACE（上）
DOCKER 基础技术：LINUX CGROUP
《自己动手写Docker》书摘之三：Linux UnionFS
Introduction to Docker
Understand images, containers, and storage drivers
Use the AUFS storage driver