The Evolution of Container Technology

Click to follow InfoQ, pinned public account

Receive the 8 AM technical breakfast for programmers

The Evolution of Container Technology
Author|Daniel J Walsh
Translator|Jin Lingjie
In recent years, container technology has become a hot topic not only among developers but also among many enterprises. This growing interest in containers has led to an increasing demand for enhanced security and hardening, as well as higher requirements for scalability and interoperability. These efforts are significant undertakings, and this article discusses the work Red Hat has done in enterprise-grade container support.
Preface

When I first met representatives from Docker, Inc. (Docker.io) in the fall of 2013, we were still researching how to use Docker containers in Red Hat Enterprise Linux (RHEL). (Part of the Docker project has since been renamed Moby.) During the integration of this technology into RHEL, we encountered many issues. The first major obstacle was finding a file system that supported Copy On Write (COW) to handle the layering of container images. Red Hat ultimately contributed parts of the COW implementation for Device Mapper, btrfs, and the initial version of OverlayFS. For RHEL, we defaulted to using Device Mapper, although support for OverlayFS was nearing completion at that time.

Another significant hurdle was the tool for starting containers. At that time, the upstream Docker used LXC tools to start containers, but we did not want to support the LXC toolset in RHEL. Before collaborating with upstream Docker, I worked with the libvirt team to create a tool called virt-sandbox that could start containers via libvirt-lxc.

Some at Red Hat thought it was a good idea to remove the LXC toolset and bridge container startup between the Docker daemon and libvirt using libvirt-lxc. However, this implementation raised concerns. The call hierarchy between the Docker client (docker-cli) and the container process (pid1OfContainer) was:

docker-cli → docker-daemon → libvirt-lxc → pid1OfContainer

I was not in favor of having two daemons between the tool for starting containers and the actual running container.

My team began working with upstream Docker developers to create a container runtime library called libcontainer in native Go. This library eventually became the initial implementation of the OCI runtime specification, runc.

docker-cli → docker-daemon @ pid1OfContainer

Many people mistakenly believe that when they execute a container, the container process is a child process of docker-cli. In fact, Docker follows a typical client/server architecture, where the container process runs in a completely isolated environment. This C/S architecture can lead to instability and potential security vulnerabilities, and it can also prevent the use of certain system features. For example, systemd has a feature called socket activation, which allows users to configure daemons to start only when a process connects to a socket. This can reduce system memory usage and allow services to run on demand. The implementation of socket activation works by having systemd listen on a TCP socket, and when that socket receives a packet, systemd activates the service that needs to listen on that socket. Once the service is activated, systemd hands off the socket to the newly started daemon. However, placing this daemon inside a Docker-based container creates issues. While it is possible to start a container using the Docker client command in the systemd unit file, systemd cannot simply transfer the socket to the daemon through the Docker client command.

Problems like these made us realize that a different approach to running containers was necessary.

Container Orchestration Issues

The upstream Docker project has been dedicated to making container usage more convenient, and it has always been a great tool for learning about Linux containers. We can easily start a container and enter it using a simple command like docker run -ti fedora sh.

The true advantage of containers lies in the ability to start many components simultaneously and assemble them into a powerful application. The challenge of setting up a multi-container application lies in the exponential growth of complexity and the difficulty of linking various dispersed parts together with simple Docker commands. How do we manage container layouts on resource-constrained cluster nodes? How do we manage the lifecycle of these containers? There are many similar questions.

At the first DockerCon, at least seven companies/open-source projects demonstrated various approaches to container orchestration. We showcased Red Hat’s OpenShift project geard, which was loosely based on OpenShift v2 containers. Red Hat decided that we needed to re-examine container orchestration and possibly collaborate with others in the open-source community.

Google demonstrated the Kubernetes container orchestration tool, built on the experience gained from internal orchestration tools developed at Google. OpenShift decided to abandon the Gear project and began collaborating with Google on Kubernetes. Kubernetes has since become one of the largest community projects on GitHub.

Kubernetes

Kubernetes initially used Google’s lmctfy as its container runtime library. In the summer of 2014, lmctfy was ported to support Docker. Kubernetes runs a daemon called kubelet on each node in the cluster. This means the original Kubernetes and Docker 1.8 workflow looks like this:

kubelet → dockerdaemon @ PID1

This goes back to the two-daemon model.

It got worse. Every time Docker released a new version, Kubernetes would break. Docker 1.10 modified the backend storage, requiring all images to be rebuilt. Docker 1.11 began starting containers through runc:

kubelet → dockerdaemon @ runc @PID1

Docker 1.12 added a container daemon to start containers, primarily to meet the needs of Docker Swarm (a competitor to Kubernetes):

kubelet → dockerdaemon → containerd @runc @ pid1

As mentioned earlier, every time Docker releases a new version, it breaks Kubernetes functionality, so tools like Kubernetes and OpenShift need to adapt to older versions of Docker.

Now, we have evolved into a three-daemon system, where if any part of the daemon fails, the entire system collapses.

Toward Container Standardization
CoreOS, rkt, and Other Runtime Environments

Due to various issues with the Docker runtime library, multiple organizations are seeking alternatives. One such organization is CoreOS. CoreOS provided an alternative container runtime environment called rkt (rocket) for upstream Docker. Additionally, CoreOS introduced a standard container specification: appc (App Container). In simple terms, they hoped that everyone could use a unified standard specification when dealing with container image storage.

This serves as a warning to the community. When we first began collaborating on container projects with upstream Docker, my biggest concern was that there would eventually be multiple standards. I did not want to see a war similar to that between the RPM format and the Debian format (deb), which affected Linux software distribution for the next 20 years. The benefit of appc was that it persuaded upstream Docker and the open-source community to work together to establish a standard body: the Open Container Initiative (OCI).

OCI has been working on two specifications:

  • OCI Runtime Specification: The OCI runtime specification “aims to define the configuration, execution environment, and lifecycle of containers.” It defines how containers are stored on disk, the JSON file used to describe applications within containers, and how to create and run containers. Upstream Docker contributed libcontainer and provided runc as the default implementation of the OCI runtime specification.

  • OCI Image Format Specification: The image format specification is basically based on the upstream Docker image format and defines the format of container images in container registries. This standard allows application developers to standardize their applications into the same format. Some features described by the appc specification have already been merged into the OCI image format specification. Both OCI specifications are close to a 1.0 official version. (The 1.0 official version was released on July 20, 2017, translator’s note.) Upstream Docker has unified support for the OCI image specification after its official release. rkt now supports both OCI images and traditional Docker image formats.

By providing industry standards for container images and runtimes, OCI has facilitated innovation in container-related tools and orchestration frameworks.

Abstract Runtime Interface

Kubernetes orchestration tools are among the beneficiaries of these standards. As a major supporter of Kubernetes, CoreOS submitted a series of patches to add support for running containers via rkt in Kubernetes. Google and the Kubernetes community found that after applying these patches, adding new container runtime support to Kubernetes would complicate and bloat the Kubernetes codebase. Therefore, the Kubernetes team decided to implement an API protocol specification called the Container Runtime Interface (CRI). They then refactored Kubernetes to call CRI directly instead of calling the Docker engine. If new container runtime interfaces need to be added in the future, only the server-side interface of CRI needs to be implemented. At the same time, Kubernetes provides a large set of tests for CRI developers to verify whether their implementations are compatible with Kubernetes. The abstraction of CRI also requires ongoing work: removing the direct calls to the Docker engine from Kubernetes and moving them to an adapter layer called docker-shim.

Container Tool Innovations
Image Repository Innovation—skopeo

A few years ago, we developed the atomic command line interface in the Atomic project. One of its features was that we wanted a tool that could verify the contents of an image while it was still in the image repository. At that time, the only workaround was to pull the image locally through the JSON file associated with the container image and then read the image information using the docker inspect command. These images could be very large, occupying gigabytes of space. With this feature, users could check the image content in advance to determine whether they needed to pull the image, so we added a –remote parameter to the docker inspect command. However, upstream Docker rejected this pull request, stating that they did not want to complicate the Docker command line interface and that users could implement the same functionality with their own small tools.

Our team, led by Antonio Murdaca, ultimately completed a tool called skopeo. Antonio positioned this tool not only to pull image configuration files but also to implement a bidirectional protocol for pulling container images from the repository to the local server and pushing images from the local server to the repository.

Currently, skopeo has been heavily integrated into the atomic command line interface, including features like checking for container updates and integration into the atomic scan command. Atomic also uses skopeo to pull and push images instead of using the upstream Docker daemon.

Containers/image Library

We have been communicating with CoreOS about the feasibility of using skopeo in rkt, but they have been unwilling to integrate skopeo’s functionality as a dependency library through an executable helper application. Therefore, we decided to split skopeo into a library and an executable, ultimately creating the image project.

The containers/image library and skopeo have been used by some upstream projects and cloud service infrastructure tools. Both can support various storage backends beyond Docker, and they provide many features such as moving images between repositories. One advantage of skopeo is that its operation does not rely on any daemon. The breakthroughs of the containers/image library have also made it possible to implement enhanced features like container image signing.

Image Processing and Scanning Innovations

As mentioned earlier, the atomic command line interface is used to implement features that are incompatible with the Docker command line interface and some features we believe upstream Docker would not accept. Additionally, we hope it can support other container runtimes, tools, and storage in an extensible manner. The previously mentioned skopeo has validated this point.

One feature we wanted to add to atomic was the atomic mount command. The demand for this command arose from the ability to obtain data from Docker image storage (upstream Docker refers to it as graph driver) and mount it somewhere, allowing other tools to inspect the image. Currently, the only way to view image content using native Docker is to start a container. If the image contains untrusted content, it is very dangerous to run its code just to inspect the image. Another issue with starting a container and then inspecting the content is that the tools used for inspection are usually not included in the container image.

The workflow of most container image scanning tools typically looks like this: they connect to the Docker socket, execute the docker save command to create a compressed package, then extract it to disk, and finally inspect the contents. This process is quite slow.

With the atomic mount command, we can directly mount the image through the Docker graph driver. If the Docker daemon is using the device mapper, it can mount this device; if it is using the overlay filesystem, it can mount the overlay. This allows us to quickly meet our needs. Now we can simply do:

# atomic mount fedora /mnt # cd /mnt

Then we can start inspecting the content. After the inspection is complete, execute:

# atomic umount /mnt

We integrated this feature into the atomic scan command, allowing us to build a fast image scanner.

Tool Coordination Issues

One major issue during the use of the atomic mount command is that it operates independently. The Docker daemon is unaware that other processes are using the image. This can lead to problems (for example, if someone first mounts the Fedora image using the command mentioned earlier, and then another person executes the docker rmi fedora command, the Docker daemon will fail to delete the Fedora image because the device is busy). At this point, the Docker daemon will enter an unpredictable state.

Container/storage Library

To address this issue, we began pulling the graph driver-related code from upstream Docker daemon into our own repository. The graph driver of the Docker daemon performs all lock-related operations in its own memory. We hoped to move the lock-related implementation to the filesystem so that different processes could operate on container storage simultaneously without going through the daemon as a single point of failure.

Ultimately, this project was named container/storage, which implements all the Copy On Write (COW) features needed for containers during runtime, build, and storage without requiring a process to control and monitor (i.e., no daemon needed). Now skopeo and several other tools and projects can better utilize storage. Other open-source projects have also begun using the containers/storage library, and at some point, we hope this project can be merged back into the upstream Docker project.

Setting Sail for Innovation

Now let’s take a look at what happens when Kubernetes runs a container through the Docker daemon on a node. First, Kubernetes executes a command:

kubelet run nginx –image=nginx

This command tells kubelet to run the nginx application on the node. The kubelet calls the container runtime interface, requesting it to start the nginx application. At this point, the implementation of the container runtime interface needs to complete the following steps:

  1. Check the local storage for an image named nginx. If it does not exist locally, the container runtime will look for the corresponding image in the image repository.

  2. If the image is not found in local storage, download it from the image repository to the local system.

  3. Release the downloaded container image to container storage (usually a Copy On Write storage) and mount it to the appropriate location.

  4. Use a standardized container runtime to run the container.

Let’s take a look at the features relied upon in the above process:

  1. OCI Image Format: Used to define the standard storage format of images in the repository.

  2. Containers/image Library: Used to implement all the features needed to pull images from the image repository.

  3. Containers/storage Library: Provides the functionality needed to extract the OCI image format into Copy On Write storage.

  4. OCI Runtime Specification and runc: Provide the tools needed to run containers (the same tools used by the Docker daemon to run containers).

This means we can achieve the capabilities needed to use containers with these tools and libraries without relying on a large container daemon.

In a medium to large-scale DevOps-based continuous integration/continuous delivery environment, efficiency, speed, and security are critical. Once the relevant tools comply with OCI specifications, developers and operations can use the best tools in the continuous integration/continuous delivery pipeline and production environment. Most container operation tools are hidden beneath container orchestration frameworks or other higher-level container platform technologies. It is foreseeable that the choice of container runtimes and image tools will become an installation option for container platforms in the future.

System (Independent) Containers

In the Atomic project, we introduced atomic host, a new way to build operating systems where software can be updated “atomically,” and most applications run in containers. The purpose of using this platform is to demonstrate that most software can be migrated to the OCI image format and can be downloaded and installed from the image repository using standard protocols. Providing software as container images allows users’ operating systems and applications to be updated at different speeds. The traditional RPM/yum/DNF method of distributing software packages restricts application versions to the lifecycle of the host operating system.

One issue with distributing the base setup as containers is that sometimes applications need to execute before the container runtime daemon starts. Let’s look at an example using the Docker daemon: Kubernetes needs to complete network setup before it can assign pods to an independent network environment. In this scenario, the default daemon we currently use is flanneld, which must run before the Docker daemon to set up the Docker daemon’s network interface. At the same time, flanneld uses etcd as its data store. This daemon must start before flanneld.

If we distribute etcd and flanneld through images, we encounter the chicken-and-egg problem. A container runtime daemon is needed to start containerized applications, but these applications need to start before the container runtime daemon. To solve this problem, I’ve found some hacks, but none are thorough. At the same time, the Docker daemon currently does not have a good way to set the startup priorities of containers. I have seen suggestions in this area, but the current implementations still use an old SysVInit-like method to start services (and I am aware of the complexity this brings).

systemd

One reason for using systemd instead of SysVInit is to handle the priority and order of starting services. Why can’t containers take advantage of this technology? In the Atomic project, we decided that when running containers on the host, there is no need for a container runtime daemon, especially for services needed during early startup. Therefore, we enhanced the atomic command line interface to allow users to install container images. When a user executes atomic install –system etcd, atomic will use skopeo to pull the OCI image of etcd from the image repository and then release the image to OSTree storage. Since we run etcd in production, this image is read-only. The next step is for the atomic command to grab the systemd unit file template from the container image and create a unit file on disk to start the image. This unit file typically uses runc to start the container on the host (although runc is not mandatory).

If the command atomic install –system flanneld is executed, a similar thing happens, except this time the generated unit file for flanneld will specify that etcd needs to run before it starts.

When the system starts, systemd ensures that etcd runs before flanneld, and the container runtime will run after flanneld starts. This feature allows users to place the Docker daemon and Kubernetes into system containers. This means users can start an atomic host or a traditional RPM-based operating system, where the entire container orchestration framework runs as containers. This functionality is powerful because we know customers want to continuously patch their container hosts independently of these components. Additionally, it minimizes the space occupied by the host operating system.

Of course, there is still debate about putting traditional applications into containers so that they can run as independent/system containers while also being orchestrated containers. Consider an Apache container; we can install it using the atomic install –system httpd command. This container image can start just like the RPM-based httpd service (systemctl start httpd, except the httpd process will start within the container). Storage can still use local means, meaning the host’s /var/www directory can be mounted into the container, and this container can listen on the local network’s port 80. This indicates that we can run traditional workloads within the host’s containers without the need for a container runtime daemon.

Building Container Images

In my opinion, the saddest thing about container innovation in the last four years is the lack of innovation in the mechanism for building container images. A container image is a compressed package consisting of the image content and some JSON files. The base image of a container is a complete root filesystem and some JSON files for description. Users can then add layers on top of it, with each added layer forming a compressed package and recording the changes in JSON files. These layers are packaged together with the base image into a container image.

Basically, everyone builds images using docker build and Dockerfile format files. Upstream Docker stopped accepting pull requests to modify and enhance the Dockerfile format and build methods years ago. Throughout the evolution of containers, Dockerfile has played an important role. Development and operations can build images in a simple and direct manner. However, I believe Dockerfile is merely a simplified bash script, and many issues remain unresolved. For example:

  • To build a container image, Dockerfile must rely on the Docker daemon.

    • No one has yet created a standard tool that can create OCI format images independently of the Docker command.

    • Tools like ansible-containers and OpenShift S2I (Source2Image) still use the Docker engine under the hood.

  • Every line in Dockerfile creates a new image, which can enhance the efficiency of building images during the development phase since tools can determine whether each line in the Dockerfile has changed. If it has not changed, the previously built image can be reused without rebuilding. However, this leads to the creation of a large number of layers.

    • This issue has led many to request a way to merge layers to limit the number of layers. Ultimately, upstream Docker accepted some suggestions to meet this demand.

  • To pull image contents from secure sites, users typically need some security measures. For example, users need RHEL certification and subscription authorization to add RHEL content to the image.

    • These keys ultimately reside in the layers of the image, and developers need to remove them from these layers.

    • To be able to mount volumes while building images, we added the -v parameter in both our distributed atomic project and docker package, but upstream Docker did not accept these patches.

  • The components built are ultimately placed in the container image. Therefore, while Dockerfile provides a better understanding of the entire build process for developers just starting to build, it is not an efficient way for large-scale enterprise environments. Additionally, after using automated container platforms, users do not care whether the method for building OCI standard images is efficient.

Setting Sail with buildah

At DevConf.cz in 2017, I had our team’s Nalin Dahyabhai take a look at what we called containers-coreutils, a set of command-line tools that essentially used the containers/storage library and containers/image library to mimic the syntax of Dockerfile. Nalin named it buildah, poking fun at my Boston accent. Using some buildah primitives, we can build a container image:

  • A major idea regarding security is to ensure that the operating system’s image is as small as possible to limit unnecessary tools. The rationale is that hackers need tools to break applications; if tools like gcc, make, dnf, etc. do not exist, attacks may stop or be limited.

  • Additionally, since these images need to be pulled and pushed over the internet, it is always a good idea to reduce the image size.

  • The primary way to build images using Docker is to install or compile software into the container’s buildroot through commands.

  • Executing the run command requires these executables to be included in the container image. For example, using the dnf command in the image requires the entire Python stack to be installed, even if the final application does not use Python.

  • ctr=$(buildah from fedora):

    • Use the containers/image library to pull the Fedora image from the image repository.

    • Return a container ID (ctr).

  • mnt=$(buildah mount $ctr):

    • Mount the newly created container image ($ctr).

    • Return the mount point path.

    • Now we can write to this mount point.

  • dnf install httpd –installroot=$mnt:

    • We can use commands from the host system to write content into the container, which means we can place keys on the host without putting them into the container, and the build tools can also be kept on the host.

    • Commands like dnf do not need to be pre-installed in the container; dependencies like Python also do not need to be included unless the application depends on them.

  • cp foobar $mnt/dir:

    • We can populate the container using any commands supported by bash.

  • buildah commit $ctr:

    • We can create layers as needed. The creation of layers is controlled by the user instead of the tool.

  • buildah config –env container=oci –entrypoint /usr/bin/httpd $ctr:

  • Commands supported in Dockerfile can also be specified.

  • buildah run $ctr dnf -y install httpd:

    • buildah also supports the run command, which does not rely on the container runtime daemon but executes commands directly in the locked container using runc.

  • buildah build-using-dockerfile -f Dockerfile .:

    • buildah also supports building images using Dockerfile.

We hope to transform tools like ansible-containers and OpenShift S2I into buildah instead of relying on a container runtime daemon.

Building container images and running containers adopt the same runtime environment, which poses a significant issue for production environments, as both need to meet security requirements simultaneously. Typically, building container images requires far more permissions than running containers. For example, by default, we allow the mknod capability. The mknod capability allows a process to create device nodes, but in production environments, very few applications require this capability. Removing the mknod capability in production can enhance system security.

Another example is that we typically grant container images read-write permissions because installation processes need to place software packages in the /usr directory. However, in production environments, I personally recommend running all containers in read-only mode. Processes within containers should only be allowed to write to tmpfs or volumes mounted into the container. By separating the image building and container running processes, we can modify these default settings to make the runtime environment more secure.

Kubernetes Runtime Abstraction: CRI-O

Kubernetes has added a set of APIs for plugging in any pod runtime, called the Container Runtime Interface (CRI). I am reluctant to run too many daemons on my system, but we added one based on this. A group led by Mrunal Patel in our team began implementing the CRI-O daemon in late 2016. This is a container runtime interface daemon for running OCI-based applications. In theory, in the future, we could compile the CRI-O code directly into kubelet to reduce one daemon.

Unlike other container runtimes, the sole purpose of CRI-O is to meet the needs of Kubernetes. Recall the steps required for Kubernetes to run a container.

Kubernetes sends a message to kubelet notifying it to start an nginx service:

  1. kubelet calls CRI-O to inform it to run nginx.

  2. CRI-O responds to the CRI request.

  3. CRI-O finds the corresponding OCI image in the image repository.

  4. CRI-O uses containers/image to pull the image from the repository to the local system.

  5. CRI-O uses containers/storage to extract the image to local storage.

  6. CRI-O uses the OCI runtime specification to start the container, typically using runc. I mentioned earlier that this is the same way the Docker daemon uses runc to start containers.

  7. If needed, kubelet can also use other runtimes to start containers, such as runv from Clear Containers.

CRI-O aims to be a stable platform for running Kubernetes, and we will only release new versions after passing all Kubernetes test suites. All pull requests submitted to https://github.com/Kubernetes-incubator/cri-o must run the entire Kubernetes test suite. Developers cannot submit pull requests that do not pass these tests. CRI-O is a fully open project, and contributors from companies like Intel, SUSE, IBM, Google, Hyper.sh, etc., are already involved. As long as the majority of CRI-O maintainers agree, pull requests will be accepted, even if these patches are not needed by Red Hat.

Final Thoughts

I hope this in-depth exploration helps readers understand the evolution of Linux containers. Linux once faced a situation where each vendor had its own standards. Docker focused on establishing a de facto standard for image building and simplifying container-related tools. The establishment of the Open Container Initiative signifies that the industry is formulating standards around core image formats and runtimes, innovating to make tools more efficient, secure, scalable, and usable. Containers allow us to verify software installations in novel ways, whether running traditional software on hosts or orchestrated microservices in the cloud. In many ways, this is just the beginning.

This article’s translation has been authorized. Original link:

https://opensource.com/article/17/7/how-linux-containers-evolved

Today’s Recommended Article

Click the image below to read

The Evolution of Container Technology

2018 China Developer Population Census

The Evolution of Container Technology

Leave a Comment