Understanding Docker: A Comprehensive Guide to Containerization

(Give ImportNew a star to enhance your Java skills)

Author: huashiou

segmentfault.com/a/1190000019462392

Off-topic

Recently, I have been learning about Docker and Kubernetes. A few days ago, I gave a technical presentation, and upon listening to the recording of my speech, I realized that just preparing a good PPT is far from enough. Without a well-prepared and logically rigorous script, I encountered issues such as stuttering, missing technical points, and logical contradictions during the presentation. To address this, I plan to write a blog post based on the PPT content before each technical sharing to clarify my expression logic. Additionally, I believe that the PPT used for each technical sharing should be as polished as possible, as you never know when you might need to present it again in the future. This article is written in the format of PPT + script, serving as a record of my recent technical sharing, and I welcome feedback.

1. Background of Docker

In regular development and project scenarios, the following situations are common:

Personal Development Environment

To work on big data-related projects, one might need to install a CDH cluster. A common practice is to set up three virtual machines corresponding to the CDH version on one’s computer. After installing the CDH cluster, considering that a clean CDH cluster might be needed later, a backup of the whole CDH cluster is usually made, resulting in six virtual machine images on the computer. Additionally, when learning other technologies, such as Ambari big data clusters, to avoid disrupting the existing virtual machine environment, one would have to set up three new virtual machines, quickly filling up the disk space with numerous virtual machine images.
Internal Development Environment

In companies, projects are often carried out in small teams, typically with the operations department allocating virtual machines from their managed server resources for internal development and testing. For example, in a machine learning-related project:

Xiaoming set up an Ambari cluster on the virtual machine allocated by the operations department to run big data-related tasks.
Xiaogang wrote a machine learning algorithm in Python 3, but found that the virtual machine was running Python 2, making the algorithm incompatible. He upgraded the Python version on the virtual machine, which allowed the algorithm to run, but some functionalities of Ambari that relied on Python might have encountered errors.
Xiaoli developed an application and started Tomcat on the virtual machine, only to find that OpenJDK was installed, preventing Tomcat from starting. He then installed a JDK, but this might have caused errors in the Java code within Ambari.
Xiaozhao wanted to use server resources for performance testing but found that the virtual machine severely degraded performance, ultimately needing to find a physical machine to conduct the tests, disrupting the original environment of the physical machine.
After completing the project, the installations on these virtual machines often become useless, and the next project team must apply for new virtual machines to redeploy the software.

Development/Testing/Production Environment

After developers write and test code in the development environment, they submit it to the testing department. When testers run it in the testing environment, they discover bugs. Developers claim there are no bugs in the development environment, leading to repeated back-and-forth discussions with testers to resolve the issues before releasing the version. After deployment in the production environment, bugs are discovered again, resulting in further disputes between engineers and testers. Sometimes, to accommodate special production environments, code needs to be customized, leading to branching, making upgrades a nightmare.
Upgrading or Migrating Projects

Each time a version is released for production, if there are multiple Tomcat applications running in the production environment, each Tomcat needs to be stopped, the WAR files replaced, and then restarted one by one, which is not only cumbersome but also prone to errors. If severe bugs arise after an upgrade, manual rollback is required. Moreover, if a project wants to migrate to the cloud, a round of testing must be conducted after deploying in the cloud, and if considering multiple cloud vendors, similar testing may need to be repeated (e.g., if the data storage component changes), which is time-consuming and labor-intensive.

Summarizing all the scenarios mentioned above, they share a common problem: there is no technology that can shield the differences in operating systems while running applications in a way that does not compromise performance to solve environmental dependency issues. Docker was born to address this.

2. What is Docker

Docker is an application container engine. First, let’s clarify what a container is. The Linux system provides Namespace and CGroup technologies to achieve environment isolation and resource control. Namespace is a kernel-level environment isolation method provided by Linux, allowing a process and its child processes to run in an isolated space from the Linux super parent process. Note that Namespace only isolates the running space; physical resources are still shared among all processes. To achieve resource isolation, the Linux system provides CGroup technology to control the resources (such as CPU, memory, disk IO, etc.) that a group of processes can use. By combining these two technologies, a user-space independent object with limited resources can be constructed, which is referred to as a container.

Linux Container (LXC) is the containerization technology provided by the Linux system, which combines Namespace and CGroup technologies to provide users with a more user-friendly interface for containerization. LXC is merely a lightweight containerization technology that can only limit certain resources and cannot achieve network restrictions or disk space usage limitations. The dotCloud company combined LXC with the technologies listed below to implement the Docker container engine. Compared to LXC, Docker offers more comprehensive resource control capabilities and is an application-level container engine.

Chroot: This technology can construct a complete Linux file system within the container;
Veth: This technology can virtually create a network card on the host to bridge with the eth0 network card in the container, enabling network communication between the container and the host, as well as between containers;
UnionFS: A union file system, Docker utilizes the “Copy on Write” feature of this technology to achieve rapid container startup and minimal resource usage, which will be discussed in detail later;
Iptables/netfilter: These two technologies implement control over container network access policies;
TC: This technology is mainly used for traffic isolation and bandwidth limitation;
Quota: This technology is used to limit the size of disk read/write space;
Setrlimit: This technology is used to limit the number of processes that can be opened in the container, as well as the number of files that can be opened, etc.

It is precisely because Docker relies on these Linux kernel technologies that it requires at least version 3.8 or higher to run Docker containers. The official recommendation is to use kernel versions above 3.10.

3. Differences from Traditional Virtualization Technologies

Traditional virtualization technologies add a software layer called Hypervisor (virtual machine monitor) between the virtual machine (VM) and hardware. The operation modes of Hypervisor can be divided into two categories:

Directly running on physical hardware, such as kernel-based KVM virtual machines, which require CPU support for virtualization technology;
Running on another operating system, such as VMWare and VirtualBox virtual machines.

Because the operating system running on the virtual machine shares hardware through the Hypervisor, the instructions issued by the VM Guest OS must be captured by the Hypervisor and translated into instructions that the physical hardware or host operating system can recognize. Virtual machines like VMWare and VirtualBox are much less performant than bare metal, while hardware-based virtual machines like KVM can achieve about 80% of the performance of bare metal. The advantages of this type of virtualization are that it provides complete isolation between different virtual machines, ensuring high security, and allows multiple operating systems (like Linux and Windows) to run on a single physical machine. However, each virtual machine is heavy, consumes a lot of resources, and starts slowly.

The Docker engine runs on the operating system and uses kernel-based technologies like LXC and Chroot to achieve container environment isolation and resource control. After the container starts, the processes inside the container interact directly with the kernel without going through the Docker engine, resulting in almost no performance loss, allowing for full utilization of bare metal performance. However, since Docker is based on Linux kernel technology for containerization, applications running inside containers can only operate on Linux kernel operating systems. Currently, the Docker engine installed on Windows actually uses the built-in Hyper-V virtualization tool to automatically create a Linux system, and the operations within the container are effectively using this virtual system.

4. Basic Concepts of Docker

Docker mainly includes the following concepts:

Engine: A tool for creating and managing containers, generating containers by reading images, and responsible for pulling images from the repository or submitting images to the repository;
Image: Similar to a virtual machine image, generally packaged from a basic operating system environment and multiple applications, serving as a template for creating containers;
Container: Can be viewed as a simplified Linux system environment (including root user permissions, process space, user space, and network space) along with the applications running inside it, packaged into a box;
Repository: A centralized place to store image files, divided into public and private repositories. Currently, the largest public repository is the official Docker Hub, along with public repositories provided by domestic platforms like Alibaba Cloud and Tencent Cloud;
Host: The server where the engine runs.

5. Comparison of Docker with Virtual Machines, Git, and JVM

To provide a more intuitive understanding of Docker, let’s make three comparisons:

In the above image, Docker’s image repository is similar to the traditional virtual machine image repository or local file system storing images. The Docker engine starts containers to run Spark clusters (with basic Linux operating system environment included in the container), analogous to virtual machine software starting multiple virtual machines to run Spark processes. The difference is that applications in Docker containers interact directly with the kernel when utilizing physical resources, without going through the Docker engine.

The repository concept of Docker is the same as Git.

Docker’s motto is “Build, Ship, and Run Any App, Anywhere,” meaning that applications can be built, shipped, and run based on Docker, allowing for one-time builds and running anywhere. Java’s motto is “Write Once, Run Anywhere,” meaning that code can be written once and run anywhere. Java adapts to operating systems based on JVM characteristics to shield system differences, while Docker uses kernel version compatibility to achieve one-time builds for running anywhere, as long as the Linux system’s kernel is version 3.8 or higher, the container can run.

Of course, just as in Java, if application code uses features from JDK 10, it cannot run on JDK 8. Similarly, if an application inside a container uses features from kernel version 4.18, the container may start on CentOS 7 (with kernel version 3.10), but the application’s functionalities will not run correctly unless the host operating system’s kernel is upgraded to version 4.18.

6. Docker Image File System

Docker images adopt a layered storage format, where each image can depend on other images for construction. Each layer of the image can be referenced by multiple images. The image dependency relationship in the above image shows that the K8S image is actually a combination of CentOS + GCC + GO + K8S. This layered structure allows for ample sharing of image layers, significantly reducing the space occupied by the image repository. For users, what they see as a container is actually a whole presented by Docker using UnionFS (union file system) to “join” the directories of related image layers into a single mount point. Here, we need to briefly introduce what UnionFS is:

UnionFS can mount the contents of multiple physically independent directories (also called branches) into a single directory. UnionFS allows control over the read/write permissions of these directories. Additionally, it has the “Copy on Write” feature for read-only files and directories, meaning that if a read-only file is modified, a copy of the file is first made to a writable layer (which may be a directory on the disk), and all modification operations actually modify this file copy, leaving the original read-only file unchanged.

One example of using UnionFS is Knoppix, a Linux distribution used for demonstrations, teaching, and commercial product presentations. It mounts a CD/DVD and a writable device (like a USB stick) together so that any changes made to files on the CD/DVD during the demonstration are applied to the USB stick, without altering the original contents on the CD/DVD.

UnionFS can mount the contents of multiple physically independent directories (also called branches) into a single directory. UnionFS allows control over the read/write permissions of these directories. Additionally, it has the “Copy on Write” feature for read-only files and directories, meaning that if a read-only file is modified, a copy of the file is first made to a writable layer (which may be a directory on the disk), and all modification operations actually modify this file copy, leaving the original read-only file unchanged. There are many types of UnionFS, among which the commonly used in Docker is AUFS, which is an upgraded version of UnionFS. Other options include DeviceMapper, Overlay2, ZFS, and VFS. Each layer of Docker images is stored by default in the /var/lib/docker/aufs/diff directory. When a user starts a container, the Docker engine first creates a writable layer directory in /var/lib/docker/aufs/diff, then uses UnionFS to mount this writable layer directory and the specified image’s layers into a directory in /var/lib/docker/aufs/mnt (where the specified image’s layers are mounted as read-only), using technologies like LXC to achieve environment isolation and resource control, allowing applications in the container to run based solely on the mounted directories and files in the mnt directory.

Utilizing the Copy on Write feature of UnionFS, when starting a container, the Docker engine essentially only adds a writable layer and constructs a Linux container, both of which consume almost no system resources. Therefore, Docker containers can start in seconds, and a single server can start thousands of Docker containers, while traditional virtual machines struggle to start dozens on a single server, and they have slow startup times. These are two significant advantages of Docker over traditional virtual machines.

When an application directly calls kernel functions for operation, the application itself can serve as the bottom layer for building images. However, since containers isolate the environment, applications inside the containers cannot access files on the host (unless specific directories or files are mapped into the container). In this case, application code can only use kernel functions. However, the Linux kernel only provides basic and low-level management functions such as process management, memory management, and file system management. In practical scenarios, almost all software is developed based on operating systems, and thus often relies on the operating system’s software and runtime libraries. If the next layer of these applications is directly the kernel, the applications will not run. Consequently, application images are often based on an operating system image to fulfill runtime dependencies.

In Docker, operating system images differ from the ISO images used for installing systems. ISO images contain the operating system kernel and all directories and software included in that distribution, while Docker operating system images do not include the system kernel but only contain essential directories (like /etc, /proc, etc.) and commonly used software and runtime libraries. One can view the operating system image as an application built on top of the kernel, providing a runtime environment for user-written applications by encapsulating kernel functionalities. Applications built on such images can leverage various software functionalities and runtime libraries of the corresponding operating system. Additionally, since applications are built on operating system images, even if they are transferred to another server, as long as the functionalities used by the application in the operating system image can adapt to the host’s kernel, the application can run normally. This is the reason behind the one-time build and run anywhere capability.

The following image illustrates the relationship between images and containers:

In the above image, the Apache application is built on the emacs image, which is based on the Debian system image. When started as a container, a writable layer is constructed on top of the Apache image layer, and all modifications to the container itself are performed in this writable layer. Debian serves as the base image for this image, providing a higher-level encapsulation of the kernel. Other images are also built on the same kernel (the following BusyBox is a minimal operating system image):

This raises a question: if applications are built based on operating system images, what if the operating system images themselves occupy substantial space? This would make the distribution of images inconvenient and increase the space occupied by the image repository. Some have already considered this, constructing different operating system images for different scenarios. Below are some of the most commonly used system images.

7. Basic Operating Systems for Docker

The above system images are suitable for different scenarios:

BusyBox: A minimal version of the Linux system, integrating over 100 commonly used Linux commands, with a size of less than 2MB, known as the “Swiss Army Knife of Linux systems,” suitable for simple testing scenarios;
Alpine: A lightweight Linux distribution focused on security, with more comprehensive functionality than BusyBox, under 5MB in size, recommended as the base image due to its sufficient basic functionality and small size, commonly used in production environments;
Debian/Ubuntu: Debian series operating systems, fully functional, around 170MB in size, suitable for development environments;
CentOS/Fedora: Both are Redhat-based Linux distributions, commonly used operating systems for enterprise servers, highly stable, approximately 200MB in size, suitable for production environments.

8. Docker Persistent Storage

Based on the previously mentioned Copy on Write feature of UnionFS, it is known that adding, deleting, or modifying files within a container actually operates on copies of files in the writable layer. When the container is closed, this writable layer is also deleted, and all modifications to the container are lost. Therefore, it is necessary to solve the issue of file persistence within the container. Docker provides two solutions:

Map directories from the host file system into the container’s directories, as shown in the diagram below. In this way, all files created in that directory within the container are stored in the corresponding directory on the host. When the container is closed, the host’s directory still exists, allowing for access to previously created files upon restarting the container, thus achieving file persistence within the container. Of course, it should be noted that if modifications are made to files that come with the image, those modifications cannot be saved upon closing the container since the image is read-only, unless a new image is built after modifying the files.

Combine the disk directories of multiple host machines into a shared storage network and map specific directories from the shared storage to specific containers, as shown in the diagram below. This way, containers can still read files created before shutdown upon restart. NFS is commonly used as a shared storage solution in production environments.

9. Docker Image Creation Methods

There are two methods for creating images:

Generate a new image from a running container

When a container is running, all modifications within it will reflect in the writable layer of the container. Docker provides a commit command that can overlay the modifications from the writable layer onto the running container, generating a new image. For example, if a Spark component is newly installed in the container, if the container is closed, the Spark component will disappear with the writable layer. However, if the commit command is used to generate a new image before closing the container, the new image will contain the Spark component when started as a container.

This method is relatively simple but does not allow for intuitive settings of environment variables, port listening, etc., making it suitable for simple usage scenarios.

Create a new image using a Dockerfile

A Dockerfile is a file that defines the steps for creating an image. The Docker engine reads the Dockerfile using the build command and constructs the image step by step according to the defined steps. In development and implementation environments, creating containers using a Dockerfile is the mainstream approach. Below is an example of a Dockerfile:

FROM ubuntu/14.04                                # Base image
MAINTAINER guest                                 # Maintainer signature
RUN apt-get install openssh-server -y            # Install SSH service
RUN mkdir /var/run/sshd                          # Create directory
RUN useradd -s /bin/bash -m -d /home/guest guest # Create user
RUN echo 'guest:123456'| chpasswd                # Change user password
ENV RUNNABLE_USER_DIR /home/guest                # Set environment variable
EXPOSE 22                                        # Default port opened in the container
CMD ["/usr/sbin/sshd -D"]                        # Automatically start SSH service when the container starts

The Docker engine can construct an Ubuntu image with SSH service based on the steps defined in the above Dockerfile.

10. Usage Scenarios of Docker

As a lightweight virtualization solution, Docker has a wide range of application scenarios. Below are some common scenarios:

Using as a lightweight virtual machine

Containers can be created using system images like Ubuntu, functioning as virtual machines, which start faster and occupy less resources compared to traditional virtual machines, allowing for a large number of operating system containers to be started on a single machine for various testing purposes;
Using as a cloud host

By combining with container management systems like Kubernetes, containers can be dynamically allocated and managed across numerous servers. Internally, it can even replace virtualization management platforms like VMWare, using Docker containers as cloud hosts;
Application service packaging

In web application service development scenarios, the Java runtime environment and Tomcat server can be packaged into a base image, and after modifying the code package, a new image can be constructed, making service upgrades and version control very convenient;
Container as a Service (CaaS)

The emergence of Docker has led many cloud platform providers to offer container cloud services, referred to as CaaS. Below is a comparison of IaaS, PaaS, and SaaS:

IaaS (Infrastructure as a Service): Provides virtual machines or other basic resources as services to users. Users can obtain virtual machines or storage resources from providers to host relevant applications while the tedious management of these infrastructures is handled by IaaS providers. The main users are system administrators and operations personnel in enterprises;
PaaS (Platform as a Service): Provides a development platform as a service to users. Users can conveniently write applications on a development platform that includes SDKs, documentation, and testing environments, without worrying about the management of servers, operating systems, networks, and storage during deployment or runtime, as these tedious tasks are handled by PaaS providers. The main users are enterprise developers.
SaaS (Software as a Service): Provides applications as services to customers. Users only need to connect to the network and use applications running in the cloud through a browser, without worrying about installation and other trivial matters, and avoiding high initial software and hardware investments. SaaS primarily targets general users.
CaaS (Container as a Service): Completes the functions of both IaaS and PaaS. Compared to traditional IaaS and PaaS services, CaaS offers more flexibility in underlying support than PaaS while being easier to control upper-layer applications than IaaS. Additionally, since Docker is a finer-grained virtualization service than VMs, it allows for more efficient utilization of computing resources. CaaS can be deployed on any physical machine, virtual machine, or IaaS cloud.

Continuous Integration and Continuous Deployment

The internet industry advocates agile development, with continuous integration and deployment (CI/CD) being the most typical development model. Using a Docker container cloud platform can achieve automation from code completion and pushing to Git/SVN, triggering the backend CaaS platform to download, compile, and build the test Docker image, replacing the test environment container service, and automatically running unit/integration tests in Jenkins or Hudson. Once tests pass, the new version image can be immediately updated online, completing service upgrades. The entire process is fully automated, greatly simplifying operations, and ensuring that online and offline environments are completely consistent, with the online service version unified with the Git/SVN release branch.
Solving the implementation challenges of microservice architecture

Frameworks like Spring Cloud can manage microservices, but microservices still need to run on operating systems. In applications developed under a microservice architecture, the number of microservices is often large, resulting in multiple microservices needing to be started on a single server to improve resource utilization. However, microservices may only be compatible with certain operating systems, leading to wasted resources and operational difficulties. Utilizing Docker’s environment isolation capabilities allows microservices to run in containers, resolving the aforementioned issues.
Executing temporary tasks

Sometimes users only want to execute one-off tasks, but using traditional virtual machines requires setting up the environment and releasing resources after task completion, which can be cumbersome. Using Docker containers allows for the construction of temporary runtime environments, making it quick and easy to execute tasks and shut down the container afterward.
Multi-tenant environments

Leveraging Docker’s environment isolation capabilities, exclusive containers can be provided for different tenants, achieving simplicity and cost-effectiveness.

11. Conclusion

The technology behind Docker is not mysterious; it is simply an application-level containerization technology that integrates various achievements accumulated by predecessors. It utilizes version-compatible kernel containerization technologies from various Linux distributions to achieve the effect of building images once and running them anywhere, while leveraging the base operating system image layers within containers to shield the differences in operating systems of actual runtime environments. This allows users when developing applications to only ensure that they can run correctly on the selected operating system and kernel version, without needing to care about the actual system differences in the runtime environment, greatly improving efficiency and compatibility.However, as the number of containers running increases, container management will become another operational challenge, necessitating the introduction of container management systems like Kubernetes, Mesos, or Swarm, which will be discussed in future opportunities.

Recommended Reading

(Click on the title to read)

SpringBoot | Chapter 14: Simple Deployment Based on Docker

Setting Up a Java Web Runtime Environment with Docker

Brief Introduction to Docker

Did you gain something from this article? Please share it with more people.

Follow “ImportNew” to enhance your Java skills

Great article, I’m reading it ❤️

Related posts

Leave a Comment Cancel reply