Reduce Docker Image Size by 99% with These Two Tricks

Follow the WeChat public account “Wonderful Linux World“

Set as “Starred” to play with Linux every day!

Reduce Docker Image Size by 99% with These Two Tricks

Introduction

For those new to containers, they are often intimidated by the size of the Docker images they build. Why does the image size exceed 1 GB when I only need a few MB executable file? This article will introduce several tricks to help you streamline the image without sacrificing the convenience for developers and operations personnel. This series will be divided into three parts:

The first part focuses on multi-stage builds, which is crucial for reducing image size. In this section, I will explain the difference between static linking and dynamic linking, their impact on images, and how to avoid negative effects. There will also be a section introducing the Alpine image.

The second part will discuss appropriate reduction strategies for different languages, focusing mainly on Go, while also touching on Java, Node, Python, Ruby, and Rust. This section will also provide a detailed guide to avoiding pitfalls with the Alpine image. What? You don’t know the pitfalls of the Alpine image? Let me tell you.

The third part will explore general reduction strategies applicable to most languages and frameworks, such as using common base images, extracting executable files, and reducing the size of each layer. Additionally, some more unique or radical tools will be introduced, such as Bazel, Distroless, DockerSlim, and UPX. Although these tools can yield remarkable results in specific scenarios, they often have counterproductive effects.

This article covers the first part.

The Root of All Evil

I bet that everyone who builds a Docker image with their own code for the first time will be intimidated by the image size. Let’s look at an example.

Let’s pull out the tried-and-true hello world C program:

/* hello.c */
int main () {
  puts("Hello, world!");
  return0;
}

And build the image with the following Dockerfile:

FROM gcc
COPY hello.c .
RUN gcc -o hello hello.c
CMD ["./hello"]

Then you will find that the successfully built image size far exceeds 1 GB… because the image contains everything from the gcc image.

If you use the Ubuntu image, install the C compiler, and compile the program, you will get an image size of about 300 MB, which is much smaller than the previous image. But it’s still not small enough, as the compiled executable is only about 20 KB:

$ ls -l hello
-rwxr-xr-x   1 root root 16384 Nov 18 14:36 hello

Similarly, the Go language version of hello world will yield the same result:

package main

import "fmt"

func main () {
  fmt.Println("Hello, world!")
}

The image size built using the base image golang is 800 MB, while the compiled executable is only 2 MB:

$ ls -l hello
-rwxr-xr-x 1 root root 2008801 Jan 15 16:41 hello

Still not ideal, is there a way to significantly reduce the image size? Read on.

To visually compare the sizes of different images, all images use the same name with different tags. For example: hello:gcc, hello:ubuntu, hello:thisweirdtrick, etc., so you can directly use the command docker images hello to list all images named hello without interference from other images.

Multi-Stage Builds

To significantly reduce the image size, multi-stage builds are essential. The idea of multi-stage builds is simple: “I don’t want to include a bunch of C or Go compilers and the entire build toolchain in the final image; I only want a compiled executable!”

Multi-stage builds can be recognized by multiple FROM instructions, where each FROM statement represents a new build stage, and the stage name can be specified using the AS parameter, for example:

FROM gcc AS mybuildstage
COPY hello.c .
RUN gcc -o hello hello.c
FROM ubuntu
COPY --from=mybuildstage hello .
CMD ["./hello"]

This example uses the base image gcc to compile the program hello.c, then starts a new build stage with ubuntu as the base image, copying the executable file hello from the previous stage into the final image. The final image size is 64 MB, reducing it by 95% from the previous 1.1 GB:

🐳 → docker images minimage
REPOSITORY          TAG                    ...         SIZE
minimage            hello-c.gcc            ...         1.14GB
minimage            hello-c.gcc.ubuntu     ...         64.2MB

Can it be further optimized? Absolutely. Before continuing to optimize, let me remind you:

When declaring build stages, you don’t have to use the keyword AS, and when copying files from the final stage, you can use the index to represent the previous build stage (starting from zero). That is, the following two lines are equivalent:

COPY --from=mybuildstage hello .
COPY --from=0 hello .

If the Dockerfile content is not very complex and there aren’t too many build stages, you can directly use the index to represent the build stage. Once the Dockerfile becomes complex with more build stages, it’s better to name each stage with the keyword AS for easier maintenance later.

Using Classic Base Images

I strongly recommend using classic base images in the first build stage, where classic images refer to images like CentOS, Debian, Fedora, and Ubuntu. You may have heard of the Alpine image; do not use it! At least not for now, I will tell you what pitfalls there are later.

`COPY --from` Using Absolute Paths

When copying files from the previous build stage, the path used is relative to the root directory of the previous stage. If you use the golang image as the base image for the build stage, you will encounter similar issues. Suppose you use the following Dockerfile to build the image:

FROM golang
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 hello .
CMD ["./hello"]

You will see the following error:

COPY failed: stat /var/lib/docker/overlay2/1be...868/merged/hello: no such file or directory

This is because the COPY command wants to copy /hello, but the golang image’s WORKDIR is /go, so the actual path of the executable file is /go/hello.

Of course, you can use an absolute path to solve this problem, but what if the base image changes the WORKDIR later? You will need to modify the absolute path constantly, so this solution is not very elegant. The best method is to specify the WORKDIR in the first stage and use an absolute path to copy the file in the second stage, so even if the base image changes the WORKDIR, it won’t affect the image build. For example:

FROM golang
WORKDIR /src
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 /src/hello .
CMD ["./hello"]

The final effect is still impressive, reducing the image size from 800 MB directly to 66 MB:

🐳 → docker images minimage
REPOSITORY     TAG                              ...    SIZE
minimage       hello-go.golang                  ...    805MB
minimage       hello-go.golang.ubuntu-workdir   ...    66.2MB

The Magic of FROM scratch

Returning to our hello world, the C version of the program size is 16 kB, and the Go version of the program size is 2 MB. So can we reduce the image to this small? Can we build an image that only contains the program we need, without any extra files?

The answer is yes; you just need to change the base image of the second stage of the multi-stage build to scratch. scratch is a virtual image that cannot be pulled or run because it represents empty, nothing! This means that the new image is built from scratch, with no other image layers. For example:

FROM golang
COPY hello.go .
RUN go build hello.go
FROM scratch
COPY --from=0 /go/hello .
CMD ["./hello"]

The size of the image built this time is exactly 2 MB, which is perfect!

However, using scratch as a base image brings many inconveniences, let me explain.

Missing Shell

The first inconvenience of the scratch image is that there is no shell, which means that CMD/RUN statements cannot use strings, for example:

...
FROM scratch
COPY --from=0 /go/hello .
CMD ./hello

If you create and run a container using the built image, you will encounter the following error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"/bin/sh\": stat /bin/sh: no such file or directory": unknown.

From the error message, it can be seen that the image does not contain /bin/sh, so the program cannot be run. This is because when you use a string as an argument in the CMD/RUN statement, these arguments are executed in /bin/sh, meaning the following two statements are equivalent:

CMD ./hello
CMD /bin/sh -c "./hello"

The solution is quite simple: **use JSON syntax instead of string syntax.** For example, replace CMD ./hello with CMD ["./hello"], so Docker will run the program directly without putting it in the shell.

Missing Debugging Tools

The scratch image does not contain any debugging tools, such as ls, ps, ping, etc., and of course, there is no shell (as mentioned earlier), so you cannot use docker exec to enter the container or view network stack information, etc.

If you want to view files in the container, you can use docker cp; if you want to view or debug the network stack, you can use docker run --net container:, or use nsenter; to better debug containers, Kubernetes has also introduced a new concept called Ephemeral Containers^[1], but it is still an Alpha feature.

Although there are many methods to help us debug containers, they complicate things, and we are pursuing simplicity; the simpler, the better.

A compromise can be to choose busybox or alpine images to replace scratch. Although they add a few MB, overall, this is just sacrificing a small amount of space for debugging convenience, which is worth it.

Missing libc

This is the most challenging problem. When using scratch as the base image, the Go version of hello world runs smoothly, but the C version does not, or more complex Go programs that use network-related packages will not run either (for example, those that use DNS), and you will encounter an error like:

standard_init_linux.go:211: exec user process caused "no such file or directory"

From the error message, it can be seen that a file is missing, but it does not tell us which files are missing. In fact, these files are the dynamic libraries required for the program to run.

So, what are dynamic libraries? Why are they needed?

Dynamic libraries and static libraries refer to the way programs are linked during the linking stage of compilation. Static libraries refer to the method of linking and packaging the generated object files (.o) with the referenced library into the executable file during the linking stage, so the corresponding linking method is called static linking (static linking). Dynamic libraries, on the other hand, are not linked to the target code during program compilation but are loaded when the program runs, so the corresponding linking method is called dynamic linking (dynamic linking).

Most programs from the 1990s used static linking because they mostly ran on floppy disks or cassette tapes, and there were no standard libraries at that time. Thus, programs had nothing to do with function libraries when running, making them easy to port. However, for time-sharing systems like Linux, which run multiple programs concurrently on the same hard drive, these programs usually rely on standard C libraries, and the advantages of dynamic linking become apparent. When using dynamic linking, the executable file does not contain standard library files but only contains an index to these library files. For example, a program that relies on the libtrigonometry.so library for the cos and sin functions will find and load libtrigonometry.so based on the index when it runs, allowing the program to call functions from that library file.

The benefits of using dynamic linking are obvious:

Save disk space; different programs can share common libraries.
Save memory; shared libraries only need to be loaded into memory from the disk once and can then be shared between different programs.
Maintenance is easier; when library files are updated, there is no need to recompile all programs using that library.

Strictly speaking, dynamic libraries combined with shared libraries can achieve the effect of saving memory. In Linux, the extension for dynamic libraries is .so (shared object), while in Windows, the extension for dynamic libraries is .DLL (Dynamic-link library)^[2].

Returning to the original question, by default, C programs use dynamic linking, and so do Go programs. The hello world program above uses the standard library file libc.so.6, so only if the image contains that file can the program run normally. Using scratch as the base image will definitely not work, and using busybox and alpine will not work either, because busybox does not contain the standard library, and alpine uses the standard library musl libc, which is incompatible with the commonly used standard library glibc. The following articles will explain this in detail, so I won’t elaborate here.

So how do we solve the standard library problem? There are three solutions.

1. Use Static Libraries

We can have the compiler use static libraries to compile the program. There are many ways to do this; if you use gcc as the compiler, just add a parameter -static:

$ gcc -o hello hello.c -static

The compiled executable file size will be 760 kB, which is much larger than the previous 16kB because the executable file contains the library files it needs to run. The compiled program can run on the scratch image.

If you use the alpine image as the base image for compilation, the resulting executable file will be even smaller (< 100kB), which will be discussed in detail in the next article.

2. Copy Library Files into the Image

To find out which library files are needed for the program to run, you can use the ldd tool:

$ ldd hello
	linux-vdso.so.1 (0x00007ffdf8acb000)
	libc.so.6 => /usr/lib/libc.so.6 (0x00007ff897ef6000)
	/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff8980f7000)

From the output, it can be seen that the program only needs the libc.so.6 library file. linux-vdso.so.1 is related to a mechanism called VDSO^[3] that speeds up certain system calls and can be optional. ld-linux-x86-64.so.2 indicates the dynamic linker itself, which contains the information about all dependent library files.

You can choose to copy all the library files listed by ldd into the image, but this can be difficult to maintain, especially when the program has a large number of dependent libraries. For the hello world program, copying library files is completely fine, but for more complex programs (for example, those that use DNS), you will encounter puzzling issues: glibc (GNU C library) implements DNS through a rather complex mechanism called NSS (Name Service Switch). This requires a configuration file /etc/nsswitch.conf and additional function libraries, but these function libraries are not displayed by ldd because they are loaded only after the program runs. If you want DNS resolution to work correctly, you must copy these additional library files (such as /lib64/libnss_*).

I personally do not recommend directly copying library files because it is very difficult to maintain, and there are many unknown risks that require constant changes.

3. Use `busybox:glibc` as the Base Image

There is one image that perfectly solves all these problems: busybox:glibc. It is only 5 MB in size and includes glibc and various debugging tools. If you want to choose a suitable image to run programs using dynamic linking, busybox:glibc is the best choice.

Note: If your program uses libraries other than the standard library, you still need to copy those library files into the image.

Conclusion

Finally, let’s compare the image sizes built using different methods:

Original build method: 1.14 GB
Using ubuntu image for multi-stage build: 64.2 MB
Using alpine image and static glibc: 6.5 MB
Using alpine image and dynamic libraries: 5.6 MB
Using scratch image and static glibc: 940 kB
Using scratch image and static musl libc: 94 kB

Ultimately, we reduced the image size by 99.99%.

However, I do not recommend using scratch as the base image because debugging can be very troublesome, but if you like it, I won’t stop you.

The next article will focus on the image reduction strategies for Go language, where a significant portion will be dedicated to discussing the Alpine image because it is so cool, and you must understand its details before using it.

Footnotes

[1]

Ephemeral Containers: https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/

[2]

Dynamic-link library: https://en.wikipedia.org/wiki/Dynamic-link_library

[3]

VDSO: https://en.wikipedia.org/wiki/VDSO

This article is reproduced from: “Cloud Native Laboratory”, original text: https://url.hi-linux.com/VHokr, copyright belongs to the original author. Contributions are welcome, please send to: [email protected].

Recently, we have established a Technical Exchange WeChat Group. Many industry experts have already joined the group. If you are interested, you can join us for technical exchanges by replying “Join Group” directly in the “Wonderful Linux World” WeChat public account.

You Might Also Like

Click the image below to read

Reduce Docker Image Size by 99% with These Two Tricks

KubeVirt: After 7 Years, Finally Bringing Virtual Machines into the Kubernetes World Reduce Docker Image Size by 99% with These Two Tricks Click the image above to receive free takeout red packets every day.

For more interesting fresh internet news, follow the “Wonderful Internet” video account to stay updated!

Using Classic Base Images

COPY --from Using Absolute Paths