Reduce Docker Image Size by 99% with These Techniques

For those who are new to containers, they can easily be intimidated by the size of the Docker images they build. Why does an image exceed 1 GB when I only need a few MB executable? This article will introduce several techniques to help you slim down your images without sacrificing the convenience for developers and operations personnel.

01

The Root of All Evil

I bet every first-time user who builds a Docker image with their own code will be shocked by the image size. Let’s look at an example.

Let’s pull out the classic hello world C program:

/* hello.c */
int main () {
  puts("Hello, world!");
  return 0;
}

And build the image using the following Dockerfile:

FROM gcc
COPY hello.c .
RUN gcc -o hello hello.c
CMD ["./hello"]

Then you will find that the successfully built image size far exceeds 1 GB… because the image contains the entire gcc image contents.

If you use the Ubuntu image to install the C compiler and compile the program, you will get an image size of about 300 MB, much smaller than the previous image. But still not small enough, as the compiled executable is less than 20 KB:

$ ls -l hello
-rwxr-xr-x   1 root root 16384 Nov 18 14:36 hello

Similarly, the Go version of hello world will yield the same result:

package main

import "fmt"

func main () {
  fmt.Println("Hello, world!")
}

The image built using the base image golang has a size of 800 MB, while the compiled executable is only 2 MB:

$ ls -l hello
-rwxr-xr-x 1 root root 2008801 Jan 15 16:41 hello

Still not ideal, is there a way to significantly reduce the image size? Read on.

To visually compare the sizes of different images, all images use the same image name with different tags. For example: hello:gcc, hello:ubuntu, hello:thisweirdtrick, etc., so you can directly use the command docker images hello to list all images named hello without interference from other images.

02

Multi-Stage Builds

To significantly reduce the image size, multi-stage builds are essential. The idea of multi-stage builds is simple: “I don’t want to include a bunch of C or Go compilers and the entire build toolchain in the final image; I just need a compiled executable!”

Multi-stage builds can be recognized by multiple FROM instructions, each FROM statement represents a new build stage, and the stage name can be specified using the AS parameter, for example:

FROM gcc AS mybuildstage
COPY hello.c .
RUN gcc -o hello hello.c
FROM ubuntu
COPY --from=mybuildstage hello .
CMD ["./hello"]

This example uses the base image gcc to compile the program hello.c, then starts a new build stage that uses ubuntu as the base image, copying the executable hello from the previous stage to the final image. The final image size is 64 MB, reduced by 95% from the previous 1.1 GB:

🐳 → docker images minimage
REPOSITORY          TAG                    ...         SIZE
minimage            hello-c.gcc            ...         1.14GB
minimage            hello-c.gcc.ubuntu     ...         64.2MB

Can it be further optimized? Of course. Before continuing optimization, let’s remind you:

When declaring build stages, you don’t have to use the keyword AS, and when copying files from the final stage, you can directly use the sequence number to represent the previous build stage (starting from zero). That is, the following two lines are equivalent:

COPY --from=mybuildstage hello .
COPY --from=0 hello .

If the Dockerfile content is not too complex and there are not many build stages, you can directly use the sequence number to represent the build stage. Once the Dockerfile becomes complex and the number of build stages increases, it is best to name each stage using the keyword AS, as this also facilitates later maintenance.

Using Classic Base Images

I strongly recommend using classic base images in the first build stage. Here, classic images refer to images like CentOS, Debian, Fedora, and Ubuntu.

COPY --from Using Absolute Paths

When copying files from the previous build stage, the paths used are relative to the previous stage’s root directory. If you use the golang image as the base image for the build stage, you may encounter similar issues. Suppose you use the following Dockerfile to build the image:

FROM golang
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 hello .
CMD ["./hello"]

You will see an error like this:

COPY failed: stat /var/lib/docker/overlay2/1be...868/merged/hello: no such file or directory

This is because the COPY command wants to copy /hello, but the golang image’s WORKDIR is /go, so the actual path of the executable is /go/hello.

Of course, you can use an absolute path to solve this problem, but what if the base image changes the WORKDIR? You would have to keep modifying the absolute path, so this solution is not very elegant. The best approach is to specify the WORKDIR in the first stage and use an absolute path to copy the file in the second stage. This way, even if the base image changes the WORKDIR, it will not affect the image building. For example:

FROM golang
WORKDIR /src
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 /src/hello .
CMD ["./hello"]

The final effect is still impressive, reducing the image size directly from 800 MB to 66 MB:

🐳 → docker images minimage
REPOSITORY     TAG                              ...    SIZE
minimage       hello-go.golang                  ...    805MB
minimage       hello-go.golang.ubuntu-workdir   ...    66.2MB

03

The Magic of FROM scratch

Returning to our hello world, the C version of the program is 16 kB, and the Go version of the program is 2 MB. Can we reduce the image to such a small size? Can we build an image that only contains the program I need, with no extra files?

The answer is yes, you just need to change the base image of the second stage of the multi-stage build to scratch. scratch is a virtual image, cannot be pulled, and cannot be run because it represents nothing! This means the new image is built from scratch, with no other image layers. For example:

FROM golang
COPY hello.go .
RUN go build hello.go
FROM scratch
COPY --from=0 /go/hello .
CMD ["./hello"]

The size of the image built this time is exactly 2 MB, which is perfect!

However, using scratch as the base image brings many inconveniences, let me explain one by one.

Missing Shell

The first inconvenience of the scratch image is that there is no shell, which means that you cannot use strings in CMD/RUN statements, for example:

...
FROM scratch
COPY --from=0 /go/hello .
CMD ./hello

If you create and run a container using the built image, you will encounter the following error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: "/bin/sh": stat /bin/sh: no such file or directory": unknown.

From the error message, we can see that the image does not contain /bin/sh, so the program cannot be run. This is because when you use a string as an argument in the CMD/RUN statement, these arguments are executed in /bin/sh, meaning the following two statements are equivalent:

CMD ./hello
CMD /bin/sh -c "./hello"

The solution is actually quite simple: **use JSON syntax instead of string syntax.** For example, replacing CMD ./hello with CMD ["./hello"] will allow Docker to run the program directly without putting it into the shell.

Missing Debugging Tools

The scratch image does not contain any debugging tools, such as ls, ps, ping, etc., and of course, there is no shell (as mentioned above), you cannot use docker exec to enter the container, nor can you view network stack information, etc.

If you want to view files in the container, you can use docker cp; if you want to view or debug the network stack, you can use docker run --net container:, or use nsenter; to better debug containers, Kubernetes has also introduced a new concept called Ephemeral Containers[1], but it is still an Alpha feature.

Although there are many ways to help us debug containers, they complicate matters. We are pursuing simplicity; the simpler, the better.

A compromise can be to choose busybox or alpine images to replace scratch, although they add a few MB, overall, this is just sacrificing a small amount of space for debugging convenience, which is still worthwhile.

Missing libc

This is the most difficult problem to solve. When using scratch as the base image, the Go version of hello world runs smoothly, but the C version does not, or a more complex Go program may also fail to run (for example, using network-related packages), you will encounter an error like:

standard_init_linux.go:211: exec user process caused "no such file or directory"

From the error message, we can see that a file is missing, but it doesn’t tell us which files are missing. In fact, these files are the dynamic libraries required for the program to run.

So, what are dynamic libraries? Why are they needed?

Dynamic libraries and static libraries refer to the way programs are linked during the linking stage of compilation. Static libraries are linked and packed into the executable file during the linking stage, while dynamic libraries are not linked to the target code during compilation but are loaded during program execution, thus referred to as dynamic linking.

Programs from the 90s mostly used static linking because they ran on floppy disks or tape drives, and there were no standard libraries at that time. This way, the program has no relation to the function library during execution, making it easy to port. But for time-sharing systems like Linux, multiple programs run concurrently on the same hard drive, and most of these programs use the standard C library, which highlights the advantages of dynamic linking. When using dynamic linking, the executable file does not contain standard library files but only contains references to these library files.

The benefits of using dynamic linking are obvious:

  1. Save disk space, different programs can share common libraries.
  2. Save memory, shared libraries only need to be loaded into memory once from disk and then shared between different programs.
  3. More convenient to maintain, after updating the library file, there is no need to recompile all programs that use that library.

Strictly speaking, dynamic libraries combined with shared libraries can achieve memory savings. The extension name of dynamic libraries in Linux is .so ( shared object), while in Windows, the extension name is .DLL (Dynamic-link library[2]).

Returning to the initial question, by default, C programs use dynamic linking, and Go programs do as well. The above hello world program uses the standard library file libc.so.6, so the program can run normally only if the image contains this file. Using scratch as the base image is definitely not possible, and using busybox and alpine will not work either, as busybox does not contain the standard library, and alpine uses the standard library musl libc, which is incompatible with the commonly used standard library glibc. Future articles will delve into this, so I won’t elaborate here.

So how do we solve the standard library problem? There are three solutions.

1. Use Static Libraries

We can instruct the compiler to use static libraries to compile the program. There are many ways to do this. If using gcc as the compiler, just add a parameter -static:

$ gcc -o hello hello.c -static

The compiled executable file size will be 760 kB, which is much larger than the previous 16kB because the executable file contains the libraries required for its execution. The compiled program can run on the scratch image.

If using the alpine image as the base image for compilation, the resulting executable file will be even smaller (< 100kB).

2. Copy Library Files to the Image

To find out which library files are needed for the program to run, you can use the ldd tool:

$ ldd hello
	linux-vdso.so.1 (0x00007ffdf8acb000)
	libc.so.6 => /usr/lib/libc.so.6 (0x00007ff897ef6000)
	/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff8980f7000)

From the output, we can see that the program only requires the libc.so.6 library file. linux-vdso.so.1 is related to a mechanism called VDSO[3] used to accelerate certain system calls and is optional. ld-linux-x86-64.so.2 indicates the dynamic linker itself, which contains information about all dependent library files.

You can choose to copy all library files listed by ldd into the image, but this is very difficult to maintain, especially when the program has many dependent libraries. For the hello world program, copying library files is completely fine, but for more complex programs (for example, those that use DNS), you will encounter confusing issues: glibc (GNU C library) implements DNS through a rather complex mechanism called NSS (Name Service Switch). It requires a configuration file /etc/nsswitch.conf and additional function libraries, but when using ldd, these function libraries will not be displayed because they are loaded after the program runs. If you want DNS resolution to work correctly, you must copy these additional library files (/lib64/libnss_*).

I personally do not recommend directly copying library files because it is very difficult to maintain, and there are many unknown risks.

3. Use busybox:glibc as the Base Image

There is one image that perfectly solves all these problems, which is busybox:glibc. It is only 5 MB in size and contains glibc and various debugging tools. If you want to choose a suitable image to run programs that use dynamic linking, busybox:glibc is the best choice.

Note: If your program uses libraries other than the standard library, you still need to copy those library files into the image.

04

Conclusion

Finally, let’s compare the sizes of images built with different methods:

  • Original build method: 1.14 GB
  • Multi-stage build using ubuntu image: 64.2 MB
  • Using alpine image with static glibc: 6.5 MB
  • Using alpine image with dynamic libraries: 5.6 MB
  • Using scratch image with static glibc: 940 kB
  • Using scratch image with static musl libc: 94 kB

In the end, we reduced the image size by 99.99%.

However, I do not recommend using scratch as the base image because it is very troublesome to debug, but if you like it, I won’t stop you.

Footnotes

[1]

Ephemeral Containers: https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/

[2]

Dynamic-link library: https://en.wikipedia.org/wiki/Dynamic-link_library

[3]

VDSO: https://en.wikipedia.org/wiki/VDSO

Leave a Comment

×