For those who are new to containers, they can easily be intimidated by the size of the Docker images they build. Why does an image exceed 1 GB
when I only need a few MB executable? This article will introduce several techniques to help you slim down your images without sacrificing the convenience for developers and operations personnel.
01
The Root of All Evil
I bet every first-time user who builds a Docker image with their own code will be shocked by the image size. Let’s look at an example.
Let’s pull out the classic hello world
C program:
/* hello.c */
int main () {
puts("Hello, world!");
return 0;
}
And build the image using the following Dockerfile:
FROM gcc
COPY hello.c .
RUN gcc -o hello hello.c
CMD ["./hello"]
Then you will find that the successfully built image size far exceeds 1 GB
… because the image contains the entire gcc
image contents.
If you use the Ubuntu
image to install the C compiler and compile the program, you will get an image size of about 300 MB
, much smaller than the previous image. But still not small enough, as the compiled executable is less than 20 KB
:
$ ls -l hello
-rwxr-xr-x 1 root root 16384 Nov 18 14:36 hello
Similarly, the Go version of hello world
will yield the same result:
package main
import "fmt"
func main () {
fmt.Println("Hello, world!")
}
The image built using the base image golang
has a size of 800 MB
, while the compiled executable is only 2 MB
:
$ ls -l hello
-rwxr-xr-x 1 root root 2008801 Jan 15 16:41 hello
Still not ideal, is there a way to significantly reduce the image size? Read on.
To visually compare the sizes of different images, all images use the same image name with different tags. For example:
hello:gcc
,hello:ubuntu
,hello:thisweirdtrick
, etc., so you can directly use the commanddocker images hello
to list all images named hello without interference from other images.
02
Multi-Stage Builds
To significantly reduce the image size, multi-stage builds are essential. The idea of multi-stage builds is simple: “I don’t want to include a bunch of C or Go compilers and the entire build toolchain in the final image; I just need a compiled executable!”
Multi-stage builds can be recognized by multiple FROM
instructions, each FROM
statement represents a new build stage, and the stage name can be specified using the AS
parameter, for example:
FROM gcc AS mybuildstage
COPY hello.c .
RUN gcc -o hello hello.c
FROM ubuntu
COPY --from=mybuildstage hello .
CMD ["./hello"]
This example uses the base image gcc
to compile the program hello.c
, then starts a new build stage that uses ubuntu
as the base image, copying the executable hello
from the previous stage to the final image. The final image size is 64 MB
, reduced by 95%
from the previous 1.1 GB
:
🐳 → docker images minimage
REPOSITORY TAG ... SIZE
minimage hello-c.gcc ... 1.14GB
minimage hello-c.gcc.ubuntu ... 64.2MB
Can it be further optimized? Of course. Before continuing optimization, let’s remind you:
When declaring build stages, you don’t have to use the keyword AS
, and when copying files from the final stage, you can directly use the sequence number to represent the previous build stage (starting from zero). That is, the following two lines are equivalent:
COPY --from=mybuildstage hello .
COPY --from=0 hello .
If the Dockerfile
content is not too complex and there are not many build stages, you can directly use the sequence number to represent the build stage. Once the Dockerfile
becomes complex and the number of build stages increases, it is best to name each stage using the keyword AS
, as this also facilitates later maintenance.
Using Classic Base Images
I strongly recommend using classic base images in the first build stage. Here, classic images refer to images like CentOS
, Debian
, Fedora
, and Ubuntu
.
COPY --from
Using Absolute Paths
When copying files from the previous build stage, the paths used are relative to the previous stage’s root directory. If you use the golang
image as the base image for the build stage, you may encounter similar issues. Suppose you use the following Dockerfile to build the image:
FROM golang
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 hello .
CMD ["./hello"]
You will see an error like this:
COPY failed: stat /var/lib/docker/overlay2/1be...868/merged/hello: no such file or directory
This is because the COPY
command wants to copy /hello
, but the golang
image’s WORKDIR
is /go
, so the actual path of the executable is /go/hello
.
Of course, you can use an absolute path to solve this problem, but what if the base image changes the WORKDIR
? You would have to keep modifying the absolute path, so this solution is not very elegant. The best approach is to specify the WORKDIR
in the first stage and use an absolute path to copy the file in the second stage. This way, even if the base image changes the WORKDIR
, it will not affect the image building. For example:
FROM golang
WORKDIR /src
COPY hello.go .
RUN go build hello.go
FROM ubuntu
COPY --from=0 /src/hello .
CMD ["./hello"]
The final effect is still impressive, reducing the image size directly from 800 MB
to 66 MB
:
🐳 → docker images minimage
REPOSITORY TAG ... SIZE
minimage hello-go.golang ... 805MB
minimage hello-go.golang.ubuntu-workdir ... 66.2MB
03
The Magic of FROM scratch
Returning to our hello world
, the C version of the program is 16 kB
, and the Go version of the program is 2 MB
. Can we reduce the image to such a small size? Can we build an image that only contains the program I need, with no extra files?
The answer is yes, you just need to change the base image of the second stage of the multi-stage build to scratch
. scratch
is a virtual image, cannot be pulled, and cannot be run because it represents nothing! This means the new image is built from scratch, with no other image layers. For example:
FROM golang
COPY hello.go .
RUN go build hello.go
FROM scratch
COPY --from=0 /go/hello .
CMD ["./hello"]
The size of the image built this time is exactly 2 MB
, which is perfect!
However, using scratch
as the base image brings many inconveniences, let me explain one by one.
Missing Shell
The first inconvenience of the scratch
image is that there is no shell
, which means that you cannot use strings in CMD/RUN
statements, for example:
...
FROM scratch
COPY --from=0 /go/hello .
CMD ./hello
If you create and run a container using the built image, you will encounter the following error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: "/bin/sh": stat /bin/sh: no such file or directory": unknown.
From the error message, we can see that the image does not contain /bin/sh
, so the program cannot be run. This is because when you use a string as an argument in the CMD/RUN
statement, these arguments are executed in /bin/sh
, meaning the following two statements are equivalent:
CMD ./hello
CMD /bin/sh -c "./hello"
The solution is actually quite simple: **use JSON syntax instead of string syntax.** For example, replacing CMD ./hello
with CMD ["./hello"]
will allow Docker to run the program directly without putting it into the shell.
Missing Debugging Tools
The scratch
image does not contain any debugging tools, such as ls
, ps
, ping
, etc., and of course, there is no shell (as mentioned above), you cannot use docker exec
to enter the container, nor can you view network stack information, etc.
If you want to view files in the container, you can use docker cp
; if you want to view or debug the network stack, you can use docker run --net container:
, or use nsenter
; to better debug containers, Kubernetes has also introduced a new concept called Ephemeral Containers[1], but it is still an Alpha feature.
Although there are many ways to help us debug containers, they complicate matters. We are pursuing simplicity; the simpler, the better.
A compromise can be to choose busybox
or alpine
images to replace scratch
, although they add a few MB, overall, this is just sacrificing a small amount of space for debugging convenience, which is still worthwhile.
Missing libc
This is the most difficult problem to solve. When using scratch
as the base image, the Go version of hello world
runs smoothly, but the C version does not, or a more complex Go program may also fail to run (for example, using network-related packages), you will encounter an error like:
standard_init_linux.go:211: exec user process caused "no such file or directory"
From the error message, we can see that a file is missing, but it doesn’t tell us which files are missing. In fact, these files are the dynamic libraries required for the program to run.
So, what are dynamic libraries? Why are they needed?
Dynamic libraries and static libraries refer to the way programs are linked during the linking stage of compilation. Static libraries are linked and packed into the executable file during the linking stage, while dynamic libraries are not linked to the target code during compilation but are loaded during program execution, thus referred to as dynamic linking.
Programs from the 90s mostly used static linking because they ran on floppy disks or tape drives, and there were no standard libraries at that time. This way, the program has no relation to the function library during execution, making it easy to port. But for time-sharing systems like Linux, multiple programs run concurrently on the same hard drive, and most of these programs use the standard C library, which highlights the advantages of dynamic linking. When using dynamic linking, the executable file does not contain standard library files but only contains references to these library files.
The benefits of using dynamic linking are obvious:
-
Save disk space, different programs can share common libraries. -
Save memory, shared libraries only need to be loaded into memory once from disk and then shared between different programs. -
More convenient to maintain, after updating the library file, there is no need to recompile all programs that use that library.
Strictly speaking, dynamic libraries combined with shared libraries can achieve memory savings. The extension name of dynamic libraries in Linux is .so
( shared object
), while in Windows, the extension name is .DLL
(Dynamic-link library[2]).
Returning to the initial question, by default, C programs use dynamic linking, and Go programs do as well. The above hello world
program uses the standard library file libc.so.6
, so the program can run normally only if the image contains this file. Using scratch
as the base image is definitely not possible, and using busybox
and alpine
will not work either, as busybox
does not contain the standard library, and alpine
uses the standard library musl libc
, which is incompatible with the commonly used standard library glibc
. Future articles will delve into this, so I won’t elaborate here.
So how do we solve the standard library problem? There are three solutions.
1. Use Static Libraries
We can instruct the compiler to use static libraries to compile the program. There are many ways to do this. If using gcc
as the compiler, just add a parameter -static
:
$ gcc -o hello hello.c -static
The compiled executable file size will be 760 kB
, which is much larger than the previous 16kB
because the executable file contains the libraries required for its execution. The compiled program can run on the scratch
image.
If using the alpine
image as the base image for compilation, the resulting executable file will be even smaller (< 100kB).
2. Copy Library Files to the Image
To find out which library files are needed for the program to run, you can use the ldd
tool:
$ ldd hello
linux-vdso.so.1 (0x00007ffdf8acb000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007ff897ef6000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff8980f7000)
From the output, we can see that the program only requires the libc.so.6
library file. linux-vdso.so.1
is related to a mechanism called VDSO[3] used to accelerate certain system calls and is optional. ld-linux-x86-64.so.2
indicates the dynamic linker itself, which contains information about all dependent library files.
You can choose to copy all library files listed by ldd
into the image, but this is very difficult to maintain, especially when the program has many dependent libraries. For the hello world
program, copying library files is completely fine, but for more complex programs (for example, those that use DNS), you will encounter confusing issues: glibc
(GNU C library) implements DNS through a rather complex mechanism called NSS
(Name Service Switch). It requires a configuration file /etc/nsswitch.conf
and additional function libraries, but when using ldd
, these function libraries will not be displayed because they are loaded after the program runs. If you want DNS resolution to work correctly, you must copy these additional library files (/lib64/libnss_*
).
I personally do not recommend directly copying library files because it is very difficult to maintain, and there are many unknown risks.
3. Use busybox:glibc
as the Base Image
There is one image that perfectly solves all these problems, which is busybox:glibc
. It is only 5 MB
in size and contains glibc
and various debugging tools. If you want to choose a suitable image to run programs that use dynamic linking, busybox:glibc
is the best choice.
Note: If your program uses libraries other than the standard library, you still need to copy those library files into the image.
04
Conclusion
Finally, let’s compare the sizes of images built with different methods:
-
Original build method: 1.14 GB -
Multi-stage build using ubuntu
image: 64.2 MB -
Using alpine
image with staticglibc
: 6.5 MB -
Using alpine
image with dynamic libraries: 5.6 MB -
Using scratch
image with staticglibc
: 940 kB -
Using scratch
image with staticmusl libc
: 94 kB
In the end, we reduced the image size by 99.99%
.
However, I do not recommend using scratch
as the base image because it is very troublesome to debug, but if you like it, I won’t stop you.
Footnotes
Ephemeral Containers: https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/
[2]Dynamic-link library: https://en.wikipedia.org/wiki/Dynamic-link_library
[3]VDSO: https://en.wikipedia.org/wiki/VDSO