How does Docker actually work? It’s a simple question that has a surprisingly complex answer. You’ve probably heard the terms “daemon” and “runtime” thrown around, but never really understood what they meant and how they fit together. If you’re like me and went wading through the source to uncover the truth, you’re not alone if you drowned in the sea of code. Let’s face it, if Docker source code was a meal, you’d be chowing down on a big bowl of spaghetti.
Like a fork that guides pasta to your mouth, this post will group and guide the digital strands of Docker into your hungry mind.
In order to better understand the present, we first need to look at the past. In 2013 Solomon Hykes of dotCloud revealed Docker to the public at the PyCon talk The future of Linux Containers. Let’s revert his git repository to January of 2013, to a simpler time in Docker’s development.
How did Docker work in 2013?
Docker is composed of two main components, a command-line application for users and a daemon which manages containers. The daemon relies on two sub components to perform its job, storage on the host file system for image and container data; and the LXC interface to abstract away the raw kernel calls needed to construct a Linux container.
The Docker command-line application is the human interface to managing all images and containers known to your running copy of Docker. It’s relatively simple since all of the management is done by the daemon. The app starts at the main function:
Immediately, a TCP connection is established to an address which is stored in the environment variable DOCKER, this is the address of the Docker daemon. The user supplied arguments are sent, and the app is now waiting to print the results from a successful reply.
In the same repo lives the code for the docker daemon, known as dockerd. Its job is to run in the background processing user requests and cleaning up containers. Upon start-up dockerd will listen for incoming HTTP connections on port 8080, and TCP connections on port 4242.
One such function is
CmdRun, which corresponds to the
docker run command.
The user will normally provide an image and command for dockerd to run. When they are omitted, the image
base and command
/bin/bash are used.
Find the image
Then we find the specified image by mapping the name (or id) to a location on the file system (assuming an image already exists due to a previous
Create the container
Then we create the container. dockerd creates a structure to hold all the metadata related to this container, and stores it in a list for easy access.
When creating the struct a unique directory is made for the container at the path
/var/lib/docker/containers/<ID>. Inside this path are two directories,
/rootfs which is the read-only root file system (the layers from the image that have been union mounted), and
/rw to have a separate read-write layer for the container to create temporary files.
Run the container
Our container is finally created! But it’s not yet running, for that we need to start it.
The first step is to make sure the container’s file system is mounted.
Using the AUFS union mount file system, the layers of an image are mounted read-only on top of each other to present one coherent view to the container. The read-write path is mounted as the topmost layer to provide the container with temporary storage.
Then, to start the container, dockerd runs another program lxc-start with the LXC template we just generated.
LXC (Linux Containers) is an abstraction layer which provides userspace applications with a simple API to create and manage containers. The truth is, containers are not a real thing, there is no such object called a container inside the Linux kernel. Containers are a collection of kernel objects that work together to provide process isolation. Therefore, the simple
lxc-start command actually translates into the setup and application of:
- Kernel namespaces (ipc, uts, mount, pid, network and user)
- Apparmor and SELinux profiles
- Seccomp policies
- Chroots (using pivot_root)
- Kernel capabilities
- and CGroups (control groups)
Finally, dockerd will then monitor the container til completion, cleaning up unnecessary data now that the container has finished.
In summary, launching a container using Docker 2013 involves the following steps:
It’s been 6 years since the introduction of Docker, and the containerisation paradigm has exploded in popularity. Both small and large enterprises have adopted Docker, especially in tandem with the orchestration system Kubernetes.
3 contributors turned into over 1800 through the power of Open Source, each person bringing with them new ideas for the project. Eager to promote extensibility, the Open Container Initiative (OCI) was formed in 2015 to define an open standard around container formats and runtimes. The image spec outlines the structure of a container image, and the runtime spec describes the interface and behaviour that implementations should adhere to in order to run containers on their platform. As a result, the community developed a wide range of projects for container management, from native containers to ones isolated by a virtual machine. With support from Microsoft, the industry now has OCI compliant native Windows containers as well.
All of these changes have been reflected in the moby repo. With this historical context, we can begin deconstructing the components of Docker 2019.
How does Docker work in 2019?
After 6 years and 36,207 commits the moby repo has evolved into a large collaborative project, influencing and relying upon many components.
In a very simplistic view, Moby 2019 has two new main components, containerd which supervises containers during their lifetime, and OCI compliant runtimes (e.g runc) that are the lowest user level abstraction for creating containers (replacing LXC).
The control flow of the command-line application for the most part hasn’t changed. Today, HTTP(S) with a JSON body is the standard for communicating with dockerd.
The engine is still responsible for a variety of tasks, like interacting with image registries and setting up directories on the file system for use by containers. The default driver will union mount an image to a directory inside of
It is no longer responsible for managing the life cycle of running containers. As the project grew, the decision was made to split off container supervision into a separate project called containerd. This way, the docker daemon can continue to innovate without concern of breaking the runtime implementation.
Although docker/engine is forked from moby/moby, allowing for possible code divergence, they share the same commit tree to date.
First we create an object to store container metadata.
Then like before, we create a root directory with both the image data and read-write layer inside for use by the container. Today the difference is that union mount file system support has grown to include
OverlayFS and more. To facilitate this, a driver system abstracts away implementation.
Finally, the container object is added to the daemon’s map of containers, for future use.
This is where containerd steps in, first we request a container be created according to the OCI specification. Then, start running a process inside of the container. All subsequent supervision is handled by containerd.
containerd has confusing terminology around it. It’s described as a runtime, but doesn’t implement the OCI runtime spec, therefore it’s not a runtime in the same way that runc is. containerd is a daemon which oversees the life cycle of containers, using OCI compliant runtimes in order to manage them. As Michael Crosby describes it, containerd is a container supervisor.
It’s designed to be the this universal base layer for supervising containers, focusing on speed and simplicity.
And it is simple, all that’s required to create a container is its specification and a bundle which encodes where the root file system is.
Starting a container involves the creation and starting of a new object called a Task, which represents a process inside of a container.
Task creation is handled by the underlying container runtime. containerd multiplexes OCI runtimes, therefore we need to lookup which runtime to use to create the task. The first and default runtime is runc. The
Create for this runtime ends up running the external process runc, but it does so indirectly using a shim.
If containerd were to crash, information about running containers would be lost. To protect against this, containerd creates a management process for each container called a shim. The shim will call an OCI runtime to create and start a container, and then perform its duty of monitoring the container to capture the exit code and manage standard IO.
In the event where containerd does crash, it can recover by communicating with the shims, and reading state from
runc is a command-line tool for spawning and running containers according to the OCI specification. Performing a similar job to LXC, it abstracts away the Linux kernel calls needed to create a container.
runc is just one implementation of the OCI runtime spec, many more exist that can be used to create containers on a variety of systems.
Finally, to start the container, runc sends a signal to the paused process to begin executing.
The Visual Summary
In summary, launching a container using Docker 2019 involves the following steps:
Using containerd’s architecture diagram as a reference, we can represent the entire process visually.
On the surface Docker and its companion projects appear chaotic but underneath there is rigid structure and modularisation. That said, uncovering all this information was not easy, it was spread across code, blog posts, conference talks, documentation and meeting notes. Having clear “self documenting” code is a great goal to aim for, but when it comes to large systems, I don’t think it’s enough. Sometimes you just need to write down in plain language what a system looks like, and what each component is responsible for.
A big thank you to all the contributors of these projects, especially those who wrote documentation which explained these systems.
I hope this has been helpful in explaining exactly how Docker runs containers. I know that I’ll be coming back to use this as a reference many times in the future.