GOAL: This project will sum up all the knowledge you have learned about operating systems and create a small container runtime that can run a containerized process on Linux.
Throughout the course, we have been using Docker and Dev Container to run our code. But what exactly is a container?
A container is a lightweight and isolated execution environment that encapsulates an application and its dependencies. It provides a consistent and reproducible environment across different systems, allowing applications to run reliably regardless of the underlying infrastructure.
Containers are created from container images, which are self-contained packages that include the application code, runtime, system tools, libraries, and configuration files required for the application to run.
A container runtime is responsible for managing the lifecycle of containers. It provides an interface between the container and the host operating system, orchestrating the necessary resources and ensuring isolation and security within the container environment.
One of the most important features of container runtimes is to provide isolation. That is, whatever happens inside a container does not affect the host system or other containers.
Modern container runtimes provide isolation for many OS abstractions, such as CPU, memory, network, etc. In this project, we focus on two essential abstractions: processes and files.
Our container runtime should provide isolation for processes. For example, the processes running on the host should not be visible to the processes inside the container.
Let's check how Docker isolates processes. We use the ps command to see the information about running processes.
If you execute ps -A, it lists all processes that are running on the system.
PID TTY TIME CMD
1 ? 06:30:08 systemd
2 ? 00:00:44 kthreadd
3 ? 00:00:00 rcu_gp
4 ? 00:00:00 rcu_par_gp
6 ? 00:00:00 kworker/0:0H-kblockd
...
This is the result of ps -A on my server. We can see many processes are running.
Let us create a new Docker container and execute the same command inside the container:
$ docker run --rm -it alpine
$ ps -A
PID USER TIME COMMAND
1 root 0:00 /bin/sh
7 root 0:00 ps -A
The first command creates a new Docker container using the alpine image. It opens a shell inside the container, so we run ps -A. Even though many processes are running on the host system, they are not visible inside the container.
In Linux, the isolation of processes is achieved using the PID namespace. A process ID (PID) is a unique number that identifies a running process. Every process belongs to a PID namespace and it can only see processes in the same PID namespace.
Our container runtime also supports isolating the filesystem. Each container has its own filesystem and cannot affect the host or other containers.
Let's check how Docker provides filesystem isolation. We use the same alpine image.
In one terminal, let us create a container and create a file foo in its root directory.
$ docker run --rm -it alpine
$ echo foo > foo
$ ls /
bin etc home media opt root sbin sys usr
dev foo lib mnt proc run srv tmp var
Open another terminal, create a new container, and run ls.
$ docker run --rm -it alpine
$ ls /
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
Even though both containers are created from the same image, foo is not visible in the new container.
One way to achieve this is to use the Overlay Filesystem. The Overlay filesystems is a type of union filesystem that allows multiple directories to be mounted together, presenting a single unified view. It provides a way to overlay a read-write filesystem on top of a read-only filesystem, creating a combined view that appears as a single coherent filesystem.
When a file or directory is accessed, the Overlay filesystem looks for it in the topmost layer first. If the file is found, it is returned. If not, the filesystem searches the lower layers in a specific order until it locates the file. This allows modifications to be made to the topmost layer, while the lower layers remain unchanged. Changes made to the topmost layer are stored separately, without modifying the underlying read-only layers.
Let's see how the overlay filesystem works with a simple example.
To use the overlay filesystem, we need three directories: lower, upper, and work. lower will be a read-only directory that provides an "image", upper will store all changes on top of lower, and work is used by the overlay filesystem as a workspace.
First, we use the following commands to create the directories. merged is a directory where we will mount the overlay filesystem at.
mkdir lower upper work mergedThen, we create a read-only file in the lower directory.
echo "this is in lower" > lower/fooNow, we are ready to create a new overlay filesystem. Use the following command to create an overlay filesystem and mount it as merged.
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work mergedNow, we mounted the overlay filesystem to merged. It provides the unified view of lower and upper.
$ ls merged
foo
$ cat merged/foo
this is in lower
Let's make some change to the file in the overlay filesystem.
echo "new foo" > merged/fooBecause lower/foo is read-only, it cannot be modified. Instead, the overlay filesystem creates upper/foo with the updated content. Because upper/foo now exists, merged/foo refers to the updated content at upper/foo.
$ echo merged/foo
new foo
$ echo upper/foo
new foo
$ echo lower/foo
foo
To provide an isolated filesystem for each container, we make an overlay filesystem. The lower directory is the container image that stores all files and directories needed for that container. Each container gets a unique upper directory, so changes by the container do not affect other containers.
The goal of this project is to complete container.c. It is capable of creating a container from an image and executing a command inside the container.
In container.c, an image is a directory under ./images that stores all the files and directories required for the system. You can think of an image directory as a snapshot of the system root directory.
The easiest way to create an image is to use the docker export command. Let us create an image directory from the alpine docker image.
First, we create a Docker container using docker run --rm -it alpine sh. This open a shell inside the newly created container.
Second, we need to get the ID of the container. Open a new terminal and run docker ps to see the list of running Docker containers.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f1cf18783484 alpine "sh" 2 minutes ago Up 2 minutes boring_sanderson
Find the container and copy the container ID (f1cf18783484 in this case).
Then, we run docker export {container ID} > alpine.tar to create a tarball of the image.
docker export f1cf18783484 > alpine.tarFinally, we extract files in the tarball to ./images/{image name} using the following commands.
$ mkdir images/alpine
$ tar -xf alpine.tar -C images/alpine
$ ls images/alpine
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
Now, we have the image directory called alpine. We use this identifier to specify the image.
container takes three or more arguments.
$ ./container
Usage: ./container [ID] [IMAGE] [CMD]...
- The first argument (ID) specifies the unique ID of the container.
- Docker assigns this at random, but we require the user to provide one .
- The ID can be at most 16 characters (
CONTAINER_ID_MAX).
- The second argument (IMAGE) specifies the image to create a container from.
./images/{IMAGE}must exist and store all files required for this container
- The rest of the arguments specify the commands to run inside the container.
- It can be more than one as the user might provide options.
For example, ./container my-container alpine echo "hello world" will
- Create a container with ID
my-container - Use the image located at
./images/alpine - Execute
echo "hello world"inside the container
sudo ./container my-container alpine echo "hello world"
hello world
You will need to complete two functions in container.c: main and container_exec.
main is the entry point of the command-line interface. It needs to parse the command-line arguments (argv) and create a child process by calling clone with appropriate parameters.
int clone_flags = SIGCHLD | CLONE_NEWNS | CLONE_NEWPID;
int pid = clone(container_exec, &child_stack, clone_flags, &container);clone works similarly to the fork system call and creates a child process. The child process executes the container_exec function and takes container as an argument, just like how we passed arguments to a new thread in pthread_create. By passing three flags, the child process will have separate PID and mount namespaces that provide isolation.
Add fields to the container struct and fill values in main so container_exec will have enough information to create a container from an image and execute the command.
main executes container_exec in a child process with separate PID and mount namespaces. container_exec needs to
- create and mount an overlay filesystem
- call
change_root - use
execvpto run the command
container_exec needs to create an overlay filesystem. The merged directory will have everything inside the image directory plus the changes made inside the container, and will be used as a root of the filesystem inside the container.
To create an overlay filesystem, use the mount function.
int mount(const char *source, const char *target,
const char *filesystemtype, unsigned long mountflags,
const void *data);sourceis often a path referring to a device. Because we are not mounting a device, use the dummy string"overlay"targetspecifies the directory at which to create the mount point. Use themergeddirectory path:/tmp/container/{id}/merged.filesystemtypespecifies the type of the filesystem. Use"overlay".mountflagsprovides options. UseMS_RELATIME.dataprovides options specific to the filesystem. The overlay filesystem takes the three arguments (lowerdir, upperdir, workdir) in the format:lowerdir={lowerdir},upperdir={upperdir},workdir={workdir}. Construct a string of this format and pass the pointer.
lowerdir should be the image directory. In principle, upperdir and workdir can be any directory, but in order for the overlay filesystem to work inside the Dev Container, those directories must be inside /tmp/container. main creates this directory.
Use /tmp/container/{id}/lower, /tmp/container/{id}/work for lowerdir and workdir, respectively. In order for mount to work, those directories must exist. Use mkdir to create a directory if not exist.
For example, if the current directory is /workspaces/project5-container, the container ID is my-container, and image name is alpine, you need to call
mount(
"overlay",
"/tmp/container/my-container/merged",
"overlay",
MS_RELATIVE,
"lowerdir=/workspaces/project5-container/images/alpine,upperdir=/tmp/container/my-container/upper,workdir=/tmp/container/my-container/workdir"
);Now, the overlay filesystem mounted at /tmp/container/{id}/merged. We want the child process to treat this as the root directory.
pivot_root is the system call to achieve this. Because calling pivot_root is complex and tedious, we provided a helper function to do so.
void change_root(const char* path)Provide the path to the "merged" directory to the change_root function. It will call pivot_root to change the root directory to the "merged" directory and ensure it cannot access outside directories.
change_root also does a couple of more things to ensure the container works properly, such as setting the PATH environment variable.
At this point, the child process has its own PID namespace and the overlay filesystem as its root directory. The last step is to execute the specified command.
Use execvp(3) so it can execute commands without specifying the full path to the executable.
int execvp(const char *file, char *const argv[]);file specifies the name of the command and argv specifies the entire arguments. argv needs to be null-terminated.
For example, if the command is echo "hello world", you should call
char *argument_list[] = {"echo", "hello world", NULL};
execvp(argument_list[0], argument_list);You can test your container runtime is working properly using the alpine image described earlier. In particular, we want to make sure processes and filesystem are isolated.
To check the process isolation, we can use the ps command.
$ sudo ./container my-container alpine sh
--- inside container ---
$ ps -A
PID USER TIME COMMAND
1 root 0:00 sh
2 root 0:00 ps -A
ps -A must not print the processes running on host. The command used to create the container (sh in the above example) should have PID 1.
To check the file system isolation, you can use cd inside the container to try to get out of the file system. If change_root is called properly, you should not be able to get out of the overlay filesystem.
$ sudo ./container my-container alpine sh
# inside container
$ cd /../../
$ ls
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
Any changes made inside the container should be visible at the upper directory.
$ sudo ./container my-container alpine sh
--- inside container ---
$ echo hello from container > hello.txt
$ exit
--- returned to host ---
$ sudo cat /tmp/container/my-container/upper/hello.txt
hello from container
While our container runtime is minimal, it is capable of running a variety of images. Once you are done testing with the alpine image, try using your container runtime to execute your favorite image.
For example, here is how to execute JavaScript (Node.js) using the node:18-alpine image.
# follow similar steps to create an image directory
$ docker pull node:18-alpine
$ docker run --rm -it node:18-alpine sh
# in a different terminal
$ docker ps # copy the container ID
$ docker export {container-id} > node.tar
$ mkdir images/node
$ tar -xf node.tar -C images/node
$ sudo ./container node-container node node
Welcome to Node.js v18.16.0.
Type ".help" for more information.
>
Error: Could not open history file.
REPL session history will not be persisted.
> console.log("hello, world!")
hello, world!
undefined
>Try running your favorite programming language with your container runtime!
makemust create thecontainerexecutable.- All source files must be formatted using clang-format. Run make format to format .c and .h files.
- The filesystems can often enter wrong states. If filesystems behave weirdly, try running the "Dev Containers: Rebuild Container" command in VS Code. This will recreate the Dev Container and likely to resolve the issue. You can also try restarting Docker Desktop.