Skip to content

Instantly share code, notes, and snippets.

@cab105
Created January 31, 2017 00:24
Show Gist options
  • Select an option

  • Save cab105/7cf4cccf204b6992c715fd21b7d8c262 to your computer and use it in GitHub Desktop.

Select an option

Save cab105/7cf4cccf204b6992c715fd21b7d8c262 to your computer and use it in GitHub Desktop.
RKT stage1_skim flavor to support Docker
RKT stage1_skim flavor to support Docker
========================================
NOTES
=====
This is the second version of an initial specification I provided outlining how
to modify docker and the stage1_fly handler such that both can interoperate with
each other. As a result of a few conversations, it was determined that we would
want to do away with the chroot environment all together, and provide a few more
modifications to the stage1 handler such that we can minimize our modifications
of the Docker environment even more.
INTRO
=====
There has been a long-sought out request for getting docker to run inside of
rkt. Given that both are containerization products that require privileged
operations, care needs to be taken to provide an adequate abstraction of the
host environment while also maintaing the look and feel of running natively.
TERMS
=====
* pod will reference a "pod" or container as hosted by rkt.
* container will reference Docker's view of a container.
ENVIRONMENT AND BOOTSTRAPPING
=============================
Based off the rkt architecture, rkt will bootstrap into a pluggable environment
that has the sole job of prepping a new runtime container for execution. There
are currently two flavors:
* stage1 - This one has several offshoots ranging from using native kvm to a
coreos based image, and leverages systemd to handle the spawning and
management of the contained process. This environment provides the most
isolation, and by extension, the most restrictions
* stage1_fly - This was created to support running the kubernetes kubelet, and
provides a basic chroot environment. In this way, the FS and process
namespace are the only things that are isolated, but other things such as
the host network stack and processes are exposed.
What would be considered ideal would be something resembling the stage1_fly
handler minus the chroot environment, and with support for more than one
application per pod. There are three reasons for removing the chroot support:
1. Docker makes use of pivot root when it comes to setting up the overlays for
its images/containers
2. Most of docker's storage drivers become unusable (overlay will no longer
be supported due to the docker pod running inside of an overlay, which leaves
the vfs storage driver as the only usable interface)
3. No easy way to modify the bind mounts such that they're visible inside the
jail once it is setup without modification of both rkt and any application
that would want to interact with the host.
In addition, the stage1_fly would need to expose at a minimum the host's kernel
modules in order for docker to ensure proper detection and bringup of its
components (think networking for the bridge driver or overlayfs for its own
management of filesystems)
For this, we'll create a new stage1 flavor (stage1_skim) that will remove the
need for chroot. We will still create the overlay for the image[s], but paths
referencing the applications contained within the image will need to be
augmented to reflect the new home for things such as the executable's absolute
path, its environment, and also the current working directory. The latter can
be done with the appropriate modifications during the exec into stage2.
GOING TO STAGE2
===============
Looking at the stage1_fly flavor, once the environment is setup, it will
chroot into the new environment, and then exec the target executable. Without
performing the chroot, we would need to modify the execution path, working
directory, and environment path before exec'ing into the target executable.
To support multiple applications running inside a single pod, we can make use of
systemd in a similar fashion to the current stage1 handler. In this case, each
app within the pod will have its own service file to account for the same tweaks
needed above in the single-app per pod model. Instead of invoking systemd-nspawn,
we will invoke systemd-run on a single process as a systemd scope that will
inherit the open file descriptor from rkt, and will be solely responsible for
kicking off the services we just created. In addition, we can define a per-pod
slice to allow for binding all services together into the same cgroup. The
ultimate goal in this case is to provide additional resiliance for processes
running inside the pod such that failures can be caught, and systemd can pick
up logging. Lastly, the service files we will create for the pod will reside in
/run/systemd/system as they will be considered transitive in nature, and not
designed to survive reboot.
Another issue for bringup with multiple apps in a single pod, and that is adding
dependencies on other apps. In the case for our Docker implementation, this would
be to ensure that containerd is started before dockerd so that containers running
inside docker can survive when the docker dameon goes away. Another case would
be on a webapp with a dependency on something like redis and ensuring that redis
is up before our webapp comes into play. The default behavior we're shooting for
with the initial release is for successive images to depend on the previous image.
Stopping and terminating the pod will consist of stopping the scope that we
spawned as a part of the run invocation. Because of the successive service
dependencies created earlier, terminating the scope will result all other
dependent processes terminating.
With the garbage collector, stage1_skim will need to be sure to remove all
systemd service/scope files that were created during the run phase, and ensure
`systemdctl daemon-reload` has been executed to ensure systemd has an updated
state of the world.
DOCKER/CONTAINER OS MODIFICATIONS
=================================
For the ContainerOS, the docker client itself would become a wrapper script that
would invoke the docker binary running inside the appropriate rkt pod. The
current docker service would be modified to load/run the appropriate docker
image. The end user can place their docker configuration in /etc/docker just
like before, and the current dockerd wrapper scripts will work unmodified.
On the back end, we would need to deliver the ACI image for docker and its
associated binaries, but that is outside of scope for stage1_skim.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment