Created
January 31, 2017 00:24
-
-
Save cab105/7cf4cccf204b6992c715fd21b7d8c262 to your computer and use it in GitHub Desktop.
RKT stage1_skim flavor to support Docker
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| RKT stage1_skim flavor to support Docker | |
| ======================================== | |
| NOTES | |
| ===== | |
| This is the second version of an initial specification I provided outlining how | |
| to modify docker and the stage1_fly handler such that both can interoperate with | |
| each other. As a result of a few conversations, it was determined that we would | |
| want to do away with the chroot environment all together, and provide a few more | |
| modifications to the stage1 handler such that we can minimize our modifications | |
| of the Docker environment even more. | |
| INTRO | |
| ===== | |
| There has been a long-sought out request for getting docker to run inside of | |
| rkt. Given that both are containerization products that require privileged | |
| operations, care needs to be taken to provide an adequate abstraction of the | |
| host environment while also maintaing the look and feel of running natively. | |
| TERMS | |
| ===== | |
| * pod will reference a "pod" or container as hosted by rkt. | |
| * container will reference Docker's view of a container. | |
| ENVIRONMENT AND BOOTSTRAPPING | |
| ============================= | |
| Based off the rkt architecture, rkt will bootstrap into a pluggable environment | |
| that has the sole job of prepping a new runtime container for execution. There | |
| are currently two flavors: | |
| * stage1 - This one has several offshoots ranging from using native kvm to a | |
| coreos based image, and leverages systemd to handle the spawning and | |
| management of the contained process. This environment provides the most | |
| isolation, and by extension, the most restrictions | |
| * stage1_fly - This was created to support running the kubernetes kubelet, and | |
| provides a basic chroot environment. In this way, the FS and process | |
| namespace are the only things that are isolated, but other things such as | |
| the host network stack and processes are exposed. | |
| What would be considered ideal would be something resembling the stage1_fly | |
| handler minus the chroot environment, and with support for more than one | |
| application per pod. There are three reasons for removing the chroot support: | |
| 1. Docker makes use of pivot root when it comes to setting up the overlays for | |
| its images/containers | |
| 2. Most of docker's storage drivers become unusable (overlay will no longer | |
| be supported due to the docker pod running inside of an overlay, which leaves | |
| the vfs storage driver as the only usable interface) | |
| 3. No easy way to modify the bind mounts such that they're visible inside the | |
| jail once it is setup without modification of both rkt and any application | |
| that would want to interact with the host. | |
| In addition, the stage1_fly would need to expose at a minimum the host's kernel | |
| modules in order for docker to ensure proper detection and bringup of its | |
| components (think networking for the bridge driver or overlayfs for its own | |
| management of filesystems) | |
| For this, we'll create a new stage1 flavor (stage1_skim) that will remove the | |
| need for chroot. We will still create the overlay for the image[s], but paths | |
| referencing the applications contained within the image will need to be | |
| augmented to reflect the new home for things such as the executable's absolute | |
| path, its environment, and also the current working directory. The latter can | |
| be done with the appropriate modifications during the exec into stage2. | |
| GOING TO STAGE2 | |
| =============== | |
| Looking at the stage1_fly flavor, once the environment is setup, it will | |
| chroot into the new environment, and then exec the target executable. Without | |
| performing the chroot, we would need to modify the execution path, working | |
| directory, and environment path before exec'ing into the target executable. | |
| To support multiple applications running inside a single pod, we can make use of | |
| systemd in a similar fashion to the current stage1 handler. In this case, each | |
| app within the pod will have its own service file to account for the same tweaks | |
| needed above in the single-app per pod model. Instead of invoking systemd-nspawn, | |
| we will invoke systemd-run on a single process as a systemd scope that will | |
| inherit the open file descriptor from rkt, and will be solely responsible for | |
| kicking off the services we just created. In addition, we can define a per-pod | |
| slice to allow for binding all services together into the same cgroup. The | |
| ultimate goal in this case is to provide additional resiliance for processes | |
| running inside the pod such that failures can be caught, and systemd can pick | |
| up logging. Lastly, the service files we will create for the pod will reside in | |
| /run/systemd/system as they will be considered transitive in nature, and not | |
| designed to survive reboot. | |
| Another issue for bringup with multiple apps in a single pod, and that is adding | |
| dependencies on other apps. In the case for our Docker implementation, this would | |
| be to ensure that containerd is started before dockerd so that containers running | |
| inside docker can survive when the docker dameon goes away. Another case would | |
| be on a webapp with a dependency on something like redis and ensuring that redis | |
| is up before our webapp comes into play. The default behavior we're shooting for | |
| with the initial release is for successive images to depend on the previous image. | |
| Stopping and terminating the pod will consist of stopping the scope that we | |
| spawned as a part of the run invocation. Because of the successive service | |
| dependencies created earlier, terminating the scope will result all other | |
| dependent processes terminating. | |
| With the garbage collector, stage1_skim will need to be sure to remove all | |
| systemd service/scope files that were created during the run phase, and ensure | |
| `systemdctl daemon-reload` has been executed to ensure systemd has an updated | |
| state of the world. | |
| DOCKER/CONTAINER OS MODIFICATIONS | |
| ================================= | |
| For the ContainerOS, the docker client itself would become a wrapper script that | |
| would invoke the docker binary running inside the appropriate rkt pod. The | |
| current docker service would be modified to load/run the appropriate docker | |
| image. The end user can place their docker configuration in /etc/docker just | |
| like before, and the current dockerd wrapper scripts will work unmodified. | |
| On the back end, we would need to deliver the ACI image for docker and its | |
| associated binaries, but that is outside of scope for stage1_skim. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment