cab105 · January 31, 2017 00:24
diff --git a/docker_rkt_spec.txt b/docker_rkt_spec.txt
 RKT stage1_skim flavor to support Docker
 ========================================

 NOTES
 =====

 This is the second version of an initial specification I provided outlining how
 to modify docker and the stage1_fly handler such that both can interoperate with
 each other.  As a result of a few conversations, it was determined that we would
 want to do away with the chroot environment all together, and provide a few more
 modifications to the stage1 handler such that we can minimize our modifications
 of the Docker environment even more.

 INTRO
 =====

 There has been a long-sought out request for getting docker to run inside of
 rkt.  Given that both are containerization products that require privileged
 operations, care needs to be taken to provide an adequate abstraction of the
 host environment while also maintaing the look and feel of running natively.

 TERMS
 =====

 * pod will reference a "pod" or container as hosted by rkt.
 * container will reference Docker's view of a container.

 ENVIRONMENT AND BOOTSTRAPPING
 =============================

 Based off the rkt architecture, rkt will bootstrap into a pluggable environment
 that has the sole job of prepping a new runtime container for execution.  There
 are currently two flavors:
 * stage1 - This one has several offshoots ranging from using native kvm to a
   coreos based image, and leverages systemd to handle the spawning and
   management of the contained process.  This environment provides the most
   isolation, and by extension, the most restrictions
 * stage1_fly - This was created to support running the kubernetes kubelet, and
   provides a basic chroot environment.  In this way, the FS and process
   namespace are the only things that are isolated, but other things such as
   the host network stack and processes are exposed.

 What would be considered ideal would be something resembling the stage1_fly
 handler minus the chroot environment, and with support for more than one
 application per pod.  There are three reasons for removing the chroot support:
  1. Docker makes use of pivot root when it comes to setting up the overlays for
     its images/containers
  2. Most of docker's storage drivers become unusable (overlay will no longer
     be supported due to the docker pod running inside of an overlay, which leaves
     the vfs storage driver as the only usable interface)
  3. No easy way to modify the bind mounts such that they're visible inside the
     jail once it is setup without modification of both rkt and any application
     that would want to interact with the host.

 In addition, the stage1_fly would need to expose at a minimum the host's kernel
 modules in order for docker to ensure proper detection and bringup of its
 components (think networking for the bridge driver or overlayfs for its own
 management of filesystems)

 For this, we'll create a new stage1 flavor (stage1_skim) that will remove the
 need for chroot.  We will still create the overlay for the image[s], but paths
 referencing the applications contained within the image will need to be
 augmented to reflect the new home for things such as the executable's absolute
 path, its environment, and also the current working directory.  The latter can
 be done with the appropriate modifications during the exec into stage2.

 GOING TO STAGE2
 ===============

 Looking at the stage1_fly flavor, once the environment is setup, it will
 chroot into the new environment, and then exec the target executable.  Without
 performing the chroot, we would need to modify the execution path, working
 directory, and environment path before exec'ing into the target executable.

 To support multiple applications running inside a single pod, we can make use of
 systemd in a similar fashion to the current stage1 handler.  In this case, each
 app within the pod will have its own service file to account for the same tweaks
 needed above in the single-app per pod model.  Instead of invoking systemd-nspawn,
 we will invoke systemd-run on a single process as a systemd scope that will
 inherit the open file descriptor from rkt, and will be solely responsible for
 kicking off the services we just created.  In addition, we can define a per-pod
 slice to allow for binding all services together into the same cgroup.  The
 ultimate goal in this case is to provide additional resiliance for processes
 running inside the pod such that failures can be caught, and systemd can pick
 up logging.  Lastly, the service files we will create for the pod will reside in
 /run/systemd/system as they will be considered transitive in nature, and not
 designed to survive reboot.

 Another issue for bringup with multiple apps in a single pod, and that is adding
 dependencies on other apps.  In the case for our Docker implementation, this would
 be to ensure that containerd is started before dockerd so that containers running
 inside docker can survive when the docker dameon goes away.  Another case would
 be on a webapp with a dependency on something like redis and ensuring that redis
 is up before our webapp comes into play.  The default behavior we're shooting for
 with the initial release is for successive images to depend on the previous image.

 Stopping and terminating the pod will consist of stopping the scope that we
 spawned as a part of the run invocation.  Because of the successive service
 dependencies created earlier, terminating the scope will result all other
 dependent processes terminating.

 With the garbage collector, stage1_skim will need to be sure to remove all
 systemd service/scope files that were created during the run phase, and ensure
 `systemdctl daemon-reload` has been executed to ensure systemd has an updated
 state of the world.

 DOCKER/CONTAINER OS MODIFICATIONS
 =================================

 For the ContainerOS, the docker client itself would become a wrapper script that
 would invoke the docker binary running inside the appropriate rkt pod.  The
 current docker service would be modified to load/run the appropriate docker
 image.  The end user can place their docker configuration in /etc/docker just
 like before, and the current dockerd wrapper scripts will work unmodified.

 On the back end, we would need to deliver the ACI image for docker and its
 associated binaries, but that is outside of scope for stage1_skim.
	RKT stage1_skim flavor to support Docker
	========================================

	NOTES
	=====

	This is the second version of an initial specification I provided outlining how
	to modify docker and the stage1_fly handler such that both can interoperate with
	each other. As a result of a few conversations, it was determined that we would
	want to do away with the chroot environment all together, and provide a few more
	modifications to the stage1 handler such that we can minimize our modifications
	of the Docker environment even more.

	INTRO
	=====

	There has been a long-sought out request for getting docker to run inside of
	rkt. Given that both are containerization products that require privileged
	operations, care needs to be taken to provide an adequate abstraction of the
	host environment while also maintaing the look and feel of running natively.

	TERMS
	=====

	* pod will reference a "pod" or container as hosted by rkt.
	* container will reference Docker's view of a container.

	ENVIRONMENT AND BOOTSTRAPPING
	=============================

	Based off the rkt architecture, rkt will bootstrap into a pluggable environment
	that has the sole job of prepping a new runtime container for execution. There
	are currently two flavors:
	* stage1 - This one has several offshoots ranging from using native kvm to a
	coreos based image, and leverages systemd to handle the spawning and
	management of the contained process. This environment provides the most
	isolation, and by extension, the most restrictions
	* stage1_fly - This was created to support running the kubernetes kubelet, and
	provides a basic chroot environment. In this way, the FS and process
	namespace are the only things that are isolated, but other things such as
	the host network stack and processes are exposed.

	What would be considered ideal would be something resembling the stage1_fly
	handler minus the chroot environment, and with support for more than one
	application per pod. There are three reasons for removing the chroot support:
	1. Docker makes use of pivot root when it comes to setting up the overlays for
	its images/containers
	2. Most of docker's storage drivers become unusable (overlay will no longer
	be supported due to the docker pod running inside of an overlay, which leaves
	the vfs storage driver as the only usable interface)
	3. No easy way to modify the bind mounts such that they're visible inside the
	jail once it is setup without modification of both rkt and any application
	that would want to interact with the host.

	In addition, the stage1_fly would need to expose at a minimum the host's kernel
	modules in order for docker to ensure proper detection and bringup of its
	components (think networking for the bridge driver or overlayfs for its own
	management of filesystems)

	For this, we'll create a new stage1 flavor (stage1_skim) that will remove the
	need for chroot. We will still create the overlay for the image[s], but paths
	referencing the applications contained within the image will need to be
	augmented to reflect the new home for things such as the executable's absolute
	path, its environment, and also the current working directory. The latter can
	be done with the appropriate modifications during the exec into stage2.

	GOING TO STAGE2
	===============

	Looking at the stage1_fly flavor, once the environment is setup, it will
	chroot into the new environment, and then exec the target executable. Without
	performing the chroot, we would need to modify the execution path, working
	directory, and environment path before exec'ing into the target executable.

	To support multiple applications running inside a single pod, we can make use of
	systemd in a similar fashion to the current stage1 handler. In this case, each
	app within the pod will have its own service file to account for the same tweaks
	needed above in the single-app per pod model. Instead of invoking systemd-nspawn,
	we will invoke systemd-run on a single process as a systemd scope that will
	inherit the open file descriptor from rkt, and will be solely responsible for
	kicking off the services we just created. In addition, we can define a per-pod
	slice to allow for binding all services together into the same cgroup. The
	ultimate goal in this case is to provide additional resiliance for processes
	running inside the pod such that failures can be caught, and systemd can pick
	up logging. Lastly, the service files we will create for the pod will reside in
	/run/systemd/system as they will be considered transitive in nature, and not
	designed to survive reboot.

	Another issue for bringup with multiple apps in a single pod, and that is adding
	dependencies on other apps. In the case for our Docker implementation, this would
	be to ensure that containerd is started before dockerd so that containers running
	inside docker can survive when the docker dameon goes away. Another case would
	be on a webapp with a dependency on something like redis and ensuring that redis
	is up before our webapp comes into play. The default behavior we're shooting for
	with the initial release is for successive images to depend on the previous image.

	Stopping and terminating the pod will consist of stopping the scope that we
	spawned as a part of the run invocation. Because of the successive service
	dependencies created earlier, terminating the scope will result all other
	dependent processes terminating.

	With the garbage collector, stage1_skim will need to be sure to remove all
	systemd service/scope files that were created during the run phase, and ensure
	`systemdctl daemon-reload` has been executed to ensure systemd has an updated
	state of the world.

	DOCKER/CONTAINER OS MODIFICATIONS
	=================================

	For the ContainerOS, the docker client itself would become a wrapper script that
	would invoke the docker binary running inside the appropriate rkt pod. The
	current docker service would be modified to load/run the appropriate docker
	image. The end user can place their docker configuration in /etc/docker just
	like before, and the current dockerd wrapper scripts will work unmodified.

	On the back end, we would need to deliver the ACI image for docker and its
	associated binaries, but that is outside of scope for stage1_skim.
No results found