Cardboard container

You may have heard about software containers, an idea of running code isolated in a lightweight cheated virtual machine; such process talks to the same kernel as other processes on the host machine (thus is virtually as fast as a normal process), but is constrained because kernel was asked to lie to it about certain aspects of the system (like mountpoints, processes or users) and disallow or limit certain actions (messing with devices, allocating RAM). The technology started from the UNIX chroot syscall that allowed filesystem hierarchy obfuscation and evolved by adding more restriction layers, yielding systems like Solaris zones or BSD jails. On Linux it is still more a set of kernel capabilities rather than something coherent, which is why there are tons of helper solutions promising to put it all together; we have OpenVZ, LXC, systemd’s nspawn and last but not least Docker which was good/cool enough to make the idea popular.

Except of making right set of syscalls in a right order the container engine must also deal with the fact that the contained software may depend on things it was isolated from; it is likely dynamically linked to a dozen of libraries (most notably libc), probably expects certain files to exist and maybe even would like to talk with some system services. Most container engines solve that by melding the software with a complete filesystem of some Linux distribution with all software dependencies installed therein, virtually independent of the host system (except kernel version compatibility). Such an approach allows one to jump over the dependency hell problem when trying to build an effective deployment pipeline or running odd binary-blob software (my favourite use-case), consequently is a huge selling point for Docker and friends; however I feel it a bit defies the lightweightness and security aspects. All these binaries that will never be executed, l13n blobs, package managers, default wallpapers…? Sudo, PAM, bash and suid binaries attack surfaces?

Obviously one may just base container on some of those stripped-down distributions like Alpine, but it would be cool to just have a container with a single binary, wouldn’t it? Let’s start with some small C program, like the following Hello World:

#include <stdio.h>
#include <stdlib.h>
int main(int argc,char **argv){
 printf("Hello, world!\n");
 return(0);
}

First thing is to compile it using static linking, so that it won’t be looking for libc *.so files:

$ gcc -static -O3 hw.c -o hw

hw is now a self-contained binary what can be verified with ldd. Now it is time to make the container image; it will be just a directory with our hw put inside:

$ mkdir hwcont
$ cp hw hwcont/.

This is basically it; systemd-nspawn can run this as it is:

$ sudo systemd-nspawn -xD hwcont /hw
Spawning container ...
Press ^] three times within 1s to kill container.
Timezone ...
Hello, world!
Container ... exited successfully.

Docker requires making tar and importing it:

$ cd hwcont
$ tar -c . > ../hwcont.tar
$ cd ..
$ docker import - hwcont < hwcont.tar
$ docker run hwcont /xs
Hello, world!
$ #Clean up
$ docker rmi -f hwcont

While we are playing minimal, it is also a good idea to swap glibc into something more compact and maybe also more secure; like for instance musl. Installing it should give you a convenient gcc wrapper musl-gcc that links against it:

$ musl-gcc -static -O3 hw.c -o hw-musl
$ du -h hw
792K   hw
$ du -h hw-musl
12K    hw-musl

12K is pretty good result, provided that dynamically linked binary is 8K; Alpine image mentioned earlier is 5M. Obviously production apps usually do something, thus will be a bit heavier; also running an arbitrary app as init is not a good idea in case child process are possible – at least some makeshift init would be necessary to reap zombies and allow the whole tree to gracefully exit in case of the main process failure.

Previously: First post, later: WebGL weather globe.

CC-BY mbq, written 31-7-2015, last revised 28-7-2018.