Unit process isolation / namespaces

Fri Aug 16 17:56:41 UTC 2019

Hi,

I would like to present a new feature I'm working on that adds OS
based process isolation to Unit.
For now, it implements just the basic building block of containers:
Linux namespaces.

Let me know what you think, if it's useful or not, etc.

To start using it, you just need to add a new "isolation" field to
your app's config:

    {
      "type": "external",
      "executable": "/bin/app",
      "isolation": {
          "namespaces": {
              "user": true,
              "mount": true
          }
      }
  }

The list of allowed namespaces are: user, mount, network, pid, uts, cgroup.
The ipc namespace is not allowed because Unit uses shared memory to
communicate with workers.
In the future, if Unit could proxy general processes (and manage them
also), we can allow the ipc namespace as well, them giving full
isolation.

Linux namespaces require CAP_SYS_ADMIN to be created if not used in
conjunction with user namespace.
Then, if you want to keep running Unit as an unprivileged user, you
need to set "user" namespace in addition to the other flags.

The PR is here (still working on it): https://github.com/nginx/unit/pull/289

When using user namespace, you can set mapping files for uid and gid
ranges inside the namespace. For uid, the file is /proc/<pid>/uid_map
and for gid it is /proc/<pid>/gid_map. Then, you can map an
unprivileged user id in the host (parent ns) to a privileged id inside
the child namespace.
I added two config fields for this mappings.

  {
      "isolation": {
          "namespaces": {
              "user": true,
              "mount": true
          },
          "uidmap": [
              {"containerID": 0, "hostID": 1000, "size": 1}
          ],
          "gidmap": [
              {"containerID": 0, "hostID": 1000, "size": 1}
          ],
    }

The config is an array because you can map several ranges. For now, if
you don't set a map config, Unit will use a common default (the
example above, but using process current euid instead of 1000). Some
distributions come with an /etc/subuid and /etc/subgid file with
application's mappings. We can make unit lookup for a mapping from
this file also in the future.
The config is based on the OCI Spec:
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#user-namespace-mappings
I don't like it much, let me know if you know a better way of configuring it.

The uid/gid mapping affects the user and group you pass in the
application config.
Then, my first question:
If the user pass a "user" or "group" that's not mapped inside the
container, what should we do?

I would like to keep user experience very simple, but having to deal
with uid/gid mappings seems a bit complex.
What do you folks think about doing some auto mappings in case the
user pass a user from host (without setting any mapping)? Is this
confuse?

If you think it's useful, what can be the next steps?

I would like to add a "rootfs" field to chroot applications, also a
"mounts" field to mount additional filesystems inside the rootfs
(kernfs, tmpfs, procfs and also user defined bind mounts from the host
filesystem).

About the isolation mechanism, I did some experiments with FreeBSD
jails and maybe we can deliver something useful there also.
Jails are significantly more secure than Linux namespaces, and I think
we can implement it relatively easy.

That's all folks!