clearly run

Run a command in a Clearly container.

Synopsis

$ clearly run [OPTION...] IMAGE -- COMMAND [ARG...]

Description

Run command COMMAND in a fully unprivileged Clearly container using the image specified by IMAGE, which can be: (1) a path to a directory, (2) the name of an image in clearly image storage (e.g. example.com:5050/foo) or, if the proper support is enabled, a SquashFS archive. clearly run does not use any setuid or setcap helpers, even for mounting SquashFS images with FUSE.

-a, --allow=DST

Allow traffic to peer container DST. Can be repeated.

-b, --bind=SRC[:DST]

Bind-mount SRC at guest DST. The default destination if not specified is to use the same path as the host; i.e., the default is --bind=SRC:SRC. Can be repeated.

With a read-only image (the default), DST must exist. However, if --write or --write-fake are given, DST will be created as an empty directory (possibly with the tmpfs overmount trick described in --bind creates mount points within un-writeable directories!). In this case, DST must be entirely within the image itself, i.e., DST cannot enter a previous bind mount. For example, --bind /foo:/tmp/foo will fail because /tmp is shared with the host via bind-mount (unless $TMPDIR is set to something else or --private-tmp is given).

Most images have ten directories /mnt/[0-9] already available as mount points.

Symlinks in DST are followed, and absolute links can have surprising behavior. Bind-mounting happens after namespace setup but before pivoting into the container image, so absolute links use the host root. For example, suppose the image has a symlink /foo -> /mnt. Then, --bind=/bar:/foo will bind-mount on the host’s /mnt, which is inaccessible on the host because namespaces are already set up and also inaccessible in the container because of the subsequent pivot into the image. Currently, this problem is only detected when DST needs to be created: clearly run will refuse to follow absolute symlinks in this case, to avoid directory creation surprises.

-c, --cd=DIR

Initial working directory in container.

--cap-add=CAP

Add system capability CAP to the container. Can be repeated. The capability is added to both effective and permitted capability sets.

--cap-drop=CAP

Drop system capability CAP from the container. Can be repeated. The capability is removed from both effective and permitted capability sets. Note that CAP_NET_ADMIN is automatically dropped by default for security reasons after network setup is complete, unless it is explicitly requested with --cap-add=NET_ADMIN.

--cpus=N

Set the number of CPUs available to the container (0-1024).

--cpu-weight=WEIGHT

Set CPU weight for the container (1-10000).

-d, --detach

Detach the container into the background. Requires --name to be set.

--env-no-expand

Don’t expand variables when using --env.

--feature=FEAT

If feature FEAT is enabled, exit successfully (zero); otherwise, exit unsuccessfully (non-zero). Note this just communicates the results of configure rather than testing the feature. Valid values of FEAT are:

  • extglob: extended globs in --unset-env

  • seccomp: system calls intercepted by seccomp

  • squash: internal SquashFUSE image mounts

  • overlayfs: unprivileged overlayfs support

  • tmpfs-xattrs: user xattrs on tmpfs

-g, --gid=GID

Run as group GID within container.

-h, --host=SRC:DST

Map SRC at guest DST (e.g. google.com:1.2.3.4). Can be repeated.

--home

Bind-mount your host home directory (i.e., $HOME) at guest /home/$USER, hiding any existing image content at that path. Implies --write-fake so the mount point can be created if needed.

-i, --ip=IP

Set a static IP address for the container.

-j, --join

Use the same container (namespaces) as peer clearly run invocations.

--join-pid=PID

Join the namespaces of an existing process.

--join-ct=N

Number of clearly run peers (implies --join; default: see below).

--join-tag=TAG

Label for clearly run peer group (implies --join; default: see below).

--label=KEY=VALUE

Set container label KEY to VALUE. Can be repeated.

--memory-max=BYTES

Set memory limit for the container in bytes (up to 1024G).

-m, --mount=DIR

Use DIR for the SquashFS mount point, which must already exist. If not specified, the default is /var/lib/clearly/mount, which will be created if needed.

--name=NAME

Assign a name to the container. Required when using --detach.

--passwd

Bind-mount /etc/{passwd,group} from the host into the container.

--pids-max=N

Set maximum number of PIDs for the container (0-1024).

-p, --port=SRC:DST

Forward host port SRC to container port DST. Can be repeated.

-q, --quiet

Be quieter; can be repeated. Incompatible with -v. See the How can I control Clearly’s quietness or verbosity? for details.

-r, --runtime=DIR

Set DIR as the runtime directory.

-s, --storage=DIR

Set the storage directory. Equivalent to the same option for clearly image(1).

--sysctl=KEY=VALUE

Set kernel parameter KEY to VALUE. Can be repeated. The parameter is written to the corresponding file in /proc/sys/.

--test=TEST

Run internal test TEST. Valid values are log and log-fail.

-t, --private-tmp

By default, the host’s /tmp (or $TMPDIR if set) is bind-mounted at container /tmp. If this is specified, a new tmpfs is mounted on the container’s /tmp instead.

-e, --env=ARG

Set environment variables per ARG (newline-delimited). Can be specified multiple times.

--env, --env=FILE, --env=VAR=VALUE

Set environment variables with newline-separated file (/clearly/environment within the image if not specified) or on the command line. See below for details.

--env0, --env0=FILE, --env0=VAR=VALUE

Like --env, but file is null-byte separated.

-u, --uid=UID

Run as user UID within container.

--unsafe

Enable various unsafe behavior. For internal use only. Seriously, stay away from this option.

--unset-env=GLOB

Unset environment variables whose names match GLOB.

-v, --verbose

Print extra chatter; can be repeated. See the FAQ entry on verbosity for details.

--warnings=NUM

Log NUM warnings and exit.

-w, --write

Mount image read-write. By default, the image is mounted read-only. This option should be avoided for most use cases, because (1) changing images live (as opposed to prescriptively with a Dockerfile) destroys their provenance and (2) SquashFS images, which is the best-practice format on parallel filesystems, must be read-only. It is better to use --overlay (for disposable data) or bind-mount host directories (for retained data).

-o, --overlay[=SIZE]

Overlay a writeable tmpfs on top of the image. This makes the image appear read-write, but it actually remains read-only and unchanged. All data "written" to the image are discarded when the container exits.

The size of the writeable filesystem SIZE is any size specification acceptable to tmpfs, e.g. 4m for 4MiB or 50% for half of physical memory. If this option is specified without SIZE, the default is 12%. Note (1) this limit is a maximum — only actually stored files consume virtual memory — and (2) SIZE larger than memory can be requested without error (the failure happens later if the actual contents become too large).

This requires kernel support and there are some caveats. See section "Writeable overlay with --write-fake" below for details.

-W, --write-fake[=SIZE]

Overlay a writeable tmpfs on top of the image. This makes the image appear read-write, but it actually remains read-only and unchanged. All data “written” to the image are discarded when the container exits.

The size of the writeable filesystem SIZE is any size specification acceptable to tmpfs, e.g. 4m for 4MiB or 50% for half of physical memory. If this option is specified without SIZE, the default is 12%. Note (1) this limit is a maximum — only actually stored files consume virtual memory — and (2) SIZE larger than memory can be requested without error (the failure happens later if the actual contents become too large).

This requires kernel support and there are some caveats. See section “Writeable overlay with --write-fake” below for details.

-?, --help

Print help and exit.

--usage

Print a short usage message and exit.

Note: Because clearly run is fully unprivileged, it is not possible to change UIDs and GIDs within the container (the relevant system calls fail). In particular, setuid, setgid, and setcap executables do not work. As a precaution, clearly run calls prctl(PR_SET_NO_NEW_PRIVS, 1) to disable these executables within the container. This does not reduce functionality but is a "belt and suspenders" precaution to reduce the attack surface should bugs in these system calls or elsewhere arise.

Image format

clearly run supports two different image formats.

The first is a simple directory that contains a Linux filesystem tree. This can be accomplished by:

  • clearly convert directly from clearly image or another builder to a directory.

  • Clearly’s tarball workflow: build or pull the image, clearly convert it to a tarball, transfer the tarball to the target system, then clearly convert the tarball to a directory.

  • Manually mount a SquashFS image, e.g. with squashfuse(1) and then un-mount it after run with fusermount -u.

  • Any other workflow that produces an appropriate directory tree.

The second is a SquashFS image archive mounted internally by clearly run, available if it’s linked with the optional libsquashfuse_ll shared library. clearly run mounts the image filesystem, services all FUSE requests, and unmounts it, all within clearly run. See --mount above to set the mount point location.

Like other FUSE implementations, Clearly calls the fusermount3(1) utility to mount the SquashFS filesystem. However, this executable does not need to be installed setuid root, and in fact clearly run actively suppresses its setuid bit if set (using prctl(2)).

Prior versions of Clearly provided wrappers for the squashfuse and squashfuse_ll SquashFS mount commands and fusermount -u unmount command. We removed these because we concluded they had minimal value-add over the standard, unwrapped commands.

Warning

Currently, Clearly unmounts the SquashFS filesystem when user command COMMAND’s process exits. It does not monitor any of its child processes. Therefore, if the user command spawns child processes and then exits before them (e.g., some daemons), those children will have the image unmounted from underneath them. In this case, the workaround is to mount/unmount using external tools. We expect to remove this limitation in a future version.

Host files and directories available in container via bind mounts

In addition to any directories specified by the user with --bind, clearly run has standard host files and directories that are bind-mounted in as well.

The following host files and directories are bind-mounted at the same location in the container. These give access to the host’s devices and various kernel facilities. (Recall that Clearly provides minimal isolation and containerized processes are mostly normal unprivileged processes.) They cannot be disabled and are required; i.e., they must exist both on host and within the image.

  • /dev

  • /proc

  • /sys

Optional; bind-mounted only if path exists on both host and within the image, without error or warning if not.

  • /etc/hosts and /etc/resolv.conf. Because Clearly containers share the host network namespace, they need the same hostname resolution configuration.

  • /etc/machine-id. Provides a unique ID for the OS installation; matching the host works for most situations. Needed to support D-Bus, some software licensing situations, and likely other use cases.

  • /var/lib/hugetlbfs at guest /var/opt/cray/hugetlbfs, and /var/opt/cray/alps/spool. These support Cray MPI.

Additional bind mounts done by default but can be disabled; see the options above.

  • $HOME at /home/$USER (and image /home is hidden). Makes user data and init files available.

  • /tmp (or $TMPDIR if set) at guest /tmp. Provides a temporary directory that persists between container runs and is shared with non-containerized application components.

  • temporary files at /etc/passwd and /etc/group. Usernames and group names need to be customized for each container run.

Multiple processes in the same container with --join

By default, different clearly run invocations use different user and mount namespaces (i.e., different containers). While this has no impact on sharing most resources between invocations, there are a few important exceptions. These include:

  1. ptrace(2), used by debuggers and related tools. One can attach a debugger to processes in descendant namespaces, but not sibling namespaces. The practical effect of this is that (without --join), you can’t run a command with clearly run and then attach to it with a debugger also run with clearly run.

  2. Cross-memory attach (CMA) is used by cooperating processes to communicate by simply reading and writing one another’s memory. This is also not permitted between sibling namespaces. This affects various MPI implementations that use CMA to pass messages between ranks on the same node, because it’s faster than traditional shared memory.

--join is designed to address this by placing related clearly run commands (the “peer group”) in the same container. This is done by one of the peers creating the namespaces with unshare(2) and the others joining with setns(2).

To do so, we need to know the number of peers and a name for the group. These are specified by additional arguments that can (hopefully) be left at default values in most cases:

  • --join-ct sets the number of peers. The default is the value of the first of the following environment variables that is defined: OMPI_COMM_WORLD_LOCAL_SIZE, SLURM_STEP_TASKS_PER_NODE, SLURM_CPUS_ON_NODE.

  • --join-tag sets the tag that names the peer group. The default is environment variable SLURM_STEP_ID, if defined; otherwise, the PID of clearly run’s parent. Tags can be re-used for peer groups that start at different times, i.e., once all peer clearly run have replaced themselves with the user command, the tag can be re-used.

Caveats:

  • One cannot currently add peers after the fact, for example, if one decides to start a debugger after the fact. (This is only required for code with bugs and is thus an unusual use case.)

  • clearly run instances race. The winner of this race sets up the namespaces, and the other peers use the winner to find the namespaces to join. Therefore, if the user command of the winner exits, any remaining peers will not be able to join the namespaces, even if they are still active. There is currently no general way to specify which clearly run should be the winner.

  • If --join-ct is too high, the winning clearly run’s user command exits before all peers join, or clearly run itself crashes, IPC resources such as semaphores and shared memory segments will be leaked. These appear as files in /dev/shm/ and can be removed with rm(1).

  • Many of the arguments given to the race losers, such as the image path and --bind, will be ignored in favor of what was given to the winner.

Writeable overlay with --write-fake

If you need the image to stay read-only but appear writeable, you may be able to use --write-fake to overlay a writeable tmpfs atop the image. This requires kernel support. Specifically:

  1. To use the feature at all, you need unprivileged overlayfs support. This is available in upstream 5.11 (February 2021), but distributions vary considerably. If you don’t have this, the container will fail to start with error “operation not permitted”.

  2. For a fully functional overlay, you need a tmpfs that supports xattrs in the user namespace. This is available in upstream 6.6 (October 2023). If you don’t have this, most things will work fine, but some operations will fail with “I/O error”, for example creating a directory with the same path as a previously deleted directory. There will also be syslog noise about xattr problems.

    (overlayfs can also use xattrs in the trusted namespace, but this requires CAP_SYS_ADMIN on the host and thus is not helpful for unprivileged containers.)

Environment variables

clearly run leaves environment variables unchanged, i.e. the host environment is passed through unaltered, except:

  • by default (--home not specified), HOME is set to /root, if it exists, and / otherwise.

  • limited tweaks to avoid significant guest breakage;

  • user-set variables via --env;

  • user-unset variables via --unset-env; and

  • set CLEARLY_RUNNING.

This section describes these features.

The default tweaks happen first, then --env and --unset-env in the order specified on the command line, and then CLEARLY_RUNNING. The two options can be repeated arbitrarily many times, e.g. to add/remove multiple variable sets or add only some variables in a file.

Default behavior

By default, clearly run makes the following environment variable changes:

$CLEARLY_RUNNING

Set to Weird Al Yankovic. While a process can figure out that it’s in an unprivileged container and what namespaces are active without this hint, that can be messy, and there is no way to tell that it’s a Clearly container specifically. This variable makes such a test simple and well-defined. (Note: This variable is unaffected by --unset-env.)

$HOME

If --home is specified, then your home directory is bind-mounted into the guest at /home/$USER. If you also have a different home directory path on the host, an inherited $HOME will be incorrect inside the guest, which confuses lots of software, notably Spack. Thus, with --home, $HOME is set to /home/$USER (by default, it is unchanged.)

$PATH

Newer Linux distributions replace some root-level directories, such as /bin, with symlinks to their counterparts in /usr.

Some of these distributions (e.g., Fedora 24) have also dropped /bin from the default $PATH. This is a problem when the guest OS does not have a merged /usr (e.g., Debian 8 “Jessie”). Thus, we add /bin to $PATH if it’s not already present.

Further reading:

$TMPDIR

Unset, because this is almost certainly a host path, and that host path is made available in the guest at /tmp unless --private-tmp is given.

Setting variables with --env or --env0

The purpose of these two options is to set environment variables within the container. Values given replace any already in the environment (i.e., inherited from the host shell) or set by earlier uses of the options. These flags take an optional argument with two possible forms:

  1. If the argument contains an equals sign (=, ASCII 61), that sets an environment variable directly. For example, to set FOO to the string value bar:

    $ clearly run --env=FOO=bar ...
    

    Single straight quotes around the value (', ASCII 39) are stripped, though be aware that both single and double quotes are also interpreted by the shell. For example, this example is similar to the prior one; the double quotes are removed by the shell and the single quotes are removed by clearly run:

    $ clearly run --env="'BAZ=qux'" ...
    
  2. If the argument does not contain an equals sign, it is a host path to a file containing zero or more variables using the same syntax as above (except with no prior shell processing).

    With --env, this file contains a sequence of assignments separated by newline (\n or ASCII 10); with --env0, the assignments are separated by the null byte (i.e., \0 or ASCII 0). Empty assignments are ignored, and no comments are interpreted. (This syntax is designed to accept the output of printenv and be easily produced by other simple mechanisms.) The file need not be seekable.

    For example:

    $ cat /tmp/env.txt
    FOO=bar
    BAZ='qux'
    $ clearly run --env=/tmp/env.txt ...
    

    For directory images only (because the file is read before containerizing), guest paths can be given by prepending the image path.

  3. If there is no argument, the file /clearly/environment within the image is used. This file is commonly populated by ENV instructions in the Dockerfile. For example, equivalently to form 2:

    $ cat Dockerfile
    [...]
    ENV FOO=bar
    ENV BAZ=qux
    [...]
    $ clearly image build -t foo .
    $ clearly convert foo /var/tmp/foo.sqfs
    $ clearly run --env /var/tmp/foo.sqfs -- ...
    

    (Note the image path is interpreted correctly, not as the --env argument.)

    At present, there is no way to use files other than /clearly/environment within SquashFS images.

Environment variables are expanded for values that look like search paths, unless --env-no-expand is given prior to --env. In this case, the value is a sequence of zero or more possibly-empty items separated by colon (:, ASCII 58). If an item begins with dollar sign ($, ASCII 36), then the rest of the item is the name of an environment variable. If this variable is set to a non-empty value, that value is substituted for the item; otherwise (i.e., the variable is unset or the empty string), the item is deleted, including a delimiter colon. The purpose of omitting empty expansions is to avoid surprising behavior such as an empty element in $PATH meaning the current directory.

For example, to set HOSTPATH to the search path in the current shell (this is expanded by clearly run, though letting the shell do it happens to be equivalent):

$ clearly run --env='HOSTPATH=$PATH' ...

To prepend /opt/bin to this current search path:

$ clearly run --env='PATH=/opt/bin:$PATH' ...

To prepend /opt/bin to the search path set by the Dockerfile, as retrieved from guest file /clearly/environment (here we really cannot let the shell expand $PATH):

$ clearly run --env --env='PATH=/opt/bin:$PATH' ...

Examples of valid assignment, assuming that environment variable BAR is set to bar and UNSET is unset or set to the empty string:

Assignment

Name

Value

FOO=bar

FOO

bar

FOO=bar=baz

FOO

bar=baz

FLAGS=-march=foo -mtune=bar

FLAGS

-march=foo -mtune=bar

FLAGS='-march=foo -mtune=bar'

FLAGS

-march=foo -mtune=bar

FOO=$BAR

FOO

bar

FOO=$BAR:baz

FOO

bar:baz

FOO=

FOO

empty string

FOO=$UNSET

FOO

empty string

FOO=baz:$UNSET:qux

FOO

baz:qux (not baz::qux)

FOO=:bar:baz::

FOO

:bar:baz::

FOO=''

FOO

empty string

FOO=''''

FOO

'' (two single quotes)

Example invalid assignments:

Assignment

Problem

FOO bar

no equals separator

=bar

name cannot be empty

Example valid assignments that are probably not what you want:

Assignment

Name

Value

Problem

FOO="bar"

FOO

"bar"

double quotes aren’t stripped

FOO=bar # baz

FOO

bar # baz

comments not supported

FOO=bar\tbaz

FOO

bar\tbaz

backslashes are not special

FOO=bar

FOO

bar

leading space in key

FOO= bar

FOO

bar

leading space in value

$FOO=bar

$FOO

bar

variables not expanded in key

FOO=$BAR baz:qux

FOO

qux

variable BAR baz not set

Removing variables with --unset-env

The purpose of --unset-env=GLOB is to remove unwanted environment variables. The argument GLOB is a glob pattern (dialect fnmatch(3) with the FNM_EXTMATCH flag where supported); all variables with matching names are removed from the environment.

Warning

Because the shell also interprets glob patterns, if any wildcard characters are in GLOB, it is important to put it in single quotes to avoid surprises.

GLOB must be a non-empty string.

Example 1: Remove the single environment variable FOO:

$ export FOO=bar
$ env | fgrep FOO
FOO=bar
$ clearly run --unset-env=FOO $CLEARLY_TEST_IMGDIR/test -- env | fgrep FOO
$

Example 2: Hide from a container the fact that it’s running in a Slurm allocation, by removing all variables beginning with SLURM. You might want to do this to test an MPI program with one rank and no launcher:

$ salloc -N1
$ env | egrep '^SLURM' | wc
   44      44    1092
$ clearly run $CLEARLY_TEST_IMGDIR/mpihello-openmpi -- /hello/hello
[... long error message ...]
$ clearly run --unset-env='SLURM*' $CLEARLY_TEST_IMGDIR/mpihello-openmpi -- /hello/hello
0: MPI version:
Open MPI v3.1.3, package: Open MPI root@c897a83f6f92 Distribution, ident: 3.1.3, repo rev: v3.1.3, Oct 29, 2018
0: init ok cn001.localdomain, 1 ranks, userns 4026532530
0: send/receive ok
0: finalize ok

Example 3: Clear the environment completely (remove all variables):

$ clearly run --unset-env='*' $CLEARLY_TEST_IMGDIR/test -- env
$

Example 4: Remove all environment variables except for those prefixed with either WANTED_ or ALSO_WANTED_:

$ export WANTED_1=yes
$ export ALSO_WANTED_2=yes
$ export NOT_WANTED_1=no
$ clearly run --unset-env='!(WANTED_*|ALSO_WANTED_*)' $CLEARLY_TEST_IMGDIR/test -- env
WANTED_1=yes
ALSO_WANTED_2=yes
$

Note that some programs, such as shells, set some environment variables even if started with no init files:

$ clearly run --unset-env='*' $CLEARLY_TEST_IMGDIR/debian_9ch -- bash --noprofile --norc -c env
SHLVL=1
PWD=/
_=/usr/bin/env
$

Examples

Run the command echo hello inside a Clearly container using the unpacked image at /data/foo:

$ clearly run /data/foo -- echo hello
hello

Run an MPI job that can use CMA to communicate:

$ srun clearly run --join /data/foo -- bar

Syslog

By default, clearly run logs its command line to syslog. (This can be disabled by configuring with --disable-syslog.) This includes: (1) the invoking real UID, (2) the number of command line arguments, and (3) the arguments, separated by spaces. For example:

Dec 10 18:19:08 mybox clearly run: uid=1000 args=7: clearly run -v /var/tmp/00_tiny -- echo hello "wor l}\$d"

Logging is one of the first things done during program initialization, even before command line parsing. That is, almost all command lines are logged, even if erroneous, and there is no logging of program success or failure.

Arguments are serialized with the following procedure. The purpose is to provide a human-readable reconstruction of the command line while also allowing each argument to be recovered byte-for-byte.

  • If an argument contains only printable ASCII bytes that are not whitespace, shell metacharacters, double quote (", ASCII 34 decimal), or backslash (\​, ASCII 92), then log it unchanged.

  • Otherwise, (a) enclose the argument in double quotes and (b) backslash-escape double quotes, backslashes, and characters interpreted by Bash (including POSIX shells) within double quotes.

The verbatim command line typed in the shell cannot be recovered, because not enough information is provided to UNIX programs. For example, echo  'foo' is given to programs as a sequence of two arguments, echo and foo; the two spaces and single quotes are removed by the shell. The zero byte, ASCII NUL, cannot appear in arguments because it would terminate the string.

Exit status

If the user command is started successfully and exits normally, clearly run’s exit status is that of the user command. Otherwise, the exit status is one of:

31

Miscellaneous clearly run failure other than the below

49

Unable to start user command (i.e., execvp(2) failed)

84

SquashFUSE loop exited on signal before user command was complete

87

Feature queried by --feature is not available

128 + N

User command killed by signal N