Tutorial¶
This tutorial will teach you how to create and run Clearly images, using both examples included with the source code as well as new ones you create from scratch.
This tutorial assumes that: (a) Clearly is in your path, including
Clearly’s fully unprivileged image builder clearly image
and
(b) Clearly is installed under /usr/local
. (If the second
assumption isn’t true, you will just need to modify some paths.)
If you want to use Docker to build images, see the FAQ.
Note
Shell sessions throughout this documentation will use the prompt $
to indicate commands executed natively on the host and >
for
commands executed in a container.
90 seconds to Clearly¶
This section is for the impatient. It shows you how to quickly build and run a “hello world” Clearly container. If you like what you see, then proceed with the rest of the tutorial to understand what is happening and how to use Clearly for your own applications.
Using a SquashFS image¶
The preferred workflow uses our internal SquashFS mounting code. Your sysadmin should be able to tell you if this is linked in.
$ cd /usr/local/share/doc/clearly/examples/hello
$ clearly image build .
inferred image name: hello
[...]
grown in 3 instructions: hello
$ clearly convert hello /var/tmp/hello.sqfs
input: clearly image hello
output: squash /var/tmp/hello.sqfs
packing ...
Parallel mksquashfs: Using 8 processors
Creating 4.0 filesystem on /var/tmp/hello.sqfs, block size 65536.
[=============================================|] 10411/10411 100%
[...]
done
$ clearly run /var/tmp/hello.sqfs -- echo "I’m in a container"
I’m in a container
Using a directory image¶
If not, you can create image in plain directory format instead. Most of this tutorial uses SquashFS images, but you can adapt it analogously to this section.
$ cd /usr/local/share/doc/clearly/examples/hello
$ clearly image build .
inferred image name: hello
[...]
grown in 4 instructions: hello
$ clearly convert hello /var/tmp/hello
input: clearly image hello
output: dir /var/tmp/hello
exporting ...
done
$ clearly run /var/tmp/hello -- echo "I’m in a container"
I’m in a container
Note
You can run perfectly well out of /tmp
, but because it is
bind-mounted automatically, the image root will then appear in multiple
locations in the container’s filesystem tree. This can cause confusion for
both users and programs.
Getting help¶
All the executables have decent help and can tell you what version of Clearly you have (if not, please report a bug). For example:
$ clearly run --help
Usage: clearly run [OPTION...] IMAGE -- COMMAND [ARG...]
Run a command in a Clearly container.
[...]
$ clearly --version
0.26
Man pages for all commands are provided in this documentation (see table of
contents at left) as well as via man(1)
.
Pull an image¶
To start, let’s obtain a container image that someone else has already built. The containery way to do this is the pull operation, which means to move an image from a remote repository into local storage of some kind.
First, browse the Docker Hub repository of official AlmaLinux images. Note the list of tags; this is a
partial list of image versions that are available. We’ll use the tag
“8
”.
Use the Clearly program clearly image
to pull this image to
Clearly’s internal storage directory:
$ clearly image pull almalinux:8
pulling image: almalinux:8
requesting arch: amd64
manifest list: downloading: 100%
manifest: downloading: 100%
config: downloading: 100%
layer 1/1: 3239c63: downloading: 68.2/68.2 MiB (100%)
pulled image: adding to build cache
flattening image
layer 1/1: 3239c63: listing
validating tarball members
layer 1/1: 3239c63: changed 42 absolute symbolic and/or hard links to relative
resolving whiteouts
layer 1/1: 3239c63: extracting
image arch: amd64
done
$ clearly image list
almalinux:8
Images come in lots of different formats; clearly run
can use directories
and SquashFS archives. For this example, we’ll use SquashFS. We use the
command clearly convert
to create a SquashFS image from the image in
internal storage, then run it:
$ clearly convert almalinux:8 almalinux.sqfs
$ clearly run almalinux.sqfs -- /bin/bash
> pwd
/
> ls
bin clearly dev etc home lib lib64 media mnt
opt proc root run sbin srv sys tmp usr var
> cat /etc/redhat-release
AlmaLinux release 8.7 (Stone Smilodon)
> exit
What do these commands do?
Create a SquashFS-format image (
clearly convert ...
).Create a running container using that image (
clearly run almalinux.sqfs
).Stop processing
clearly run
options (--
). (This is standard notation for UNIX command line programs.)Run the program
/bin/bash
inside the container, which starts an interactive shell, where we enter a few commands and then exit, returning to the host.
Containers are not special¶
Many folks would like you to believe that containers are magic and special (especially if they want to sell you their container product). This is not the case. To demonstrate, we’ll create a working container image using standard UNIX tools.
Many Linux distributions provide tarballs containing installed base images, including Alpine. We can use these in Clearly directly:
$ wget -O alpine.tar.gz 'https://github.com/alpinelinux/docker-alpine/blob/v3.16/x86_64/alpine-minirootfs-3.16.3-x86_64.tar.gz?raw=true'
$ tar tf alpine.tar.gz | head -10
./
./root/
./var/
./var/log/
./var/lock/
./var/lock/subsys/
./var/spool/
./var/spool/cron/
./var/spool/cron/crontabs
./var/spool/mail
This tarball is what’s called a “tarbomb”, so we need to provide an enclosing directory to avoid making a mess:
$ mkdir alpine
$ cd alpine
$ tar xf ../alpine.tar.gz
$ ls
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
$ du -sh
5.6M .
$ cd ..
Now, run a shell in the container! (Note that base Alpine does not have Bash,
so we run /bin/sh
instead.)
$ clearly run ./alpine -- /bin/sh
> pwd
/
> ls
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
> cat /etc/alpine-release
3.16.3
> exit
Warning
Generally, you should avoid directory-format images on shared filesystems
such as NFS and Lustre, in favor of local storage such as tmpfs
and
local hard disks. This will yield better performance for you and anyone
else on the shared filesystem. In contrast, SquashFS images should work
fine on shared filesystems.
Build from Dockerfile¶
The other containery way to get an image is the build operation. This interprets a recipe, usually a Dockerfile, to create an image and place it into builder storage. We can then extract the image from builder storage to a directory and run it.
Clearly supports arbitrary image builders. In this tutorial, we use
clearly image
, which comes with Clearly, but you can also use others,
e.g. Docker or Podman. clearly image
is a big deal because it is completely
unprivileged. Other builders typically run as root or require setuid root
helper programs; this raises a number of security questions.
We’ll write a “Hello World” Python program and put it into an image we specify with a Dockerfile. Set up a directory to work in:
$ mkdir hello.src
$ cd hello.src
Type in the following program as hello.py
using your least favorite
editor:
#!/usr/bin/python3
print("Hello World!")
Next, create a file called Dockerfile
and type in the following
recipe:
FROM almalinux:8
RUN yum -y install python36
COPY ./hello.py /
RUN chmod 755 /hello.py
These four instructions say:
FROM
: We are extending thealmalinux:8
base image.
RUN
: Install thepython36
RPM package, which we need for our Hello World program.
COPY
: Copy the filehello.py
we just made to the root directory of the image. In the source argument, the path is relative to the context directory, which we’ll see more of below.
RUN
: Make that file executable.
Note
COPY
is a standard instruction but has a number of disadvantages in
its corner cases. Clearly also has RSYNC
, which addresses
these; see its documentation for details.
Let’s build this image:
$ clearly image build -t hello -f Dockerfile .
1. FROM almalinux:8
[...]
4. RUN chmod 755 /hello.py
grown in 4 instructions: hello
This command says:
Build (
clearly image build
) an image named (a.k.a. tagged) “hello” (-t hello
).Use the Dockerfile called “Dockerfile” (
-f Dockerfile
).Use the current directory as the context directory (
.
).
Now, list the images clearly image
knows about:
$ clearly image list
almalinux:8
hello
And run the image we just made:
$ cd ..
$ clearly convert hello hello.sqfs
$ clearly run hello.sqfs -- /hello.py
Hello World!
This time, we’ve run our application directly rather than starting an interactive shell.
Push an image¶
The containery way to share your images is by pushing them to a container registry. In this section, we will set up a registry on GitLab and push the hello image to that registry, then pull it back to compare.
Destination setup¶
Create a private container registry:
Browse to https://gitlab.com (or any other GitLab instance).
Log in. You should end up on your Projects page.
Click New project then Create blank project.
Name your project “
test-registry
”. Leave Visibility Level at Private. Click Create project. You should end up at your project’s main page.At left, choose Settings (the gear icon) → General, then Visibility, project features, permissions. Enable Container registry, then click Save changes.
At left, choose Packages & Registries (the box icon) → Container registry. You should see the message “There are no container images stored for this project”.
At this point, we have a container registry set up, and we need to teach
clearly image
how to log into it. On gitlab.com
and some other
instances, you can use your GitLab password. However, GitLab has a thing
called a personal access token (PAT) that can be used no matter how you log
into the GitLab web app. To create one:
Click on your avatar at the top right. Choose Edit Profile.
At left, choose Access Tokens (the three-pin plug icon).
Type in the name “
registry
”. Tick the boxes read_registry and write_registry. Click Create personal access token.Your PAT will be displayed at the top of the result page under Your new personal access token. Copy this string and store it somewhere safe & policy-compliant for your organization. (Also, you can revoke it at the end of the tutorial if you like.)
Push¶
We can now use clearly image push
to push the image to GitLab. (Note that
the tagging step you would need for Docker is unnecessary here, because we can
just specify a destination reference at push time.)
You will need to substitute your GitLab username for $USER
below.
When you are prompted for credentials, enter your GitLab username and copy-paste the PAT you created earlier (or enter your password).
Note
The specific GitLab path may vary depending on how your GitLab is set up.
Check the Docker examples on the empty container registry page for the
value you need. For example, if you put your container registry in a group
called “containers”, the image reference would be
gitlab.com/$USER/containers/myproj/hello:latest
.
$ clearly image push hello gitlab.com:5050/$USER/myproj/hello:latest
pushing image: hello
destination: gitlab.com:5050/$USER/myproj/hello:latest
layer 1/1: gathering
layer 1/1: preparing
preparing metadata
starting upload
layer 1/1: bca515d: checking if already in repository
Username: $USER
Password:
layer 1/1: bca515d: not present, uploading: 139.8/139.8 MiB(100%
config: f969909: checking if already in repository
config: f969909: not present, uploading
manifest: uploading
cleaning up
done
Go back to your container registry page. You should see your image listed now!
Pull and compare¶
Let’s pull that image and see how it looks:
$ clearly image pull --auth registry.gitlab.com/$USER/myproj/hello:latest hello.2
pulling image: gitlab.com:5050/$USER/myproj/hello:latest
destination: hello.2
[...]
$ clearly image list
almalinux:8
hello
hello.2
$ clearly convert hello.2 ./hello.2
$ ls ./hello.2
bin clearly dev etc hello.py home lib lib64 media mnt
opt proc root run sbin srv sys tmp usr var
MPI Hello World¶
In this section, we’ll build and run a simple MPI parallel program.
Image builds can be chained. Here, we’ll build a chain of four images: the
official almalinux:8
image, a customized AlmaLinux 8 image, an OpenMPI
image, and finally the application image.
Important: Many of the specifics in this section will vary from site to site. In that case, follow your site’s instructions instead.
Build base images¶
First, build two images using the Dockerfiles provided with Clearly. These two build should take about 15 minutes total, depending on the speed of your system.
Note that Clearly infers their names from the Dockerfile name, so we
don’t need to specify -t
.
$ clearly image build \
-f /usr/local/share/doc/clearly/examples/Dockerfile.almalinux_8clearly \
/usr/local/share/doc/clearly/examples
$ clearly image build \
-f /usr/local/share/doc/clearly/examples/Dockerfile.openmpi \
/usr/local/share/doc/clearly/examples
Build image¶
Next, create a new directory for this project, and within it the following
simple C program called mpihello.c
. (Note the program contains a bug;
consider fixing it.)
#include <stdio.h>
#include <mpi.h>
int main (int argc, char **argv)
{
int msg, rank, rank_ct;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &rank_ct);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("hello from rank %d of %d\n", rank, rank_ct);
if (rank == 0) {
for (int i = 1; i < rank_ct; i++) {
MPI_Send(&msg, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
printf("rank %d sent %d to rank %d\n", rank, msg, i);
}
} else {
MPI_Recv(&msg, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("rank %d received %d from rank 0\n", rank, msg);
}
MPI_Finalize();
}
Add this Dockerfile
:
FROM openmpi
RUN mkdir /hello
WORKDIR /hello
COPY mpihello.c .
RUN mpicc -o mpihello mpihello.c .
(The instruction WORKDIR
changes directories; the default working
directory within a Dockerfile is /
).
Now build. The default Dockerfile is ./Dockerfile
, so we can omit
-f
.
$ ls
Dockerfile mpihello.c
$ clearly image build -t mpihello
$ clearly image list
almalinux:8
almalinux_8clearly
mpihello
openmpi
Finally, create a squashball image and copy it to the supercomputer:
$ clearly convert mpihello mpihello.sqfs
$ scp mpihello.sqfs super-fe:
Run the container¶
We’ll run this application interactively. One could also put similar steps in a Slurm batch script.
First, obtain a two-node allocation and load Clearly:
$ salloc -N2 -t 1:00:00
salloc: Granted job allocation 599518
[...]
$ module load clearly
Then, run the application on all cores in your allocation:
$ srun -c1 clearly run ~/mpihello.sqfs -- /hello/mpihello
hello from rank 1 of 72
rank 1 received 0 from rank 0
[...]
hello from rank 63 of 72
rank 63 received 0 from rank 0
Win!
Build cache¶
clearly image
subcommands that create images, such as build and pull, can
use a build cache to speed repeated operations. That is, an image is created
by starting from the empty image and executing a sequence of instructions,
largely Dockerfile instructions but also some others like “pull” and “import”.
Some instructions are expensive to execute so it’s often cheaper to retrieve
their results from cache instead.
Let’s set up this example by first resetting the build cache:
$ clearly image build-cache --reset
$ mkdir cache-test
$ cd cache-test
Suppose we have a Dockerfile a.df
:
FROM almalinux:8
RUN sleep 2 && echo foo
RUN sleep 2 && echo bar
On our first build, we get:
$ clearly image build -t a -f a.df .
1. FROM almalinux:8
[ ... pull chatter omitted ... ]
2. RUN echo foo
copying image ...
foo
3. RUN echo bar
bar
grown in 3 instructions: a
Note the dot after each instruction’s line number. This means that the
instruction was executed. You can also see this in the output of the two
echo
commands.
But on our second build, we get:
$ clearly image build -t a -f a.df .
1* FROM almalinux:8
2* RUN sleep 2 && echo foo
3* RUN sleep 2 && echo bar
copying image ...
grown in 3 instructions: a
Here, instead of being executed, each instruction’s results were retrieved
from cache. Cache hit for each instruction is indicted by an asterisk
(“*
”) after the line number. Even for such a small and short
Dockerfile, this build is noticeably faster than the first.
Let’s also try a second, slightly different Dockerfile, b.df
. The
first two instructions are the same, but the third is different.
FROM almalinux:8
RUN sleep 2 && echo foo
RUN sleep 2 && echo qux
Build it:
$ clearly image build -t b -f b.df .
1* FROM almalinux:8
2* RUN sleep 2 && echo foo
3. RUN sleep 2 && echo qux
copying image
qux
grown in 3 instructions: b
Here, the first two instructions are hits from the first Dockerfile, but the third is a miss, so Clearly retrieves that state and continues building.
Finally, inspect the cache:
$ clearly image build-cache --tree
* (b) RUN sleep 2 && echo qux
| * (a) RUN sleep 2 && echo bar
|/
* RUN sleep 2 && echo foo
* (almalinux:8) PULL almalinux:8
* (root) ROOT
named images: 4
state IDs: 5
commits: 5
files: 317
disk used: 3 MiB
Here there are four named images: a
and b
that we built, the
base image almalinux:8
, and the empty base of everything ROOT
.
Also note that a
and b
diverge after the last common
instruction RUN sleep 2 && echo foo
.
Appendices¶
These appendices contain further tutorials that may be enlightening but are less essential to understanding Clearly.
Namespaces in Clearly¶
Let’s revisit the symlinks in /proc
, but this time with Clearly:
$ ls -l /proc/self/ns
total 0
lrwxrwxrwx 1 clearly clearly 0 Sep 28 11:24 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 11:24 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 11:24 net -> net:[4026531969]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 11:24 pid -> pid:[4026531836]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 11:24 user -> user:[4026531837]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 11:24 uts -> uts:[4026531838]
$ clearly run /var/tmp/hello -- ls -l /proc/self/ns
total 0
lrwxrwxrwx 1 clearly clearly 0 Sep 28 17:34 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 17:34 mnt -> mnt:[4026532257]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 17:34 net -> net:[4026531969]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 17:34 pid -> pid:[4026531836]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 17:34 user -> user:[4026532256]
lrwxrwxrwx 1 clearly clearly 0 Sep 28 17:34 uts -> uts:[4026531838]
The container has different mount (mnt
) and user (user
)
namespaces, but the rest of the namespaces are shared with the host. This
highlights Clearly’s focus on functionality (make your container run),
rather than isolation (protect the host from your container).
Normally, each invocation of clearly run
creates a new container, so if you
have multiple simultaneous invocations, they will not share containers. In
some cases this can cause problems with MPI programs. However, there is an
option --join
that can solve them; see the FAQ for
details.
All you need is Bash¶
In this exercise, we’ll use shell commands to create minimal container image with a working copy of Bash, and that’s all. To do so, we need to set up a directory with the Bash binary, the shared libraries it uses, and a few other hooks needed by Clearly.
Important: Your Bash is almost certainly linked differently than described below. Use the paths from your terminal, not this tutorial. Adjust the steps below as needed. It will not work otherwise.
$ ldd /bin/bash
linux-vdso.so.1 (0x00007ffdafff2000)
libtinfo.so.6 => /lib/x86_64-linux-gnu/libtinfo.so.6 (0x00007f6935cb6000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6935cb1000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6935af0000)
/lib64/ld-linux-x86-64.so.2 (0x00007f6935e21000)
$ ls -l /lib/x86_64-linux-gnu/libc.so.6
lrwxrwxrwx 1 root root 12 May 1 2019 /lib/x86_64-linux-gnu/libc.so.6 -> libc-2.28.so
The shared libraries pointed to are symlinks, so we’ll use cp -L
to
dereference them and copy the target files. linux-vdso.so.1
is a
kernel thing, not a shared library file, so we don’t copy that.
Set up the container:
$ mkdir alluneed
$ cd alluneed
$ mkdir bin
$ mkdir dev
$ mkdir lib
$ mkdir lib64
$ mkdir lib/x86_64-linux-gnu
$ mkdir proc
$ mkdir sys
$ mkdir tmp
$ cp -pL /bin/bash ./bin
$ cp -pL /lib/x86_64-linux-gnu/libtinfo.so.6 ./lib/x86_64-linux-gnu
$ cp -pL /lib/x86_64-linux-gnu/libdl.so.2 ./lib/x86_64-linux-gnu
$ cp -pL /lib/x86_64-linux-gnu/libc.so.6 ./lib/x86_64-linux-gnu
$ cp -pL /lib64/ld-linux-x86-64.so.2 ./lib64/ld-linux-x86-64.so.2
$ cd ..
$ ls -lR alluneed
./alluneed:
total 0
drwxr-x--- 2 clearly clearly 60 Mar 31 17:15 bin
drwxr-x--- 2 clearly clearly 40 Mar 31 17:26 dev
drwxr-x--- 2 clearly clearly 80 Mar 31 17:27 etc
drwxr-x--- 3 clearly clearly 60 Mar 31 17:17 lib
drwxr-x--- 2 clearly clearly 60 Mar 31 17:19 lib64
drwxr-x--- 2 clearly clearly 40 Mar 31 17:26 proc
drwxr-x--- 2 clearly clearly 40 Mar 31 17:26 sys
drwxr-x--- 2 clearly clearly 40 Mar 31 17:27 tmp
./alluneed/bin:
total 1144
-rwxr-xr-x 1 clearly clearly 1168776 Apr 17 2019 bash
./alluneed/dev:
total 0
./alluneed/lib:
total 0
drwxr-x--- 2 clearly clearly 100 Mar 31 17:19 x86_64-linux-gnu
./alluneed/lib/x86_64-linux-gnu:
total 1980
-rwxr-xr-x 1 clearly clearly 1824496 May 1 2019 libc.so.6
-rw-r--r-- 1 clearly clearly 14592 May 1 2019 libdl.so.2
-rw-r--r-- 1 clearly clearly 183528 Nov 2 12:16 libtinfo.so.6
./alluneed/lib64:
total 164
-rwxr-xr-x 1 clearly clearly 165632 May 1 2019 ld-linux-x86-64.so.2
./alluneed/proc:
total 0
./alluneed/sys:
total 0
./alluneed/tmp:
total 0
Next, start a container and run /bin/bash
within it.
$ clearly run ./alluneed -- /bin/bash
> pwd
/
> echo "hello world"
hello world
> ls /
bash: ls: command not found
> echo *
bin dev home lib lib64 proc sys tmp
> exit
It’s not very useful since the only commands we have are Bash built-ins, but it’s a container!
Interacting with the host¶
Clearly is not an isolation layer, so containers have full access to host resources, with a few quirks. This section demonstrates how that works.
Filesystems¶
Clearly makes host directories available inside the container using bind mounts, which is somewhat like a hard link in that it causes a file or directory to appear in multiple places in the filesystem tree, but it is a property of the running kernel rather than the filesystem.
Several host directories are always bind-mounted into the container. These
include system directories such as /dev
, /proc
, /sys
,
and /tmp
. Others can be requested with a command line option, e.g.
--home
bind-mounts the invoking user’s home directory.
Clearly uses recursive bind mounts, so for example if the host has a
variety of sub-filesystems under /sys
, as Ubuntu does, these will be
available in the container as well.
In addition to these, arbitrary user-specified directories can be added using
the --bind
or -b
switch. By default, mounts use the same path
as provided from the host. In the case of directory images, which are
writeable, the target mount directory will be automatically created before the
container is started:
$ mkdir /var/tmp/foo0
$ echo hello > /var/tmp/foo0/bar
$ mkdir /var/tmp/foo1
$ echo world > /var/tmp/foo1/bar
$ clearly run -b /var/tmp/foo0 -b /var/tmp/foo1 /var/tmp/hello -- bash
> cat /var/tmp/foo0/bar
hello
> cat /var/tmp/foo1/bar
world
However, as SquashFS filesystems are read-only, in this case you must provide
a destination that already exists, like those created under /mnt
:
$ mkdir /var/tmp/foo0
$ echo hello > /var/tmp/foo0/bar
$ mkdir /var/tmp/foo1
$ echo world > /var/tmp/foo1/bar
$ clearly run -b /var/tmp/foo0 -b /var/tmp/foo1 /var/tmp/hello -- bash
clearly run[1184427]: error: can’t mkdir: /var/tmp/hello/var/tmp/foo0: Read-only file system (misc.c:142 30)
$ clearly run -b /var/tmp/foo0:/mnt/0 -b /var/tmp/foo1:/mnt/1 /var/tmp/hello -- bash
> ls /mnt
0 1 2 3 4 5 6 7 8 9
> cat /mnt/0/bar
hello
> cat /mnt/1/bar
world
Network¶
Clearly containers share the host’s network namespace, so most network things should be the same.
However, SSH is not aware of Clearly containers. If you SSH to a node
where Clearly is installed, you will get a shell on the host, not in a
container, even if ssh
was initiated from a container:
$ stat -L --format='%i' /proc/self/ns/user
4026531837
$ ssh localhost stat -L --format='%i' /proc/self/ns/user
4026531837
$ clearly run /var/tmp/hello.sqfs -- /bin/bash
> stat -L --format='%i' /proc/self/ns/user
4026532256
> ssh localhost stat -L --format='%i' /proc/self/ns/user
4026531837
There are a couple ways to SSH to a remote node and run commands inside a
container. The simplest is to manually invoke clearly run
in the
ssh
command:
$ ssh localhost clearly run /var/tmp/hello.sqfs -- stat -L --format='%i' /proc/self/ns/user
4026532256
Note
Recall that by default, each clearly run
invocation creates a new
container. That is, the ssh
command above has not entered an
existing user namespace ’2256
; rather, it has re-used the namespace
ID ’2256
.
Another may be to edit one's shell initialization scripts to check the command
line and exec(1)
clearly run
if appropriate. This is brittle but
avoids wrapping ssh
or altering its command line.
User and group IDs¶
Unlike Docker and some other container systems, Clearly tries to make the
container’s users and groups look the same as the host’s. This is accomplished
by bind-mounting a custom /etc/passwd
and /etc/group
into the
container. For example:
$ id -u
901
$ whoami
clearly
$ clearly run /var/tmp/hello.sqfs -- bash
> id -u
901
> whoami
clearly
More specifically, the user namespace, when created without privileges as
Clearly does, lets you map any container UID to your host UID.
clearly run
implements this with the --uid
switch. So, for example,
you can tell Clearly you want to be root, and it will tell you that
you’re root:
$ clearly run --uid 0 /var/tmp/hello.sqfs -- bash
> id -u
0
> whoami
root
But, as shown above, this doesn’t get you anything useful, because the container UID is mapped back to your UID on the host before permission checks are applied:
> dd if=/dev/mem of=/tmp/pwned
dd: failed to open '/dev/mem': Permission denied
This mapping also affects how users are displayed. For example, if a file is
owned by you, your host UID will be mapped to your container UID, which is
then looked up in /etc/passwd
to determine the display name. In
typical usage without --uid
, this mapping is a no-op, so everything
looks normal:
$ ls -nd ~
drwxr-xr-x 87 901 901 4096 Sep 28 12:12 /home/clearly
$ ls -ld ~
drwxr-xr-x 87 clearly clearly 4096 Sep 28 12:12 /home/clearly
$ clearly run /var/tmp/hello.sqfs -- bash
> ls -nd ~
drwxr-xr-x 87 901 901 4096 Sep 28 18:12 /home/clearly
> ls -ld ~
drwxr-xr-x 87 clearly clearly 4096 Sep 28 18:12 /home/clearly
But if --uid
is provided, things can seem odd. For example:
$ clearly run --uid 0 /var/tmp/hello.sqfs -- bash
> ls -nd /home/clearly
drwxr-xr-x 87 0 901 4096 Sep 28 18:12 /home/clearly
> ls -ld /home/clearly
drwxr-xr-x 87 root clearly 4096 Sep 28 18:12 /home/clearly
This UID mapping can contain only one pair: an arbitrary container UID to your
effective UID on the host. Thus, all other users are unmapped, and they show
up as nobody
:
$ ls -n /tmp/foo
-rw-rw---- 1 902 902 0 Sep 28 15:40 /tmp/foo
$ ls -l /tmp/foo
-rw-rw---- 1 sig sig 0 Sep 28 15:40 /tmp/foo
$ clearly run /var/tmp/hello.sqfs -- bash
> ls -n /tmp/foo
-rw-rw---- 1 65534 65534 843 Sep 28 21:40 /tmp/foo
> ls -l /tmp/foo
-rw-rw---- 1 nobody nogroup 843 Sep 28 21:40 /tmp/foo
User namespaces have a similar mapping for GIDs, with the same limitation ---
exactly one arbitrary container GID maps to your effective primary GID. This
can lead to some strange-looking results, because only one of your GIDs can be
mapped in any given container. All the rest become nogroup
:
$ id
uid=901(clearly) gid=901(clearly) groups=901(clearly),903(nerds),904(losers)
$ clearly run /var/tmp/hello.sqfs -- id
uid=901(clearly) gid=901(clearly) groups=901(clearly),65534(nogroup)
$ clearly run --gid 903 /var/tmp/hello.sqfs -- id
uid=901(clearly) gid=903(nerds) groups=903(nerds),65534(nogroup)
However, this doesn’t affect access. The container process retains the same GIDs from the host perspective, and as always, the host IDs are what control access:
$ ls -l /tmp/primary /tmp/supplemental
-rw-rw---- 1 sig clearly 0 Sep 28 15:47 /tmp/primary
-rw-rw---- 1 sig nerds 0 Sep 28 15:48 /tmp/supplemental
$ clearly run /var/tmp/hello.sqfs -- bash
> cat /tmp/primary > /dev/null
> cat /tmp/supplemental > /dev/null
One area where functionality is reduced is that chgrp(1)
becomes
useless. Using an unmapped group or nogroup
fails, and using a mapped
group is a no-op because it’s mapped back to the host GID:
$ ls -l /tmp/bar
rw-rw---- 1 clearly clearly 0 Sep 28 16:12 /tmp/bar
$ clearly run /var/tmp/hello.sqfs -- chgrp nerds /tmp/bar
chgrp: changing group of '/tmp/bar': Invalid argument
$ clearly run /var/tmp/hello.sqfs -- chgrp nogroup /tmp/bar
chgrp: changing group of '/tmp/bar': Invalid argument
$ clearly run --gid 903 /var/tmp/hello.sqfs -- chgrp nerds /tmp/bar
$ ls -l /tmp/bar
-rw-rw---- 1 clearly clearly 0 Sep 28 16:12 /tmp/bar
Workarounds include chgrp(1)
on the host or fastidious use of setgid
directories:
$ mkdir /tmp/baz
$ chgrp nerds /tmp/baz
$ chmod 2770 /tmp/baz
$ ls -ld /tmp/baz
drwxrws--- 2 clearly nerds 40 Sep 28 16:19 /tmp/baz
$ clearly run /var/tmp/hello.sqfs -- touch /tmp/baz/foo
$ ls -l /tmp/baz/foo
-rw-rw---- 1 clearly nerds 0 Sep 28 16:21 /tmp/baz/foo
Apache Spark¶
This example is in examples/spark
. Build a SquashFS image of it and
upload it to your supercomputer.
Interactive¶
We need to first create a basic configuration for Spark, as the defaults in the Dockerfile are insufficient. For real jobs, you’ll want to also configure performance parameters such as memory use; see the documentation. First:
$ mkdir -p ~/sparkconf
$ chmod 700 ~/sparkconf
We’ll want to use the supercomputer’s high-speed network. For this example, we’ll find the Spark master’s IP manually:
$ ip -o -f inet addr show | cut -d/ -f1
1: lo inet 127.0.0.1
2: eth0 inet 192.168.8.3
8: eth1 inet 10.8.8.3
Your site support can tell you which to use. In this case, we’ll use 10.8.8.3.
Create some configuration files. Replace [MYSECRET]
with a string only
you know. Edit to match your system; in particular, use local disks instead of
/tmp
if you have them:
$ cat > ~/sparkconf/spark-env.sh
SPARK_LOCAL_DIRS=/tmp/spark
SPARK_LOG_DIR=/tmp/spark/log
SPARK_WORKER_DIR=/tmp/spark
SPARK_LOCAL_IP=127.0.0.1
SPARK_MASTER_HOST=10.8.8.3
$ cat > ~/sparkconf/spark-defaults.conf
spark.authenticate true
spark.authenticate.secret [MYSECRET]
We can now start the Spark master:
$ clearly run -b ~/sparkconf /var/tmp/spark.sqfs -- /spark/sbin/start-master.sh
Look at the log in /tmp/spark/log
to see that the master started
correctly:
$ tail -7 /tmp/spark/log/*master*.out
17/02/24 22:37:21 INFO Master: Starting Spark master at spark://10.8.8.3:7077
17/02/24 22:37:21 INFO Master: Running Spark version 2.0.2
17/02/24 22:37:22 INFO Utils: Successfully started service 'MasterUI' on port 8080.
17/02/24 22:37:22 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://127.0.0.1:8080
17/02/24 22:37:22 INFO Utils: Successfully started service on port 6066.
17/02/24 22:37:22 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/02/24 22:37:22 INFO Master: I have been elected leader! New state: ALIVE
If you can run a web browser on the node, browse to
http://localhost:8080
for the Spark master web interface. Because this
capability varies, the tutorial does not depend on it, but it can be
informative. Refresh after each key step below.
The Spark workers need to know how to reach the master. This is via a URL; you can get it from the log excerpt above, or consult the web interface. For example:
$ MASTER_URL=spark://10.8.8.3:7077
Next, start one worker on each compute node.
In this tutorial, we start the workers using srun
in a way that
prevents any subsequent srun
invocations from running until the Spark
workers exit. For our purposes here, that’s OK, but it’s a significant
limitation for some jobs. Alternatives include
pdsh
, which is the approach we use for the Spark tests
(examples/other/spark/test.bats
), or a simple for loop of ssh
calls. Both of these are also quite clunky and do not scale well.
$ srun sh -c " clearly run -b ~/sparkconf /var/tmp/spark.sqfs -- \
spark/sbin/start-slave.sh $MASTER_URL \
&& sleep infinity" &
One of the advantages of Spark is that it’s resilient: if a worker becomes unavailable, the computation simply proceeds without it. However, this can mask issues as well. For example, this example will run perfectly fine with just one worker, or all four workers on the same node, which aren’t what we want.
Check the master log to see that the right number of workers registered:
$ fgrep worker /tmp/spark/log/*master*.out
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:39890 with 16 cores, 187.8 GB RAM
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:44735 with 16 cores, 187.8 GB RAM
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:22445 with 16 cores, 187.8 GB RAM
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:29473 with 16 cores, 187.8 GB RAM
Despite the workers calling themselves 127.0.0.1, they really are running
across the allocation. (The confusion happens because of our
$SPARK_LOCAL_IP
setting above.) This can be verified by examining logs
on each compute node. For example (note single quotes):
$ ssh 10.8.8.4 -- tail -3 '/tmp/spark/log/*worker*.out'
17/02/24 22:52:24 INFO Worker: Connecting to master 10.8.8.3:7077...
17/02/24 22:52:24 INFO TransportClientFactory: Successfully created connection to /10.8.8.3:7077 after 263 ms (216 ms spent in bootstraps)
17/02/24 22:52:24 INFO Worker: Successfully registered with master spark://10.8.8.3:7077
We can now start an interactive shell to do some Spark computing:
$ clearly run -b ~/sparkconf /var/tmp/spark.sqfs -- /spark/bin/pyspark --master $MASTER_URL
Let’s use this shell to estimate 𝜋 (this is adapted from one of the Spark examples):
>>> import operator
>>> import random
>>>
>>> def sample(p):
... (x, y) = (random.random(), random.random())
... return 1 if x*x + y*y < 1 else 0
...
>>> SAMPLE_CT = int(2e8)
>>> ct = sc.parallelize(xrange(0, SAMPLE_CT)) \
... .map(sample) \
... .reduce(operator.add)
>>> 4.0*ct/SAMPLE_CT
3.14109824
(Type Control-D to exit.)
We can also submit jobs to the Spark cluster. This one runs the same example as included with the Spark source code. (The voluminous logging output is omitted.)
$ clearly run -b ~/sparkconf /var/tmp/spark.sqfs -- \
/spark/bin/spark-submit --master $MASTER_URL \
/spark/examples/src/main/python/pi.py 1024
[...]
Pi is roughly 3.141211
[...]
Exit your allocation. Slurm will clean up the Spark daemons.
Success! Next, we’ll run a similar job non-interactively.
Non-interactive¶
We’ll re-use much of the above to run the same computation non-interactively.
For brevity, the Slurm script at examples/other/spark/slurm.sh
is not
reproduced here.
Submit it as follows. It requires three arguments: the squashball, the image directory to unpack into, and the high-speed network interface. Again, consult your site administrators for the latter.
$ sbatch -N4 slurm.sh spark.sqfs /var/tmp ib0
Submitted batch job 86754
Output:
$ fgrep 'Pi is' slurm-86754.out
Pi is roughly 3.141393
Success! (to four significant digits)