2. Tutorial¶
This tutorial will teach you how to create and run Charliecloud images, using both examples included with the source code as well as new ones you create from scratch.
This tutorial assumes that Charliecloud is correctly installed as described in
the previous section, that the executables are on your $PATH
, and that
you have access to the examples in the source code.
Note
Shell sessions throughout this documentation will use the prompt $
to indicate commands executed natively on the host and >
for
commands executed in a container.
2.1. 90 seconds to Charliecloud¶
This section is for the impatient. It shows you how to quickly build and run a “hello world” Charliecloud container. If you like what you see, then proceed with the rest of the tutorial to understand what is happening and how to use Charliecloud for your own applications.
$ ch-build -t hello ~/charliecloud
Sending build context to Docker daemon 15.67 MB
[...]
Successfully built 1136de7d4c0a
$ ch-docker2tar hello /var/tmp
57M /var/tmp/hello.tar.gz
$ ch-tar2dir /var/tmp/hello.tar.gz /var/tmp
creating new image /var/tmp/hello
/var/tmp/hello unpacked ok
$ ch-run /var/tmp/hello -- echo "I'm in a container"
I'm in a container
2.2. Getting help¶
All the executables have decent help and can tell you what version of Charliecloud you have (if not, please report a bug). For example:
$ ch-run --help
Usage: ch-run [OPTION...] NEWROOT CMD [ARG...]
Run a command in a Charliecloud container.
[...]
$ ch-run --version
0.2.0+4836ac1
The help text is also collected later in this documentation; see Help text for executables.
2.3. Your first user-defined software stack¶
In this section, we will create and run a simple “hello, world” image. This
uses the hello
example in the Charliecloud source code. Start with:
$ cd examples/serial/hello
2.3.1. Defining your UDSS¶
You must first write a Dockerfile that describes the image you would like;
consult the Dockerfile documentation for details on how to do
this. Note that run-time functionality such as ENTRYPOINT
is not
supported.
We will use the following very simple Dockerfile:
FROM debian:jessie
RUN apt-get update \
&& apt-get install -y openssh-client \
&& rm -rf /var/lib/apt/lists/*
COPY examples/serial/hello hello
RUN touch /usr/bin/ch-ssh
This creates a minimal Debian Jessie image with ssh
installed. We will
encounter more complex Dockerfiles later in this tutorial.
Note
Docker does not update the base image unless asked to. Specific images can be updated manually; in this case:
$ sudo docker pull debian:jessie
There are various resources and scripts online to help automate this process.
2.3.2. Build Docker image¶
Charliecloud provides a convenience wrapper around docker build
that
works around some of its more irritating characteristics. In particular, it
passes through any HTTP proxy variables, and by default it uses the Dockerfile
in the current directory, rather than at the root of the Docker context
directory. (We will address the context directory later.)
The two arguments here are a tag for the Docker image and the context directory, which in this case is the Charliecloud source code.
$ ch-build -t hello ~/charliecloud
Sending build context to Docker daemon 15.67 MB
Step 1/4 : FROM debian:jessie
---> 86baf4e8cde9
[...]
Step 4/4 : RUN touch /usr/bin/ch-ssh
---> 1136de7d4c0a
Successfully built 1136de7d4c0a
Note that Docker prints each step of the Dockerfile as it’s executed.
ch-build
and many other Charliecloud commands wrap various
privileged docker
commands. Thus, you will be prompted for a password
to escalate as needed. Note however that most configurations of sudo
don’t require a password on every invocation, so just because you aren’t
prompted doesn’t mean privileged commands aren’t running.
2.3.4. Flatten image¶
Next, we flatten the Docker image into a tarball, which is then a plain file
amenable to standard file manipulation commands. This tarball is placed in an
arbitrary directory, here /var/tmp
.
$ ch-docker2tar hello /var/tmp
57M /var/tmp/hello.tar.gz
2.3.5. Distribute tarball¶
Thus far, the workflow has taken place on the build system. The next step is
to copy the tarball to the run system. This can use any appropriate method for
moving files: scp
, rsync
, something integrated with the
scheduler, etc.
If the build and run systems are the same, then no copy is needed. This is a typical use case for development and testing.
2.3.6. Unpack tarball¶
Charliecloud runs out of a normal directory rather than a filesystem image. In order to create this directory, we unpack the image tarball. This will replace the image directory if it already exists.
$ ch-tar2dir /var/tmp/hello.tar.gz /var/tmp
creating new image /var/tmp/hello
/var/tmp/hello unpacked ok
Generally, you should avoid unpacking into shared filesystems such as NFS and
Lustre, in favor of local storage such as tmpfs
and local hard disks.
This will yield better performance for you and anyone else on the shared
filesystem.
Note
You can run perfectly well out of /tmp
, but because it is
bind-mounted automatically, the image root will then appear in multiple
locations in the container’s filesystem tree. This can cause confusion for
both users and programs.
2.3.7. Activate image¶
We are now ready to run programs inside a Charliecloud container. This is done
with the ch-run
command:
$ ch-run /var/tmp/hello -- echo hello
hello
Symbolic links in /proc
tell us the current namespaces, which are
identified by long ID numbers:
$ ls -l /proc/self/ns
total 0
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 11:24 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 11:24 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 11:24 net -> net:[4026531969]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 11:24 pid -> pid:[4026531836]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 11:24 user -> user:[4026531837]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 11:24 uts -> uts:[4026531838]
$ ch-run /var/tmp/hello -- ls -l /proc/self/ns
total 0
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 17:34 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 17:34 mnt -> mnt:[4026532257]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 17:34 net -> net:[4026531969]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 17:34 pid -> pid:[4026531836]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 17:34 user -> user:[4026532256]
lrwxrwxrwx 1 reidpr reidpr 0 Sep 28 17:34 uts -> uts:[4026531838]
Notice that the container has different mount (mnt
) and user
(user
) namespaces, but the rest of the namespaces are shared with the
host. This highlights Charliecloud’s focus on functionality (make your UDSS
run), rather than isolation (protect the host from your UDSS).
Each invocation of ch-run
creates a new container, so if you have
multiple simultaneous invocations, they will not share containers. However,
container overhead is minimal, and containers communicate without hassle, so
this is generally of peripheral interest.
Note
The --
in the ch-run
command line is a standard argument
that separates options from non-option arguments. Without it,
ch-run
would try (and fail) to interpret ls
’s -l
argument.
These IDs are available both in the symlink target as well as its inode number:
$ stat -L --format='%i' /proc/self/ns/user
4026531837
$ ch-run /var/tmp/hello -- stat -L --format='%i' /proc/self/ns/user
4026532256
You can also run interactive commands, such as a shell:
$ ch-run /var/tmp/hello -- /bin/bash
> stat -L --format='%i' /proc/self/ns/user
4026532256
> exit
Be aware that wildcards in the ch-run
command are interpreted by the
host, not the container, unless protected. One workaround is to use a
sub-shell. For example:
$ ls /usr/bin/oldfind
ls: cannot access '/usr/bin/oldfind': No such file or directory
$ ch-run /var/tmp/hello -- ls /usr/bin/oldfind
/usr/bin/oldfind
$ ls /usr/bin/oldf*
ls: cannot access '/usr/bin/oldf*': No such file or directory
$ ch-run /var/tmp/hello -- ls /usr/bin/oldf*
ls: cannot access /usr/bin/oldf*: No such file or directory
$ ch-run /var/tmp/hello -- sh -c 'ls /usr/bin/oldf*'
/usr/bin/oldfind
You have now successfully run commands within a single-node Charliecloud container. Next, we explore how Charliecloud accesses host resources.
2.4. Interacting with the host¶
Charliecloud is not an isolation layer, so containers have full access to host resources, with a few quirks. This section demonstrates how this works.
2.4.1. Filesystems¶
Charliecloud makes host directories available inside the container using bind mounts, which is somewhat like a hard link in that it causes a file or directory to appear in multiple places in the filesystem tree, but it is a property of the running kernel rather than the filesystem.
Several host directories are always bind-mounted into the container. These
include system directories such as /dev
, /proc
, and
/sys
; /tmp
; Charliecloud’s ch-ssh
command in
/usr/bin
; and the invoking user’s home directory (for dotfiles),
unless --no-home
is specified.
Charliecloud uses recursive bind mounts, so for example if the host has a
variety of sub-filesystems under /sys
, as Ubuntu does, these will be
available in the container as well.
In addition to the default bind mounts, arbitrary user-specified directories
can be added using the --bind
or -b
switch. By default,
/mnt/0
, /mnt/1
, etc., are used for the destination in the guest:
$ mkdir /var/tmp/foo0
$ echo hello > /var/tmp/foo0/bar
$ mkdir /var/tmp/foo1
$ echo world > /var/tmp/foo1/bar
$ ch-run -b /var/tmp/foo0 -b /var/tmp/foo1 /var/tmp/hello -- bash
> ls /mnt
0 1 2 3 4 5 6 7 8 9
> cat /mnt/0/bar
hello
> cat /mnt/1/bar
world
Explicit destinations are also possible:
$ ch-run -b /var/tmp/foo0:/mnt /var/tmp/hello -- bash
> ls /mnt
bar
> cat /mnt/bar
hello
2.4.2. Network¶
Charliecloud containers share the host’s network namespace, so most network things should be the same.
However, SSH is not aware of Charliecloud containers. If you SSH to a node
where Charliecloud is installed, you will get a shell on the host, not in a
container, even if ssh
was initiated from a container:
$ stat -L --format='%i' /proc/self/ns/user
4026531837
$ ssh localhost stat -L --format='%i' /proc/self/ns/user
4026531837
$ ch-run /var/tmp/hello -- /bin/bash
> stat -L --format='%i' /proc/self/ns/user
4026532256
> ssh localhost stat -L --format='%i' /proc/self/ns/user
4026531837
There are several ways to SSH to a remote note and run commands inside a
container. The simplest is to manually invoke ch-run
in the
ssh
command:
$ ssh localhost ch-run /var/tmp/hello -- stat -L --format='%i' /proc/self/ns/user
4026532256
Note
Recall that each ch-run
invocation creates a new container. That
is, the ssh
command above has not entered an existing user
namespace ’2256
; rather, it has re-used the namespace ID
’2256
.
Another is to use the ch-ssh
wrapper program, which adds
ch-run
to the ssh
command implicitly. It takes the
ch-run
arguments from the environment variable CH_RUN_ARGS
,
making it mostly a drop-in replacement for ssh
. For example:
$ export CH_RUN_ARGS="/var/tmp/hello --"
$ ch-ssh localhost stat -L --format='%i' /proc/self/ns/user
4026532256
$ ch-ssh -t localhost /bin/bash
> stat -L --format='%i' /proc/self/ns/user
4026532256
ch-ssh
is available inside containers as well (in /usr/bin
via
bind-mount):
$ export CH_RUN_ARGS="/var/tmp/hello --"
$ ch-run /var/tmp/hello -- /bin/bash
> stat -L --format='%i' /proc/self/ns/user
4026532256
> ch-ssh localhost stat -L --format='%i' /proc/self/ns/user
4026532258
This also demonstrates that ch-run
does not alter your environment
variables.
Warning
CH_RUN_ARGS
is interpreted very simply; the sole delimiter is spaces. It is not shell syntax. In particular, quotes and backslashes are not interpreted.- Argument
-t
is required for SSH to allocate a pseudo-TTY and thus convince your shell to be interactive. In the case of Bash, otherwise you’ll get a shell that accepts commands but doesn’t print prompts, among other other issues. (Issue #2.)
A third may be to edit one’s shell initialization scripts to check the command
line and exec(1)
ch-run
if appropriate. This is brittle but
avoids wrapping ssh
or altering its command line.
2.4.3. User and group IDs¶
Unlike Docker and some other container systems, Charliecloud tries to make the
container’s users and groups look the same as the host’s. (This is
accomplished by bind-mounting /etc/passwd
and /etc/group
into
the container.) For example:
$ id -u
901
$ whoami
reidpr
$ ch-run /var/tmp/hello -- bash
> id -u
901
> whoami
reidpr
More specifically, the user namespace, when created without privileges as
Charliecloud does, lets you map any container UID to your host UID.
ch-run
implements this with the --uid
switch. So, for example,
you can tell Charliecloud you want to be root, and it will tell you that
you’re root:
$ ch-run --uid 0 /var/tmp/hello -- bash
> id -u
0
> whoami
root
But, this doesn’t get you anything useful, because the container UID is mapped back to your UID on the host before permission checks are applied:
> dd if=/dev/mem of=/tmp/pwned
dd: failed to open '/dev/mem': Permission denied
This mapping also affects how users are displayed. For example, if a file is
owned by you, your host UID will be mapped to your container UID, which is
then looked up in /etc/passwd
to determine the display name. In
typical usage without --uid
, this mapping is a no-op, so everything
looks normal:
$ ls -nd ~
drwxr-xr-x 87 901 901 4096 Sep 28 12:12 /home/reidpr
$ ls -ld ~
drwxr-xr-x 87 reidpr reidpr 4096 Sep 28 12:12 /home/reidpr
$ ch-run /var/tmp/hello -- bash
> ls -nd ~
drwxr-xr-x 87 901 901 4096 Sep 28 18:12 /home/reidpr
> ls -ld ~
drwxr-xr-x 87 reidpr reidpr 4096 Sep 28 18:12 /home/reidpr
But if --uid
is provided, things can seem odd. For example:
$ ch-run --uid 0 /var/tmp/hello -- bash
> ls -nd /home/reidpr
drwxr-xr-x 87 0 901 4096 Sep 28 18:12 /home/reidpr
> ls -ld /home/reidpr
drwxr-xr-x 87 root reidpr 4096 Sep 28 18:12 /home/reidpr
This UID mapping can contain only one pair: an arbitrary container UID to your
effective UID on the host. Thus, all other users are unmapped, and they show
up as nobody
:
$ ls -n /tmp/foo
-rw-rw---- 1 902 902 0 Sep 28 15:40 /tmp/foo
$ ls -l /tmp/foo
-rw-rw---- 1 sig sig 0 Sep 28 15:40 /tmp/foo
$ ch-run /var/tmp/hello -- bash
> ls -n /tmp/foo
-rw-rw---- 1 65534 65534 843 Sep 28 21:40 /tmp/foo
> ls -l /tmp/foo
-rw-rw---- 1 nobody nogroup 843 Sep 28 21:40 /tmp/foo
User namespaces have a similar mapping for GIDs, with the same limitation —
exactly one arbitrary container GID maps to your effective primary GID. This
can lead to some strange-looking results, because only one of your GIDs can be
mapped in any given container. All the rest become nogroup
:
$ id
uid=901(reidpr) gid=901(reidpr) groups=901(reidpr),903(nerds),904(losers)
$ ch-run /var/tmp/hello -- id
uid=901(reidpr) gid=901(reidpr) groups=901(reidpr),65534(nogroup)
$ ch-run --gid 903 /var/tmp/hello -- id
uid=901(reidpr) gid=903(nerds) groups=903(nerds),65534(nogroup)
However, this doesn’t affect access. The container process retains the same GIDs from the host perspective, and as always, the host IDs are what control access:
$ ls -l /tmp/primary /tmp/supplemental
-rw-rw---- 1 sig reidpr 0 Sep 28 15:47 /tmp/primary
-rw-rw---- 1 sig nerds 0 Sep 28 15:48 /tmp/supplemental
$ ch-run /var/tmp/hello -- bash
> cat /tmp/primary > /dev/null
> cat /tmp/supplemental > /dev/null
One area where functionality is reduced is that chgrp(1)
becomes
useless. Using an unmapped group or nogroup
fails, and using a mapped
group is a no-op because it’s mapped back to the host GID:
$ ls -l /tmp/bar
rw-rw---- 1 reidpr reidpr 0 Sep 28 16:12 /tmp/bar
$ ch-run /var/tmp/hello -- chgrp nerds /tmp/bar
chgrp: changing group of '/tmp/bar': Invalid argument
$ ch-run /var/tmp/hello -- chgrp nogroup /tmp/bar
chgrp: changing group of '/tmp/bar': Invalid argument
$ ch-run --gid 903 /var/tmp/hello -- chgrp nerds /tmp/bar
$ ls -l /tmp/bar
-rw-rw---- 1 reidpr reidpr 0 Sep 28 16:12 /tmp/bar
Workarounds include chgrp(1)
on the host or fastidious use of setgid
directories:
$ mkdir /tmp/baz
$ chgrp nerds /tmp/baz
$ chmod 2770 /tmp/baz
$ ls -ld /tmp/baz
drwxrws--- 2 reidpr nerds 40 Sep 28 16:19 /tmp/baz
$ ch-run /var/tmp/hello -- touch /tmp/baz/foo
$ ls -l /tmp/baz/foo
-rw-rw---- 1 reidpr nerds 0 Sep 28 16:21 /tmp/baz/foo
This concludes our discussion of how a Charliecloud container interacts with its host and principal Charliecloud quirks. We next move on to installing software.
2.5. Installing your own software¶
This section covers four situations for making software available inside a Charliecloud container:
- Third-party software installed into the image using a package manager.
- Third-party software compiled from source into the image.
- Your software installed into the image.
- Your software stored on the host but compiled in the container.
Many of Docker’s Best practices for writing Dockerfiles apply to Charliecloud images as well, so you should be familiar with that document.
Note
Maybe you don’t have to install the software at all. Is there already a trustworthy image on Docker Hub you can use as a base?
2.5.1. Third-party software via package manager¶
This approach is the simplest and fastest way to install stuff in your image.
The examples/hello
Dockerfile also seen above does this to install the
package openssh-client
:
FROM debian:jessie
RUN apt-get update \
&& apt-get install -y openssh-client \
&& rm -rf /var/lib/apt/lists/*
You can use distribution package managers such as apt-get
, as
demonstrated above, or others, such as pip
for Python packages.
Be aware that the software will be downloaded anew each time you build the image, unless you add an HTTP cache, which is out of scope of this tutorial.
2.5.2. Third-party software compiled from source¶
Under this method, one uses RUN
commands to fetch the desired software
using curl
or wget
, compile it, and install. Our example does
this with two chained Dockerfiles. First, we build a basic Debian image
(test/Dockerfile.debian8
):
FROM debian:jessie
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update \
&& apt-get install -y apt-utils
Then, we add OpenMPI with test/Dockerfile.debian8openmpi
:
FROM debian8
# OS packages needed to build OpenMPI.
RUN apt-get install -y \
file \
flex \
g++ \
gcc \
gfortran \
less \
libdb5.3-dev \
make \
wget
# Compile OpenMPI. We can't use the Debian package because
# --disable-pty-support is needed to avoid "pipe function call failed when
# setting up I/O forwarding subsystem".
ENV MPI_URL https://www.open-mpi.org/software/ompi/v1.10/downloads
ENV MPI_VERSION 1.10.5
WORKDIR /usr/src
RUN wget -nv ${MPI_URL}/openmpi-${MPI_VERSION}.tar.gz
RUN tar xf openmpi-${MPI_VERSION}.tar.gz
RUN cd openmpi-${MPI_VERSION} \
&& CFLAGS=-O3 \
CXXFLAGS=-O3 \
./configure --prefix=/usr \
--sysconfdir=/mnt/0 \
--disable-pty-support \
&& make -j$(getconf _NPROCESSORS_ONLN) install
RUN rm -Rf openmpi-${MPI_VERSION}*
So what is going on here?
- Use the latest Debian, Jessie, as the base image.
- Install a basic build system using the OS package manager.
- Download and untar OpenMPI. Note the use of variables to make adjusting the
URL and MPI version easier, as well as the explanation of why we’re not
using
apt-get
, given that OpenMPI 1.10 is included in Debian. - Build and install OpenMPI. Note the
getconf
trick to guess at an appropriate parallel build. - Clean up, in order to reduce the size of layers as well as the resulting
Charliecloud tarball (
rm -Rf
).
2.5.3. Your software stored in the image¶
This method covers software provided by you that is included in the image. This is recommended when your software is relatively stable or is not easily available to users of your image, for example a library rather than simulation code under active development.
The general approach is the same as installing third-party software from
source, but you use the COPY
instruction to transfer files from the
host filesystem (rather than the network via HTTP) to the image. For example,
the mpihello
Dockerfile uses this approach:
COPY /examples/mpi/mpihello /hello
WORKDIR /hello
RUN make clean && make
These Dockerfile instructions:
Copy the host directory
examples/mpi/mpihello
to the image at path/hello
. The host path is relative to the context directory; Docker builds have no access to the host filesystem outside the context directory. (This is so the Docker daemon can run on a different machine — the context directory is tarred up and sent to the daemon, even if it’s on the same machine.)The convention for the Charliecloud examples is that the build directory is always rooted at the top of the Charliecloud source code, but we could just as easily have provided the
mpihello
directory. In that case, the source inCOPY
would have been.
.cd
to/hello
.Compile our example. We include
make clean
to remove any leftover build files, since they would be inappropriate inside the container.
Once the image is built, we can see the results. (Install the image into
/var/tmp
as outlined above, if you haven’t already.)
$ ch-run /var/tmp/mpihello -- ls -lh /hello
total 32K
-rw-rw---- 1 reidpr reidpr 908 Oct 4 15:52 Dockerfile
-rw-rw---- 1 reidpr reidpr 157 Aug 5 22:37 Makefile
-rw-rw---- 1 reidpr reidpr 1.2K Aug 5 22:37 README
-rwxr-x--- 1 reidpr reidpr 9.5K Oct 4 15:58 hello
-rw-rw---- 1 reidpr reidpr 1.4K Aug 5 22:37 hello.c
-rwxrwx--- 1 reidpr reidpr 441 Aug 5 22:37 test.sh
We will revisit this image later.
2.5.4. Your software stored on the host¶
This method leaves your software on the host but compiles it in the image. This is recommended when your software is volatile or each image user needs a different version, for example a simulation code under active development.
The general approach is to bind-mount the appropriate directory and then run
the build inside the container. We can re-use the mpihello
image to
demonstrate this.
$ cd examples/mpi/mpihello
$ ls -l
total 20
-rw-rw---- 1 reidpr reidpr 908 Oct 4 09:52 Dockerfile
-rw-rw---- 1 reidpr reidpr 1431 Aug 5 16:37 hello.c
-rw-rw---- 1 reidpr reidpr 157 Aug 5 16:37 Makefile
-rw-rw---- 1 reidpr reidpr 1172 Aug 5 16:37 README
$ ch-run -b . /var/tmp/mpihello -- sh -c 'cd /mnt/0 && make'
mpicc -std=gnu11 -Wall hello.c -o hello
$ ls -l
total 32
-rw-rw---- 1 reidpr reidpr 908 Oct 4 09:52 Dockerfile
-rwxrwx--- 1 reidpr reidpr 9632 Oct 4 10:43 hello
-rw-rw---- 1 reidpr reidpr 1431 Aug 5 16:37 hello.c
-rw-rw---- 1 reidpr reidpr 157 Aug 5 16:37 Makefile
-rw-rw---- 1 reidpr reidpr 1172 Aug 5 16:37 README
A common use case is to leave a container shell open in one terminal for building, and then run using a separate container invoked from a different terminal.
2.6. Your first single-node, multi-process jobs¶
This is an important use case even for large-scale codes, when testing and development happens at small scale but need to use an environment comparable to large-scale runs.
This tutorial covers three approaches:
- Processes are coordinated by the host, i.e., one process per container.
- Processes are coordinated by the container, i.e., one container with multiple processes, using configuration files from the container.
- Processes are coordinated by the container using configuration files from the host.
In order to test approach 1, you must install OpenMPI 1.10.x on the host. In our experience, we have had success compiling from source with the same options as in the Dockerfile, but there is probably more nuance to the match than we’ve discovered.
2.6.1. Processes coordinated by host¶
This approach does the forking and process coordination on the host. Each process is spawned in its own container, and because Charliecloud introduces minimal isolation, they can communicate as if they were running directly on the host.
For example, using mpirun
and the mpihello
example above:
$ mpirun --version
mpirun (Open MPI) 1.10.2
$ stat -L --format='%i' /proc/self/ns/user
4026531837
$ ch-run /var/tmp/mpihello -- mpirun --version
mpirun (Open MPI) 1.10.4
$ mpirun -np 4 ch-run /var/tmp/mpihello -- /hello/hello
0: init ok cn001, 4 ranks, userns 4026532256
1: init ok cn001, 4 ranks, userns 4026532267
2: init ok cn001, 4 ranks, userns 4026532269
3: init ok cn001, 4 ranks, userns 4026532271
0: send/receive ok
0: finalize ok
The advantage is that we can easily take advantage of host-specific things such as configurations; the disadvantage is that it introduces a close coupling between the host and container that can manifest in complex ways. For example, while OpenMPI 1.10.2 worked with 1.10.4 above, both had to be compiled with the same options. The OpenMPI 1.10.2 packages that come with Ubuntu fail with “orte_util_nidmap_init failed” if run with the container 1.10.4.
2.6.2. Processes coordinated by container¶
This approach starts a single container process, which then forks and coordinates the parallel work. The advantage is that this approach is completely independent of the host for dependency configuration and installation; the disadvantage is that it cannot take advantage of any host-specific things that might e.g. improve performance.
For example:
$ ch-run /var/tmp/mpihello -- mpirun -np 4 /hello/hello
0: init ok cn001, 4 ranks, userns 4026532256
1: init ok cn001, 4 ranks, userns 4026532256
2: init ok cn001, 4 ranks, userns 4026532256
3: init ok cn001, 4 ranks, userns 4026532256
0: send/receive ok
0: finalize ok
2.6.3. Processes coordinated by container using host configuration¶
This approach is a middle ground. The use case is when there is some host-specific configuration we want to use, but we don’t want to install the entire configured dependency on the host. It would be undesirable to copy this configuration into the image, because that would reduce its portability.
The host configuration is communicated to the container by bind-mounting the relevant directory and then pointing the application to it. There are a variety of approaches. Some application or frameworks take command-line parameters specifying the configuration path.
The approach used in our example is to set the configuration directory to
/mnt/0
. This is done in mpihello
with the --sysconfdir
argument:
RUN cd openmpi-${VERSION} \
&& CFLAGS=-O3 CXXFLAGS=-O3 \
./configure --prefix=/usr --sysconfdir=/mnt/0 \
--disable-pty-support --disable-mpi-cxx --disable-mpi-fortran \
&& make -j$(getconf _NPROCESSORS_ONLN) install
The effect is that the image contains a default MPI configuration, but if you
specify a different configuration directory with --bind
, that is
overmounted and used instead. For example:
$ ch-run -b /usr/local/etc /var/tmp/mpihello -- mpirun -np 4 /hello/hello
0: init ok cn001, 4 ranks, userns 4026532256
1: init ok cn001, 4 ranks, userns 4026532256
2: init ok cn001, 4 ranks, userns 4026532256
3: init ok cn001, 4 ranks, userns 4026532256
0: send/receive ok
0: finalize ok
A similar approach creates a dangling symlink with RUN
that is
resolved when the appropriate host directory is bind-mounted into
/mnt
.
2.7. Your first multi-node jobs¶
This section assumes that you are using a Slurm cluster with a working OpenMPI
1.10.x installation and some type of node-local storage. A tmpfs
is recommended, and we use /var/tmp
for this tutorial. (Using
/tmp
often works but can cause confusion because it’s shared by the
container and host, yielding cycles in the directory tree.)
We cover three cases:
- The MPI hello world example above, run interactively, with the host coordinating.
- Same, non-interactive.
- An Apache Spark example, run interactively.
- Same, non-interactive.
We think that container-coordinated MPI jobs will also work, but we haven’t worked out how to do this yet. (See issue #5.)
Note
The image directory is mounted read-only by default so it can be shared by
multiple Charliecloud containers in the same or different jobs. It can be
mounted read-write with ch-run -w
.
Warning
The image can reside on most filesystems, but be aware of metadata impact. A non-trivial Charliecloud job may overwhelm a network filesystem, earning you the ire of your sysadmins and colleagues. (NFS sometimes does not work for read-only images; see issue #9.)
2.7.1. Interactive MPI hello world¶
First, obtain an interactive allocation of nodes. This tutorial assumes an allocation of 4 nodes (but any number should work) and an interactive shell on one of those nodes. For example:
$ salloc -N4
We also need OpenMPI 1.10.x available and with the correct mapping policy:
$ mpirun --version
mpirun (Open MPI) 1.10.5
$ export OMPI_MCA_rmaps_base_mapping_policy=
The next step is to distribute the image tarball to the compute nodes. To do
so, we run one instance of ch-tar2dir
on each node:
$ mpirun -pernode ch-tar2dir mpihello.tar.gz /var/tmp
App launch reported: 4 (out of 4) daemons - 3 (out of 4) procs
creating new image /tmp/mpihello
creating new image /tmp/mpihello
creating new image /tmp/mpihello
creating new image /tmp/mpihello
/tmp/mpihello unpacked ok
/tmp/mpihello unpacked ok
/tmp/mpihello unpacked ok
/tmp/mpihello unpacked ok
We can now activate the image and run our program:
$ mpirun ch-run /var/tmp/mpihello -- /hello/hello
App launch reported: 4 (out of 4) daemons - 48 (out of 64) procs
2: init ok cn001, 64 ranks, userns 4026532567
4: init ok cn001, 64 ranks, userns 4026532571
8: init ok cn001, 64 ranks, userns 4026532579
[...]
45: init ok cn003, 64 ranks, userns 4026532589
17: init ok cn002, 64 ranks, userns 4026532565
55: init ok cn004, 64 ranks, userns 4026532577
0: send/receive ok
0: finalize ok
Success!
2.7.2. Non-interactive MPI hello world¶
Production jobs are normally run non-interactively, via submission of a job script that runs when resources are available, placing output into a file.
The MPI hello world example includes such a script, slurm.sh
:
#!/bin/bash
#SBATCH --time=0:10:00
# Arguments: Path to tarball, path to image parent directory.
set -e
TAR="$1"
IMGDIR="$2"
IMG="$2/$(basename "${TAR%.tar.gz}")"
if [[ -z $TAR ]]; then
echo 'no tarball specified' 1>&2
exit 1
fi
printf 'tarball: %s\n' "$TAR"
if [[ -z $IMGDIR ]]; then
echo 'no image directory specified' 1>&2
exit 1
fi
printf 'image: %s\n' "$IMG"
# Make Charliecloud available (varies by site).
module purge
module load openmpi
module load sandbox
module load charliecloud
# Makes "mpirun -pernode" work.
export OMPI_MCA_rmaps_base_mapping_policy=
# MPI version on host.
printf 'host: '
mpirun --version | egrep '^mpirun'
# Unpack image.
mpirun -pernode ch-tar2dir $TAR $IMGDIR
# MPI version in container.
printf 'container: '
ch-run $IMG -- mpirun --version | egrep '^mpirun'
# Run the app.
mpirun ch-run $IMG -- /hello/hello
Note that this script both unpacks the image and runs it.
Submit it with something like:
$ sbatch -N4 slurm.sh ~/mpihello.tar.gz /var/tmp
207745
When the job is complete, look at the output:
$ cat slurm-207745.out
tarball: /home/reidpr/mpihello.tar.gz
image: /var/tmp/mpihello
host: mpirun (Open MPI) 1.10.5
App launch reported: 4 (out of 4) daemons - 3 (out of 4) procs
creating new image /var/tmp/mpihello
creating new image /var/tmp/mpihello
[...]
/var/tmp/mpihello unpacked ok
/var/tmp/mpihello unpacked ok
container: mpirun (Open MPI) 1.10.5
App launch reported: 4 (out of 4) daemons - 32 (out of 64) procs
2: init ok cn004, 64 ranks, userns 4026532604
3: init ok cn004, 64 ranks, userns 4026532606
4: init ok cn004, 64 ranks, userns 4026532608
[...]
63: init ok cn007, 64 ranks, userns 4026532630
30: init ok cn005, 64 ranks, userns 4026532628
27: init ok cn005, 64 ranks, userns 4026532622
0: send/receive ok
0: finalize ok
Success!
2.7.3. Interactive Apache Spark¶
This example is in examples/spark
. Build a tarball and upload it to
your cluster.
Once you have an interactive job, unpack the tarball.
$ srun ch-tar2dir spark.tar.gz /var/tmp
creating new image /var/tmp/spark
creating new image /var/tmp/spark
[...]
/var/tmp/spark unpacked ok
/var/tmp/spark unpacked ok
We need to first create a basic configuration for Spark, as the defaults in the Dockerfile are insufficient. (For real jobs, you’ll want to also configure performance parameters such as memory use; see the documentation.) First:
$ mkdir -p ~/sparkconf
$ chmod 700 ~/sparkconf
We’ll want to use the cluster’s high-speed network. For this example, we’ll find the Spark master’s IP manually:
$ ip -o -f inet addr show | cut -d/ -f1
1: lo inet 127.0.0.1
2: eth0 inet 192.168.8.3
8: eth1 inet 10.8.8.3
Your site support can tell you which to use. In this case, we’ll use 10.8.8.3.
Create some configuration files. Replace [MYSECRET]
with a string only
you know. Edit to match your system; in particular, use local disks instead of
/tmp
if you have them:
$ cat > ~/sparkconf/spark-env.sh
SPARK_LOCAL_DIRS=/tmp/spark
SPARK_LOG_DIR=/tmp/spark/log
SPARK_WORKER_DIR=/tmp/spark
SPARK_LOCAL_IP=127.0.0.1
SPARK_MASTER_HOST=10.8.8.3
$ cat > ~/sparkconf/spark-defaults.conf
spark.authenticate true
spark.authenticate.secret [MYSECRET]
We can now start the Spark master:
$ ch-run -b ~/sparkconf /var/tmp/spark -- /spark/sbin/start-master.sh
Look at the log in /tmp/spark/log
to see that the master started
correctly:
$ tail -7 /tmp/spark/log/*master*.out
17/02/24 22:37:21 INFO Master: Starting Spark master at spark://10.8.8.3:7077
17/02/24 22:37:21 INFO Master: Running Spark version 2.0.2
17/02/24 22:37:22 INFO Utils: Successfully started service 'MasterUI' on port 8080.
17/02/24 22:37:22 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://127.0.0.1:8080
17/02/24 22:37:22 INFO Utils: Successfully started service on port 6066.
17/02/24 22:37:22 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/02/24 22:37:22 INFO Master: I have been elected leader! New state: ALIVE
If you can run a web browser on the node, browse to
http://localhost:8080
for the Spark master web interface. Because this
capability varies, the tutorial does not depend on it, but it can be
informative. Refresh after each key step below.
The Spark workers need to know how to reach the master. This is via a URL; you can derive it from the above, or consult the web interface. For example:
$ MASTER_URL=spark://10.8.8.3:7077
Next, start one worker on each compute node. This is a little ugly;
mpirun
will wait until everything is finished before returning, but we
want to start the workers in the background, so we add &
and introduce
a race condition. (srun
has different, even less helpful behavior: it
kills the worker as soon as it goes into the background.)
$ mpirun -map-by '' -pernode ch-run -b ~/sparkconf /var/tmp/spark -- \
/spark/sbin/start-slave.sh $MASTER_URL &
One of the advantages of Spark is that it’s resilient: if a worker becomes available, the computation simply proceeds without it. However, this can mask issues as well. For example, this example will run perfectly fine with just one worker on the same node as the master, which isn’t what we want.
Check the master log to see that the right number of workers registered:
$ fgrep worker /tmp/spark/log/*master*.out
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:39890 with 16 cores, 187.8 GB RAM
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:44735 with 16 cores, 187.8 GB RAM
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:22445 with 16 cores, 187.8 GB RAM
17/02/24 22:52:24 INFO Master: Registering worker 127.0.0.1:29473 with 16 cores, 187.8 GB RAM
Despite the workers calling themselves 127.0.0.1, they really are running
across the allocation. (The confusion happens because of our
$SPARK_LOCAL_IP
setting above.) This can be verified by examining logs
on each compute node. For example:
$ ssh 10.8.8.4
$ tail -3 /tmp/spark/log/*worker*.out
17/02/24 22:52:24 INFO Worker: Connecting to master 10.8.8.3:7077...
17/02/24 22:52:24 INFO TransportClientFactory: Successfully created connection to /10.8.8.3:7077 after 263 ms (216 ms spent in bootstraps)
17/02/24 22:52:24 INFO Worker: Successfully registered with master spark://10.8.8.3:7077
$ exit
We can now start an interactive shell to do some Spark computing:
$ ch-run -b ~/sparkconf /var/tmp/spark -- /spark/bin/pyspark --master $MASTER_URL
Let’s use this shell to estimate 𝜋 (this is adapted from one of the Spark examples):
>>> import operator
>>> import random
>>>
>>> def sample(p):
... (x, y) = (random.random(), random.random())
... return 1 if x*x + y*y < 1 else 0
...
>>> SAMPLE_CT = int(2e8)
>>> ct = sc.parallelize(xrange(0, SAMPLE_CT)) \
... .map(sample) \
... .reduce(operator.add)
>>> 4.0*ct/SAMPLE_CT
3.14109824
(Type Control-D to exit.)
We can also submit jobs to the Spark cluster. This one runs the same example as included with the Spark source code. (The voluminous logging output is omitted.)
$ ch-run -b ~/sparkconf /var/tmp/spark -- \
/spark/bin/spark-submit --master $MASTER_URL \
/spark/examples/src/main/python/pi.py 1024
[...]
Pi is roughly 3.141211
[...]
Exit your allocation. Slurm will clean up the Spark daemons.
Success! Next, we’ll run a similar job non-interactively.
2.7.4. Non-interactive Apache Spark¶
We’ll re-use much of the above to run the same computation non-interactively.
For brevity, the Slurm script at examples/other/spark/slurm.h
is not
reproduced here.
Submit it as follows. It requires three arguments: the tarball, the image directory to unpack into, and the high-speed network interface. Again, consult your site administrators for the latter.
$ sbatch -N4 slurm.sh spark.tar.gz /var/tmp eth1
Submitted batch job 86754
Output:
$ fgrep 'Pi is' slurm-86754.out
Pi is roughly 3.141393
Success! (to four significant digits)