Create from config template with initial_setup

This commit is contained in:
Jip-Hop 2024-02-11 18:30:47 +01:00
parent f16a91f81d
commit 8571caa431
10 changed files with 656 additions and 260 deletions

View File

@ -45,7 +45,13 @@ Creating a jail is interactive. You'll be presented with questions which guide y
jlmkr create myjail
```
After answering a few questions you should have your first jail up and running!
After answering some questions you should have your first jail up and running!
You may also specify a path to a config template, for a quick and consistent jail creation process.
```shell
jlmkr create myjail /path/to/config/template
```
### Startup Jails on Boot

View File

@ -1,120 +0,0 @@
# Rootless podman in rootless Fedora jail
## Disclaimer
**These notes are a work in progress. Using podman in this setup hasn't been extensively tested.**
## Installation
Prerequisites. Installed jailmaker and setup bridge networking.
Run `jlmkr create rootless` to create a new jail. During jail creation choose fedora 39. This way we get the most recent version of podman available. Don't enable docker compatibility, we're going to enable only the required options manually.
Add `--network-bridge=br1 --resolv-conf=bind-host --system-call-filter='add_key keyctl bpf' --private-users=524288:65536 --private-users-ownership=chown` when asked for additional systemd-nspawn flags during jail creation.
We start at UID 524288, as this is the [systemd range used for containers](https://github.com/systemd/systemd/blob/main/docs/UIDS-GIDS.md#summary).
The `--private-users-ownership=chown` option will ensure the rootfs ownership is corrected.
After the jail has started run `jlmkr stop rootless && jlmkr edit rootless`, remove `--private-users-ownership=chown` and increase the UID range to `131072` to double the number of UIDs available in the jail. We need more than 65536 UIDs available in the jail, since rootless podman also needs to be able to map UIDs. If I leave the `--private-users-ownership=chown` option I get the following error:
> systemd-nspawn[678877]: Automatic UID/GID adjusting is only supported for UID/GID ranges starting at multiples of 2^16 with a range of 2^16
The flags look like this now:
```
systemd_nspawn_user_args=--network-bridge=br1 --resolv-conf=bind-host --system-call-filter='add_key keyctl bpf' --private-users=524288:131072
```
Start the jail with `jlmkr start rootless` and open a shell session inside the jail (as the remapped root user) with `jlmkr shell rootless`.
Then inside the jail start the network services (wait to get IP address via DHCP) and install podman:
```bash
# systemd-networkd should already be enabled when using jlmkr.py from the develop branch
systemctl --now enable systemd-networkd
# Add the required capabilities to the `newuidmap` and `newgidmap` binaries.
# https://github.com/containers/podman/issues/2788#issuecomment-1016301663
# https://github.com/containers/podman/issues/2788#issuecomment-479972943
# https://github.com/containers/podman/issues/12637#issuecomment-996524341
setcap cap_setuid+eip /usr/bin/newuidmap
setcap cap_setgid+eip /usr/bin/newgidmap
# Create new user
adduser rootless
# Set password for user
passwd rootless
# Clear the subuids and subgids which have been assigned by default when creating the new user
usermod --del-subuids 0-4294967295 --del-subgids 0-4294967295 rootless
# Set a specific range, so it fits inside the number of available UIDs
usermod --add-subuids 65536-131071 --add-subgids 65536-131071 rootless
# Check the assigned range
cat /etc/subuid
# Check the available range
cat /proc/self/uid_map
dnf -y install podman
exit
```
From the TrueNAS host, open a shell as the rootless user inside the jail.
```bash
jlmkr shell --uid 1000 rootless
```
Run rootless podman as user 1000.
```bash
id
podman run hello-world
podman info
```
The output of podman info should contain:
```
graphDriverName: overlay
graphOptions: {}
graphRoot: /home/rootless/.local/share/containers/storage
[...]
graphStatus:
Backing Filesystem: zfs
Native Overlay Diff: "true"
Supports d_type: "true"
Supports shifting: "false"
Supports volatile: "true"
Using metacopy: "false"
```
## Cockpit management
Inside the rootless jail run (as root user):
```bash
dnf install cockpit cockpit-podman
systemctl enable --now cockpit.socket
ip a
```
Check the IP address of the jail and access the Cockpit web interface at https://0.0.0.0:9090 where 0.0.0.0 is the IP address you just found using `ip a`.
Then login as user `rootless` with the password you've created earlier. Click on `Podman containers`. In case it shows `Podman service is not active` then click `Start podman`. You can now manage your rootless podman containers in the rootless jailmaker jail using the Cockpit web GUI.
## TODO:
On truenas host do:
sudo sysctl net.ipv4.ip_unprivileged_port_start=23
> Which would prevent a process by your user impersonating the sshd daemon.
Actually make it persistent.
## Additional resources:
Resources mentioning `add_key keyctl bpf`
- https://bbs.archlinux.org/viewtopic.php?id=252840
- https://wiki.archlinux.org/title/systemd-nspawn
- https://discourse.nixos.org/t/podman-docker-in-nixos-container-ideally-in-unprivileged-one/22909/12
Resources mentioning `@keyring`
- https://github.com/systemd/systemd/issues/17606
- https://github.com/systemd/systemd/blob/1c62c4fe0b54fb419b875cb2bae82a261518a745/src/shared/seccomp-util.c#L604
`@keyring` also includes `request_key` but doesn't include `bpf`

324
jlmkr.py
View File

@ -17,6 +17,7 @@ import shutil
import stat
import subprocess
import sys
import tempfile
import time
import urllib.request
from collections import defaultdict
@ -273,19 +274,21 @@ def stop_jail(jail_name):
return subprocess.run(["machinectl", "poweroff", jail_name]).returncode
def parse_config(jail_config_path):
def parse_config_string(config_string):
config = configparser.ConfigParser()
try:
# Workaround to read config file without section headers
config.read_string("[DEFAULT]\n" + Path(jail_config_path).read_text())
config.read_string("[DEFAULT]\n" + config_string)
config = dict(config["DEFAULT"])
return config
def parse_config_file(jail_config_path):
try:
return parse_config_string(Path(jail_config_path).read_text())
except FileNotFoundError:
eprint(f"Unable to find config file: {jail_config_path}.")
return
config = dict(config["DEFAULT"])
return config
def add_hook(jail_path, systemd_run_additional_args, hook_command, hook_type):
if not hook_command:
@ -321,13 +324,53 @@ def start_jail(jail_name):
jail_path = get_jail_path(jail_name)
jail_config_path = get_jail_config_path(jail_name)
jail_rootfs_path = get_jail_rootfs_path(jail_name)
config = parse_config(jail_config_path)
config = parse_config_file(jail_config_path)
if not config:
eprint("Aborting...")
return 1
# Handle initial setup
initial_setup = config.get("initial_setup")
# Alternative method to setup on first boot:
# https://www.undrground.org/2021/01/25/adding-a-single-run-task-via-systemd/
# If there's no machine-id, then this the first time the jail is started
if initial_setup and not os.path.exists(
os.path.join(jail_rootfs_path, "etc/machine-id")
):
# Run the command directly if it doesn't start with a shebang
if initial_setup.startswith("#!"):
# Write a script file and call that
initial_setup_file = os.path.abspath(
os.path.join(jail_path, ".initial_setup")
)
print(initial_setup, file=open(initial_setup_file, "w"))
stat_chmod(initial_setup_file, 0o700)
cmd = [
"systemd-nspawn",
"-q",
"-D",
jail_rootfs_path,
f"--bind-ro={initial_setup_file}:/root/initial_startup",
"/root/initial_startup",
]
else:
cmd = ["systemd-nspawn", "-q", "-D", jail_rootfs_path, initial_setup]
returncode = subprocess.run(cmd).returncode
if returncode != 0:
eprint("Failed to run initial setup:")
eprint(initial_setup)
eprint()
eprint("Abort starting jail.")
return returncode
# Cleanup the initial_setup_file
Path(initial_setup_file).unlink(missing_ok=True)
systemd_run_additional_args = [
f"--unit={SYMLINK_NAME}-{jail_name}",
f"--working-directory=./{jail_path}",
@ -665,11 +708,22 @@ def check_jail_name_available(jail_name, warn=True):
return False
def create_jail(jail_name, distro="debian", release="bookworm"):
def ask_jail_name(jail_name=""):
while True:
print()
jail_name = input_with_default("Enter jail name: ", jail_name).strip()
if check_jail_name_valid(jail_name):
if check_jail_name_available(jail_name):
return jail_name
def create_jail(jail_name="", config_path=None, distro="debian", release="bookworm"):
"""
Create jail with given name.
"""
config_string = ""
print(DISCLAIMER)
if os.path.basename(os.getcwd()) != "jailmaker":
@ -705,6 +759,49 @@ def create_jail(jail_name, distro="debian", release="bookworm"):
os.makedirs(JAILS_DIR_PATH, exist_ok=True)
stat_chmod(JAILS_DIR_PATH, 0o700)
#################
# Config handling
#################
if config_path:
try:
config_string = Path(config_path).read_text()
except FileNotFoundError:
eprint(f"Unable to find file: {config_path}.")
return 1
else:
print()
if agree("Do you wish to create a jail from a config template?", "n"):
print(
dedent(
"""
A text editor will open so you can provide the config template.
- please copy your config
- paste it into the text editor
- save and close the text editor
"""
)
)
input("Press Enter to open the text editor.")
with tempfile.NamedTemporaryFile() as f:
subprocess.call([TEXT_EDITOR, f.name])
f.seek(0)
config_string = f.read().decode()
if config_string:
config = parse_config_string(config_string)
# Ask for jail name if not provided
if not (
jail_name
and check_jail_name_valid(jail_name)
and check_jail_name_available(jail_name)
):
jail_name = ask_jail_name(jail_name)
jail_path = get_jail_path(jail_name)
distro, release = config.get("initial_rootfs_image").split()
else:
print()
if not agree(f"Install the recommended image ({distro} {release})?", "y"):
print(
@ -745,18 +842,9 @@ def create_jail(jail_name, distro="debian", release="bookworm"):
release = input("Release: ")
while True:
print()
jail_name = input_with_default("Enter jail name: ", jail_name).strip()
if check_jail_name_valid(jail_name):
if check_jail_name_available(jail_name):
break
jail_name = ask_jail_name(jail_name)
jail_path = get_jail_path(jail_name)
# Cleanup in except, but only once the jail_path is final
# Otherwise we may cleanup the wrong directory
try:
print(
dedent(
f"""
@ -875,8 +963,106 @@ def create_jail(jail_name, distro="debian", release="bookworm"):
)
)
# Use mostly default settings for systemd-nspawn but with systemd-run instead of a service file:
# https://github.com/systemd/systemd/blob/main/units/systemd-nspawn%40.service.in
# Use TasksMax=infinity since this is what docker does:
# https://github.com/docker/engine/blob/master/contrib/init/systemd/docker.service
# Use SYSTEMD_NSPAWN_LOCK=0: otherwise jail won't start jail after a shutdown (but why?)
# Would give "directory tree currently busy" error and I'd have to run
# `rm /run/systemd/nspawn/locks/*` and remove the .lck file from jail_path
# Disabling locking isn't a big deal as systemd-nspawn will prevent starting a container
# with the same name anyway: as long as we're starting jails using this script,
# it won't be possible to start the same jail twice
systemd_run_default_args = [
"--property=KillMode=mixed",
"--property=Type=notify",
"--property=RestartForceExitStatus=133",
"--property=SuccessExitStatus=133",
"--property=Delegate=yes",
"--property=TasksMax=infinity",
"--collect",
"--setenv=SYSTEMD_NSPAWN_LOCK=0",
]
# Always add --bind-ro=/sys/module to make lsmod happy
# https://manpages.debian.org/bookworm/manpages/sysfs.5.en.html
systemd_nspawn_default_args = [
"--keep-unit",
"--quiet",
"--boot",
"--bind-ro=/sys/module",
"--inaccessible=/sys/module/apparmor",
]
systemd_nspawn_user_args_multiline = "\n\t".join(
shlex.split(systemd_nspawn_user_args)
)
systemd_run_default_args_multiline = "\n\t".join(systemd_run_default_args)
systemd_nspawn_default_args_multiline = "\n\t".join(systemd_nspawn_default_args)
config_string = cleandoc(
f"""
startup={startup}
docker_compatible={docker_compatible}
gpu_passthrough_intel={gpu_passthrough_intel}
gpu_passthrough_nvidia={gpu_passthrough_nvidia}
"""
)
config_string += (
f"\n\nsystemd_nspawn_user_args={systemd_nspawn_user_args_multiline}\n\n"
)
config_string += cleandoc(
"""
# # Specify command/script to run on the HOST before starting the jail
# # For example to load kernel modules and config kernel settings
# pre_start_hook=#!/usr/bin/bash
# echo 'PRE_START_HOOK'
# echo 1 > /proc/sys/net/ipv4/ip_forward
# modprobe br_netfilter
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
# echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables
#
# # Specify a command/script to run on the HOST after stopping the jail
# post_stop_hook=echo 'POST_STOP_HOOK'
# Specify command/script to run IN THE JAIL before starting it for the first time
# Useful to install packages on top of the base rootfs
# NOTE: this script will run in the host networking namespace and ignores
# all systemd_nspawn_user_args such as bind mounts
initial_setup=#!/usr/bin/bash
set -euo pipefail
apt-get update && apt-get -y install curl
curl -fsSL https://get.docker.com | sh
"""
)
config_string += "\n".join(
[
"",
"",
"# You generally will not need to change the options below",
f"systemd_run_default_args={systemd_run_default_args_multiline}",
"",
f"systemd_nspawn_default_args={systemd_nspawn_default_args_multiline}",
"",
"# Used by jlmkr create",
f"initial_rootfs_image={distro} {release}",
]
)
print()
##############
# Create start
##############
# Cleanup in except, but only once the jail_path is final
# Otherwise we may cleanup the wrong directory
try:
jail_config_path = get_jail_config_path(jail_name)
jail_rootfs_path = get_jail_rootfs_path(jail_name)
@ -1014,84 +1200,7 @@ def create_jail(jail_name, distro="debian", release="bookworm"):
file=open(os.path.join(preset_path, "00-jailmaker.preset"), "w"),
)
# Use mostly default settings for systemd-nspawn but with systemd-run instead of a service file:
# https://github.com/systemd/systemd/blob/main/units/systemd-nspawn%40.service.in
# Use TasksMax=infinity since this is what docker does:
# https://github.com/docker/engine/blob/master/contrib/init/systemd/docker.service
# Use SYSTEMD_NSPAWN_LOCK=0: otherwise jail won't start jail after a shutdown (but why?)
# Would give "directory tree currently busy" error and I'd have to run
# `rm /run/systemd/nspawn/locks/*` and remove the .lck file from jail_path
# Disabling locking isn't a big deal as systemd-nspawn will prevent starting a container
# with the same name anyway: as long as we're starting jails using this script,
# it won't be possible to start the same jail twice
systemd_run_default_args = [
"--property=KillMode=mixed",
"--property=Type=notify",
"--property=RestartForceExitStatus=133",
"--property=SuccessExitStatus=133",
"--property=Delegate=yes",
"--property=TasksMax=infinity",
"--collect",
"--setenv=SYSTEMD_NSPAWN_LOCK=0",
]
# Always add --bind-ro=/sys/module to make lsmod happy
# https://manpages.debian.org/bookworm/manpages/sysfs.5.en.html
systemd_nspawn_default_args = [
"--keep-unit",
"--quiet",
"--boot",
"--bind-ro=/sys/module",
"--inaccessible=/sys/module/apparmor",
]
systemd_nspawn_user_args_multiline = "\n\t".join(
shlex.split(systemd_nspawn_user_args)
)
systemd_run_default_args_multiline = "\n\t".join(systemd_run_default_args)
systemd_nspawn_default_args_multiline = "\n\t".join(systemd_nspawn_default_args)
config = cleandoc(
f"""
startup={startup}
docker_compatible={docker_compatible}
gpu_passthrough_intel={gpu_passthrough_intel}
gpu_passthrough_nvidia={gpu_passthrough_nvidia}
"""
)
config += (
f"\n\nsystemd_nspawn_user_args={systemd_nspawn_user_args_multiline}\n\n"
)
config += cleandoc(
"""
# Specify command/script to run on the HOST before starting the jail
pre_start_hook=echo 'PRE_START_HOOK'
# Specify a command/script to run on the HOST after stopping the jail
post_stop_hook=#!/usr/bin/bash
echo 'POST STOP HOOK'
"""
)
config += "\n".join(
[
"",
"",
"# You generally will not need to change the options below",
f"systemd_run_default_args={systemd_run_default_args_multiline}",
"",
f"systemd_nspawn_default_args={systemd_nspawn_default_args_multiline}",
"",
"# The below is for reference only, currently not used",
f"initial_rootfs_image={distro} {release}",
]
)
print(config, file=open(jail_config_path, "w"))
print(config_string, file=open(jail_config_path, "w"))
os.chmod(jail_config_path, 0o600)
@ -1100,6 +1209,13 @@ def create_jail(jail_name, distro="debian", release="bookworm"):
cleanup(jail_path)
raise error
# In case you want to create a jail without any user interaction,
# you need to skip this final question
# echo 'y' | jlmkr create test testconfig
# TODO: make jlmkr create work cleanly without user interaction.
# Current echo 'y' workaround may cause problems when the jail name already exists
# You'd end up with a new jail called 'y'
# and the script will crash at the agree statement below
print()
if agree(f"Do you want to start jail {jail_name} right now?", "y"):
return start_jail(jail_name)
@ -1286,7 +1402,7 @@ def list_jails():
# TODO: add additional properties from the jails config file
for jail_name in jails:
config = parse_config(get_jail_config_path(jail_name))
config = parse_config_file(get_jail_config_path(jail_name))
startup = False
if config:
@ -1434,7 +1550,7 @@ def startup_jails():
start_failure = False
for jail_name in get_all_jail_names():
config = parse_config(get_jail_config_path(jail_name))
config = parse_config_file(get_jail_config_path(jail_name))
if config and config.get("startup") == "1":
if start_jail(jail_name) != 0:
start_failure = True
@ -1463,9 +1579,11 @@ def main():
help="install jailmaker dependencies and create symlink",
)
subparsers.add_parser(
create_parser = subparsers.add_parser(
name="create", epilog=DISCLAIMER, help="create a new jail"
).add_argument("name", nargs="?", help="name of the jail")
)
create_parser.add_argument("name", nargs="?", help="name of the jail")
create_parser.add_argument("config", nargs="?", help="path to config file template")
subparsers.add_parser(
name="start", epilog=DISCLAIMER, help="start a previously created jail"
@ -1539,7 +1657,7 @@ def main():
sys.exit(install_jailmaker())
elif args.subcommand == "create":
sys.exit(create_jail(args.name))
sys.exit(create_jail(args.name, args.config))
elif args.subcommand == "start":
sys.exit(start_jail(args.name))
@ -1580,7 +1698,7 @@ def main():
else:
if agree("Create a new jail?", "y"):
print()
sys.exit(create_jail(""))
sys.exit(create_jail())
else:
parser.print_usage()

View File

@ -0,0 +1,3 @@
# Debian Docker Jail Template
Check out the [config](./config) template file. You may provide it when asked during `jlmkr create` or, if you have the template file stored on your NAS, you may provide it directly by running `jlmkr create mydockerjail /mnt/tank/path/to/docker/config`.

59
templates/docker/config Normal file
View File

@ -0,0 +1,59 @@
startup=0
gpu_passthrough_intel=0
gpu_passthrough_nvidia=0
# Use macvlan networking to provide an isolated network namespace,
# so docker can manage firewall rules
# Alternatively use --network-bridge=br1 instead of --network-macvlan
# Ensure to change eno1/br1 to the interface name you want to use
# You may want to add additional options here, e.g. bind mounts
systemd_nspawn_user_args=--network-macvlan=eno1
--resolv-conf=bind-host
--system-call-filter='add_key keyctl bpf'
# Script to run on the HOST before starting the jail
# Load kernel module and config kernel settings required for docker
pre_start_hook=#!/usr/bin/bash
echo 'PRE_START_HOOK'
echo 1 > /proc/sys/net/ipv4/ip_forward
modprobe br_netfilter
echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables
# Install docker inside the jail:
# https://docs.docker.com/engine/install/debian/#install-using-the-repository
# NOTE: this script will run in the host networking namespace and ignores
# all systemd_nspawn_user_args such as bind mounts
initial_setup=#!/usr/bin/bash
set -euo pipefail
apt-get update && apt-get -y install ca-certificates curl
install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update
apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# You generally will not need to change the options below
systemd_run_default_args=--property=KillMode=mixed
--property=Type=notify
--property=RestartForceExitStatus=133
--property=SuccessExitStatus=133
--property=Delegate=yes
--property=TasksMax=infinity
--collect
--setenv=SYSTEMD_NSPAWN_LOCK=0
systemd_nspawn_default_args=--keep-unit
--quiet
--boot
--bind-ro=/sys/module
--inaccessible=/sys/module/apparmor
# Used by jlmkr create
initial_rootfs_image=debian bookworm

101
templates/lxd/README.md Normal file
View File

@ -0,0 +1,101 @@
# Ubuntu LXD Jail Template
Check out the [config](./config) template file. You may provide it when asked during `jlmkr create` or, if you have the template file stored on your NAS, you may provide it directly by running `jlmkr create mylxdjail /mnt/tank/path/to/lxd/config`.
Unfortunately snapd doesn't want to install from the `initial_setup` script inside the config file. So we manually finish the setup by running the following after creating and starting the jail:
```bash
jlmkr exec mylxdjail bash -c 'apt-get update &&
apt-get install -y --no-install-recommends snapd &&
snap install lxd'
# Answer yes when asked the following:
# Would you like the LXD server to be available over the network? (yes/no) [default=no]: yes
# TODO: fix ZFS
jlmkr exec mylxdjail bash -c 'lxd init &&
snap set lxd ui.enable=true &&
systemctl reload snap.lxd.daemon'
```
Then visit the `lxd` GUI inside the browser https://0.0.0.0:8443. To find out which IP address to use instead of 0.0.0.0, check the IP address for your jail with `jlmkr list`.
## Disclaimer
**These notes are a work in progress. Using Incus in this setup hasn't been extensively tested.**
## Installation
Create a debian 12 jail and [install incus](https://github.com/zabbly/incus#installation). Also install the `incus-ui-canonical` package to install the web interface. Ensure the config file looks like the below:
Run `modprobe vhost_vsock` on the TrueNAS host.
```
startup=0
docker_compatible=1
gpu_passthrough_intel=1
gpu_passthrough_nvidia=0
systemd_nspawn_user_args=--network-bridge=br1 --resolv-conf=bind-host --bind=/dev/fuse --bind=/dev/kvm --bind=/dev/vsock --bind=/dev/vhost-vsock
# You generally will not need to change the options below
systemd_run_default_args=--property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0
systemd_nspawn_default_args=--keep-unit --quiet --boot --bind-ro=/sys/module --inaccessible=/sys/module/apparmor
```
Check out [First steps with Incus](https://linuxcontainers.org/incus/docs/main/tutorial/first_steps/).
## Create Ubuntu Desktop VM
Incus web GUI should be running on port 8443. Create new instance, call it `dekstop`, and choose the `Ubuntu jammy desktop virtual-machine ubuntu/22.04/desktop` image.
## Bind mount / virtiofs
To access files from the TrueNAS host directly in a VM created with incus, we can use virtiofs.
```bash
incus config device add desktop test disk source=/home/test/ path=/mnt/test
```
The command above (when ran as root user inside the incus jail) adds a new virtiofs mount of a test directory inside the jail to a VM named desktop. The `/home/test` dir resides in the jail, but you can first bind mount any directory from the TrueNAS host inside the incus jail and then forward this to the VM using virtiofs. This could be an alternative to NFS mounts.
### Benchmarks
#### Inside LXD ubuntu desktop VM with virtiofs mount
root@desktop:/mnt/test# mount | grep test
incus_test on /mnt/test type virtiofs (rw,relatime)
root@desktop:/mnt/test# time iozone -a
[...]
real 2m22.389s
user 0m2.222s
sys 0m59.275s
#### In a jailmaker jail on the host:
root@incus:/home/test# time iozone -a
[...]
real 0m59.486s
user 0m1.468s
sys 0m25.458s
#### Inside LXD ubuntu desktop VM with virtiofs mount
root@desktop:/mnt/test# dd if=/dev/random of=./test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 36.321 s, 29.6 MB/s
#### In a jailmaker jail on the host:
root@incus:/home/test# dd if=/dev/random of=./test2.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 7.03723 s, 153 MB/s
## Create Ubuntu container
To be able to create unprivileged (rootless) containers with incus inside the jail, you need to increase the amount of UIDs available inside the jail. Please refer to the [Podman instructions](../podman/README.md) for more information. If you don't increase the UIDs you can only create privileged containers. You'd have to change `Privileged` to `Allow` in `Security policies` in this case.
## Canonical LXD install via snap
Installing the lxd snap is an alternative to Incus. But out of the box running `snap install lxd` will cause AppArmor issues when running inside a jailmaker jail on SCALE.
## References
- [Running QEMU/KVM Virtual Machines in Unprivileged LXD Containers](https://dshcherb.github.io/2017/12/04/qemu-kvm-virtual-machines-in-unprivileged-lxd.html)

55
templates/lxd/config Normal file
View File

@ -0,0 +1,55 @@
startup=0
gpu_passthrough_intel=1
gpu_passthrough_nvidia=0
# Use macvlan networking to provide an isolated network namespace,
# so lxd can manage firewall rules
# Alternatively use --network-bridge=br1 instead of --network-macvlan
# Ensure to change eno1/br1 to the interface name you want to use
# You may want to add additional options here, e.g. bind mounts
# TODO: don't use --capability=all but specify only the required capabilities
systemd_nspawn_user_args=--network-macvlan=eno1
--resolv-conf=bind-host
--capability=all
--bind=/dev/fuse
--bind=/dev/kvm
--bind=/dev/vsock
--bind=/dev/vhost-vsock
# Script to run on the HOST before starting the jail
# Load kernel module and config kernel settings required for lxd
pre_start_hook=#!/usr/bin/bash
echo 'PRE_START_HOOK'
echo 1 > /proc/sys/net/ipv4/ip_forward
modprobe br_netfilter
echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables
modprobe vhost_vsock
# NOTE: this script will run in the host networking namespace and ignores
# all systemd_nspawn_user_args such as bind mounts
initial_setup=#!/usr/bin/bash
# https://discuss.linuxcontainers.org/t/snap-inside-privileged-lxd-container/13691/8
ln -s /bin/true /usr/local/bin/udevadm
# You generally will not need to change the options below
systemd_run_default_args=--property=KillMode=mixed
--property=Type=notify
--property=RestartForceExitStatus=133
--property=SuccessExitStatus=133
--property=Delegate=yes
--property=TasksMax=infinity
--collect
--setenv=SYSTEMD_NSPAWN_LOCK=0
# TODO: check if the below 2 are required
# --setenv=SYSTEMD_SECCOMP=0
# --property=DevicePolicy=auto
systemd_nspawn_default_args=--keep-unit
--quiet
--boot
--bind-ro=/sys/module
--inaccessible=/sys/module/apparmor
# Used by jlmkr create
initial_rootfs_image=ubuntu jammy

121
templates/podman/README.md Normal file
View File

@ -0,0 +1,121 @@
# Fedora Podman Jail Template
Check out the [config](./config) template file. You may provide it when asked during `jlmkr create` or, if you have the template file stored on your NAS, you may provide it directly by running `jlmkr create mypodmanjail /mnt/tank/path/to/podman/config`.
## Rootless
### Disclaimer
**These notes are a work in progress. Using podman in this setup hasn't been extensively tested.**
### Installation
Prerequisites: created a jail using the [config](./config) template file.
Run `jlmkr edit mypodmanjail` and add `--private-users=524288:65536 --private-users-ownership=chown` to `systemd_nspawn_user_args`. We start at UID 524288, as this is the [systemd range used for containers](https://github.com/systemd/systemd/blob/main/docs/UIDS-GIDS.md#summary).
The `--private-users-ownership=chown` option will ensure the rootfs ownership is corrected.
After the jail has started run `jlmkr stop mypodmanjail && jlmkr edit mypodmanjail`, remove `--private-users-ownership=chown` and increase the UID range to `131072` to double the number of UIDs available in the jail. We need more than 65536 UIDs available in the jail, since rootless podman also needs to be able to map UIDs. If I leave the `--private-users-ownership=chown` option I get the following error:
> systemd-nspawn[678877]: Automatic UID/GID adjusting is only supported for UID/GID ranges starting at multiples of 2^16 with a range of 2^16
The flags look like this now:
```
systemd_nspawn_user_args=--network-macvlan=eno1
--resolv-conf=bind-host
--system-call-filter='add_key keyctl bpf'
--private-users=524288:131072
```
Start the jail with `jlmkr start mypodmanjail` and open a shell session inside the jail (as the remapped root user) with `jlmkr shell mypodmanjail`.
Then inside the jail setup the new rootless user:
```bash
# Create new user
adduser rootless
# Set password for user
passwd rootless
# Clear the subuids and subgids which have been assigned by default when creating the new user
usermod --del-subuids 0-4294967295 --del-subgids 0-4294967295 rootless
# Set a specific range, so it fits inside the number of available UIDs
usermod --add-subuids 65536-131071 --add-subgids 65536-131071 rootless
# Check the assigned range
cat /etc/subuid
# Check the available range
cat /proc/self/uid_map
exit
```
From the TrueNAS host, open a shell as the rootless user inside the jail.
```bash
jlmkr shell --uid 1000 mypodmanjail
```
Run rootless podman as user 1000.
```bash
id
podman run hello-world
podman info
```
The output of podman info should contain:
```
graphDriverName: overlay
graphOptions: {}
graphRoot: /home/rootless/.local/share/containers/storage
[...]
graphStatus:
Backing Filesystem: zfs
Native Overlay Diff: "true"
Supports d_type: "true"
Supports shifting: "false"
Supports volatile: "true"
Using metacopy: "false"
```
### Binding to Privileged Ports:
Add `sysctl net.ipv4.ip_unprivileged_port_start=23` to the `pre_start_hook` inside the config to lower the range of privileged ports. This will still prevent an unprivileged process from impersonating the sshd daemon. Since this lowers the range globally on the TrueNAS host, a better solution would be to specifically add the capability to bind to privileged ports.
## Cockpit Management
Install and enable cockpit:
```bash
jlmkr exec mypodmanjail bash -c "dnf -y install cockpit cockpit-podman && \
systemctl enable --now cockpit.socket && \
ip a &&
ip route | awk '/default/ { print \$9 }'"
```
Check the IP address of the jail and access the Cockpit web interface at https://0.0.0.0:9090 where 0.0.0.0 is the IP address you just found using `ip a`.
If you've setup the `rootless` user, you may login with the password you've created earlier. Otherwise you'd have to add an admin user first:
```bash
jlmkr exec podmantest bash -c 'adduser admin
passwd admin
usermod -aG wheel admin'
```
Click on `Podman containers`. In case it shows `Podman service is not active` then click `Start podman`. You can now manage your (rootless) podman containers in the (rootless) jailmaker jail using the Cockpit web GUI.
## Additional Resources:
Resources mentioning `add_key keyctl bpf`
- https://bbs.archlinux.org/viewtopic.php?id=252840
- https://wiki.archlinux.org/title/systemd-nspawn
- https://discourse.nixos.org/t/podman-docker-in-nixos-container-ideally-in-unprivileged-one/22909/12
Resources mentioning `@keyring`
- https://github.com/systemd/systemd/issues/17606
- https://github.com/systemd/systemd/blob/1c62c4fe0b54fb419b875cb2bae82a261518a745/src/shared/seccomp-util.c#L604
`@keyring` also includes `request_key` but doesn't include `bpf`

53
templates/podman/config Normal file
View File

@ -0,0 +1,53 @@
startup=0
gpu_passthrough_intel=0
gpu_passthrough_nvidia=0
# Use macvlan networking to provide an isolated network namespace,
# so podman can manage firewall rules
# Alternatively use --network-bridge=br1 instead of --network-macvlan
# Ensure to change eno1/br1 to the interface name you want to use
# You may want to add additional options here, e.g. bind mounts
systemd_nspawn_user_args=--network-macvlan=eno1
--resolv-conf=bind-host
--system-call-filter='add_key keyctl bpf'
# Script to run on the HOST before starting the jail
# Load kernel module and config kernel settings required for podman
pre_start_hook=#!/usr/bin/bash
echo 'PRE_START_HOOK'
echo 1 > /proc/sys/net/ipv4/ip_forward
modprobe br_netfilter
echo 1 > /proc/sys/net/bridge/bridge-nf-call-iptables
echo 1 > /proc/sys/net/bridge/bridge-nf-call-ip6tables
# Install podman inside the jail
# NOTE: this script will run in the host networking namespace and ignores
# all systemd_nspawn_user_args such as bind mounts
initial_setup=#!/usr/bin/bash
set -euo pipefail
dnf -y install podman
# Add the required capabilities to the `newuidmap` and `newgidmap` binaries
# https://github.com/containers/podman/issues/2788#issuecomment-1016301663
# https://github.com/containers/podman/issues/12637#issuecomment-996524341
setcap cap_setuid+eip /usr/bin/newuidmap
setcap cap_setgid+eip /usr/bin/newgidmap
# You generally will not need to change the options below
systemd_run_default_args=--property=KillMode=mixed
--property=Type=notify
--property=RestartForceExitStatus=133
--property=SuccessExitStatus=133
--property=Delegate=yes
--property=TasksMax=infinity
--collect
--setenv=SYSTEMD_NSPAWN_LOCK=0
systemd_nspawn_default_args=--keep-unit
--quiet
--boot
--bind-ro=/sys/module
--inaccessible=/sys/module/apparmor
# Used by jlmkr create
initial_rootfs_image=fedora 39