NixOS and Stateless Deployment
If I had my way, I would never deploy or administer a linux server that isn’t running NixOS.
I’m not exactly a prolific sysadmin - in my time, I’ve set up and administered servers numbering in the low tens. And yet every single time, it’s awful.
Firstly, you get out of the notion of doing anything manually, ever. Anytime you do something manually you create a unique snowflake, and then 3 weeks (or 3 years!) down the track you tear your hair out trying to recreate whatever seemingly-unimportant thing it is you did last time that must have made it work.
So you learn about automated deployment. There are no shortage of tools, and they’re mostly pretty similar. I’ve personally used these, and learned about many more in my quest not to have an awful deployment experience:
All of these work more or less as advertised, but all of them still leave me with a pretty crappy deployment experience.
The problem
Most of those are imperative, in that they boil down to a list of steps - “install X”, “upload file A -> B”, etc. This is the obvious approach to automating deployment, kind of like a shell script is the obvious approach to automating a process. It takes what you currently do, and turns it into one or more concrete files that you can modify and replay later.
And obviously, the entire problem of server deployment is deeply stateful - your server is quite literally a state machine, and each deployment attempts to modify its current state into (hopefully) the expected target state.
Unfortunately, in such a system it can be difficult to predict how the current state will interact with your deployment scripts. Performing the same deployment to two servers that started in different states can have drastically different results. Usually one of them failing.
Puppet is a little different, in that you don’t specify what you want to happen, but rather the desired state. Instead of writing down the steps required to install the package foo, you simply state that you want foo to be installed, and puppet knows what to do to get the current system (whatever its state) into the state you asked for.
Which would be great, if it weren’t a pretty big lie.
The thing is, it’s a fool’s errand to try and specify your system state in puppet. Puppet is built on traditional linux (and even windows) systems, with their stateful package managers and their stateful file systems and their stateful user management and their stateful configuration directories, and… well, you get the idea. There are plenty of places for state to hide, and puppet barely scratches the surface.
If you deploy a puppet configuration that specifies “package foo must be installed”, but then you remove that line from your config at time t, what happens? Well, now any servers deployed before t will have foo installed, but new servers (after t) will not. You did nothing wrong, it’s just that puppet’s declarative approach is only a thin veneer over an inherently stateful system.
To correctly use puppet, you would have to specify not only what you do want to be true about a system, but also all of the possible things that you do not want to be true about a system. This includes any package that may have ever been installed, any file that may have ever been created, any users or groups that may have ever been created, etc. And if you miss any of that, well, don’t worry. You’ll find out when it breaks something.
So servers are deeply stateful. And deployment is typically imperative. This is clearly a bad mix for something that you want to be as reproducible and reliable as possible.
Puppet tries to fix the “imperative” part of deployment, but can’t really do anything about the statefulness of its hosts. Can we do better?
Well, yeah.
nix, the purely functional package manager
It started with nix, the “purely function package manager”. From the description, you can tell that nix is not your standard package manager. Most packaging systems consist of thousands of individual packages, referenced by some package id / name, with loose versioning requirements on other packages in the repository (e.g “libfoo >= 2.3.1 <2.4”). This has worked well for OSS distributions for a long time, and it’s a pretty versatile model. But nix is different. It is not a collection of packages, each with multiple versions. It is a single, monolithic expression defined in a functional, lazy, purpose-built language.
So you don’t have a package named git. You have “the entire set of packages” (usually named pkgs), and you can access its git property, which is a specific version of git. For git, there is only one version. For other tools (like python), there are multiple different attributes for different minor versions (e.g. pkgs.python is an alias to pkgs.python2, which itself is an alias to pkgs.python27).
Okay, but why?
So far, this is not functionally different to standard package management. After all, my fedora box has explicit python2 and python2.7 packages too. But one difference is that traditional packages are global, and not user-serviceable. Say that (for whatever reason), you needed python2.7 with some additional patch. Since the entire package space exists in one expression, you can manipulate it in very flexible ways. For example:
# my-python.nix
{ pkgs }:
let base = pkgs.python27;
in base // {
	patches = (base.patches or []) ++ [ ./my-python.patch ];
}
Even if you’ve never read a nix expression before, you can probably guess what that does. It defines a function which takes an argument named pkgs. It binds the local variable base to pkgs.python27. And it defines the result to be base with a single overridden patches attribute (the // operator merges two attribute sets). This just takes whatever patches the base expression has (if any), and adds the my-python.patch file. patches is a standard attribute - a list of patch files which will be applied to the source before building. So now we have a python package which is identical to the official package with just a tiny modification. We didn’t have to set up our own repository, or figure out how to make a modified .rpm from our distro’s sources.
Thankfully, I’ve never needed to run a patched version of python on a production server. But I have needed to experiment with prerelease versions of etcd on a test server, and nix makes substituting official packages for modified versions pretty trivial.
The right dependency for the job
Another interesting outcome of the packages being defined in a proper programming language is that package sets can be parameterised. Often, python libraries are version-independent - the same code will work just fine in multiple minor versions of python (e.g 2.6, 2.7), sometimes even major versions (2.x and 3.x). But any modules that requires compilation must be compiled against the minor version of python that they’ll be used with. And python is pretty good at its ABI guarantees - other runtimes (like ocaml or haskell) have extremely delicate ABIs which generally mean you must compile against the exact runtime and libraries that you’ll be using at runtime.
For nix, this happens automatically. Each python package is actually a function which takes (amongst other things) the python implementation in use. This could be pkgs.python, pkgs.python27 or even my-python (that we made above). Because each python package has a build-time dependency on the exact python version used, you will never have ABI or other incompatibilities that come from using a different version at build time than run time.
And to make sure that the python version used at compile time is also used at runtime, nix doesn’t use the concept of a globally-installed python interpreter. When you pass a python implementation to a nix function, that function will actually see python as living at a path like /nix/store/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-python/bin/python (and will hard-code this path wherever needed, instead of e.g. /usr/bin/python). The xxxxxxxxs are a cryptographic hash of the inputs used to define this python implementation. Since nix is pure1, the same inputs will produce the same output. This means the above path is immutable, which is good for both caching and consistency. And because each nix build output lives in a unique path under /nix/store, you never run into name clashes - you can have as many simultaneous versions of python (or anything else) as your system needs, although you’ll generally end up with just one or two in practice.
Hardcode all the things
This definitely has some inefficiencies. For one, overriding the “default” python to the one we defined above will cause every python package that you use with it to be rebuilt, because the inputs used to build python are slightly different from the officially-built packages. On the other hand, there is no way for nix to tell whether a given change affects the ABI (or API) of an implementation. The only safe thing that nix can do is to rebuild in this case, which means nix will rebuild things when it doesn’t necessarily need to, but on the other hand it never has to worry about ABI incompatibilities. Distributions are more efficient here by only rebuilding when necessary, but at the cost of a fair bit of manpower, runtime-specific knowledge, and sometimes getting it wrong.
Because a nix implementation hardcodes paths to all its dependencies as paths under /nix/store, it’s actually incredibly easy to compute the closure of a given implementation - that is, every other implementation that it references, transitively. This means that if you build something locally, you can run nix-copy-closure <path> <remote-machine> to have that exact implementation, and all of its transitive dependencies, copied to a remote machine. This is obviously tremendously useful for deployment, as you can be sure that each machine will receive the exact same results, without having to deal with time-related inconsistencies (like running apt-get update at different times on different machines). It’s also extremely efficient - nix store paths are unique and immutable, so any store path that already exists doesn’t need to be copied (or even re-checked).
The payoff: NixOS
But I haven’t even got to the most amazing part of nix yet: NixOS. NixOS came from the desire to extend nix’s features from “package management” to “the entire machine”.
With nix, every derivation (output) is pure - given the same set of inputs, you get the same output. And what’s more, the inputs are described in one language (also called nix), which is functional and strongly typed. So if you could define an entire machine using the nix language, and somehow run it, it would be like puppet on steroids - the state of the entire machine would be pure - applying a given config would always produce the same result, regardless of the previous state of the machine. This is exactly what NixOS does, and it works amazingly well.
Why is it better than those other declarative systems?
It should hopefully be obvious at this point why NixOS is better than puppet: Both are declarative, but puppet is impure and non-exhaustive - when you apply a config, puppet compares everything specified against the current state of the system. Everything not specified is left alone, which means you’re only specifying a very tiny subset of your system. With NixOS, if something is not specified, it is not present. This includes configuration files, packages, even users and groups. So NixOS is pure, declarative and exhaustive. In fact, due to nix’s purity and pervasive caching, “rebuilding the entire OS” with NixOS is often quicker than reapplying a relatively simple puppet config.
Puppet is not the only declarative system though - docker is declarative and exhaustive. It’s obviously a bit of an apples to oranges comparison (like comparing Ubuntu to VirtualBox), but the similarities are still interesting:
Building a docker image starts with a well-known state (e.g a vanilla Ubuntu-LTS image), and the steps in a Dockerfile are simply executed sequentially. When you rebuild the Dockerfile it doesn’t try and get you from the current state to the new state, it just throws away the current state and rebuilds the new state from scratch. Of course, it also uses caching so that you don’t have to wait half an hour to rebuild stuff that you haven’t changed. Unfortunately for docker, its caching mechanism is “usually good enough”, a.k.a “sometimes catastrophically broken” if you aren’t aware of its impurities2.
So docker is both declarative and exhaustive, but its impurities still cause issues and inconsistencies between builds (as well as a lot of redundant work when rebuilding, which can be very slow). Also, because you’re pretty much running two OSes (guest and host), you need to deal with all of those issues too (in particular, making sure security updates are applied on both systems). And once you have a docker container, you still need to worry about the environment in which you’re running it, which docker doesn’t address at all.
Lumps of impurity
Of course, not everything should be stateless. If you have a database server and you change its configuration, you really do want it to keep some state around (the database!). But NixOS keeps the “OS” parts of the machine stateless, meaning that the only state you need to manage is that which you create yourself. From the top of my head (not an exhaustive list):
- /var,- /tmp,- /run: stateful, unmanaged
- each user’s $HOMEdirectory: stateful, unmanaged (rarely used on a server, though)
- Users & Groups: stateless
- Installed software: stateless
- Installed services (using systemd): stateless
- All program configuration (i.e all of /etc): stateless3
- Kernel & kernel modules, grub configuration: stateless (but requires a reboot to activate)
- Disk mounts: stateless
By “stateless”, I mean that the given item is generated entirely from the pure nix expression of the system, and isn’t affected by any previous state.
NixOS also uses a bunch of impure technology under the hood, e.g when applying a new set of installed services, it still needs to look at the current installed services in order to tell systemd to stop & remove services that are not in the new system config. But this is fully automated, and the typed, single-language nature of using nix as the only configuration mechanism means that this is not a very complex task, and is generally bug-free.
The bootstrapping problem
Bootstrapping a system is always going to be different to updating a system. Many tools deal with only one side of this equation - e.g many deployment tools will allow you to provision a machine from some base image, but you’ll have to use something else (perhaps ansible or puppet) to keep that machine up to date. And in the space between, you probably need to do some custom scripting to install puppet itself, or set up admin users with appropriate SSH keys.
Nix can’t remove the difference between bootstrapping and updating, but it can dramatically reduce it. I won’t go too deeply into this right now (this post is already rather long), but one important thing about NixOS is that you can build a machine’s configuration locally, and then just push it to the actual machine using nix-copy-closure. Once pushed, a new root can be made active with a single nix command.
When bootstrapping, obviously you don’t yet have a remote machine running NixOS to push to yet. But that doesn’t mean you have to use completely different tools. Under the hood, the root filesystem of a NixOS machine lives as an attribute of your system config: <config>.system.build.toplevel. But there are other options - if you instead build <config>.system.build.virtualBoxImage, out will pop a VirtualBox image for that system instead. Similar attributes exist for producing an EC2 image, a LXC container, an OpenStack Nova image, and plenty more. So instead of starting from some vanilla NixOS image and then deploying your system over the top (which can be be difficult since things like SSH keys and additional users won’t be set up), you can use nix to build a fully-formed image and deploy that directly.
Unfortunately, the stuff I just described is not (yet) very accessible to new users. Right now, it’s probably better to follow the manual install instructions (which are a bit tedious) when you’re just starting out, as it is much simpler to understand what’s going on.
… so should I use it?
NixOS is not yet for the faint of heart. It’s probably very different from what you’re currently using, and you can’t step into it gradually (it’s NixOS or null). Debugging system config can occasionally be tricky, since you can’t just go in an modify a file directly - most of the OS is mounted readonly, forcing you to make modifications properly (i.e. via nix config). This is clearly what you want for deployed machines, but can be frustrating during development.
There’s also the question of security updates. There are (I believe) enough NixOS machines running in production that the nixpkgs folks are pretty keen to apply security updates ASAP, but they will probably never be as responsive as the bigger distros. And there is of course much less manpower available to test the stability of updates. But these are both network effects, which ill hopefully improve the more people use it.
So if you’re game, do give it a try. Once you’ve got the hang of it, I doubt you’ll ever want to go back to deploying a “normal” linux distribution. I certainly don’t.
And if you can’t yet commit to NixOS, maybe give nix itself a try. You can install it on Linux and OSX, and won’t interfere with any system software. It can be quite a useful tool for managing consistent development environments, and is a good way to ease into how nix works without jumping straight into NixOS.
- 
      Technically, nixis not completely pure. But there are very few sources of impurity, none of which should be a problem for the standard set of packagesnixpkgs. If your own build process is sensitive to the current temperature, you probably have bigger issues than build impurity. ↩
- 
      A problematic example: running apt-get update && apt-get upgradewill apply security fixes the first time it is built, but docker will use that cached result in the future (even if it’s months old). This makes it hard to be sure that a docker image actually includes the latest security updates. ↩
- 
      There is a minor edge case where files in /etcthat require special permissions (like 600 for /etc/suders.d/) may remain after being removed from the configuration. As long as you don’t need to make your own files in/etcwith particular permissions, this won’t affect you. ↩
