This is my first post about content-addressability in Nix — a long-awaited feature that is hopefully coming soon! In this post I will show you how this feature will improve the Nix infrastructure. I’ll come back in another post to explain the technical challenges of adding content-addressability to Nix.
Nix has a wonderful model for handling packages. Because each derivation is stored under (aka addressed by) a unique name, multiple versions of the same library can coexist on the same system without issues: each version of the library has a distinct name, as far as Nix is concerned.
What’s more, if openssl
is upgraded in Nixpkgs, Nix knows that all the
packages that depend on openssl
(i.e., almost everything) must be
rebuilt, if only so that they point at the name of the new openssl
version. This way, a Nix installation will never feature a package
built for one version of openssl
, but dynamically linked against
another: as a user, it means that you will never have an undefined
symbol error. Hurray!
The input-addressed store
How does Nix achieve this feat? The idea is that the name of a package
is derived from all of its inputs (that is, the complete list of
dependencies, as well as the package description). So if you change
the git tag from which openssl
is fetched, the name changes, if the
name of openssl
changes, then the name of any package which has openssl
in
its dependencies changes.
However this can be very pessimistic: even changes that aren’t
semantically meaningful can imply mass rebuilding and downloading. As
a slightly extreme example, this merge-request on
Nixpkgs makes a tiny change to the way openssl
is built. It doesn’t actually
change openssl
, yet requires rebuilding an insane amount of
packages. Because, as far as Nix is concerned, all these packages have
different names, hence are different packages. In reality, though,
they weren’t.
Nevertheless, the cost of the rebuild has to be born by the Nix infrastructure: Hydra builds all packages to populate the cache, and all the newly built packages must be stored. It costs both time, and money (in cpu power, and storage space).
Unnecessary rebuilds?
Most distributions, by default, don’t rebuild packages when their dependencies change, and have a (more-or-less automated) process to detect changes that require rebuilding reverse dependencies. For example, Debian tries to detect ABI changes automatically and Fedora has a more manual process. But Nix doesn’t.
The issue is that the notion of a “breaking change” is a very fuzzy one. Should we follow Debian and consider that only ABI changes are breaking? This criterion only applies for shared libraries, and as the Debian policy acknowledges, only for “well-behaved” programs. So if we follow this criterion, there’s still need for manual curation, which is precisely what Nix tries to avoid.
The content-addressed model
Quite happily, there is a criterion to avoid many useless rebuilds without sacrificing correctness: detecting when changes in a package (or one of its dependencies) yields the exact same output.
That might seem like an edge case, but the openssl
example above (and many others) shows that there’s a practical application to it.
As another example, go
depends on perl
for its tests, so an upgrade of perl
requires rebuilding all the Go packages in Nixpkgs, although it most likely doesn’t change the output of the go
derivation.
But, for Nix to recognise that a package is not a new package, the
new, unchanged, openssl
or go
packages must have the same name
as the old version. Therefore, the name of a package must not be
derived from its inputs which have changed, but, instead, it should be
derived from the content of the compiled package. This is called
content addressing.
Content addressing is how you can be sure that when you and a
colleague at the other side of the world type git checkout 7cc16bb8cd38ff5806e40b32978ae64d54023ce0
you actually have the exact
same content in your tree. Git commits are content addressed, therefore the name
7cc16bb8cd38ff5806e40b32978ae64d54023ce0
refers to that exact
tree.
Yet another example of content-addressed storage is IPFS. In IPFS storage files can be stored in any number of computers, and even moved from computer to computer. The content-derived name is used as a way to give an intrinsic name to a file, regardless of where it is stored.
In fact, even the particular use case that we are discussing here - avoiding recompilation when a rebuilt dependency hasn’t changed - can be found in various build systems such as Bazel. In build systems, such recompilation avoidance is sometimes known as the early cutoff optimization − see the build systems a la carte paper for example).
So all we need to do is to move the Nix store from an input-addressed model to a content-addressed model, as used by many tools already, and we will be able to save a lot of storage space and CPU usage, by rebuilding many fewer packages. Nixpkgs contributors will see their CI time improved. It could also allow serving a binary cache over IPFS.
Well, like many things with computers, this is actually way harder than it sounds (which explains why this hasn’t already been done despite being discussed nearly 15 years ago in the original paper), but we now believe that there’s a way forward… more on that in a later post.
Conclusion
A content-addressed store for Nix would help reduce the insane load that Hydra has to sustain. While content-addressing is a common technique both in distributed systems and build systems (Nix is both!), getting to the point where it was feasible to integrate content-addressing in Nix has been a long journey.
In a future post, I’ll explain why it was so hard, and how we finally managed to propose a viable design for a content-addressed Nix.