Build large polyglot projects with Bazel... now with Haskell support

28 February 2018 — by Mathieu Boespflug, Mark Karpov, Mateusz Kowalczyk

Publishing code to the world is easy these days: take your code, tack some extra metadata onto it, call that a package and upload it to npmjs.com / crates.io / Hackage / etc. It’s also easy for other developers to include your code as part of their project: build tools are available to pull your package from where you published it, along with any packages that yours in turn depends on, and build everything in the right order. Haskellers have the good fortune of having at least two great options to choose from when it comes to build tools: Stack and cabal-install. Both tools support building Haskell code, but what they also have in common is they are very much Haskell-centric.

In short, these tools work great for projects that are a) meant to be open source or easily open-sourceable, b) small and c) mostly or exclusively Haskell. Over the past couple of months, we have been focused on building a solution for the opposite use case: corporate monorepos, which are a) closed source b) large and c) almost always a mix of very many languages. It turns out that Google, Facebook, Twitter and many others have needed a solution to this use case for many years now. They came up with Bazel, Buck, Pants and many others, respectively. So all we needed to do was add Haskell support to one of these existing solutions. We chose Google’s Bazel.

Polyglot monorepos

Beyond a certain size, it’s unrealistic to expect that the most cost effective way to implement a project is to do so using only one language. On many of our customer projects, in practice we end up with a mix of Haskell, Java, Scala, C/C++, R, Python and even FORTRAN. Quite simply because it’s much cheaper to reuse existing code, or to play on the particular strengths of a programming language, than to reimplement everything using a uniform stack.

Case in point: we have multiple projects involving compute-intensive models of (bio)physical phenomena. These models are typically implemented in Haskell, with some basic numerical routines offloaded to C and FORTRAN. For faster response times, we transparently distribute these workloads on multiple machines using Apache Spark, which is written in Scala and Java. We do so using sparkle, which in turn uses inline-java and related packages for high-performance bridging between Haskell and the JVM.

What Spark wants is a full-fledged application with a Java-style main function as the entrypoint,

public static void main(String[] args) {
    // Call Haskell code
	...
}

packaged as a JAR file. So to build an application that we can run on many nodes of a large cluster, we need to build C or FORTRAN numerical routines as a library, build a Haskell binary dynamically linked to said library, then build the Java glue code as a JAR and finally perform cunning tricks to inject the Haskell binary into the resulting JAR.

Today, we do this with the help of multiple different build systems (one for each implementation language used) that have no knowledge of each other, along with some ad hoc scripting. Each build system represents build targets as nodes in a dependency graph. But with this situation the dependency graph of each build system is entirely opaque to every other system. No single system has an overall view of the dependency graph. This situation has a number of drawbacks:

Continuous integration builds are slow, due to lost parallelization opportunities. If you have many cores available, the more a build system has detailed information about dependencies, the more it can parallelize execution of build tasks on all cores throughout. For example, Stack and cabal-install can build multiple packages in parallel, but when the package dependency graph is mostly one long chain, packages will be built one after the other on a single core. It’s tempting to parallelize both at the Stack/Cabal level and the GHC level, but that can lead to oversubscribing memory and CPU cores and ultimately, thrashing (not to mention enduring scaling bugs killing throughput).
Partial rebuilds are often wrong, because it’s hard to verify that all dependencies across build systems were accurately declared. For example, if Cabal doesn’t know that Java files are also dependencies and Gradle needs to be called again if they change, then partial rebuilds might lead to incorrect results.
Conversely opportunities for partial rebuilds instead of full rebuilds are sometimes lost, leading to slow rebuilds. This happens when dependencies are accurately declared, but not precisely declared. For instance, it’s a shame to rebuild all of dependent package B, along with anything that depends on it, after a simple change to package A that did not cause any of the interface files for modules in A to change.
The resulting system is brittle, complex and therefore hard to maintain, because it’s an accumulation of multiple build subsystems configured in ways custom to each language. You have a stack.yaml file in YAML syntax for Stack, a package.cabal file in Cabal syntax for each package, a build.gradle file that is a Groovy script or a pom.xml file in XML syntax for Java packages, Makefile’s and autoconf-hell for C/C++ etc.

Wouldn’t it be great if instead of this inefficient mishmash of build system stuff, we could have a single build system dealing with everything, using a single configuration drawn from files in a uniform and easy-to-read syntax? Wouldn’t it be great if these build system configuration files were all very short instantiations of standard rules succinctly encapsulating best practices for how to build C/C++ libraries, Java or Scala packages, Haskell apps, etc?

What we’re after is a uniform way to locate, navigate and build code, no matter the language. A good first step in this direction is to store all code in a single repository. Dan Luu says it this way:

With a monorepo, projects can be organized and grouped together in whatever way you find to be most logically consistent, and not just because your version control system forces you to organize things in a particular way. […] A side effect of the simplified organization is that it’s easier to navigate projects. The monorepos I’ve used let you essentially navigate as if everything is on a networked file system, re-using the idiom that’s used to navigate within projects. Multi repo setups usually have two separate levels of navigation – the filesystem idiom that’s used inside projects, and then a meta-level for navigating between projects.

All code in your entire company universe is readily accessible with a simple cd some/code/somewhere. It then becomes natural to make each code directory map to one code component buildable independently and uniformly. All we need is a build system that can scale to building all the code everywhere as fast as possible, or indeed any component of it, where code components simply map to directories on the filesystem. We expect build configuration to be modular: each code component is packaged in a single directory with a single, succinct, declarative BUILD file describing how to build everything inside using always the same build configuration syntax. No brittle multi-language build scripts that require specialist knowledge to hack and induce a terrible bus factor on project maintenance and development.

That’s how development can scale to large sizes. In fact these monorepos can grow very large indeed.

Google Bazel

One large monorepo facilitates writing tooling that will work with the entire thing. Google did just that with their Blaze build system, open sourced in 2015 as Bazel (the two systems are technically distinct but share most of the code). As we touched upon above, a crucial property of building code at scale is that partial rebuilds should be correct and fast. This implies that dependencies should be declared accurately and precisely (respectively).

Bazel tries hard to offer guarantees that dependencies are at least complete (this property is called build hermiticity). It does so by sandboxing builds, making only the declared inputs to a build action available. In this way, if the build action in reality depends on any undeclared inputs, the build action will consistently fail, because anything outside of the sandbox is simply not available. This has an important consequence: even with huge amounts of code, developers can be very confident indeed that partial rebuilds will yield exactly the same result as full rebuilds, as should be the case if all dependencies are correctly specified. Developers never need to make clean just in case, so very seldom need to ever wait for a full rebuild.

Better still, Bazel has good support for local and distributed caching, optimized dependency analysis and parallel execution, so you get fast and incremental builds provided dependencies are declared precisely. Bazel is clever. Rules can have multiple outputs produced by sequences of actions, but Bazel will only rerun precisely those actions that produce the smallest subset of outputs strictly necessary to build the given top-level target given on the command-line.

Bazel is very serious about recompilation avoidance. Because it knows the entirety of the dependency graph, it can detect precisely which parts of the build graph are invariant under a rebuild of some of the dependencies. For example, it will relink your Haskell app but won’t rebuild any Haskell code if only a C file changed without altering any C header files, even if this was a C file of a package deep in the dependency graph.

By using Bazel, you get to piggyback on all the work towards performance tuning, improving scalability, and tooling development (static analysis, documentation generators, debuggers, profilers, etc) by Google engineers over a period of nearly 10 years. And since its open sourcing, that of a community of engineers at companies using Bazel, such as Stripe, Uber, Asana, Dropbox etc.

Better still, Bazel already has support for building a variety of languages, including C/C++, Rust, Scala, Java, Objective-C, etc. By using Bazel, we get to reuse best practices for building each of these languages. And focus entirely on Haskell support.

How we added Haskell support

Bazel uses a subset of Python syntax for BUILD files. Each “component” in your project typically gets one such file. Here is an example build description for inline-java, involving the creation of one C library and one Haskell library:

cc_library(
  name = "bctable",
  srcs = ["cbits/bctable.c"],
  hdrs = ["cbits/bctable.h"],
)

haskell_library(
  name = "inline-java",
  src_strip_prefix = "src",
  srcs = glob(['src/**/*.hs', 'src/**/*.hsc']),
  deps = [ "//jni", "//jvm", ":bctable"],
  prebuilt_dependencies = [
    "base",
    "bytestring",
    ...
    "template-haskell",
    "temporary",
  ],
)

That’s it! Each target definition is an instantiation of some rule (cc_library, haskell_library, etc). Each rule is either primitive to Bazel or implemented in an extension language (also a subset of Python) called Skylark. These rules are not meant to be highly configurable or particularly general: they are meant to capture best practices once and for all. This is why BUILD files are typically quite small. Unlike a Makefile, you won’t find actual commands in a BUILD file. These focus purely on the what, not the how, which is fairly constrained, by design.

We have rules for building Haskell libraries (packages), binaries, tests and Haddock documentation. Libraries or binaries can freely provide C/C++ or Java targets as dependencies, or use preprocessors such as hsc2hs. Yet all in all, the entire support for building Haskell currently weighs in at barely 2,000 lines of Skylark code. The heavy lifting is done by Bazel itself, which implements once and for all, in a way common to all languages, BUILD file evaluation, dependency graph analysis and parallel execution.

Next steps

At this point, rules_haskell, the set of Bazel rules for Haskell, is in Beta. We’ve been internally dog fooding it on a few projects, and today Adjoint.io and a few others have already deployed it for their CI. TokTok is building much of their entire software stack using Bazel. We have not implemented special support for building packages from Hackage, instead relying on existing tools (in particular Nix) to provide those. We use Bazel to build our own code only, not upstream dependencies. But there are experiments underway to use Bazel for upstream dependencies as well. We encourage you to experiment with rules_haskell and report any issues you find. We’d love your feedback!

In future posts, we’ll explore:

using Bazel’s remote cache to very aggressively yet safely cache everything that can be to further speed up CI builds,
playing on the respective strengths of both Bazel and Nix for truly reproducible builds,
Packaging deployable binaries and containers.

Learn more about building polyglot projects using Bazel in our post Nix + Bazel = fully reproducible, incremental builds.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Tweag Internship Programme Implementing a safer sort with linear types →