Services
BiotechFintechAutonomous Vehicles
Open sourceContactCareersTeamResearchBlog
1 August 2022 — by Guillaume Genestier
Recompilation avoidance in rules_haskell
HaskellBazelbuild system

Bazel and rules_haskell

Bazel is an open-source tool to build and test projects. It is particularly well-suited for multilingual monorepos. One strength of Bazel is its extensibility. Anyone can declare a new build rule (or test rule) and distribute it.

Among those rule sets, rules_haskell defines how to build Haskell code.

When one is programming, they compile the same project dozens of times per day, often after modifying only one or two files. It would be a pity to recompile all the modules each times, this would be extremely time-consuming and disrupt the programmer’s workflow.

To identify what should be recompiled, Bazel uses a quite simple strategy: all actions contain the list of inputs it uses, and each time the action is executed, Bazel hashes the inputs and store the result in cache. When re-execution of an action is attempted, Bazel first checks for a cache hit, and calls the compiler only if an input file has been modified.

One could think that since Bazel provides such a mechanism to prevent superfluous recompilation, then developpers of a set of Bazel rules (like rules_haskell) should not worry about it.

Well, the story in this blogpost would be quite uninteresting if things were so simple.

When one compiles a Haskell module, GHC needs the source code of the module and the interface files of all its dependencies1. Hence, those files are given as inputs to the Bazel actions compiling a Haskell module. However, since the interface file of a module contains the hash of the interfaces of all the modules it depends of, with the previous version of rules_haskell, a modification in one file triggered the recompilation of all the modules which transitively depend on it.

In this post we explain how the most recent version of rules_haskell reused the recompilation avoidance mechanism implemented in GHC, to save Bazel users from useless recompilation of modules.

How GHC deals with recompilation avoidance

When compiling a module (A.hs), in addition to the object file (A.o), GHC generates an interface file (A.hi). This file is used for sharing inter-module information that would otherwise be difficult to extract from a compiled object file.

These files contain various information, useful in different contexts, among which:

  • The list of symbols it exports, including the type of all symbols and the hash of their implementation,
  • The list of modules and external packages it depends on,
  • The list of orphan instances.

This file is the one used to know if a module which depends on A should be recompiled after A.hs has been modified. However, to determine if recompilation is required, not all of the information mentioned is useful. For instance, the list of packages a module depends on, is not relevant information when determining if its dependency should be recompiled, similarly, precise implementation of functions only matters if the module is compiled with inlining turned on.

The relevant bits for the recompilation avoidance mechanism are summarised in an ABI hash (Application Binary Interface). As explained in the GHC Wiki, “When considering whether or not a module’s dependent modules need to be recompiled due to changes in the current module, a changed ABI hash is a necessary but not sufficient condition for recompilation”2.

Example

To illustrate the mechanism, let us consider 3 simple files:

module A (f1, f2, T, a) where          module B where                         module C where

data T = T0 | T1                       import A (f1, f2, T)                   import qualified B

a :: T                                 g1 :: A.T -> A.T                       data N = Z | S N
a = T1                                 g1 x = A.f1 (A.f2 x) x
                                                                              h :: a -> a -> a
f1 :: a -> b -> a                      g2 :: a -> a -> a -> a                 h x = B.g2 x x
f1 x y = x                             g2 x y = A.f1 (A.f1 x) (A.f1 x)

f2 :: T -> T
f2 T0 = a
f2 T1 = T1

There are many changes one can make to A.hs which would change the interface file B.hi, but do not affect the ABI hash of B avoiding to trigger the recompilation of C.hs.

Changing the export list of A:

- module A (f1, f2, T, a) where
+ module A (f1, f2, T) where

Change the type of a function of A:

- f1 :: a -> b -> a
- f1 x y = x
+ f1 :: a -> b -> b
+ f1 x y = y

Inline a function of A:

+ {-# INLINE f1 #-}
f1 :: a -> b -> a

Modify the import list of B:

- import qualified A (f1, f2, T)
+ import qualified A (f1, f2, T, a)

All those modifications affect the part of the interface file of B regarding imports, hence it changes B.hi, so with the previous version of rules_haskell, C, which depends on B, would have been recompiled. However, those changes do not impact the ABI stored in B.hi, Hence C is not recompiled, when using ghc --make or the most recent version of rules_haskell, since it is not impacted by those changes.

Mimicking this behaviour

Now that we have understood the mechanism used by GHC to decide if recompilation is required, we want to teach Bazel to use it.

ABI files

Since the relevant information to know if recompilation is required is the ABI hash nested inside the interface file, and files are the unit considered by Bazel to detect modifications, one has to first extract this hash and put it in its own file.

The strategy chosen for this is to first generate the human-readable version of the interface file (using ghc --show-iface A.hi) and then store only the line containing the ABI hash into a file A.abi.

Tweak the caching mechanism with unused_inputs_list

This new file A.abi is then added to the list of inputs required to compile the modules importing A. However, it cannot completely replace the A.hi file, since whenever the modification of A.hs is important enough to affect the ABI hash, the whole interface file is required by GHC to compile the other modules.

As Bazel’s caching mechanism inspects all the inputs to know if recompilation should be triggered, adding a new file to the list of inputs can only cause recompilation to occur more often than in the previous state.

This is exactly the opposite of our goal, hence we have to somehow teach Bazel to not inspect all the inputs when deciding if a “target” should be regenerated. Fortunately, there is a mechanism in Bazel which has exactly this effect (even if what not intented to be used this way): declaring some inputs as “unused”.

When an input occurs in the unused_inputs_list, it is not considered in the computation of the hash of inputs used to decide if regeneration of a target is required. Hence declaring all the interface files as “unused inputs” allows us to trick the Bazel caching mechanism into not inspecting the interface files, but only the associated ABI files3, when deciding which targets to regenerate. Furthermore, since the interface files are still in the input list, when recompilation is needed Bazel will unprincipledly use them despite us tagging them as “unused”.

It must be noted that Bazel documentation on unused_inputs_list is very light, but mentions that “Any change in those files must not affect in any way the outputs of the action”. Hence, the non consideration of the inputs listed in this field when computing the hash for caching is quite expected, however, it is not clear from the documentation that Bazel can use those inputs when recompiling.

Closing Remarks

This project was possible thanks to the generous funding from Symbiont. The work presented in this post is built upon [haskell_modules][haskell modules], the previous work conducted by Facundo Domiguez to finely identify the inputs required to build each Haskell module.

In this post, we presented a technique to declare some inputs as “irrelevant” when Bazel decides if recompilation is required, applied to the specific case of the GHC compiler. Since this problem seems quite common (recompilation avoidance is a problem that every language has), we expect it to find other application soon. Especially, we expect this to raise awareness in the Bazel community about how useful “irrelevant for caching” inputs are, and would lead to a clarification of the purpose of unused_inputs_list (and if useful to the creation of a separate ignored_by_caching_inputs optional parameter).


  1. If the module uses Template Haskell or a plugin, it is not sufficient, as explained in next footnote.

  2. This does not apply if the importing modules use Template Haskell or a plugin, since in this case the result of compiling a module can depend of implementation of imported modules, not just of their interfaces. In both cases, to decide if a recompilation should be performed, GHC simply relies on the hash of the generated object file rather than just on the interface file.

  3. Even if the target module uses Template Haskell or a plugin, it is safe to hide the interface files from Bazel’s caching mechanism. Indeed, even it could happen that recompilation is required whereas no ABI hash changed, the object files of all the modules it depends of is given as input to the Bazel rule compiling this kind of module. Hence any modification affecting the object file will trigger recompilation, no matter its impact on the ABI hash.

If you enjoyed this article, you might be interested in joining the Tweag team.
This article is licensed under a Creative Commons Attribution 4.0 International license.
Interested in working at Tweag?Join us
See our work
  • Biotech
  • Fintech
  • Autonomous vehicles
  • Open source
Tweag
Tweag HQ → 207 Rue de Bercy — 75012 Paris — France
hello@tweag.io
© Tweag I/O Limited. All rights reserved
Privacy Policy