If you follow the open source and especially the linux community you might have come across the term reproducible (or repeatable) builds. Its claim is: You should be able to verify that a published project’s binary really is the exact result of building a certain version of a set of the project’s sources. Sounds easy, doesn’t it?
Normally this it is not very easy since non-trivial build setups tend to produce different binary results. The reasons for this could be different build tool versions or influences from the build environment like the building operating system, installed libraries or even the wall clock time. So to be absolutely certain that a binary originates from a certain version of the source code and does not contain any unexpected modifications, you will have to build the sources on your own.
The idea behind the reproducible build movement is to ensure that the build process will always produce exactly the same (binary) result, regardless of the build environment and time when the build is run. This would allow you to use published binaries from any source and delegate the verification to another third party that you trust.
- Alice creates an OSS-Project bananas at github and releases a version 1.0 as git tag as well as binary.
- Bob likes bananas and wants to use Alice’s binary but does not know if it really contains only the compiler output of the source code from the 1.0.0 tag of the banana repo. Who knows if Alice really published bananas and not something that just looks like bananas from the outside.
- Now Bob asks his trusted buddy Trent to build the bananas repository.
- Trent executes the build and publishes a hash over the produced binary artifact.
- Bob can now simply compare the hash of Alice’s binary with the published hash from Trent. If the hash is the same, the binary if trustworthy – it’s really bananas.
In this example, Trent could even be a public service that actively builds different projects and publishes verification reports. If there is no Trent you could trust, you could still verify it on your own.
To ensure this process works, your build process must act like a deterministic transformation from source to binary code: For the same set of source files it must always create the same data – byte exact.
This idea has some problems though. It is necessary that the build environment is always the same, no matter the machine. But not each compiler or even each compiler version produces the same output given the same input. This is where most of the effort in the topic is spent: Describing and automating the build environment setup, make it a part of the source code so that everyone can take any version of the source code, build it and get the same result without caring too much about manually installing the expected build environment.
This whole discussion has mostly been happening around applications and libraries that are related to privacy and security. Prominent examples in the Java world are the open source secure messaging app Signal or the encryption library bouncycastle – in fact libraries where a malicious third party could have interest to compromise it to eavesdrop information or weaken secure communication. But if we are honest – such malicious code could lurk inside any popular library.
The whole motivation on this topic is nicely wrapped up at reproducible-builds.org.
And what about the JVM?
Compared to C or C++ compilers and toolchains, Java is fortunately a lot simpler in these regards. There are fewer compilers and there are also just a handful different build tools noteworthy:
But aren’t java builds already reproducible?
javac HelloWorld.javaalways produce the same
javac -cp ... src/main/java/**/*always produce the same set of class files?
At first we can observe that at least the
javac from the Oracle JDK behaves deterministic. Invoking the same version of
javac with the same input parameters and the same input Java files will always produce the same output class files. So if we enforce the version of the building JDK, we can at least reach reproducibility for our class files.
Things get tricky when we start to bundle our classes into JAR files. According to the first sentence of the JAR File Specification, JAR files are merely ZIP containers with some mandatory files and directory structure. However the ZIP file specification (APPNOTE from PKWARE from PKWARE or ISO/IEC 21320-1:2015) mandate the local file modification timestamp of each ZIP file entry to be written into the ZIP’s file entry descriptor. This modification timestamp is encoded as MS-DOS timestamp (starting 01/01/1980 with 2-second accuracy). That means that each created ZIP (and hence also JAR) will result in a different binary output depending on the modification timestamp of the input files.
Another problem is that some JAR bundlers (e.g. the maven-archiver-plugin) could parallelize the compression of jar file entries which also leads to non-deterministic order of ZIP file entries and hence to binary different results.
Another factor might be the file system file order. Some operating systems enumerate directories in alphabetical order while other enumerate them by the underlying inode number. JAR bundlers that simply compress a directory of class files might therefore also create different results across different platforms.
Make your maven build reproducible
As mentioned before, Maven suffers from some behaviours that cause different build artifacts on a binary level.
A simple solution is to take all archive artifacts generated by Maven (JAR and WAR files) and:
- uncompress them
- sort the files using an alphabetical order
- recompress them and set the modification timestamp to a constant value (e.g.
This behavior could be supported by the Maven archiver one day – without this extra repackaging step – but until then, we can use the Reproducible Build Maven Plugin which exactly fulfills this purpose.
Additionally some older versions of Maven (see MSHARED-494) place the build timestamp into the generated
pom.properties file. This behaviour has been fixed in maven-archiver-3.1.0 and can be avoided by using a recent Maven version or force a minimal version for the Maven archiver.
<build><plugins><plugin> <artifactId>maven-jar-plugin</artifactId> <dependencies> <dependency> <!-- MSHARED-494: avoid timestamps in pom.properties --> <groupId>org.apache.maven</groupId> <artifactId>maven-archiver</artifactId> <version>3.1.1</version> </dependency> </dependencies> </plugin></plugins></build>
Applying these simple steps will make your Maven build artifacts reproducible.
Beyond security: (Snapshot-)Artifact caching
Beside the trust benefits, reproducible builds give you some other nice properties. As you might know from the domain of functional programming, if your function is free of observable side effects and always returns the same output for the same input, it is pure and you are free to just memoize the result of the evaluation of the function instead of evaluating it again and again.
What if we could see our build as such a pure function – it does not matter if you recompile your code or use an existing binary that has previously been compiled – they will be the same. And just to mention the obvious, two executable binary files with the same byte structure will always behave the same at runtime (if the execution environment is the same or equal).
This idea can be highly beneficial to build tools that work on project that span multiple teams and modules. Especially in distributed build environments (those typically start once you use a CI server, local developer machines and a shared artifact repository) you can save a lot of time and bandwidth if you stop copying SNAPSHOT artifacts around that logically have not changed since you up- or downloaded them the last time (using simple things as file hashes which are already incorporated in protocols like HTTP and its caching mechanisms).
Further downstream usage of your artifacts may benefit from reproducible builds and its binary stability as well. Assume your project uses Docker images to QA-test your teams artifacts. These Docker images will always be recreated for each snapshot artifact that you create. This will create a lot of potentially big images that have to be transferred between different servers and have to be cleaned up eventually. But if parts of your build create new artifacts that are identical to previous builds, a Docker build can simply reuse the old images since they are identical.
We have seen that Java builds are non-deterministic for no good reason but that there are workarounds to fix it. This could enable us to build a verified build ecosystem around Java dependencies. But reproducible builds are also beneficial for larger build pipelines. They can help to detect unchanged dependencies and avoid unnecessary partial rebuilds.