I’ve spent far too long thinking about and optimising the execution speed of CI jobs, mostly within GitLab CI, but the general thinking translates to all systems, there are only so many ways these things can work and generally the handbrakes across them are the same.
This is an attempt to write down some of the thinking and lessons learnt over time so others can have their CI go faster.
Guiding thoughts
- Reliable and fast CI can be a major productivity boost for teams, the more complex the system the bigger the boost.
- Overheads become a burden when work done by a given job grows and/or the number of times a job is run grows
- For a side project where you run a CI build once a week, you don’t care but for a team to get the most out of CI you should.
- There is no silver bullet for making CI fast, delays can come from anywhere. It requires regular attention, i’m trying to cover some of the major handbrakes i’ve encountered.
- When thinking about fast CI, how and where it executes it the most important bit, the execution environment.
- Ideally execution environment of a CI job should be ephemeral, independent and deterministic. Jobs shouldn’t have side-effects (unless by design like publishing an artifact) and should run the same every time given the same inputs (this is rarely the reality tho, few jobs are run in a hermetic way but we try).
- The primary way this is achieved is via an ephemeral container/machine that the CI job runs within, the execution environment is spun up fresh with some known state / set of dependencies already available like NodeJS X.Y or Java X.
- There are lots of ways to do it not like this, but in general I avoid them as they are hard to reason about and create too many other problems, so most/all of the optimisation work will assume each job is executed in an ephemeral independent environment.
- There are many ways to make things ‘faster’, throwing more resources at it is often the simplest, but almost always ‘doing less’ is what to aim for rather than just trying to brute-force it. That requires insight and time though so its a trade-off.
Topics
Future topics
- The role of a task orchestrator (important conceptually)
- need a fast, feature rich and language/ecosystem agnostic way of describing tasks and how they relate
- used in CI to wrap all executions/tasks, and avoid re-work, Gradle + Build Cache and Bazel are examples
- tempting to not bother or implement a language specific way but for bigger projects/teams/patterns doing it consistently is best
- Gradle
- Best thing ever
- Has footguns and is complex
- Worth understanding to use at scale
- More friendly and bigger community than Bazel, much better developer experience
- Can execute anything in a general sense, has deep plugins that understand the language for something like Java/JVM languages
- NodeJS / NPM / Yarn
- node_modules is the problem generally, different with PNP
- yarn helps because more deterministic
- zillions of small files that cause slowness, use mounted squashfs to go fast
- avoid
yarn install/npm ci/npm installvia lock file hashing and execution avoidance
- XCode
- pretty hostile to headless/CI/modern practices until very recently
- hard to cache than most things
- not linux so can’t use squashfs, use
hdiutilinstead. - is sensitive to changes in path, so /builds/
` that some system do is bad
- Integration testing in Kubernetes
- I use
bsycorp/kindbut haven’t looked recently atkube/kind - Want fast ephemeral cluster to deploy stuff into
- Don’t want to have to push images to registry bc that is slow/wasteful
- I use
- Building containers
- Ideally don’t want dockerd, bc privs / inefficient / more than we need
- Really just want to construct and push a tar file, why need all the ceremony?
- Jib looks like a good fit, can build simple images
- Combine with Gradle for super powers, made a plugin to make it great.
- Monorepos
- Unique problems
- Totally worth it, my preferred way to work, lots of benefits
- Creates some technical problems that need solving
- how to know what to build? cant build everything all the time it takes too long
- if you have implemented most of optimisations described here then not a problem
- Task orchestrator knows what he been built before, small changes require small builds even in monorepo
- Efficient caching means even as dependencies grow, CI stays fast. Even if
node_modulesis 5GB.