Keeping up with upstream projects

Overview

This morning I resumed work on a new service I’ve been working on for CPS support in Merge. As a brand new service, when I create the protobuf definitions and surrounding code go mod will grab all the dependencies from around the internet that are needed to build this service.

However, as a new service, go mod is not constrained by all of the frozen module imports that other Merge projects have, so we get the latest and greatest. And that is good, except for when it comes into conflict with dependencies via imports with other Merge projects, and how they imported the latest and greatest versions of those dependencies at the time they were created.

Something that seems to cause particular pain, and chewed up several hours of my morning is gRPC/protobuf (as well as etcd because they refuse to implement go modules correctly, but let’s set that aside for the moment). Others have felt this pain with gRPC as well.The difficult bit is that my new service is not wrong for importing the latest stable versions of things, and existing services are also not wrong for sticking with what is known to work - at least for some reasonable and well defined duration of time. And exactly what that duration is and how me manage it is what I’d like to talk about here.

Stability is of paramount importance, and we cannot just go update all the things because a new release of an upstream dependency was pushed. In fact I think that for updates to our stable branches, upstream dependencies need to be in production for a while before updating, for testing and unstable we can incorporate sooner.

However, on the other hand, the need for stability does not completely outweigh the need to stay current with our upstream projects. Otherwise you eventually wind up where I did a few years ago on a former project, going to do some seemingly simple maintenance on a component only to find 5+ year old dependencies that turn into a weeks long upgrade effort that incurs extremely high operational risk. Put in a more positive light, our users benefit directly from the rate at which we can deliver new features and improvements, and these features and improvements come in large part from upstream innovation - so keeping pace is important.

Goals

I think we need to define a policy/process that governs how we interact with upstream projects in the stable/testing/unstable universes. And that process needs to be underpinned DevOps machinery to make it a viable one. This includes

  • Testing machinery that allows us to evaluate the impact of pulling upstream updates
  • Automated identification of policy violations with respect to updating an upstream dependency
    • Pulling something into stable too soon
    • An upstream library becoming too stale
  • Cross project dependency management.
    • Example: all of Merge uses gRPC 1.2.6/1.2.8/master for stable/testing/unstable respectively
  • Tools to ensure go.mod files conform to cross-project policies.

Another thing worth discussing here is a return to nightly builds and testing. We did this somewhat in the beginning of Merge, but back then everything was too new - both the Merge systems themselves and the tools and techniques we used for testing. So the result was everything just exploded every night and it was misery for everyone that distracted from the critical paths that needed attention to get things actually working for some set components that would eventually comprise and end-to-end working Merge system.