This post was originally published in the Enigma blog.
In every software engineering problem I’ve worked on, I’ve noticed a recurring tension between two highly desirable properties: flexibility and robustness. But in each situation, this tension manifests itself in different ways.
At Enigma, our goal is to build a complete set of authoritative profiles of U.S. small businesses. This requires us to integrate hundreds of different data sets and to continually create and refine machine learning–based algorithms in a highly research-driven development process.
For us, flexibility means we continually integrate new data sources and algorithms that we need to rapidly experiment with and validate. Robustness means running a complex data pipeline at scale while spending minimal time on maintenance.
The tension between flexibility and robustness arises when we’re excited about a potential research breakthrough and want to rapidly test out the effects it has on our data asset. We want to quickly deploy the change in an isolated data pipeline and measure the results against what’s currently running in our production environment.
In this post, I’ll discuss the various approaches we tried and why we’re now using data branching to address this tension.
We initially tried to resolve this by having distinct production and dev pipelines. We deployed code to the dev pipeline from separate git branches and maintained copies of the data in distinct namespaces. This solution delivered a high degree of robustness, but at the cost of the flexibility we needed. The main problems we encountered were:
We needed to keep the data between the dev and prod pipelines in sync. At best, this required us to copy large data files from prod to dev. At worst, it required us to re-compute results on dev unrelated to the experiment we were running.
Data scientists conducting research needed to understand the state of our dev pipeline prior to running experiments and comparing them to prod. As a result, we frequently ran time-consuming experiments only to discover we couldn’t use the results. In practice, data scientists required the help of data engineers to run these experiments. This reduced data engineering teams’ velocity and limited data scientists’ autonomy.
As our team has grown, we experienced increased contention in our dev environment when we wanted to run multiple experiments at the same time. Different data scientists would need to wait for the dev environment to “free up” before they could test their changes.
New Approach: Data Branching
Earlier this year, we began to explore lakeFS for data branching as a way to resolve this tension. Data branching overlays a git-like abstraction on top of the physical data. A data branch is a logically isolated view of the data that can be modified and merged into other branches.
Data branching makes it trivial for researchers to create an environment based on the latest production data. With a simple command, a data scientist can create an isolated data branch for their experiment that’s guaranteed to be identical to production except for the specific changes they make. This empowers data scientists to work independently of data engineers.
Data branching resolves issues of environment contention by allowing for the creation of isolated experimental environments (each experiment runs on a different branch).
LakeFS branches solve the isolation challenge in a straightforward way. Today, every developer and researcher creates separate data branches, which includes a complete snapshot of the prod data (at no additional storage cost). You can make your change and review its impact on the final data set without fear of interfering with someone else’s work or polluting the production data.
It’s also much easier for us to run parallel pipelines and maintain stable pipelines for customers who want to upgrade at a slower cadence.
In addition to making experiments easier, we saw other benefits for our production pipeline:
Branching for validation: Our data pipeline consists of a series of stages. Between each stage, we run validation logic before promoting the results to the next stage. We realized we could replace this hand-crafted promotion logic by running our candidate pipeline on a branch and merging this branch onto main if validation succeeded.
Data set tagging: Another challenge we had was determining which data set versions contributed to our final data asset. Tagging a branch provided clear semantics about the complete set of intermediate data sets that went into the pipeline. This is extremely helpful when diagnosing issues and anomalies on the final data asset.
Build vs. Buy
After briefly considering implementing this ourselves as a metadata layer using git branches, we decided to partner with lakeFS to provide our data branching solution. There were a few reasons for this:
LakeFS is 100% focused on providing a solution for data branching. We like their focus on solving this one problem extremely well.
We are impressed by the caliber of people at the company, including their leadership and technical talent. My experience is that smart people with high personal integrity and a clear focus are best positioned to solve difficult problems.
They launched an open-source solution two years ago, so the core capability had been battle tested. They were in the process of releasing a cloud hosted version of their product, which meant less maintenance and support for our internal team.
Achievements to Date
Over the past three months, we fully migrated our data pipelines to lakeFS. Overall, it’s been a successful partnership. For the most part, the product has met or exceeded our expectations. Where it hasn’t (mainly response time on certain endpoints) the lakeFS team has been fully engaged in turning around solutions — often in a matter of days. The attentiveness and sense of urgency is refreshing.
Moving to a data branching solution has paid off quickly for us. A few days after completing the migration, we’ve already reduced testing time by 80% on two different projects. And we’re excited to see how data branching increases our product velocity in the coming quarter.