Recently, we’ve heard from several community members experimenting with new development workflows using lakeFS and dbt.
The timing isn’t surprising given dbt’s more recent support of big data compute tools like Spark and Trino that are some of the most commonly-used technologies by lakeFS users managing a data lake over an object store.
The combination of dbt and lakeFS is natural as both tools encourage applying software engineering best practices to data.
dbt improves the data transformation process by adding model referencing, lineage, documentation and testing capability. This simplifies the common processes of creating tables, writing inserts, upserts or creating snapshots and allows analysts to independently author data pipelines – removing the engineering bottleneck.
When it comes to managing multiple environments, the dbt documentation is clear. Multiple targets can be configured within a profile to create separate output locations for models, typically named “dev” and “prod”.
We recommend using different schemas within one data warehouse to separate your environments. This is the easiest to set up, and is the most cost effective solution in a modern cloud-based data stack.
Taking the idea further, the guide goes on to recommend “each user to have their own development environment.”
The Challenge with Scale
What we’ve personally experienced and observed is that once a data org starts operating at a certain scale, this approach becomes less feasible. A team with a hundred analysts and engineers cannot have everyone create their own copy of a petabyte-sized
app_clickstream_fact_table on which to run dbt tests.
At the other end of the spectrum, using one dev environment to support many users will inevitably lead to conflicting changes and require coordination. Even with a single dev environment, it is non-trivial to keep it hydrated with the latest data prod.– a requirement if we want to use unit and integration tests to catch potential issues before logic changes are implemented in prod. And even if we implement a way to constantly sync the dev environment, that will require copying a lot of data! (again, at scale)
What we need is a solution that lets us create logical copies of production data without incurring the cost of replicating data. With this functionality, it becomes possible to whip up perfectly-replicated dev environments where we can build models and run dbt tests before exposing the outputs to any downstream consumers.
Sound too good to be true? It isn’t.
By adding lakeFS to the architecture, we can take advantage of the following abstractions and obtain the above functionality:
- lakeFS branches – to create logical copies of data using metadata. Creating a new branch sourced from production provides an identical copy of the data without duplicating it.
- lakeFS pre-merge hooks – to couple the adding of new data into production with the success of data validation tests
- lakeFS protected branches – to prevent users from directly writing into a branch, forcing the use of pre-merge hooks, as mentioned above.
Curious how it works?
In the next section, we’ll walk through the workflow that shows how to implement this strategy for development and staging environments for data pipelines managed with dbt. (This architecture is based on the work of lakeFS community member Sid from paige.ai.)
Let's see it in action!
We’ll start with an environment where dbt is being used to manage data transformations performed by Spark. The first step toward a more robust development environment is to use the lakeFS-Spark integration to enable git-like operations over the data in object storage.
To configure Spark with lakeFS, we update the
fs.s3a.endpoint in the
hive-site.xml file as shown below.
With this configuration set, we can create schemas in Spark that point to a branch in a lakeFS. Then the name of the Spark schema is set as schema for an output in the dbt “profiles.yml” config.
# example profiles.yml for staging env on lakefs branch lakefs-example: target: staging outputs: dbt-prod: method: thrift type: spark schema: dbt-prod
To show how it all links together, you can perform the following steps:
- Create a branch in lakeFS:
lakectl branch create lakefs://repo/dbt-prod -s lakefs://repo/main
- Create a schema in Spark using the branch:
create schema dbt-prod location ‘s3://repo/dbt-prod’
- Copy the metadata of relevant tables into the schema:
lakectl metastore copy --from-schema main --to-schema dbt-prod --from-table my-table --to-branch dbt-prod
- Add a new target to dbt ‘s profiles.yml: make sure it points to the new schema and run dbt commands
We recommend going through these steps by creating a branch named “dbt-prod” sourced from the “main” branch in lakeFS and also creating a corresponding schema in Spark. This environment will be used as a staging area where dbt tests can run over the model outputs prior to data being exposed to downstream consumers.
If all of dbt’s tests pass, we perform a merge operation in lakeFS from the dbt-prod branch to main to atomically expose the tested outputs.
The Development Workflow
Instead of using a permanent development environment with an outdated and/or partial copy of production data, we use a lakeFS branch to instantly replicate the full production data without duplicating any files.
We do this by running through the same process outlined above, but this time start by creating a personal development branch we’ll call dev-branch-a sourced from dbt-prod. Within this environment you get a full replica of production data to develop and observe the effect of changes to downstream tables and systems.
Unlike when doing testing in a partial dev environment, you can gain an extra level of confidence that a downstream dashboard, API, or Salesforce consumer that looks good reading from dbt-prod, the same will be true when reading from “main”.
Especially important for larger teams, we can create as many of these “feature” dev branches as needed so each developer on your team can work and test changes in isolation, instead of in what we call the “shared folder” scenario. The requirement to make sure changes to tables in dev are coordinated and communicated is made obsolete when there isn’t one long-living dev environment multiple people are working in simultaneously.
Once testing configured in dbt and/or lakeFS succeeds, we merge our code changes to the production branch in a git repository and have confidence that the dbt models produced by this code won’t cause any unexpected issues. This is true, so long as we’ve configured meaningful tests for our project.
We’re truly excited about what these development and testing workflows can do to speed up development cycles of data teams.
We’re actively working to make it work more auto-magically where lakeFS branches auto-generate Hive schemas that come hydrated with the (meta)data needed to immediately run dbt transformations.
To follow this work and share your thoughts, see issue #2530 on the lakeFS GitHub!
Read Related Articles.
Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark. Fast forward
Overview of Hive’s Metastore Let’s get right into it. This is not an objective recap of every topic covered at the Future of Metadata After
Iddo Avneri contributed to this article.