Incidentally, these technologies are a common choice among our community members, who often use dbt and Databricks together to manage a data lake (or lakehouse) over an object store.
The combination of dbt and lakeFS just makes sense. Both of these tools represent the same approach to applying software engineering best practices to data.
But what about the mix of dbt and Databricks? What are the advantages of this combination, and how does lakeFS fit into it? Keep reading to learn more about this combo and how to make it work for your data lake.
dbt + Databricks: Why bring them together?
In essence, dbt simplifies analytics on a data lake or data lakehouse and expands the analytics capabilities of Databricks.
Here are a few benefits users of both dbt and Databricks get:
- Easy way to incorporate data analysts into the data infrastructure development process – dbt provides a streamlined process from development to production, enabling users who may be new to Git and software development standards to easily use the lakehouse.
- Reduced expense and complexity – by using dbt, Delta Lake, and Databricks SQL, the data team can operate on a single platform, architecture management is simplified and redundant expenses are avoided.
- Seamless collaboration among teams – the combination of tools facilitates the development, maintenance, and distribution of data infrastructure by arranging it into modular Python functions or SQL queries. It also automates dependency management and executes sequencing.
- Easier understanding of data by line-of-business teams – teams can access lineage, document as they code, and benefit from metadata via integrations with well-known BI and data discoverability tools.
- Enhanced data governance and discovery – dbt supports Databricks Unity Catalog to help consolidate workflow-wide documentation and governance.
How does the dbt + Databricks mix work?
To bring dbt and Databricks together, you can use this handy adapter.
- An active dbt account.
- Ability to create object storage in a cloud service provider account. It can be an S3 bucket on AWS, Azure Blob Storage on Azure, and Google Storage on Google Cloud Platform).
dbt + Databricks: Getting started
Next, you need to create a workspace in Databricks. To do this, you’ll either need to register and create a Databricks account or sign in with your existing credentials.
A Databricks workspace provides access to every asset hosted on the platform. Workspaces consolidate elements such as clusters, SQL warehouses, and notebooks into a single location.
So, pick the right cloud provider region and enter the name of your workspace. After getting the right cloud resources, you need to import data into your workspace and get started.
Connect dbt to Databricks
You can do this in two ways:
- Partner Connect offers a streamlined configuration process for establishing a dbt account directly from the Databricks trial account. This works great if you’d like to be operational quickly.
- Alternatively, you can create a dbt account and set the Databricks connection manually. This connection method is recommended for those looking to familiarize themselves with the dbt setup procedure and customize their configuration.
Perform your first commit
Now you can start development within the IDE. Your project can initially start up for a few minutes while it establishes a Git connection, clones your repository, and verifies the connection to the warehouse.
To perform your first commit, select Commit and sync. Click Commit after using the commit message for the initial commit.
This enables the creation of the initial commit to the managed repository and generates a branch to which new database code can be added. You can now execute the dbt run and directly query data from your warehouse.
Create your first branch
Next, you can create a new branch. Due to the read-only mode setting on the primary branch, you need to create a new one.
After clicking the icon adjacent to the model’s directory, choose Create file and enter its name. Click Save after pasting the following query into the file.
Enter dbt run in the command prompt located at the bottom of the screen. A successful run should let you see all your models.
You can then establish connections between these tables and views and business intelligence (BI) tools, limiting them to extracted data rather than unprocessed data within the BI tool.
Recommit the modifications
After building a new model, you need to commit the modifications to the project. This is how you ensure that the repository contains the most recent code.
On the left-hand side, select Commit and sync, then enter a description. You can finally merge this branch into the main branch by committing these modifications to the main branch of your repository.
What you gain by adding lakeFS into the mix
There’s no doubt about dbt’s role in improving the data transformation process. It provides teams with model referencing, lineage, documentation, and testing capabilities. All of these things streamline common processes, such as:
- Creating tables
- Writing inserts and upserts
- Creating snapshots.
This allows data analysts to independently author data pipelines, removing the bottleneck generated due to the lack of engineering resources.
dbt documentation provides clarity around managing multiple environments. You can configure multiple targets within a profile to build separate output locations for models, typically named “dev” and “prod.”
The guide recommends “each user to have their own development environment” but once a data org reaches a certain scale, this approach becomes unsustainable.
The challenge of scale
Let’s consider a team of a hundred analysts and engineers. Practically speaking, each of them won’t generate their own copy of a petabyte-sized app_clickstream_fact_table on which to execute database tests.
At the opposite end of the spectrum, having a single development environment to accommodate a large number of users would always result in competing modifications that would need to be coordinated.
Even with a single development environment, it is difficult to keep up to date with the most recent data from production. This is a prerequisite if you want to use unit and integration tests to detect possible issues before logic changes are deployed in production.
Even if you design a means to continually sync the development environment, you’d need to copy a massive amount of data.
What you need is a system that allows teams to produce logical copies of production data without incurring the costs associated with data replication. This capability lets you create completely duplicated development environments that can construct models and run dbt tests before exposing the results to any downstream customers.
Adding lakeFS into the architecture gives the freedom to use the following abstractions and achieve the aforementioned functionality:
- lakeFS branches – to create logical copies of data using metadata
- lakeFS pre-merge hooks – to combine the process of adding new data into production with data validation tests
- lakeFS protected branches – to prevent users from directly writing into a branch, forcing the use of pre-merge hooks
How does dbt + lakeFS work?
lakeFS provides version control for the data lake and creates and accesses versions using Git-like semantics. It lets you use concepts like branches to establish an isolated version of the data, commits that create a reproducible point in time, and merges to combine your changes in a single atomic operation.
Instead of employing a persistent development environment that contains an outdated and/or incomplete duplicate of production data, you can use a lakeFS branch to rapidly reproduce the whole production data without copying any files.
lakeFS does this by using the identical steps indicated above. This time, you can begin by creating a personal development branch called dev-branch-a, which is derived from dbt-prod. This environment provides a complete replica of production data for developing and testing the impact of changes on downstream tables and systems.
Unlike when testing in a partial dev environment, you can be confident that a downstream dashboard, API, or Salesforce consumer that looks nice reading from dbt-prod would also appear fine when reading from “main.”
You may construct as many of these “feature” dev branches as needed, allowing each developer on your team to work and test changes independently, rather than in what we term the “shared folder” situation.
When you’re not dealing with a single long-lived dev environment in which numerous people work at the same time, you no longer have to deal with the need to ensure that changes to tables under development are coordinated and communicated.
Once testing in dbt and/or lakeFS is complete, you can merge your code changes to the production branch in a Git repository and be sure that the dbt models generated by the code will not create any unforeseen difficulties. This is true as long as you have implemented relevant tests for our project.
Check out lakeFS docs to learn more about how it works with other tools, including dbt, Databricks, Spark, and many others.
And if you’re ready to start your adventure with data version control, go straight to the GitHub page.
Table of Contents