Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Tal Sofer
Tal Sofer Author

Tal Sofer is a product manager at Treeverse, the company...

Last updated on November 25, 2025

Source code versioning has been a standard practice in software engineering for a long time. Data engineers can also benefit from this approach. But versioning data just like we version code is easier said than done. Git wasn’t designed to handle massive data volumes, and judging by the rapid rise of machine learning applications, datasets will only grow bigger in the future.

This is where dbt and lakeFS come in, offering complementary solutions to modern data management. dbt focuses on managing code for data transformations, while lakeFS provides a data version control system for the data itself. Together, they empower teams to build reliable, reproducible workflows for transformations and datasets! 

In this article, we dive into code version control with dbt and then move into data version control to show the distinct use cases they address and how data teams can benefit from both to achieve more efficiency, collaboration and trust in data. 

What is dbt?

dbt is a solution that allows you to modularize and centralize your analytics code and provide your data team with guardrails similar to those used in software engineering workflows. 

The team can work together on data models, version them, and test and document your queries before safely deploying them to production with monitoring and visibility.

dbt compiles and runs your analytics code against your data platform, allowing your team to work with a single source of truth for metrics, insights, and business definitions. This single source of truth, paired with the flexibility to write tests for your data, eliminates errors as logic evolves and alerts you to issues.

What is dbt?
Source: https://docs.getdbt.com/docs/introduction

Key Features of dbt

  • Ability to materialize queries into relations with SQL boilerplate code
  • dbt’s ref function allows you to implement transformations in phases
  • Document your dbt project easily using a framework for creating, versioning, and sharing documentation for your dbt models
  • Test your models to strengthen the integrity of the SQL in each model by providing assertions about the results produced by the model
  • dbt includes a package manager, which enables analysts to create and publish both public and private repositories of dbt code that others can subsequently reference
  • dbt provides snapshots of data to help you track changes and recover historical values

dbt Use Cases 

Eliminates Data Silos

Data silos emerge when an organization has more data than it can use for analysis. As an organization’s data grows, it becomes increasingly essential to centralize it for analysis. You can be proactive by developing centralized logic for many database use cases within a data warehouse. dbt can help you build your transformational logic using models that can be applied to different downstream processes.

Boosts Data Quality

Data engineers spend much time updating, maintaining, and discovering quality concerns in the data pipeline. dbt lets you clean data using simple SQL statements or Python programming. You may create complicated models to filter, manipulate, and automate data cleaning operations. 

The source of data has a significant impact on its quality. Because data gathering sources vary frequently, you may encounter a problematic process every time you change them. To avoid this, you can standardize the source reference using dbt. 

Machine Learning Data Preparation

With dbt, data engineers can easily split data or create new variables using Python’s rich ecosystem for modeling data. dbt with Python enables data engineers to introspect and explore new possibilities with current data to increase the accuracy of machine learning models.

Data Lineage and Transparency

dbt removes the guesswork and builds trust among stakeholders by giving data lineage and documentation. Data lineage allows you to trace the data’s history, origins, and transformation. On the other hand, dbt documentation assists you in identifying the types of transformations, the logic associated with models, and more. 

Test and Debug Data Pipelines

dbt helps you test and debug data pipelines in near real-time. You can use built-in dbt features to test the freshness of sources and include test cases (singular and generic tests).

Now that you know how code versioning works and what use cases it covers, let’s take a closer look at data version control use cases with the example of lakeFS.

What is lakeFS?

lakeFS sits on top of the data lake you want to version, serving as an additional layer enabling Git-style operations on the object storage. lakeFS is an excellent choice for teams looking to:

  • build and test in isolation on object storage, 
  • manage a long-term production environment with versioning, 
  • and achieve exceptional collaboration. 

The data version control system can manage structured and unstructured data and is format-agnostic, so it works with all existing compute engines. It was designed to be scalable while still providing excellent performance.

What is lakeFS
Source: lakeFS

Key Features of lakeFS

  • lakeFS is format-agnostic
  • It integrates with a wide range of data tools and platforms
  • It ensures ACID transactions
  • It eliminates the need for data duplication by using zero-copy branching, saving storage costs
  • It performs well across data lakes of various sizes
  • It offers customizable garbage collection capabilities

lakeFS Use Cases 

Write-Audit-Publish

lakeFS makes it easier to implement the Write-Audit-Publish pattern thanks to a feature called hooks. Hooks automate data checks and validations on lakeFS branches. Specific data actions, such as committing or merging, can set off these checks for a smooth, fully automated Write-Audit-Publish process on data lakes.

ETL Test Environment

lakeFS makes creating distinct development/test environments for ETL testing easy and cost-efficient. The solution implements zero-copy branching to avoid data duplication while generating a new environment. This saves you time on environment maintenance and enables you to create as many environments as you require.

Each branch in lakeFS can be a separate ecosystem – since branches are isolated, changes to one do not affect the others.

Reproducibility

To ensure data reproducibility, all it takes is creating a new commit for your lakeFS repository. As long as a commit has been made, replicating a specific state is as simple as reading data from a path with the unique commit_id assigned to each commit.

Swift Error Recovery Using Rollbacks

Dealing with a critical data mistake can be challenging and stressful. The option to use a rollback alleviates some of that and increases the likelihood that you will be able to determine what happened without incurring further problems.

lakeFS enables you to design your data lake so rollbacks are easy to implement. This starts with making a commit to your lakeFS repository whenever its status changes. Using the lakeFS UI or CLI, you can rapidly set the current state, or HEAD, of a branch to any historical commit, effectively doing a rollback.

lakeFS + dbt: Optimal data management

Now that we understand how dbt and lakeFS help data practitioners manage their code and data at scale, let’s take a look at some use cases where they complement one another.

Versioned data sources for dbt Projects

lakeFS provides the ability to create snapshots of your raw data, ensuring dbt transformations are always operating on a consistent and reproducible dataset. This in turn guarantees that any changes in the raw data can be tracked and managed effectively.

Experimentation with branching

Using lakeFS branches for our data, dbt users can experiment with transformations in isolated environments. This gives you an intuitive way to create a branch, run a dbt model, and validate the transformation without affecting production data.

Reproducible transformations

By combining lakeFS’ data version control with dbt’s transformation logic, you can ensure that both the data and code used to generate a model are reproducible at any point in time, simplifying auditing and debugging.

Testing and validation in isolation

dbt tests can be run against isolated datasets in lakeFS branches, ensuring data quality before merging changes into production environments.

Collaboration across teams

lakeFS provides a commit history for data changes, enabling teams to track how raw data evolves over time and align it with dbt transformation changes for better collaboration and governance.

Write-Audit-Publish for data workflows

Combining dbt’s automation capabilities with lakeFS Write-Audit-Publish workflows, users can implement CI/CD pipelines for both data and transformations, catching errors early and promoting stable deployments.

Wrap up

By uniting dbt’s strengths in code versioning and transformation logic with lakeFS’ robust data version control capabilities, teams can achieve unparalleled efficiency, collaboration and trust in their data workflows. lakeFS and dbt together present a modern approach to data management. Their complementary pairing ensures that your data and transformations are always consistent, reproducible, and ready for scale — whether building reliable ETL pipelines, experimenting with models, or enhancing collaboration across teams.

lakeFS