Promoting ETL code for production is a straightforward process. We have our code – usually stored in Git – and want to build and test it. We can check out the code, write changes, and commit them.
Once we’re ready to promote our ETL into production, we run a pull request that might run automated Write-Audit-Publish testing or ensure no security vulnerabilities. If all goes well, we will merge the branch into production and start running our ETL there.
This process works really well. Or does it?
The problem with the current ETL testing process is that we test against sample data instead of production data (and keep our fingers crossed that everything works).
It doesn’t have to be this way.
In this article, we examine how lakeFS enables testing ETL code on production data with the help of GitHub Actions.
What is lakeFS, and how does it work for ETL versioning?
lakeFS is an open-source solution that lets you work with data the way you would work with code.
It sits on top of any object storage, from Amazon S3 to on-premises solutions, providing it with a Git interface. You can now create branches in your bucket, make changes, commit those changes, or revert to the previous version if something doesn’t work. The ecosystem of tools in your environment can access this versioned data via the lakeFS API.
lakeFS is in charge of versioning and promoting ETL code to production. What this means is that you can create a branch, make changes, commit changes, and create a pull request.
However, we can make this easier by leveraging the integration of GitHub Actions and lakeFS to have the former execute lakeFS actions.
The GitHub action creates a data branch of our production data, allowing us to run the ETL against a copy of that data (don’t worry, it’s not a copy of your data – more on that later!). If all tests pass, the data will be merged – with zero risk to the data in production. If the test fails, we can revert these changes at any time.
How does this work in practice? Let’s explore in this guide.
Promoting ETL code to production with lakeFS and GitHub Actions
The zero-copy mechanism of lakeFS
lakeFS runs on top of the object storage. Every commit is a collection of pointers to files in that bucket.
lakeFS takes advantage of object stores’ immutability and does copy-on-write. This means that when a new file is created, the next commit will point to that additional file. For files that haven’t changed, we’ll have multiple commits pointing to these files (which are ultimately physical objects).

This is why lakeFS can handle a massive amount of data.
Let’s say we want to create a copy of our entire production data and run our ETL against it. This is a metadata-only operation, meaning that it’s highly performant. Thanks to zero-copy, it handles hundreds of data files with significant storage cost savings; you can have multiple copies without any real copies.
Automated testing practical demo
We’ll be working in the sample Git repository called lakeFS-samples-ci-cd.

Our Databricks notebook is here with an ETL job built in Databricks and checked into this Git repo.

It’s a basic ETL job where I have two basic tables with a parent/child relationship:
- A list of famous people
- A list of categories they belong to (from sports to music)
ENVIRONMENT = getArgument('environment')
lakefsEndPoint = getArgument('lakefs_end_point')
repo_name = getArgument('lakefs_repo')
newBranch = getArgument('lakefs_branch')
if ENVIRONMENT == "prod":
DATA_SOURCE = getArgument('data_source_storage_namespace')
elif ENVIRONMENT == "dev":
DATA_SOURCE = f"lakefs://{repo_name}/{newBranch}/delta-tables"
print(DATA_SOURCE)
df = spark.read.format("delta").load(f"{DATA_SOURCE}/famous_people_raw")
df.write.format("delta").partitionBy("country").save(f"{DATA_SOURCE}/famous_people")
df.display()We partition these people by country:
df.write.format("delta").partitionBy("country").save(f"{DATA_SOURCE}/famous_people")Let’s say we’d like to add something else to our code – we will keep all the categories except for music:
from pyspark.sql.functions import col
df = spark.read.format("delta").load(f"{DATA_SOURCE}/category_raw")
df_not_music = df.filter(col("category") != "music")
df_not_music.write.format("delta").mode("overwrite").save(f"{DATA_SOURCE}/category_raw")
df_not_music.display()This will delete the music category for the category table, not the table with famous people.
We can now commit the change and test it to make sure it’s all good. We can create a new branch here and make a pull request. Sometimes, we may have multiple commits before creating a pull request.
This will automatically run a GitHub Action to test this ETL job in isolation in the lakeFS environment, meaning we never impact your production data. This is a zero-copy operation, so no data is copied; instead, we create pointers to the production data.
The whole process will take a few seconds to import the data. The job will run on a separate branch in lakeFS, and any changes made to the data will be saved in the lakeFS repository.
If multiple team members are working on the same code, lakeFS will create a separate branch for each pull request. This is what it looks like on the lakeFS UI:

Each branch is a unique, isolated environment where team members can run their code, confident that their changes will never impact production data or other people’s work.
GitHub Action will also run the data validations at the end, e.g., checking if the number of categories in both categories and the famous people table are the same or not:
df_category = spark.read.format("delta").load(f"{DATA_SOURCE}/category_raw")
df_category.display()
df_famous_people = spark.read.format("delta").load(f"{DATA_SOURCE}/famous_people")
df_famous_people.groupby("category").count().display()
# Check number of categories
number_of_categories = df_famous_people.groupby("category").count().count()
if number_of_categories == df_category.count():
dbutils.notebook.exit("Success")
else:
dbutils.notebook.exit(f"Referential integrity issue. Number of categories in 'famous_people' table are {number_of_categories} while number of categories in parent 'category_raw' table are {df_category.count()}.")The data validation check returns a success or failure. If data validation fails, ETL code changes can’t be merged into the master/production branch.
Wrap up
lakeFS allows you to run an ETL job as a pull request against production data when promoting data to production. This allows you to work in isolation and test changes on the real scale of the production data.


