Delta Lake is one of the three new open-source table formats gaining wide adoption in the data engineering community, along with Apache Iceberg and Apache Hudi. In the recent release of lakeFS we’ve added support for comparing the state of Delta Lake tables so that you can see metadata for what has changed.
Delta Diff in Action
Let’s walk through an example of what this looks like in practice.
Loading data into the main branch
To start with, I’ll write 1000 rows of test data to the demo.users
Delta table on lakeFS, and commit it to the main
branch:
df.write.format("delta").mode('overwrite').save('s3a://example/main/demo/users')
commit_creation = CommitCreation(
message="Initial user data load",
)
api_instance.commit('example', 'main', commit_creation)
Modifying data on a new branch
Now I want to add some more user data, but being prudent in my approach take a branch from main
to test the addition of the new data, just in case it doesn’t work.
branch_creation = BranchCreation(
name="modify_user_data",
source="main",
)
api_response = api_instance.create_branch('example', branch_creation)
I can confirm from the lakeFS web interface that I’ve now got two branches – main
and modify_user_data
:
To start with I’m going to add in some new data, using a merge
:
new_df = spark.read.parquet("file://" + SparkFiles.get("userdata2.parquet"))
users_deltaTable = DeltaTable.forPath(spark, 's3a://example/modify_user_data/demo/users')
users_deltaTable.alias("users").merge(
source = new_df.alias("new_users"),
condition = "users.id = new_users.id") \
.whenNotMatchedInsertAll() \
.execute()
Now I’ll mask IP addresses for users in a given country. First I’ll check a sample of the data:
deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(5)
+--------+---------------+
| country| ip_address|
+--------+---------------+
| China| 140.35.109.83|
|Portugal| 232.234.81.197|
|Portugal| 194.224.39.215|
| China| 246.225.12.189|
| China| 80.111.141.47|
+--------+---------------+
only showing top 5 rows
Next, I run the update
:
deltaTable.update(
condition = "country == 'Portugal'",
set = { "ip_address" : "'x.x.x.x'" })
and then check another sample of the data to make sure it’s worked:
deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(5)
+--------+---------------+
| country| ip_address|
+--------+---------------+
| China| 140.35.109.83|
|Portugal| x.x.x.x|
|Portugal| x.x.x.x|
| China| 246.225.12.189|
| China| 80.111.141.47|
+--------+---------------+
only showing top 5 rows
Finally, I’m going to delete
some rows, trimming out those users with a salary greater than 60k:
deltaTable.delete(col("salary") > 60000)
With all of that done, I’ll commit the changes to my working branch
commit_creation = CommitCreation(
message="Modify user data"
)
api_instance.commit('example', 'modify_user_data', commit_creation)
Doing a Diff of the Delta Table in lakeFS
We’ve now got two versions of our users
Delta Table; one on the main
branch:
DeltaTable.forPath(spark, 's3a://example/main/demo/users').toDF().count()
1000
and one on the modify_user_data
branch:
DeltaTable.forPath(spark, 's3a://example/modify_user_data/demo/users').toDF().count()
236
Over in the lakeFS web interface we can compare the two branches and see that there’s a difference:
and we can see what the difference is too:
Pretty handy, right? As well as comparing branch to branch, we can also inspect individual commits:
How did they do that?
The great thing about developing in the open is that as well as the code being available, you can see the design considerations and decisions made in the design document. The diff is based on the history that Delta provides, married with the information that lakeFS stores too.
Go make a Diff-erence – try this out in lakeFS today!
We hope that this new feature will make it easier to work with your data in Delta and lakeFS. Give it a go today – grab lakeFS here, try out the Quickstart, and join our Slack group to let us know how you got on.
If you want to try out the examples shown above you can find the notebook on the lakeFS GitHub sample repo.
Table of Contents