Ready to dive into the lake?

lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Robin Moffatt
April 17, 2023

Delta Lake is one of the three new open-source table formats gaining wide adoption in the data engineering community, along with Apache Iceberg and Apache Hudi. In the recent release of lakeFS we’ve added support for comparing the state of Delta Lake tables so that you can see metadata for what has changed.

Delta Diff in Action

Let’s walk through an example of what this looks like in practice.

Loading data into the main branch

To start with, I’ll write 1000 rows of test data to the demo.users Delta table on lakeFS, and commit it to the main branch:

df.write.format("delta").mode('overwrite').save('s3a://example/main/demo/users')
commit_creation = CommitCreation(
    message="Initial user data load",
)
api_instance.commit('example', 'main', commit_creation)

Modifying data on a new branch

Now I want to add some more user data, but being prudent in my approach take a branch from main to test the addition of the new data, just in case it doesn’t work.

branch_creation = BranchCreation(
    name="modify_user_data",
    source="main",
) 

api_response = api_instance.create_branch('example', branch_creation)

I can confirm from the lakeFS web interface that I’ve now got two branches – main and modify_user_data:

To start with I’m going to add in some new data, using a merge:

new_df = spark.read.parquet("file://" + SparkFiles.get("userdata2.parquet"))
users_deltaTable = DeltaTable.forPath(spark, 's3a://example/modify_user_data/demo/users')

users_deltaTable.alias("users").merge(
    source = new_df.alias("new_users"),
    condition = "users.id = new_users.id") \
  .whenNotMatchedInsertAll() \
  .execute()

Now I’ll mask IP addresses for users in a given country. First I’ll check a sample of the data:

deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(5)
+--------+---------------+
| country|     ip_address|
+--------+---------------+
|   China|  140.35.109.83|
|Portugal| 232.234.81.197|
|Portugal| 194.224.39.215|
|   China| 246.225.12.189|
|   China|  80.111.141.47|
+--------+---------------+
only showing top 5 rows

Next, I run the update:

deltaTable.update(
    condition = "country == 'Portugal'",
    set = { "ip_address" : "'x.x.x.x'" })

and then check another sample of the data to make sure it’s worked:

deltaTable.toDF().filter(col("country").isin("Portugal", "China")).select("country","ip_address").show(5)
+--------+---------------+
| country|     ip_address|
+--------+---------------+
|   China|  140.35.109.83|
|Portugal|        x.x.x.x|
|Portugal|        x.x.x.x|
|   China| 246.225.12.189|
|   China|  80.111.141.47|
+--------+---------------+
only showing top 5 rows

Finally, I’m going to delete some rows, trimming out those users with a salary greater than 60k:

deltaTable.delete(col("salary") > 60000)

With all of that done, I’ll commit the changes to my working branch

commit_creation = CommitCreation(
    message="Modify user data"
) 

api_instance.commit('example', 'modify_user_data', commit_creation)

Doing a Diff of the Delta Table in lakeFS

We’ve now got two versions of our users Delta Table; one on the main branch:

DeltaTable.forPath(spark, 's3a://example/main/demo/users').toDF().count()
1000

and one on the modify_user_data branch:

DeltaTable.forPath(spark, 's3a://example/modify_user_data/demo/users').toDF().count()
236

Over in the lakeFS web interface we can compare the two branches and see that there’s a difference:

and we can see what the difference is too:

Pretty handy, right? As well as comparing branch to branch, we can also inspect individual commits:

How did they do that?

The great thing about developing in the open is that as well as the code being available, you can see the design considerations and decisions made in the design document. The diff is based on the history that Delta provides, married with the information that lakeFS stores too.

Go make a Diff-erence – try this out in lakeFS today!

We hope that this new feature will make it easier to work with your data in Delta and lakeFS. Give it a go today – grab lakeFS here, try out the Quickstart, and join our Slack group to let us know how you got on.

If you want to try out the examples shown above you can find the notebook on the lakeFS GitHub sample repo.

Git for Data – lakeFS

  • Get Started
    Get Started
  • The annual State of Data Engineering Report is now available. Find out what’s new in 2023 -

    +