Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Paul Singman
Paul Singman Author

September 8, 2021

Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark.

Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case you haven’t been keeping up, these include:

Taken together, it’s interesting to note that these features provide comparable functionality to the  set of tools commonly referred to as the “Modern Data Stack”.

modern data stack vs databricks ecosystem

The result is a noticeably consolidated data stack, almost entirely contained within the Databricks ecosystem.

Some people cheer for this type of consolidation, tired of spending time fitting together pieces of an analytics puzzle that don’t necessarily want to get along. Others believe an unbundled architecture is preferable, allowing users can mix-and-match tools specialized for a specific purpose. 

In truth, there’s no clear answer of who is right. It depends largely on the execution of the different companies competing in the space. For its part, lakeFS is largely agnostic in this battle, as it fits at a foundational level with nearly any stack.

Given their positioning, Databricks sees value in growing the data lake ecosystem, which includes lakeFS. Consequently, we’ve started to collaborate more closely with members of the Databricks team, in both content and product.

Data + AI Online Meetup Recap

multi-table transactions delta lakefs

One of the first outcomes of this collaboration is a joint meetup presentation with myself and Denny Lee

The Topic: Multi-Transactional Guarantees with Delta Lake and lakeFS. 

The Key Takeaway: The version-controlled workflows enabled by lakeFS allows you to expose new data from multiple datasets in one atomic merge operation. This prevents the possibility of a consumer of the data seeing an inconsistent view, which can lead to incorrect metrics. 

lakefs consistency branch merge

After showing how to configure a Spark cluster to read/write from a lakeFS repo, I hopped into a demo of running a data validation check with a Databricks Job and lakeFS pre-merge hook.

Check out the full talk below!

About lakeFS

The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.

Our mission is to maximize the manageability of open source data analytics solutions that scale.

To learn more...

Read Related Articles.

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started