Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark.
Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case you haven’t been keeping up, these include:
Taken together, it’s interesting to note that these features provide comparable functionality to the set of tools commonly referred to as the “Modern Data Stack”.
The result is a noticeably consolidated data stack, almost entirely contained within the Databricks ecosystem.
Some people cheer for this type of consolidation, tired of spending time fitting together pieces of an analytics puzzle that don’t necessarily want to get along. Others believe an unbundled architecture is preferable, allowing users can mix-and-match tools specialized for a specific purpose.
In truth, there’s no clear answer of who is right. It depends largely on the execution of the different companies competing in the space. For its part, lakeFS is largely agnostic in this battle, as it fits at a foundational level with nearly any stack.
Given their positioning, Databricks sees value in growing the data lake ecosystem, which includes lakeFS. Consequently, we’ve started to collaborate more closely with members of the Databricks team, in both content and product.
Data + AI Online Meetup Recap
One of the first outcomes of this collaboration is a joint meetup presentation with myself and Denny Lee.
The Topic: Multi-Transactional Guarantees with Delta Lake and lakeFS.
The Key Takeaway: The version-controlled workflows enabled by lakeFS allows you to expose new data from multiple datasets in one atomic merge operation. This prevents the possibility of a consumer of the data seeing an inconsistent view, which can lead to incorrect metrics.
After showing how to configure a Spark cluster to read/write from a lakeFS repo, I hopped into a demo of running a data validation check with a Databricks Job and lakeFS pre-merge hook.
Check out the full talk below!
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
Audit Logs are Now Available in lakeFS Cloud
TL;DR lakeFS Cloud offers Audit Logs for compliance, operational stability, monitoring access, activities and security analysis. In the latest version of lakeFS Cloud, we introduced
5 New Year Resolutions for Data Engineers
As a new year is just around the corner, it is time to look ahead to the year that is coming and make some new
Data Lake Governance at Scale with lakeFS
No time for the full article now? Read the abbreviated version here Introduction Often, data lake platforms lack simple ways to enforce data governance. This