Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark.
Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case you haven’t been keeping up, these include:
Taken together, it’s interesting to note that these features provide comparable functionality to the set of tools commonly referred to as the “Modern Data Stack”.
The result is a noticeably consolidated data stack, almost entirely contained within the Databricks ecosystem.
Some people cheer for this type of consolidation, tired of spending time fitting together pieces of an analytics puzzle that don’t necessarily want to get along. Others believe an unbundled architecture is preferable, allowing users can mix-and-match tools specialized for a specific purpose.
In truth, there’s no clear answer of who is right. It depends largely on the execution of the different companies competing in the space. For its part, lakeFS is largely agnostic in this battle, as it fits at a foundational level with nearly any stack.
Given their positioning, Databricks sees value in growing the data lake ecosystem, which includes lakeFS. Consequently, we’ve started to collaborate more closely with members of the Databricks team, in both content and product.
Data + AI Online Meetup Recap
The Topic: Multi-Transactional Guarantees with Delta Lake and lakeFS.
The Key Takeaway: The version-controlled workflows enabled by lakeFS allows you to expose new data from multiple datasets in one atomic merge operation. This prevents the possibility of a consumer of the data seeing an inconsistent view, which can lead to incorrect metrics.
After showing how to configure a Spark cluster to read/write from a lakeFS repo, I hopped into a demo of running a data validation check with a Databricks Job and lakeFS pre-merge hook.
Check out the full talk below!
Read Related Articles.
Hive & Hadoop — A Brief History Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was