It was 27th June 2022. San Francisco was bustling with 5000+ data folks from around the world to attend the Data & AI summit live after two years.
Four days packed with tons of information from Keynotes, Speakers, Panels, Expo booths and Databricks trainings. Flurry of new product announcements followed. lakeFS cloud launch, Delta lake going completely open source, Delta 2.0, Unity Catalog, Databricks marketplace, MLflow pipelines, and more.
The lakeFS team were invited to give two talks at the summit. It turned out to be three lakeFS talks, thanks to Holden Karau, A Netflix engineer, who decided to use lakeFS for the demo!
- Chaos Engineering in the World of Complex Data Flow by Adi Polak, VP of Developer Experience at lakeFS.
- Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects by Oz Katz, CTO and co-creator of lakeFS.
- Tools for Assisted Apache Spark Version Migrations, from 2.1 to 3.2+ by Holden Karau, Engineer at Netflix.
The recorded sessions are available for free for everyone to (re)watch. Some of you might remember us from the cutest Axolotl you had your eyes on.
Alright! Enough about us. Data whisperers from the expo floor have some of the interesting stories from the summit to tell us. Read on!
Highlights & Industry Trends
- The biggest announcement was Databricks open sourcing all of Delta. Interestingly enough, just last month Snowflake had announced support for (open source) Iceberg table format. Two of the biggest behemoths in data space are racing towards open source tech. The wind is definitely blowing towards OSS and it is a huge win for the community.
- There was a common consensus among the data/ ML engineers that software engineering best practices ought to be enforced into the world of data. We have a long way to go before it becomes a standard practice. Although having access to the right tools could accelerate the adoption.
- Implementing CI/CD for data or enabling data versioning on data lakes seem to be the first step in implementing engineering best practices and garnered interest from the data/ML engineers. lakeFS helps exactly with this and is strongly leading the data community in this space.
- There were a number of no-code/ low-code tools present at the expo floor. While they lower the barrier to entry for most data teams and push make data engineering cost-effective, it remains to be seen how effective they can be for productionizing data products at scale.
- In the AI/ML space, the spotlight was on operationalizing and deploying ML pipelines in production, so MLOps challenges and solutions were the talk of the town. MLOps offerings seem fragmented with no one offering end-to-end ML solutions. Definitely a space to watch out for.
- Data–centric AI has been getting a lot of attention, thanks to Andrew Ng. This puts data engineering as part of the ML workflows and it’s interesting to see how this would shape the Data Engineering landscape.
Now let us go over some of the major product/feature announcements from Databricks this year.
Product Announcements from Databricks
The week started with a keynote from Reynold Xin, co-founder of Databricks introducing SparkConnect, a thin client abstraction for spark.
It decouples the architecture to introduce a simple, stable, language-agnostic client-server protocol. With SparkConnect, developers can embed spark in any application and expose it through the thin client.
What is even more interesting for streaming users is Project Lightspeed, the next-generation of Spark Structured Streaming.
As an increasing number of applications are using streaming data, the goal of Project Lightspeed is to offer features such as reduced latency, enhanced functionality for data processing, improved connectors for the ecosystem, and simplified deployment and troubleshooting.
Most exciting news for the open source community was when Michael Armburst, creator of SparkSQL, announced “All of Delta is now open”. Delta lake was open sourced in 2019 and is maintained by Linux Foundation while some features remained proprietary. But not anymore.
Delta lake 2.0 was announced. It brings major improvements to query performance – support for change data feed, Z-order clustering, idempotent writes to Delta tables, etc. It also supports dropping columns from the delta table as a fast, metadata only operation.
Enhanced data sharing enabled by databricks marketplace and cleanrooms.
Databricks marketplace, powered by DeltaSharing, is an open marketplace for packaging and distributing data assets such as datasets, notebooks, dashboards and machine learning models. It allows consumers to access data products without having to be on the Databricks platform.
The innovations. just. won’t. stop. Databricks cleanrooms are a way to share and join data across organizations in a privacy-centric way.
Data teams with heterogeneous access to data run into data leak issues. With data clean rooms, data teams can share their existing data and run complex workloads in any language on the data while maintaining data privacy.
DB SQL CLI now enables developers and analysts to run queries directly from their local computers. It provides query federation, using which any remote data sources including postgreSQL, mySQL, AWS Redshift and others can be queried without the need to extract and load from these sources.
DatabricksSQL Serverless is now in Public Preview on AWS, making it easy to get started with data warehousing on the Lakehouse. Serverless frees up time, lowers costs, and allows you to focus on delivering the most value to your business rather than managing infrastructure.
Unity catalog will be generally available on AWS and Azure in the upcoming weeks. It offers a centralized governance solution for all data and data assets, with built-in search and discovery, automated lineage for all workloads, with performance and scalability for a lakehouse on any cloud.
This makes it easy to set and maintain fine-grained access controls for the data teams across the stack. It provides a single interface to manage permissions for all assets including auditing and lineage.
Delta Live Tables & Enzyme Optimizer
Delta live tables is an ETL framework that helps developers build reliable pipelines. It offers declarative tools to develop and manage data pipelines for both batch and streaming use cases.
The developer only needs to define the transformations whereas the job orchestration, cluster management, monitoring, data quality checking and error handling are handled by the live tables. Major announcements around DLT include enhanced autoscaling, support for change data capture and SCD type 2.
Enzyme Optimizer is an automatic optimizer for incremental ETL that reduces the cost of pipelines. This new innovation automatically selects the fastest technique for incrementalization queries.
MLflow pipelines with mlflow 2.0
MLflow pipelines provide a standardized framework for creating production grade ML pipelines that combine modular ML code with software engineering best practices.
Real time ML training pipelines can be built on top of pipeline templates making it easy to perform rapid iteration and deployment while following DevOps best practices.
Databricks workflows, the fully managed orchestration service, enables data engineers, data scientists and data analysts to build reliable data, analytics and ML workflows on any cloud without needing to manage complex infrastructure.
Workflows allows users to build ETL pipelines that are automatically managed, including ingestion, and lineage, using Delta Live Tables. You can also orchestrate any combination of Notebooks, SQL, Spark, ML models, and dbt as a Jobs workflow, including calls to other systems.
lakeFS is an open source project that offers data versioning at scale for object stores. Think “git for data”, only it scales for 100s of petabytes data lakes, and integrates seamlessly with major data tools and frameworks.
Our mission is to impart engineering best practices into the world of Data & ML.
Read related articles.
Imagine the software engineering world before distributed version control systems like Git became widespread. This is where the data world is currently at. The explosion
Overview Data changes frequently, making the task of keeping track of its exact state over time difficult. Oftentimes, people maintain only one state of their