The data industry has long been waiting for a solution that would integrate the data structures and data management functions of data warehouses directly into the type of low-cost storage utilized for data lakes. Enter Databricks lakehouse, an architecture that does just that.
By merging the best from data warehouses and data lakes, a lakehouse offers a single system that allows data teams to work faster since they can utilize both structured and unstructured data without having to visit numerous other systems.
Data lakehouses also guarantee that teams have access to the most current and comprehensive data for data science, machine learning, and business analytics initiatives.
Sounds great? Keep reading to dive into Databricks lakehouse architecture and take the first steps to implementing it.
What is the Databricks Lakehouse?
Databricks lakehouse (data lakehouse) is a new type of open data management architecture that combines the scalability, flexibility, and low cost of data lakes with the data management and ACID transactions of data warehouses. This basically lets teams carry out BI and ML tasks on any data.
But where did the idea for this architecture come from? What shortcomings of data warehouses and data lakes does lakehouse solve?
Understanding Data Lakehouse, Data Warehouse and Data Lakes
The data warehouse vs. data lake question is no longer needed when dealing with data lakehouses. The former have emerged as two important competitors in the data storage and analytics industries, each with their own set of benefits and drawbacks.
The major distinction between them is that data warehousea can only handle structured and semi-structured data – and data lakes can hold an endless quantity of both structured and unstructured data.
A data lake is a centralized repository where teams can store large volumes of organized, semi-structured, and unstructured data. A data lake may handle structured data (for example, relational databases), semi-structured data (for example, JSON, XML), and unstructured data (for example, text, photos, and videos).
This makes data lakes a good choice for batch processing and large-scale data storage of raw data from a variety of sources, including sensors, logs, social media, streaming data, and others.
Lakes use a “schema-on-read” strategy, which means that data is modified after analysis rather than before data input. Data engineers can tailor the format and arrangement of data to suit use cases or analytical needs. This enables faster data exploration and analysis.
Data lakes are also very scalable. They may store petabytes or even exabytes of data, letting businesses increase their storage and processing capacities as their data requirements rise.
However, data lakes have a few limitations:
- Lakes lack centralized data governance, making it difficult to maintain data consistency and security. Without sufficient controls, they can grow cluttered, resulting in data integrity concerns.
- Since data lakes include a wide range of data types, inadequate organization can result in fragmented and segregated lakes or data swamps. This might make data finding difficult and result in duplicate or inconsistent data, impacting query performance.
- Because a data lake doesn’t enable concurrent transactions, several users attempting to access or edit data at the same time might result in data inconsistencies or integrity concerns.
To address these limitations, Databricks introduced a new architecture called a data lakehouse.
A data lakehouse is a data management design that combines data lake and data warehouse features.
Key Components of Databricks Lakehouse
Storage: Allowing Structured and Unstructured Data
Raw source data is stored in a data lake in lakehouse data storage. Data teams can use the lakehouse’s built-in data warehouse features, such as schema enforcement and indexing, to convert data for analysis, ensure data integrity, and ease governance. Moreover, they can use data quality processes to guarantee the reliability and accuracy of their data, including data profiling, cleaning, validation, and metadata management.
Since a lakehouse combines the advantages of a data lake with a data warehouse, teams no longer need separate data silos and can carry out analytics and generate insights straight from raw data without moving or duplicating data.
By default, tables produced with Databricks use the Delta Lake protocol. When you build a new Delta table, the table’s metadata is added to the metastore in the defined schema or database.
Data and table information are stored in a cloud object storage directory. Technically, the metastore reference to a Delta table is unnecessary; you can construct Delta tables by dealing directly with directory paths using Spark APIs. Some new Delta Lake features may save extra metadata in the table directory, although all Delta tables have:
The Delta Lake protocol comes with ACID guarantees per table. ACID is an acronym that stands for atomicity, consistency, isolation, and durability:
- Atomicity – ensures that each transaction is handled as a single “unit” that either succeeds or fails completely.
- Consistency – it guarantees concern about how a given state of data is seen by concurrent activities.
- Isolation – it relates to how concurrent operations may interfere with one another.
- Durability – it denotes the permanence of committed changes.
While many data processing and warehousing platforms claim to provide ACID transactions, particular guarantees vary by system.
Data lakehouse aims to improve efficiency by building data warehouses based on data lake technology. Although storage is quick and inexpensive, the lakehouse strategy enhances data quality and removes redundancy.
The lakehouse structure includes ETL, which acts as a pipeline between the unsorted lake layer and the integrated warehouse layer.
Data Versioning: Table Level Time Travel
The users of Databricks Delta Lake can benefit from table level time travel as Delta automatically versions data and lets you retrieve any past version of that data.
This temporal data management streamlines your data pipeline by making it simple to audit, roll back data in the event of unintentional poor writes or deletes, and replicate tests and reports. This helps teams standardize on a clean, centralized, versioned big data repository in their own cloud storage.
Indexing is the process of constructing an index or metadata layer on top of a data lake, which is a massive collection of raw and unstructured data. Indexing extracts and organizes properties and information from data stored in a data lake. File names, file sizes, creation dates, data types, and even user-defined tags are all good examples.
The index is then constructed utilizing this extracted data, resulting in a searchable catalog of the data. It catalogs the data and offers it in a structured format, allowing for faster and more targeted searches and analysis. Indexing data makes it simpler to find and retrieve particular information, which speeds up data processing and analytics operations.
When a user wants to access certain data, they may use the index to quickly locate the necessary files or datasets. The index points to the location of the data in the data lake, allowing for quick data retrieval without having to scan the whole lake. This significantly reduces the amount of time necessary for data access and processing.
Unity Catalog combines data governance and discovery. It’s available in notebooks, tasks, Delta Live Tables, and Databricks SQL, and delivers capabilities and UIs for workloads and users suited for both data lakes and data warehouses.
It offers a single solution for categorizing, organizing, and managing disparate data sources, making it simpler for data engineers, data scientists, and analysts to discover and use the information they want. Unity Data Catalog helps teams realize the full potential of their data by providing features such as data discovery, data lineage, and governance capabilities.
Audit logs are critical for a variety of reasons, ranging from compliance to cost control. They’re your official record of what’s going on at your lakehouse.
Before, platform administrators had to set up audit logging for each workspace separately. This meant that costs were higher and there could be organizational blind spots because workspaces could be created that did not have audit logging turned on.
Databricks lakehouse users can use a single account to manage all of their users, groups, workspaces, and audit logs from a single location. This makes life considerably easier for platform administrators and poses significantly fewer security risks.
After configuring audit logging at the account level, users can be certain that they get a low latency stream of all essential events occurring on their lakehouse – for all new and current workspaces established under that account.
Data governance is an essential component of any data management strategy, but it is especially important in a data lakehouse. It’s critical that teams working with lakehouses develop appropriate data governance procedures, rules, and technology to guarantee that data stored in a data lakehouse is accurate, trustworthy, and available for usage by authorized personnel and systems.
Here’s a list of data governance components in data lakehouses:
- Catalog of data
- Data access regulations
- Policies for data retention
- Data adherence, traceability, and safety
- Data observation
Databricks and the Linux Foundation created Delta Sharing to provide the first open source method for data sharing across data, analytics, and AI. Users can securely communicate live data across platforms, clouds, and locations without replicating it.
With a secure hosted environment, teams can easily communicate with customers and partners on any cloud while protecting data privacy.
Data Version Control
With frequent modifications and updates to data, ACID transactions are there to ensure that data integrity is maintained throughout batch and streaming reads and writes. It’s essential for the entire team to have a consistent picture of the data.
Data version control is an essential part of that effort.
One example of such a solution is Delta Lake, an open-source storage system that aims to improve data lake table performance and give transactional assurance.
While Delta Lake provides transactionality for structured data per table, other solutions like the open-source lakeFS enhance the lakehouse with a full data version control for all data types allowing multi table transactionality.
Using lakeFS, you get the ability to use Git-like operations to manage changes to data over time, such as branching for dev/test to create an isolated version of the data, commiting to create a reproducible point in time, and merging to incorporate your changes in one atomic action.
Databricks Lakehouse Pros and Cons
Data lakehouses combine the best of both worlds: a data warehouse and a data lake. They works with both structured and unstructured data, supporting a variety of workloads, and may benefit any member of the data team, from data engineers to data analysts to machine learning engineers.
Data lakehouse pros
- A single data repository takes less time and money to manage than a multi-solution system.
- Reduced data storage costs.
- BI tools have direct access to a larger dataset.
- There is less data transfer and redundancy.
- Schema management has been simplified.
- Because there is just one control point, data governance is simplified.
- Support for ACID-compliant transactions
Data lakehouse cons
- Migrating from traditional data warehouses may be costly and time-consuming.
- Comes with a steep learning curve and the complexity of the setup.
- Scala is the first language.
- The community is rather small in comparison to popular free tools.
Databricks Lakehouse Use Cases
The data lakehouse design is more flexible than the standard data warehouse or data lake architecture in that it may minimize data redundancy and increase data quality while providing lower-cost storage. The unsorted lake layer and the integrated warehouse layer are linked through ETL pipelines.
It is most commonly used when data sets are bigger, from hundreds of terabytes and more, in environments that include both structured, semi structured and possibly unstructured data that contribute to the same data operation.
A rising network of providers, including AWS S3, Azure Blob, GCS, min.io, NetApp Grid provide options for cost effective data storage.
Example use case: Intuit
Intuit has enterprise-scale data processing demands, with over 100 million users and sales close to $10 billion. Instead of designing distinct architectures for each big data project, which would worsen the data silo problem, Intuit developed a unified strategy that uses a lakehouse as the corporate data standard.
For real-time processing, the lakehouse architecture supports Spark Streaming as well as Flink. Analysts can use Redshift and Athena, as well as Databricks SQL and Photon-powered data science notebooks, to access the same data collection.
The most significant component of the lakehouse design is that different Intuit data personas all have a consistent perspective of the same collection of data.
Getting Started with Databricks Lakehouse
- Start with the data lake that currently maintains the majority of the company data
- Improve the quality and control of your data lake
- Data should be optimized for rapid query performance
- Provide native machine learning support
- Use open data formats and APIs to avoid lock-in
Delta Lake Integration Into Databricks Lakehouse Platform
Delta Lake is the optimized storage layer that serves as the Databricks Lakehouse Platform’s basis for storing data and tables. This open-source software adds a file-based transaction log, enabling ACID transactions and scalable metadata management, to Parquet data files.
Delta Lake is completely compatible with Apache Spark APIs and was designed for tight interaction with structured streaming, allowing users to use a single copy of data for both batch and streaming operations while also offering incremental processing at scale.
Data warehouses have a long history in decision support and business intelligence applications. But they clearly weren’t designed to handle unstructured data, semi-structured data, or data with a high variety, velocity, and volume.
Data lakes then emerged to handle raw data in a variety of formats on cheap storage for data science and machine learning. Still, they lacked critical features from the world of data warehouses: they didn’t support transactions or enforce data quality. Also, their lack of consistency/isolation made mixing appends and reads, as well as batch and streaming jobs, nearly impossible.
So, data teams connected these systems so that BI and ML can work on the data that is stored in both. This creates duplicate data, extra infrastructure costs, security concerns, and high operating costs.
Data lakehouses offered a way out by bringing the best of both worlds.
If you’re using a lakehouse architecture, you can enhance it with additional tools – for example, lakeFS for better data versioning capabilities.
Check out this Databricks lakeFS integration to learn more and join our Slack community to meet like-minded people experimenting with the lakehouse architecture.
Table of Contents