Many organizations today use a complex mix of data lakes and data warehouses to build the foundation for their data-driven processes. They run parallel pipelines for handling data in planned batches or streaming data in real time, often adding new tools for analytics, business intelligence, and data science.
Databricks was designed to reduce this complexity. The Databricks architecture is known for being a single, cloud-native platform that encompasses all areas of data engineering, data management, and data analysis.
Keep reading to learn about all the architectural intricacies of Databricks and how it can help your team harness the potential of data in your organization.
What is Databricks Architecture?
The Databricks architecture is simple and cloud-native. It was designed to effortlessly integrate the customer’s Databricks account with their current cloud accounts from major cloud providers like AWS, Google, or Azure.
The Databricks platform is exceptionally versatile, using open-source solutions at every step to accommodate the many different ways teams explore and process data, all inside a smooth and unified platform with handy features like Autoloader.
Technically speaking, Databricks is a hybrid Platform-as-a-Service (PaaS) general-purpose and data-agnostic platform. Users often install a single-tenant data plane (virtual network and computing) in their own cloud service provider account, while the multi-tenant control plane runs within Databricks, hence the name hybrid PaaS. That way, you get to reap the benefits of a PaaS platform while retaining control over their data processing clusters locally.
What does it mean that Databricks is data-agnostic? Essentially, the platform doesn’t care which data you process on it. You can add source code, business logic, and datasets with the confidence of never running into messages suggesting to “truncate user IDs” or anything like that.
Components of the Databricks Architecture
The design has two layers: the Control Plane and the Data Plane (currently called Compute Plane).
The Control Plane houses Databricks’ back-end services, such as the graphical interface and REST APIs for account administration and workspaces. You can find all the backend services that Databricks handles for your account there. This also includes notebook commands and several other workspace customizations, which are saved here and encrypted at rest.
Data Plane (currently called Compute Plane)
The Data Plane – currently called the Compute Plane – is responsible for external/client communications and data processing.
Note that while it’s common to use the customer’s cloud account for both the Data Plane and data storage, Databricks also supports the platform architecture where the Data Plane lives in their cloud and the customer’s cloud account contains data storage.
The Compute Plane is basically where your data is processed. Most Databricks computations use computational resources from your AWS account (conventional Compute Plane). This refers to your AWS account’s network and its associated compute resources. Databricks uses the traditional compute plane for notebooks and jobs, as well as pro and standard Databricks SQL warehouses.
Key Features of Databricks Security Architecture
Databricks prioritizes security, with features like encryption, access control, data governance, and architectural security measures in place to safeguard and assure data integrity.
The Databricks security architecture was designed to provide robust data protection mechanisms while also ensuring platform integrity. The design includes a variety of security measures and best practices to protect sensitive data and prevent unwanted access to a data lake, data warehouse, or lakehouse.
The essential components of Databricks security architecture are as follows:
Databricks has powerful access control methods to govern user permissions and prevent unauthorized access. Role-based access control (RBAC) and fine-grained access control let teams limit user rights and efficiently manage data access.
Databricks includes several access control techniques used for different types of secure objects.
For example, access to workspace-level secure objects is regulated via Access Control Lists. You can use them to govern who has access to workspace objects (folders, notebooks, experiments, and models), clusters, pools, jobs, Delta Live Tables pipelines, alerts, dashboards, queries, and SQL warehouses. All workspace admin users, as well as those with a delegated authority to do so, may manage Access Control Lists.
For account-level security items, Databricks offers account role-based access control. And you can manage access to data securable objects using the Unity Catalog or Hive metastore table access controls.
Databricks also offers administrative roles and privileges that can be assigned directly to users, service principals, and groups. Note that access control requires the Premium plan or higher.
Data governance refers to the rules and processes that an organization uses to safely manage its data assets. Centralized data governance is a key component of a lakehouse, which combines data warehousing and AI use cases into a single platform.
This streamlines the current data stack by removing the data silos that formerly separated and complicated data engineering, analytics, business intelligence, data science, and machine learning.
Databricks provides tools for data governance, such as audits and compliance controls. These capabilities allow enterprises to track and monitor data access, manage the data lifecycle, and meet all their legal obligations or compliance goals.
To simplify data governance, Databricks Lakehouse provides a unified governance solution for data, analytics, and AI. By reducing the number of copies of your data and migrating to a single data processing layer where all of your data governance rules can run concurrently, you increase your chances of staying compliant or detecting breaches.
Databricks uses encryption at rest and in transit to protect data from unauthorized access. Encryption techniques protect data storage, network transmission, and user credentials.
Databricks has encryption options to secure your data residing in a data warehouse or data lake, but note that not all security measures are available at all price points.
You can add a customer-controlled key to managed services to help safeguard and limit access to the following types of encrypted data:
- Notebook source files kept on the control plane.
- Interactive notebook results stored in the control plane.
- Secrets saved using the secret management APIs.
- Databricks’ SQL queries and query history.
- Personal access tokens or other credentials needed to configure Git integration with Databricks repositories.
You can use your own key to encrypt data in your cloud account when setting up your workspace.
Let’s start with some basics about Databricks networking.
As mentioned above, Databricks works in two planes: Control Plane and Compute Plane. The former contains the backend services that Databricks handles for your Databricks account. The web application resides on the Control Plane.
The compute plane is where your information is processed. There are two sorts of compute planes, depending on the computation used:
- Traditional Compute Plane – Databricks computing resources are stored in your AWS account; this refers to your AWS account’s network and its associated resources.
- Serverless Compute Plane – this is where computing resources operate within your Databricks account.
Now let’s dive into Databricks network security.
Many mechanisms that are part of Databricks help secure workplaces and prevent data exfiltration. These safeguards prevent unwanted network access while also ensuring data security and privacy.
By default, Databricks provides a secure networking environment; however, if your company has specific requirements, you can set network connectivity features between the various networking connections depicted in the diagram. For example, you can set up features to manage access and offer private connectivity between users and their Databricks workspaces.
Classic computing resources, like clusters, are deployed in your cloud account and communicate with the control plane. Traditional network connectivity characteristics can be used to install traditional compute plane resources in your virtual private cloud, as well as to provide private communication between clusters and the control plane.
Security Best Practices
Databricks encourages a number of helpful security best practices through its Security Reference Architecture (SRA), which also includes templates for deploying workspaces with predetermined security configurations. This makes it easier for teams to comply with defined security requirements and architectural security controls.
The Security Reference Architecture (SRA) in combination with Terraform templates simplifies the deployment of workspaces that adhere to security best practices. The official Databricks Terraform provider allows you to programmatically install workspaces and other needed cloud infrastructure. These unified Terraform templates come pre-configured with robust security settings comparable to those used by our most security-conscious clients.
Databricks often sees the following security settings:
- Databricks often expects production or enterprise deployments to include setups like Single Sign-On (SSO) and multi-factor authentication (MFA).
- Highly-secure installations typically include sensitive data, intellectual property, or regulated industries like healthcare, life sciences, or financial services. These deployments may use a PrivateLink connection.
- Most enterprise production Databricks deployments include the common settings listed below. For small data science teams, deploying all of these may not be necessary.
If Databricks is an important element of your organization or you analyze sensitive data, the platform recommends reviewing these action items:
- Check if numerous workspaces are necessary for segmentation.
- Make sure that your S3 buckets are encrypted and public access is restricted.
- Deploy Databricks in a customer-managed VPC to have more control over the network environment. Adding this choice to your workplace can improve its success in the future, even if it’s not necessary right now.
- Authenticate with single sign-on and multi-factor authentication.
- Separate accounts with administrative powers from regular user accounts.
- Run production workloads using service principles.
- Configure Databricks’ audit log delivery.
- Use token management – it lets you set the maximum token lifespan for future tokens.
- Configure the admin panel settings based on your organization’s needs.
- Unity Catalog can provide fine-grained access control and centralized governance controls.
- Use bucket rules or other mitigations to prevent storing production datasets in DBFS.
- Backup your notebooks on the control plane or save them in Git repositories.
- Store and utilize secrets securely using Databricks or a third-party service.
- Consider whether to deploy network security against data loss.
- Restart clusters on a regular basis so that the most recent fixes are deployed.
Databricks was designed to enable safe cross-functional team communication while managing a considerable amount of backend services, and it definitely does the job! The unique Databricks architecture allows teams to focus on data science, analytics, and data engineering tasks without worrying about the complexity of their setup.
If you’d like to see how Databricks can be easily extended to cover data version control, check out this guide to integrating Databricks with the open-source tool lakeFS.
Table of Contents