Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial

Iddo Avneri

Last updated on September 10, 2024

Home > Blog > Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial

Watch how to get started with lakeFS

Introduction

This tutorial will review all steps needed to configure lakeFS on Databricks.

This tutorial assumes that lakeFS is already set up and running against your storage (in this example AWS s3), and is focused on setting up the Databricks and lakeFS integration.

Prerequisites

lakeFS Server (you can deploy one independently or use the hosted solution lakeFS Cloud).
Databricks server with the ability to run compute clusters on top of it.
If you are using Databricks on AWS then AWS credentials with Access to the S3 bucket for lakeFS Hadoop file system configuration. But if you are using Azure Databricks then you will configure the lakeFS Hadoop file system in Presigned Mode.
Permissions to manage the cluster configuration, including adding libraries.

Step 1 – Acquire lakeFS Key and Secret

In this step, we will acquire the lakeFS Key and Secret that will later on be imputed in Databricks in the following steps. If you already have a Secret Key, you can skip this section.

Note: To create a new access key, You need either AuthManageOwnCredentials Policy or AuthFullAccess Policy attached to your user.

A new key will be generated:

As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again. (You will be able to create new ones)

Step 2 – Configure Spark on your cluster to communicate with lakeFS

In Databricks, go to your cluster configuration page.
Click Edit.
Expand Advanced Options.
If you are using Databricks on AWS then under the Spark tab, add the following configuration, replacing the credentials and endpoint with yours:

Copy Code

spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.lakefs.endpoint https://lakefs.example.com/api/v1
spark.hadoop.fs.s3a.access.key AKIAIOSFODNN7EXAMPLE
spark.hadoop.fs.s3a.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

If you are using Azure Databricks then under the Spark tab, add the following configuration, replacing the credentials and endpoint with yours:

Copy Code

spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.lakefs.endpoint https://lakefs.example.com/api/v1
spark.hadoop.fs.lakefs.access.mode presigned

When using Delta Lake tables, the following may also be required in some versions of Databricks:

Copy Code

spark.hadoop.fs.s3a.bucket.<repo-name>.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.<repo-name>.session.token lakefs

When running Azure Databricks, the following configuration is needed as well:

Copy Code

spark.databricks.delta.logStore.crossCloud.fatal false

(If you don’t use this configuration on azure Databricks you would get a Writing to Delta table on AWS from non-AWS is unsafe in terms of providing transactional guarantees error message)

For more information, please see the documentation from Databricks.

Optional – Running lakeFS inside your VPC

When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. This can be done by setting up a VPC peering between the two VPCs (the one where lakeFS runs and the one where Databricks runs). For this to work on DeltaLake tables, you would also have to disable multi-cluster writes with:

Copy Code

spark.databricks.delta.multiClusterWrites.enabled false

Step 3 – Add the lakeFS Hadoop File System

Find the latest version available here.
For example, at the moment it is 0.1.15:
In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
Click “Install New”
Select “Maven”
Input the version of the client you would like to use. For example, if the latest version is 0.1.9: io.lakefs:hadoop-lakefs-assembly:0.1.15
Click Install

You should now see a line showing the installation was successful:

Step 4 – Configure the Python Client

To interact with lakeFS from Python,

In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
Click “Install New”
Select “PyPl” and insert lakefs==<lakeFS version> (for example, my version of lakeFS is 0.4.0, So I’ll insert: lakefs-client==0.4.0)
Find out the last version by visiting https://pypi.org/project/lakefs/

lakeFS for Databricks configuration tutorial

You should now see a line showing the installation was successful:

Step 5 (Optional) – Adding the lakeFS Metadata SPARK Client

Some operations in lakeFS require adding the SPARK client (for example, some of the ways of exporting data).

Find out the latest version, available here.
For example, currently it is 0.8.0:

Note that I’m running Spark 3.1.2. If you are running a different version of spark, choose the corresponding package for your version.
Click on the latest version (for example 0.8.0) above, and under “Files” download the Jar:
In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
Click “Install New”
Drag and drop the downloaded jar:

Click Install.
You should now see a line showing the installation was successful:

Restart your cluster.

Step 6 – Try it out in a notebook

Create a new notebook in Databricks and attach it to the cluster configured above.

Alternatively, import this example notebook if you are using Databricks on AWS or import this example notebook if you are using Azure Databricks.

Setting up lakeFS endpoint, access and secret keys

Copy Code

#setting up lakeFS end point access and secret in order to later configure the python client
lakefsEndPoint = 'https://YourEndPoint/' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAlakeFSAccessKey'
lakefsSecretKey = 'lakeFSSecretKey'

Configuring Python Client

Copy Code

import lakefs
from lakefs.client import Client

clt = Client(
   host=lakefsEndPoint,
   username=lakefsAccessKey,
   password=lakefsSecretKey,
)

Configuring some environment variables (to be used in next command)

Copy Code

repo = "learn-lakefs-repo01" # the name of the lakeFS repository
storageNamespace = 's3://bucket-name/' + repo
sourceBranch = "main"
dataPath = "product-reviews"

Note: follow the Azure sample notebook to use Blob Storage instead of S3.

4. Create a lakeFS Repository with the Python Client

Copy Code

repository = lakefs.Repository(
 repo,
 client=clt).create(
   storage_namespace=storageNamespace,
   default_branch=sourceBranch,
   exist_ok=True)
branchMain = repository.branch(sourceBranch)

5. Write data via the lakeFS Hadoop file system client to the new repository:

Copy Code

import_data_path = "/databricks-datasets/amazon/test4K/"
df = spark.read.parquet(import_data_path)
df.write.format("parquet").save("lakefs://{}/{}/{}".format(repo,sourceBranch,dataPath))

6. Committing the data with the python client:

Copy Code

branchMain.commit(message='Uploading intial data into lakefs', 
        metadata={'using': 'python_api'})

7. Reading data via the lakeFS Hadoop file system client:

Copy Code

# Note - This example uses static strings instead of parameters for an easier read

df = spark.read.parquet("lakefs://learn-lakefs-repo01/main/product-reviews/")
df.show()

Summary

We have now successfully configured a Databricks cluster that is able to read data from lakeFS using the lakeFS Hadoop File system Client and execute git actions via Python.

If you’d like a more personalized walkthrough to learn more about lakeFS for Databricks, book a time here.