Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Iddo Avneri
Iddo Avneri Author

Iddo has a strong software development background. He started his...

Last updated on September 10, 2024

Introduction

This tutorial will review all steps needed to configure lakeFS on Databricks. 

This tutorial assumes that lakeFS is already set up and running against your storage (in this example AWS s3), and is focused on setting up the Databricks and lakeFS integration.

Prerequisites

  1. lakeFS Server (you can deploy one independently or use the hosted solution lakeFS Cloud). 
  2. Databricks server with the ability to run compute clusters on top of it. 
  3. If you are using Databricks on AWS then AWS credentials with Access to the S3 bucket for lakeFS Hadoop file system configuration. But if you are using Azure Databricks then you will configure the lakeFS Hadoop file system in Presigned Mode.
  4. Permissions to manage the cluster configuration, including adding libraries.

Step 1 – Acquire lakeFS Key and Secret

In this step, we will acquire the lakeFS Key and Secret that will later on be imputed in Databricks in the following steps. If you already have a Secret Key, you can skip this section.

Note: To create a new access key, You need either AuthManageOwnCredentials Policy or AuthFullAccess Policy attached to your user. 

Login to lakeFS and click on Administration -> Create Access Key

A new key will be generated:

As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again. (You will be able to create new ones)

Step 2 – Configure Spark on your cluster to communicate with lakeFS

  1. In Databricks, go to your cluster configuration page.
  2. Click Edit.
  3. Expand Advanced Options.
  4. If you are using Databricks on AWS then under the Spark tab, add the following configuration, replacing the credentials and endpoint with yours:
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.lakefs.endpoint https://lakefs.example.com/api/v1
spark.hadoop.fs.s3a.access.key AKIAIOSFODNN7EXAMPLE
spark.hadoop.fs.s3a.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY


If you are using Azure Databricks then under the Spark tab, add the following configuration, replacing the credentials and endpoint with yours:

spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.lakefs.endpoint https://lakefs.example.com/api/v1
spark.hadoop.fs.lakefs.access.mode presigned


When using Delta Lake tables, the following may also be required in some versions of Databricks:

spark.hadoop.fs.s3a.bucket.<repo-name>.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.<repo-name>.session.token lakefs

When running Azure Databricks, the following configuration is needed as well:

spark.databricks.delta.logStore.crossCloud.fatal false

(If you don’t use this configuration on azure Databricks you would get a Writing to Delta table on AWS from non-AWS is unsafe in terms of providing transactional guarantees error message) 

For more information, please see the documentation from Databricks.

Optional – Running lakeFS inside your VPC

When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. This can be done by setting up a VPC peering between the two VPCs (the one where lakeFS runs and the one where Databricks runs). For this to work on DeltaLake tables, you would also have to disable multi-cluster writes with:

spark.databricks.delta.multiClusterWrites.enabled false

Step 3 – Add the lakeFS Hadoop File System 

  1. Find the latest version available here.
    For example, at the moment it is 0.1.15:
  2. In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
  3. Click “Install New”
  4. Select “Maven”
  5. Input the version of the client you would like to use. For example, if the latest version is 0.1.9: io.lakefs:hadoop-lakefs-assembly:0.1.15
  6. Click Install 
  1. You should now see a line showing the installation was successful:

Step 4 – Configure the Python Client 

To interact with lakeFS from Python, 

  1. In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
  2. Click “Install New”
  3. Select “PyPl” and insert lakefs==<lakeFS version> (for example, my version of lakeFS is 0.4.0, So I’ll insert: lakefs-client==0.4.0)
    Find out the last version by visiting https://pypi.org/project/lakefs/
lakeFS for Databricks configuration tutorial
  1. You should now see a line showing the installation was successful:
lakeFS for Databricks configuration tutorial

Step 5 (Optional) – Adding the lakeFS Metadata SPARK Client

Some operations in lakeFS require adding the SPARK client (for example, some of the ways of exporting data).

  1. Find out the latest version, available here.
    For example, currently it is 0.8.0:

    Note that I’m running Spark 3.1.2. If you are running a different version of spark, choose the corresponding package for your version
  2. Click on the latest version (for example 0.8.0) above, and under “Files” download the Jar:
  3. In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
  4. Click “Install New”
  5. Drag and drop the downloaded jar:
  1. Click Install.
  2. You should now see a line showing the installation was successful:

Restart your cluster.

Step 6 – Try it out in a notebook

Create a new notebook in Databricks and attach it to the cluster configured above. 

Alternatively, import this example notebook if you are using Databricks on AWS or import this example notebook if you are using Azure Databricks.

  1. Setting up lakeFS endpoint, access and secret keys
#setting up lakeFS end point access and secret in order to later configure the python client
lakefsEndPoint = 'https://YourEndPoint/' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAlakeFSAccessKey'
lakefsSecretKey = 'lakeFSSecretKey'
  1. Configuring Python Client
import lakefs
from lakefs.client import Client

clt = Client(
   host=lakefsEndPoint,
   username=lakefsAccessKey,
   password=lakefsSecretKey,
)
  1. Configuring some environment variables (to be used in next command)
repo = "learn-lakefs-repo01" # the name of the lakeFS repository
storageNamespace = 's3://bucket-name/' + repo
sourceBranch = "main"
dataPath = "product-reviews"

Note: follow the Azure sample notebook to use Blob Storage instead of S3.

4. Create a lakeFS Repository with the Python Client

repository = lakefs.Repository(
 repo,
 client=clt).create(
   storage_namespace=storageNamespace,
   default_branch=sourceBranch,
   exist_ok=True)
branchMain = repository.branch(sourceBranch)

5. Write data via the lakeFS Hadoop file system client to the new repository:

import_data_path = "/databricks-datasets/amazon/test4K/"
df = spark.read.parquet(import_data_path)
df.write.format("parquet").save("lakefs://{}/{}/{}".format(repo,sourceBranch,dataPath))

6. Committing the data with the python client:

    branchMain.commit(message='Uploading intial data into lakefs', 
            metadata={'using': 'python_api'})

    7. Reading data via the lakeFS Hadoop file system client:

      # Note - This example uses static strings instead of parameters for an easier read
      
      df = spark.read.parquet("lakefs://learn-lakefs-repo01/main/product-reviews/")
      df.show()

      Summary 

      We have now successfully configured a Databricks cluster that is able to read data from lakeFS using the lakeFS Hadoop File system Client and execute git actions via Python.

      If you’d like a more personalized walkthrough to learn more about lakeFS for Databricks, book a time here.

      lakeFS

      We use cookies to improve your experience and understand how our site is used.

      Learn more in our Privacy Policy