Iddo Avneri
January 10, 2023

Introduction

This tutorial will review all steps needed to configure lakeFS on Databricks. 

This tutorial assumes that lakeFS is already set up and running against your storage (in this example AWS s3), and is focused on setting up the Databricks and lakeFS integration.

Prerequisites

  1. lakeFS Server (you can deploy one independently or use the hosted solution lakeFS Cloud). 
  2. lakeFS Credentials (Key & Secret) or the permissions to create those credentials.
  3. AWS credentials with Access to the S3 bucket for lakeFS Hadoop file system configuration. 
  4. Databricks server with the ability to run compute clusters on top of it. 
  5. Permissions to manage the cluster configuration, including adding libraries. 

Step 1 – Acquire lakeFS Key and Secret

In this step, we will acquire the lakeFS Key and Secret that will later on be imputed in Databricks in the following steps. If you already have a Secret Key, you can skip this section.

Note: To create a new access key, You need either AuthManageOwnCredentials Policy or AuthFullAccess Policy attached to your user. 

Login to lakeFS and click on Administration -> Create Access Key

A new key will be generated:

As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again (You will be able to create new ones). 

TODO: Note about integration users?

Step 2 – Configure Spark on your cluster to communicate with lakeFS

  1. In Databricks, go to your cluster configuration page.
  2. Click Edit.
  3. Expand Advanced Options.
  4. Under the Spark tab, add the following configuration, replacing the credentials and endpoint with yours.
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
spark.hadoop.fs.lakefs.access.key AKIAlakefs12345EXAMPLE
spark.hadoop.fs.lakefs.secret.key abc/lakefs/1234567bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.s3a.access.key AKIAIOSFODNN7EXAMPLE
spark.hadoop.fs.s3a.secret.key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
spark.hadoop.fs.lakefs.endpoint https://lakefs.example.com/api/v1

When using DeltaLake tables, the following may also be required in some versions of Databricks:

spark.hadoop.fs.s3a.bucket.<repo-name>.aws.credentials.provider shaded.databricks.org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.<repo-name>.session.token lakefs

When running Azure Databricks, the following configuration is needed as well:

spark.databricks.delta.logStore.crossCloud.fatal false

(If you don’t use this configuration on azure Databricks you would get a Writing to Delta table on AWS from non-AWS is unsafe in terms of providing transactional guarantees error message) 

*Notice that for Azure Databricks, you would need to use the S3A gateway as opposed to the lakeFS Hadoop FS client. 

For more information, please see the documentation from Databricks.

Optional – Running lakeFS inside your VPC

When lakeFS runs inside your private network, your Databricks cluster needs to be able to access it. This can be done by setting up a VPC peering between the two VPCs (the one where lakeFS runs and the one where Databricks runs). For this to work on DeltaLake tables, you would also have to disable multi-cluster writes with:

spark.databricks.delta.multiClusterWrites.enabled false

Step 3 – Add the lakeFS Hadoop File System 

  1. Find out the latest version available here.
    For example, at the moment it is 0.1.9:
  2. In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
  3. Click “Install New”
  4. Select “Maven”
  5. Input the version of the client you would like to use. For example, if the latest version is 0.1.9: io.lakefs:hadoop-lakefs-assembly:0.1.9
  6. Click Install 
  1. You should now see a line showing the installation was successful:

Step 4 – Configure the Python Client 

To interact with lakeFS from Python, 

  1. In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
  2. Click “Install New”
  3. Select “PyPl” and insert lakefs_client==<lakeFS version> (for example, my version of lakeFS is 0.88, So I’ll insert: lakefs-client==0.88.0)
    Find out the last version by visiting https://pypi.org/project/lakefs-client/
  1. You should now see a line showing the installation was successful:

Step 5 – Adding the lakeFS SPARK Client

Some operations in lakeFS require adding the SPARK client (for example, some of the ways of exporting data). 

  1. Find out the latest version available here.
    For example, currently it is 0.6.0:

    Note that I’m running Spark 3.1.2. If you are running a different version of spark, chose the package for your version
  2. Click on the latest version (for example 0.6.0) above, and under “Files” download the Jar:
  3. In Databricks, go to your cluster configuration page and choose the “Libraries” tab.
  4. Click “Install New”
  5. Drag and drop the downloaded jar:
  1. Click Install.
  2. You should now see a line showing the installation was successful:

Restart your cluster.

Step 6 – Try it out in a notebook

Create a new notebook in Databricks and attach it to the cluster configured above. 

Alternatively, import this example notebook.

  1. Setting up lakeFS end point access and secret
lakefsEndPoint = 'https://YourEndPoint' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAlakeFSAccessKey'
lakefsSecretKey = 'lakeFSSecret Key'
  1. Configuring Python Client
%xmode Minimal
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

client = LakeFSClient(configuration)
  1. Configuring some environment variables (to be used in next commend)
repo = "learn-lakefs-repo01" # the name of the lakeFS repository
storageNamespace = "s3://bucketwheretherepositorysits" # Should be unique
sourceBranch = "main"dataPath = "product-reviews"
  1. Create a lakeFS Repository with the Python Client
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=repo,
        storage_namespace=storageNamespace,
        default_branch=sourceBranch))
  1. Write data via the lakeFS Hadoop file system client to the new repository:
import_data_path = "/databricks-datasets/amazon/test4K/"
df = spark.read.parquet(import_data_path)
df.write.format("parquet").save("lakefs://{}/{}/{}".format(repo,sourceBranch,dataPath))
  1. Committing the data with the python client:
client.commits.commit(
    repository=repo,
    branch=sourceBranch,
    commit_creation=models.CommitCreation(
        message='Uploading intial data into lakefs',
        metadata={'using': 'python_api'}))
  1. Reading data via the lakeFS Hadoop file system client:
# Note - This example uses static strings instead of parameters for an easier read
df = spark.read.parquet("lakefs://learn-lakefs-repo01/main/product-reviews/")

df.show()

Summary 

We have configured a Databricks cluster to be able to read data from lakeFS using the lakeFS Hadoop File system Client and execute git actions via Python

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +