Barak Amar
October 27, 2022

Overview

Our routine work with data includes developing code, choosing and upgrading compute infrastructure, and testing new and changed data pipelines. Usually, this requires running our tested pipelines in parallel to production, in order to test the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, and creating alternative paths. Such workflows are time-consuming and error-prone, and may end up damaging production or otherwise polluting the data lake.

lakeFS enables a safe and automated development environment for data without the need to copy or mock data, work on production pipelines, and without a heavy investment of DevOps efforts. In this article, we will demonstrate how to build a development and testing environment for data, working with lakeFS and using Spark. We will start by creating a repository and building a small Spark application while using lakeFS’ capabilities such as easily committing or reverting changes to data.

Why do I need multiple environments?

When working with a data lake, it’s useful to have replicas of your production environment. These replicas allow you to test and understand changes to your data without impacting consumers of the production data.

Running ETL jobs directly in production is a guaranteed way to have data issues flow into dashboards, ML models, and other consumers sooner or later. The most common approach to avoid making changes directly in production is to create and maintain a second data environment (dev/test environment) where updates are implemented first.

The issues with this approach are that it’s time-consuming and costly to maintain this separate test environment. And, for larger teams, it forces multiple people to share a single environment, requiring significant coordination.

How do I build an isolated testing environment for data with lakeFS?

lakeFS makes creating environments to test data in isolation instantaneous. This frees you from spending time on environment maintenance and makes it possible to create as many environments as needed.

In a lakeFS repository, data is always located on a branch. You can think of each branch in lakeFS as its own environment. This is because branches are isolated, meaning changes on one branch have no effect on other branches.

Objects that remain unchanged between two branches are not copied, but rather shared to both branches via metadata pointers (shallow copy) that lakeFS manages. If you make a change on one branch and want it reflected on another, you can perform a merge operation to update one branch with the changes from another.

Let’s see an example of using multiple lakeFS branches for isolation.

Using branches as isolated environments

The key difference when using lakeFS for isolated data environments is that you can create them immediately before testing a change. And once new data is merged into production, you can delete the branch – effectively deleting the old environment.

This is different from creating a long-living test environment used as a staging area to test all the updates. With lakeFS, we create a new branch for each change to production that we want to make. One benefit of this is the ability to test multiple changes at one time.

Build a testing environment for data – let’s see how it’s done

We will start by creating a repository and building a small Spark application while using lakeFS’ capabilities, such as easily committing or reverting changes to data.

Prerequisites

  • Installation of lakeFS Server – You can either spin up a server on the playground, or run one locally
  • lakectl – The official lakeFS CLI. Downloaded from our Releases page

Helpful Tip – Watch how to run lakeFS locally against you your storage in less than 3 minutes

Create Repository

First let’s create a new repository, called `example-repo` that will hold our data – the following using S3 bucket `example-bucket` as the underlying storage.

$ lakectl repo create lakefs://example-repo s3://example-bucket/example-repo -d main
Repository: lakefs://example-repo
Repository 'example-repo' created:
storage namespace: s3://example-bucket/example-repo
default branch: main
timestamp: 1660458426

Upload data and commit it

Using the following command we can download the data and upload it into the repository.

# Download data file - credit to Gutenberg EBook of Alice’s Adventures in Wonderland
# Or you can use any text file you like
$ curl -o alice.txt https://www.gutenberg.org/files/11/11-0.txt

# Upload to our repository
$ lakectl fs upload -s alice.txt lakefs://example-repo/main/alice.txt
Path: alice.txt
Modified Time: 2022-08-14 09:35:16 +0300 IDT
Size: 174313 bytes
Human Size: 174.3 kB
Physical Address: s3://barak-bucket1/example-repo/ec827db52c044d3881cdf426b99819f3
Checksum: 7e56f48cb7671ceba58e48bf232a379c
Content-Type: application/octet-stream

This is the starting point of our development, we want the data to be committed to a point in time that we can always rollback, reference, or branch from.

$ lakectl commit -m "alice in wonderland" lakefs://example-repo/main
Branch: lakefs://example-repo/main
Commit for branch "main" completed.

ID: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Message: alice in wonderland
Timestamp: 2022-08-14 09:36:08 +0300 IDT
Parents: fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc

A recommended methodology would be to update the development environment’s main branch periodically, to keep the development environment on par with production. Since branches are isolated, you can choose to update your branch, and newly updated branches can be created from the updated main branch.

Develop a Spark application on an isolated branch

Capturing the state of our repository while working on our application starts with a new branch.

$ lakectl branch create lakefs://example-repo/word-count1 -s lakefs://example-repo/main
Source ref: lakefs://example-repo/main
created branch 'word-count1' bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620

Creating a branch provides you an isolated environment with a snapshot of your repository. It is a very lightweight operation that will not copy any data and still give us an isolated space that is guaranteed not to change.

We can use lakectl to list the content of our development branch, and see the same data as in our main branch:

$ lakectl fs ls lakefs://example-repo/word-count1/
object          2022-08-14 09:35:16 +0300 IDT    174.3 kB        alice.txt

$ llakectl fs ls lakefs://example-repo/main/       
object          2022-08-14 09:35:16 +0300 IDT    174.3 kB        alice.txt

Let’s run a simple word count using Spark to produce a report based on our story. Starting with setting up the lakeFS Spark integration:

sc.hadoopConfiguration.set("fs.s3a.access.key",<lakeFS access key ID>)
sc.hadoopConfiguration.set("fs.s3a.secret.key",<lakeFS secret access key>)
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.local.lakefs.io:8000")

val branch = "s3a://example-repo/word-count1/"
val textFile = sc.textFile(branch + "alice.txt")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile(branch + "wc.report")

Access to our repository is done through the S3 interface, by setting the s3a endpoint and credentials based on our lakeFS installation.

Running the above will produce our word count report. We can use lakectl diff to see it:

$ lakectl diff lakefs://example-repo/word-count1
Ref: lakefs://example-repo/word-count1
+ added wc.report/_SUCCESS
+ added wc.report/part-00000
+ added wc.report/part-00001

Any issue we have with the generated data, we can always discard all uncommitted changes by reset the branch:

$ lakectl branch reset lakefs://example-repo/word-count1 --yes 
Branch: lakefs://example-repo/word-count1

Rollback or commit changes

Committing the changes in data is the natural next step, whenever the testing is completed. 

You can include additional metadata fields in your lakeFS commit. For instance, by specifying its git commit hash, we can reference the code that was used to build the data.

$ lakectl commit lakefs://example-repo/word-count1 -m "word count" --meta git_commit_hash=cc313c5
Branch: lakefs://example-repo/word-count1
Commit for branch "word-count1" completed.

ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Message: word count
Timestamp: 2022-08-14 11:36:16 +0300 IDT
Parents: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620

Using the `meta` flag we can store multiple metadata key/value pairs to help label our commit. Later we can check the log and use the referenced data.

$ lakectl log lakefs://example-repo/word-count1

ID:            068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Author:        barak
Date:          2022-08-14 11:36:16 +0300 IDT

	word count
	
	
		git_commit_hash = cc313c5
	
ID:            bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Author:        barak
Date:          2022-08-14 09:36:08 +0300 IDT

	alice in wonderland
	
	
ID:            fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc
Date:          2022-08-14 09:27:06 +0300 IDT

	Repository created

Any issue we have with the generated data, we can always discard all uncommitted changes by reset the branch:

$ lake ctl branch reset lakefs://example-repo/word-count1 --yes 
Branch: lakefs://example-repo/word-count1

Rollback or commit changes

Committing the changes in data is the natural next step, whenever the testing is completed. 

You can include additional metadata fields in your lakeFS commit. For instance, by specifying its git commit hash, we can reference the code that was used to build the data.

$ lakectl commit lakefs://example-repo/word-count1 -m "word count" --meta git_commit_hash=cc313c5
Branch: lakefs://example-repo/word-count1
Commit for branch "word-count1" completed.

 
ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Message: word count
Timestamp: 2022-08-14 11:36:16 +0300 IDT
Parents: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5

Using the `meta` flag we can store multiple metadata key/value pairs to help label our commit. Later we can check the log and use the referenced data.

d9fb61839a620
$ lakectl log lakefs://example-repo/word-count1
 
ID:            068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df

Author:        barak
Date:          2022-08-14 11:36:16 +0300 IDT
 
	word count
	
	
		git_commit_hash = cc313c5
	
ID:            bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Author:        barak
Date:          2022-08-14 09:36:08 +0300 IDT
 
	alice in wonderland
	
	
ID:            fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc
Date:          2022-08-14 09:27:06 +0300 IDT
 
	Repository created
 
 
The commit ID, for example 068ddaf6b4919e5b, can be used for addressing the repository at this point by our CLI
$ lakectl fs cat lakefs://example-repo/068ddaf6b4919e5b/wc.report/part-00000
(“Found,2)
(someone,1)
(plan.”,1)
(bone,1)
(roses.,1)
...

Or by our application as s3a address

s3a://example-repo/068ddaf6b4919e5b/wc.report

Merge and iterate on the development workflow

Once both code and data are committed, they can be reviewed together before deciding to merge our new data into the main branch.

$ lakectl merge lakefs://example-repo/word-count1 lakefs://example-repo/main
Source: lakefs://example-repo/word-count1
Destination: lakefs://example-repo/main
Merged "word-count1" into "main" to get "50d593e52cba687f2af7894189b381ebafdf7495f9e65a70ba2fc67e4e8c0677".

Merge will generate a commit on the target branch with all the changes we made.

Committing is a fast and atomic metadata operation – no data is copied during the process. Our data is stored only once. If we merge changes to multiple objects, we are guaranteed that all of them will show up on our destination branch at the same time.

Optionally, we can delete the feature branch once it is no longer required

$ lakectl branch delete lakefs://example-repo/word-count1 --yes
Branch: lakefs://example-repo/word-count1

Summary

Thank you for reading, and I hope you now feel empowered to build an isolated development and testing environment for your data lake. Those are all suggestions – what matters is the concept and methodology. 

Check out our docs for more information on how to get started with lakeFS, and join our Slack community to ask questions!

LakeFS

  • Get Started
    Get Started
  • Git for Data - What, How and Why Now?

    Read the article
    +