Barak Amar
October 27, 2020

Overview

As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, and creating alternative paths. Such workflows are error-prone, and may end up damaging production or otherwise polluting the data lake.


lakeFS enables a safe and automated development environment on your data lake without the need to copy or mock data, work on production pipelines, or involve DevOps. In this article, we will show you how to create a development environment working with lakeFS using Spark. We will start by creating a repository and building a small Spark application while using lakeFS’s capabilities such as easily committing or reverting changes to data.

Prerequisites

  • Installation of lakeFS – setting up a local lakeFS server is a one-liner. Head to our quickstart to get one up and running
  • lakectl – The official lakeFS CLI. Downloaded from our Releases page

Agenda

  1. Create repository
  2. Import data and commit it
  3. Develop a Spark application on an isolated branch
  4. Rollback or commit changes
  5. Merge and iterate on the development workflow

Create Repository

First let’s create a new repository, called `example-repo` that will hold our data – the following using S3 bucket `example-bucket` as the underlying storage.

$ lakectl repo create lakefs://example-repo s3://example-bucket -d main

Repository 'example-repo' created:
storage namespace: s3://example-bucket
default branch: main
timestamp: 1603797890

Import data and commit it

Using the following command we can download the data and upload it into the repository.

# Download data file - credit to Gutenberg EBook of Alice’s Adventures in Wonderland
# Or you can use any text file you like
$ curl -o alice.txt https://www.gutenberg.org/files/11/11-0.txt


# Upload to our repository
$ lakectl fs upload -s alice.txt lakefs://example-repo@main/alice.txt

Path: alice.txt
Modified Time: 2020-10-27 13:25:35 +0200 IST
Size: 174693 bytes
Human Size: 174.7 kB
Checksum: 0df1d0026f6465334ac70b9a164fc726

This is the starting point of our development, we want the data to be committed to a point in time that we can always rollback, reference, or branch from.

$ lakectl commit -m "alice in wonderland" lakefs://example-repo@main

Commit for branch "main" completed.

ID: ~wUANJrNv
Message: alice in wonderland
Timestamp: 2020-10-27 13:25:41 +0200 IST
Parents: ~wUANJrNu

A recommended methodology would be to update the development environment main branch periodically, to keep the development environment on par with production. Since branches are isolated, they can choose if to update their branch, and newly updated branches can be created from the updated main branch.

Creating a repository on lakeFS

Develop a Spark application on an isolated branch

Capturing the state of our repository while working on our application starts with a new branch.

$ lakectl branch create lakefs://example-repo@word-count1 -s lakefs://example-repo@main

created branch 'word-count1'

Creating a branch provides you an isolated environment with a snapshot of your repository. It is a very lightweight operation that will not copy any data and still give us an isolated space that is guaranteed not to change.

We can use lakectl to list the content of our development branch, and see the same data as in our main branch:

$ lakectl fs ls lakefs://example-repo@word-count1
object    2020-10-25 16:55:33 +0200 IST    174.7 kB        alice.txt

$ lakectl fs ls lakefs://example-repo@main
object    2020-10-25 16:55:33 +0200 IST    174.7 kB        alice.txt

Let’s run a simple word count using Spark to produce a report based on our story

sc.hadoopConfiguration.set("fs.s3a.access.key",<lakeFS access key ID>)
sc.hadoopConfiguration.set("fs.s3a.secret.key",<lakeFS secret access key>)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.local.lakefs.io:8000")

val branch = "s3a://example-repo/word-count1/"
val textFile = sc.textFile(branch + "alice.txt")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile(branch + "wc.report")

Access to our repository is done through the S3 interface, by setting the s3a endpoint and credentials based on our lakeFS installation.

Running the above will produce our word count report. We can use lakectl diff to see it:

$ lakectl diff lakefs://example-repo@word-count1
+ added wc.report/_SUCCESS
+ added wc.report/part-00000
+ added wc.report/part-00001
Branch word count

Any issue we have with the generated data, we can always discard all uncommitted changes by reverting them:

$ lakectl branch revert lakefs://example-repo@word-count1

When our feature is complete, we can commit our data to lakeFS.

You can include additional metadata fields in your lakeFS commit. For instance, by specifying its git commit hash, we can reference the code that was used to build the data.

$ lakectl commit lakefs://example-repo@word-count1 -m "word count" --meta git_commit_hash=cc313c5
Commit for branch "word-count1" completed.

ID: ~Awz8bEu5ysXCFCbdak
Message: word count
Timestamp: 2020-10-27 13:40:41 +0200 IST
Parents: ~Awz8bEu5ysXCFCbdaj

Using the `meta` flag we can store multiple metadata key/value pairs to help label our commit. Later we can check the log and use the referenced data.

$ lakectl log lakefs://example-repo@word-count1
ID: ~Awz8bEu5ysXCFCbdak
Author: barak.amar
Date: 2020-10-27 13:40:41 +0200 IST

        word count


                git_commit_hash = cc313c5

ID: ~Awz8bEu5ysXCFCbdaj
Date: 2020-10-27 13:36:42 +0200 IST

        Branch 'word-count1' created, source branch 'main'


ID: ~wUANJrNv
Author: barak.amar
Date: 2020-10-27 13:25:41 +0200 IST

        alice in wonderland
...
repositories example repo commits branch word-count

The commit ID, for example ~Awz8bEu5ysXCFCbdak, can be used for addressing the repository at this point by our CLI

$ lakectl fs cat lakefs://example-repo@~6kfQBz477AWBqR/wc.report/part-00000
...

Or by our application as s3a address

"s3a://example-repo/~Awz8bEu5ysXCFCbdak/wc.report"

Merge and iterate on the development workflow

Once both code and data are committed, they can be reviewed together before deciding to merge our new data into the main branch.

$ lakectl merge lakefs://example-repo@word-count1 lakefs://example-repo@main
new: 3 modified: 0 removed: 0

Merge will generate a commit on the target branch with all the changes we made.

Committing is a fast and atomic metadata operation – no data is copied during the process. Our data is stored only once. If we merge changes to multiple objects, we are guaranteed that all of them will show up on our destination branch at the same time.

Optionally we can delete the feature branch once it is no longer required

$ lakectl branch delete lakefs://example-repo@word-count1

Summary

I hope this post was helpful and that you now feel empowered to go build your own development environment on top of your data lake. These are all suggestions – the concept and methodology is what’s important. 

Check out our docs for more information on how to get started with lakeFS, and feel free to join our Slack community to ask questions!

LakeFS

  • Get Started
    Get Started