As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, and creating alternative paths. Such workflows are error-prone, and may end up damaging production or otherwise polluting the data lake.
lakeFS enables a safe and automated development environment on your data lake without the need to copy or mock data, work on production pipelines, or involve DevOps. In this article, we will show you how to create a development environment working with lakeFS using Spark. We will start by creating a repository and building a small Spark application while using lakeFS’s capabilities such as easily committing or reverting changes to data.
- Installation of lakeFS – setting up a local lakeFS server is a one-liner. Head to our quickstart to get one up and running
- lakectl – The official lakeFS CLI. Downloaded from our Releases page
- Create repository
- Import data and commit it
- Develop a Spark application on an isolated branch
- Rollback or commit changes
- Merge and iterate on the development workflow
First let’s create a new repository, called `example-repo` that will hold our data – the following using S3 bucket `example-bucket` as the underlying storage.
$ lakectl repo create lakefs://example-repo s3://example-bucket -d main Repository 'example-repo' created: storage namespace: s3://example-bucket default branch: main timestamp: 1603797890
Import data and commit it
Using the following command we can download the data and upload it into the repository.
# Download data file - credit to Gutenberg EBook of Alice’s Adventures in Wonderland # Or you can use any text file you like $ curl -o alice.txt https://www.gutenberg.org/files/11/11-0.txt # Upload to our repository $ lakectl fs upload -s alice.txt lakefs://example-repo@main/alice.txt Path: alice.txt Modified Time: 2020-10-27 13:25:35 +0200 IST Size: 174693 bytes Human Size: 174.7 kB Checksum: 0df1d0026f6465334ac70b9a164fc726
This is the starting point of our development, we want the data to be committed to a point in time that we can always rollback, reference, or branch from.
$ lakectl commit -m "alice in wonderland" lakefs://example-repo@main Commit for branch "main" completed. ID: ~wUANJrNv Message: alice in wonderland Timestamp: 2020-10-27 13:25:41 +0200 IST Parents: ~wUANJrNu
A recommended methodology would be to update the development environment main branch periodically, to keep the development environment on par with production. Since branches are isolated, they can choose if to update their branch, and newly updated branches can be created from the updated main branch.
Develop a Spark application on an isolated branch
Capturing the state of our repository while working on our application starts with a new branch.
$ lakectl branch create lakefs://example-repo@word-count1 -s lakefs://example-repo@main created branch 'word-count1'
Creating a branch provides you an isolated environment with a snapshot of your repository. It is a very lightweight operation that will not copy any data and still give us an isolated space that is guaranteed not to change.
We can use lakectl to list the content of our development branch, and see the same data as in our main branch:
$ lakectl fs ls lakefs://example-repo@word-count1 object 2020-10-25 16:55:33 +0200 IST 174.7 kB alice.txt $ lakectl fs ls lakefs://example-repo@main object 2020-10-25 16:55:33 +0200 IST 174.7 kB alice.txt
Let’s run a simple word count using Spark to produce a report based on our story
sc.hadoopConfiguration.set("fs.s3a.access.key",<lakeFS access key ID>) sc.hadoopConfiguration.set("fs.s3a.secret.key",<lakeFS secret access key>) sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.local.lakefs.io:8000") val branch = "s3a://example-repo/word-count1/" val textFile = sc.textFile(branch + "alice.txt") val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile(branch + "wc.report")
Access to our repository is done through the S3 interface, by setting the s3a endpoint and credentials based on our lakeFS installation.
Running the above will produce our word count report. We can use lakectl diff to see it:
$ lakectl diff lakefs://example-repo@word-count1 + added wc.report/_SUCCESS + added wc.report/part-00000 + added wc.report/part-00001
Any issue we have with the generated data, we can always discard all uncommitted changes by reverting them:
$ lakectl branch revert lakefs://example-repo@word-count1
When our feature is complete, we can commit our data to lakeFS.
You can include additional metadata fields in your lakeFS commit. For instance, by specifying its git commit hash, we can reference the code that was used to build the data.
$ lakectl commit lakefs://example-repo@word-count1 -m "word count" --meta git_commit_hash=cc313c5 Commit for branch "word-count1" completed. ID: ~Awz8bEu5ysXCFCbdak Message: word count Timestamp: 2020-10-27 13:40:41 +0200 IST Parents: ~Awz8bEu5ysXCFCbdaj
Using the `meta` flag we can store multiple metadata key/value pairs to help label our commit. Later we can check the log and use the referenced data.
$ lakectl log lakefs://example-repo@word-count1 ID: ~Awz8bEu5ysXCFCbdak Author: barak.amar Date: 2020-10-27 13:40:41 +0200 IST word count git_commit_hash = cc313c5 ID: ~Awz8bEu5ysXCFCbdaj Date: 2020-10-27 13:36:42 +0200 IST Branch 'word-count1' created, source branch 'main' ID: ~wUANJrNv Author: barak.amar Date: 2020-10-27 13:25:41 +0200 IST alice in wonderland ...
The commit ID, for example ~Awz8bEu5ysXCFCbdak, can be used for addressing the repository at this point by our CLI
$ lakectl fs cat lakefs://example-repo@~6kfQBz477AWBqR/wc.report/part-00000 ...
Or by our application as s3a address
Merge and iterate on the development workflow
Once both code and data are committed, they can be reviewed together before deciding to merge our new data into the main branch.
$ lakectl merge lakefs://example-repo@word-count1 lakefs://example-repo@main new: 3 modified: 0 removed: 0
Merge will generate a commit on the target branch with all the changes we made.
Committing is a fast and atomic metadata operation – no data is copied during the process. Our data is stored only once. If we merge changes to multiple objects, we are guaranteed that all of them will show up on our destination branch at the same time.
Optionally we can delete the feature branch once it is no longer required
$ lakectl branch delete lakefs://example-repo@word-count1
I hope this post was helpful and that you now feel empowered to go build your own development environment on top of your data lake. These are all suggestions – the concept and methodology is what’s important.