As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, and creating alternative paths. Such workflows are error-prone, and may end up damaging production or otherwise polluting the data lake.
lakeFS enables a safe and automated development environment on your data lake without the need to copy or mock data, work on production pipelines, or involve DevOps. In this article, we will show you how to create a development environment working with lakeFS using Spark. We will start by creating a repository and building a small Spark application while using lakeFS’s capabilities such as easily committing or reverting changes to data.
- Installation of lakeFS – setting up a local lakeFS server is a one-liner. Head to our quickstart to get one up and running
- lakectl – The official lakeFS CLI. Downloaded from our Releases page
- Create repository
- Import data and commit it
- Develop a Spark application on an isolated branch
- Rollback or commit changes
- Merge and iterate on the development workflow
First let’s create a new repository, called `example-repo` that will hold our data – the following using S3 bucket `example-bucket` as the underlying storage.
$ lakectl repo create lakefs://example-repo s3://example-bucket/example-repo -d main Repository: lakefs://example-repo Repository 'example-repo' created: storage namespace: s3://example-bucket/example-repo default branch: main timestamp: 1660458426
Import data and commit it
Using the following command we can download the data and upload it into the repository.
# Download data file - credit to Gutenberg EBook of Alice’s Adventures in Wonderland # Or you can use any text file you like $ curl -o alice.txt https://www.gutenberg.org/files/11/11-0.txt # Upload to our repository $ lakectl fs upload -s alice.txt lakefs://example-repo/main/alice.txt Path: alice.txt Modified Time: 2022-08-14 09:35:16 +0300 IDT Size: 174313 bytes Human Size: 174.3 kB Physical Address: s3://barak-bucket1/example-repo/ec827db52c044d3881cdf426b99819f3 Checksum: 7e56f48cb7671ceba58e48bf232a379c Content-Type: application/octet-stream
This is the starting point of our development, we want the data to be committed to a point in time that we can always rollback, reference, or branch from.
$ lakectl commit -m "alice in wonderland" lakefs://example-repo/main Branch: lakefs://example-repo/main Commit for branch "main" completed. ID: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620 Message: alice in wonderland Timestamp: 2022-08-14 09:36:08 +0300 IDT Parents: fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc
A recommended methodology would be to update the development environment main branch periodically, to keep the development environment on par with production. Since branches are isolated, they can choose if to update their branch, and newly updated branches can be created from the updated main branch.
Develop a Spark application on an isolated branch
Capturing the state of our repository while working on our application starts with a new branch.
$ lakectl branch create lakefs://example-repo/word-count1 -s lakefs://example-repo/main Source ref: lakefs://example-repo/main created branch 'word-count1' bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Creating a branch provides you an isolated environment with a snapshot of your repository. It is a very lightweight operation that will not copy any data and still give us an isolated space that is guaranteed not to change.
We can use
lakectl to list the content of our development branch, and see the same data as in our main branch:
$ lakectl fs ls lakefs://example-repo/word-count1/ object 2022-08-14 09:35:16 +0300 IDT 174.3 kB alice.txt $ llakectl fs ls lakefs://example-repo/main/ object 2022-08-14 09:35:16 +0300 IDT 174.3 kB alice.txt
Let’s run a simple word count using Spark to produce a report based on our story
sc.hadoopConfiguration.set("fs.s3a.access.key",<lakeFS access key ID>) sc.hadoopConfiguration.set("fs.s3a.secret.key",<lakeFS secret access key>) sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true") sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.local.lakefs.io:8000") val branch = "s3a://example-repo/word-count1/" val textFile = sc.textFile(branch + "alice.txt") val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile(branch + "wc.report")
Access to our repository is done through the S3 interface, by setting the s3a endpoint and credentials based on our lakeFS installation.
Running the above will produce our word count report. We can use
lakectl diff to see it:
$ lakectl diff lakefs://example-repo/word-count1 Ref: lakefs://example-repo/word-count1 + added wc.report/_SUCCESS + added wc.report/part-00000 + added wc.report/part-00001
Any issue we have with the generated data, we can always discard all uncommitted changes by reset the branch:
$ lakectl branch reset lakefs://example-repo/word-count1 --yes Branch: lakefs://example-repo/word-count1
When our feature is complete, we can commit our data to lakeFS.
You can include additional metadata fields in your lakeFS commit. For instance, by specifying its git commit hash, we can reference the code that was used to build the data.
$ lakectl commit lakefs://example-repo/word-count1 -m "word count" --meta git_commit_hash=cc313c5 Branch: lakefs://example-repo/word-count1 Commit for branch "word-count1" completed. ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df Message: word count Timestamp: 2022-08-14 11:36:16 +0300 IDT Parents: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Using the `meta` flag we can store multiple metadata key/value pairs to help label our commit. Later we can check the log and use the referenced data.
$ lakectl log lakefs://example-repo/word-count1 ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df Author: barak Date: 2022-08-14 11:36:16 +0300 IDT word count git_commit_hash = cc313c5 ID: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620 Author: barak Date: 2022-08-14 09:36:08 +0300 IDT alice in wonderland ID: fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc Date: 2022-08-14 09:27:06 +0300 IDT Repository created
The commit ID, for example
068ddaf6b4919e5b, can be used for addressing the repository at this point by our CLI
$ lakectl fs cat lakefs://example-repo/068ddaf6b4919e5b/wc.report/part-00000 (“Found,2) (someone,1) (plan.”,1) (bone,1) (roses.,1) ...
Or by our application as s3a address
Merge and iterate on the development workflow
Once both code and data are committed, they can be reviewed together before deciding to merge our new data into the main branch.
$ lakectl merge lakefs://example-repo/word-count1 lakefs://example-repo/main Source: lakefs://example-repo/word-count1 Destination: lakefs://example-repo/main Merged "word-count1" into "main" to get "50d593e52cba687f2af7894189b381ebafdf7495f9e65a70ba2fc67e4e8c0677".
Merge will generate a commit on the target branch with all the changes we made.
Committing is a fast and atomic metadata operation – no data is copied during the process. Our data is stored only once. If we merge changes to multiple objects, we are guaranteed that all of them will show up on our destination branch at the same time.
Optionally we can delete the feature branch once it is no longer required
$ lakectl branch delete lakefs://example-repo/word-count1 --yes Branch: lakefs://example-repo/word-count1
I hope this post was helpful and that you now feel empowered to go build your own development environment on top of your data lake. These are all suggestions – the concept and methodology is what’s important.
If you enjoyed this article, check out other related posts: