Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Barak Amar
Barak Amar Author

October 27, 2020

Overview

As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, and creating alternative paths. Such workflows are error-prone, and may end up damaging production or otherwise polluting the data lake.


lakeFS enables a safe and automated development environment on your data lake without the need to copy or mock data, work on production pipelines, or involve DevOps. In this article, we will show you how to create a development environment working with lakeFS using Spark. We will start by creating a repository and building a small Spark application while using lakeFS’s capabilities such as easily committing or reverting changes to data.

Prerequisites

  • Installation of lakeFS – setting up a local lakeFS server is a one-liner. Head to our quickstart to get one up and running
  • lakectl – The official lakeFS CLI. Downloaded from our Releases page

Agenda

  1. Create repository
  2. Import data and commit it
  3. Develop a Spark application on an isolated branch
  4. Rollback or commit changes
  5. Merge and iterate on the development workflow

Create Repository

First let’s create a new repository, called `example-repo` that will hold our data – the following using S3 bucket `example-bucket` as the underlying storage.

$ lakectl repo create lakefs://example-repo s3://example-bucket/example-repo -d main
Repository: lakefs://example-repo
Repository 'example-repo' created:
storage namespace: s3://example-bucket/example-repo
default branch: main
timestamp: 1660458426

Import data and commit it

Using the following command we can download the data and upload it into the repository.

# Download data file - credit to Gutenberg EBook of Alice’s Adventures in Wonderland
# Or you can use any text file you like
$ curl -o alice.txt https://www.gutenberg.org/files/11/11-0.txt

# Upload to our repository
$ lakectl fs upload -s alice.txt lakefs://example-repo/main/alice.txt
Path: alice.txt
Modified Time: 2022-08-14 09:35:16 +0300 IDT
Size: 174313 bytes
Human Size: 174.3 kB
Physical Address: s3://barak-bucket1/example-repo/ec827db52c044d3881cdf426b99819f3
Checksum: 7e56f48cb7671ceba58e48bf232a379c
Content-Type: application/octet-stream

This is the starting point of our development, we want the data to be committed to a point in time that we can always rollback, reference, or branch from.

$ lakectl commit -m "alice in wonderland" lakefs://example-repo/main
Branch: lakefs://example-repo/main
Commit for branch "main" completed.

ID: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Message: alice in wonderland
Timestamp: 2022-08-14 09:36:08 +0300 IDT
Parents: fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc

A recommended methodology would be to update the development environment main branch periodically, to keep the development environment on par with production. Since branches are isolated, they can choose if to update their branch, and newly updated branches can be created from the updated main branch.

Develop a Spark application on an isolated branch

Capturing the state of our repository while working on our application starts with a new branch.

$ lakectl branch create lakefs://example-repo/word-count1 -s lakefs://example-repo/main
Source ref: lakefs://example-repo/main
created branch 'word-count1' bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620

Creating a branch provides you an isolated environment with a snapshot of your repository. It is a very lightweight operation that will not copy any data and still give us an isolated space that is guaranteed not to change.

We can use lakectl to list the content of our development branch, and see the same data as in our main branch:

$ lakectl fs ls lakefs://example-repo/word-count1/
object          2022-08-14 09:35:16 +0300 IDT    174.3 kB        alice.txt

$ llakectl fs ls lakefs://example-repo/main/       
object          2022-08-14 09:35:16 +0300 IDT    174.3 kB        alice.txt

Let’s run a simple word count using Spark to produce a report based on our story

sc.hadoopConfiguration.set("fs.s3a.access.key",<lakeFS access key ID>)
sc.hadoopConfiguration.set("fs.s3a.secret.key",<lakeFS secret access key>)
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://s3.local.lakefs.io:8000")

val branch = "s3a://example-repo/word-count1/"
val textFile = sc.textFile(branch + "alice.txt")
val counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile(branch + "wc.report")

Access to our repository is done through the S3 interface, by setting the s3a endpoint and credentials based on our lakeFS installation.

Running the above will produce our word count report. We can use lakectl diff to see it:

$ lakectl diff lakefs://example-repo/word-count1
Ref: lakefs://example-repo/word-count1
+ added wc.report/_SUCCESS
+ added wc.report/part-00000
+ added wc.report/part-00001

Any issue we have with the generated data, we can always discard all uncommitted changes by reset the branch:

$ lakectl branch reset lakefs://example-repo/word-count1 --yes 
Branch: lakefs://example-repo/word-count1

When our feature is complete, we can commit our data to lakeFS.

You can include additional metadata fields in your lakeFS commit. For instance, by specifying its git commit hash, we can reference the code that was used to build the data.

$ lakectl commit lakefs://example-repo/word-count1 -m "word count" --meta git_commit_hash=cc313c5
Branch: lakefs://example-repo/word-count1
Commit for branch "word-count1" completed.

ID: 068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Message: word count
Timestamp: 2022-08-14 11:36:16 +0300 IDT
Parents: bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620

Using the `meta` flag we can store multiple metadata key/value pairs to help label our commit. Later we can check the log and use the referenced data.

$ lakectl log lakefs://example-repo/word-count1

ID:            068ddaf6b4919e5bf97a14d254ba5da64fa16e7fef438b1d65656cc9727b33df
Author:        barak
Date:          2022-08-14 11:36:16 +0300 IDT

	word count
	
	
		git_commit_hash = cc313c5
	
ID:            bf099c651bf8ee914fca2b42824f05d2f48ac17bde80f4b34a5d9fb61839a620
Author:        barak
Date:          2022-08-14 09:36:08 +0300 IDT

	alice in wonderland
	
	
ID:            fdcd03133c44f982b2f294494253d479a3409abd08d0d6e2686ea8764f0677dc
Date:          2022-08-14 09:27:06 +0300 IDT

	Repository created

The commit ID, for example 068ddaf6b4919e5b, can be used for addressing the repository at this point by our CLI

$ lakectl fs cat lakefs://example-repo/068ddaf6b4919e5b/wc.report/part-00000
(“Found,2)
(someone,1)
(plan.”,1)
(bone,1)
(roses.,1)
...

Or by our application as s3a address s3a://example-repo/068ddaf6b4919e5b/wc.report

Merge and iterate on the development workflow

Once both code and data are committed, they can be reviewed together before deciding to merge our new data into the main branch.

$ lakectl merge lakefs://example-repo/word-count1 lakefs://example-repo/main
Source: lakefs://example-repo/word-count1
Destination: lakefs://example-repo/main
Merged "word-count1" into "main" to get "50d593e52cba687f2af7894189b381ebafdf7495f9e65a70ba2fc67e4e8c0677".

Merge will generate a commit on the target branch with all the changes we made.

Committing is a fast and atomic metadata operation – no data is copied during the process. Our data is stored only once. If we merge changes to multiple objects, we are guaranteed that all of them will show up on our destination branch at the same time.

Optionally we can delete the feature branch once it is no longer required

$ lakectl branch delete lakefs://example-repo/word-count1 --yes
Branch: lakefs://example-repo/word-count1

Summary

I hope this post was helpful and that you now feel empowered to go build your own development environment on top of your data lake. These are all suggestions – the concept and methodology is what’s important. 

Check out our docs for more information on how to get started with lakeFS, and feel free to join our Slack community to ask questions!


If you enjoyed this article, check out other related posts:

Git for Data – lakeFS

  • Get Started
    Get Started
  • Create a Dev/Test Environment for Data Pipelines Using Spark and Python in this LIVE WEBINAR -

    Register here
    +