Yoni Augarten
January 5, 2021

lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like operations over your MinIO storage environment and works seamlessly with all modern data frameworks such as Spark, Hive, Presto, Kafka, R and Native Python etc.

Common use-cases include creating a development environment without copying or mocking data, consciously ingesting new data sources, and building a resilient production environment to deploy fresh data.

In this post, I will cover how to set up lakeFS over MinIO, and give you a sense for the ways lakeFS makes it easy to work with data.

Prerequisites

  • MinIO Server Installed from here.
  • Installing mc from here.
  • Installing docker-compose from here.

Installation

We will install lakeFS locally on your development machine. More installation options are available in our docs.

A production-suitable installation will require a persistent Postgres installation. However, for this example we will use a Postgres instance within a docker container.Create a docker-compose environment file for lakeFS, replacing <minio_access_key_id>, <minio_secret_access_key> and <minio_endpoint> with their values in your MinIO installation.

Run the following commands:

LAKEFS_CONFIG_FILE=./.lakefs-env
echo "AWS_ACCESS_KEY_ID=<minio_access_key_id>" > $LAKEFS_CONFIG_FILE
echo "AWS_SECRET_ACCESS_KEY=<minio_secret_access_key>" >> $LAKEFS_CONFIG_FILE
echo "LAKEFS_BLOCKSTORE_S3_ENDPOINT=<minio_endpoint>" >> $LAKEFS_CONFIG_FILE
echo "LAKEFS_BLOCKSTORE_TYPE=s3" >> $LAKEFS_CONFIG_FILE
echo "LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true" >> $LAKEFS_CONFIG_FILE

Then start lakeFS:

curl https://compose.lakefs.io | docker-compose --env-file $LAKEFS_CONFIG_FILE -f - up

Configuration

Browse to lakeFS to create an admin user: http://127.0.0.1:8000/setup

Take note of the generated access key and secret.

We will use the lakectl binary to perform lakeFS operations. Find the distribution suitable to your operating system here, and extract the lakectl binary from the tar.gz archive. Put it somewhere in your $PATH and run lakectl –version to verify.

Then run the following command to configure lakectl (use the credentials given to you in the setup before):

lakectl config
# output:
# Config file /home/janedoe/.lakectl.yaml will be used
# Access key ID: <LAKEFS_ACCESS_KEY_ID>
# Secret access key: <LAKEFS_SECRET_KEY>
# Server endpoint URL: http://127.0.0.1:8000/api/v1

Verify that lakectl can access lakeFS with the command:

lakectl repo list

If no error is displayed, you can now set a MinIO alias for lakeFS:

mc alias set lakefs http://s3.local.lakefs.io:8000 <LAKEFS_ACCESS_KEY_ID> <LAKEFS_SECRET_KEY>

Example

Create a bucket in MinIO. Note that this bucket is created directly in your Minio installation. Later we will use lakeFS to enable versioning on this bucket.

mc mb myminio/example-bucket

Create a repository in lakeFS:

lakectl repo create lakefs://example-repo s3://example-bucket

Generate two example files:

echo "my first file" > myfile.txt
echo "my second file" > myfile2.txt

Copy the file to your master branch, and commit:

mc cp ./myfile.txt lakefs/example-repo/master/
lakectl commit lakefs://example-repo@master -m "my first commit"

Create a branch named branch1, and copy a file to it:

lakectl branch create lakefs://example-repo@branch1 --source lakefs://example-repo@master
mc cp ./myfile2.txt lakefs/example-repo/branch1/

List master and the branch and see that the new file is only visible in the branch, while the older file is visible in both the branch and master.

mc ls lakefs/example-repo/master
# only myfile.txt should be listed

mc ls lakefs/example-repo/branch1
# both files should be listed

Let’s commit the branch, and merge it back to master:

lakectl commit lakefs://example-repo@branch1 -m "my second commit"
lakectl merge lakefs://example-repo@branch1 lakefs://example-repo@master

Now both files are accessible through master:

mc ls lakefs/example-repo/master

Conclusion 

Give it a try on your own. Leverage the power of Git branches to run parallel pipelines, experiment and upgrade safely. If you need a little help, check out our documentation and feel free to reach out on our public Slack channel as well.

LakeFS

  • Get Started
    Get Started