lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like operations over your MinIO storage environment and works seamlessly with all modern data frameworks such as Spark, Hive, Presto, Kafka, R and Native Python etc.
Common use-cases include creating a development environment without copying or mocking data, consciously ingesting new data sources, and building a resilient production environment to deploy fresh data.
In this post, I will cover how to set up lakeFS over MinIO, and give you a sense for the ways lakeFS makes it easy to work with data.
We will install lakeFS locally on your development machine. More installation options are available in our docs.
A production-suitable installation will require a persistent Postgres installation. However, for this example we will use a Postgres instance within a docker container.Create a docker-compose environment file for lakeFS, replacing <minio_access_key_id>, <minio_secret_access_key> and <minio_endpoint> with their values in your MinIO installation.
Run the following commands:
LAKEFS_CONFIG_FILE=./.lakefs-env echo "AWS_ACCESS_KEY_ID=<minio_access_key_id>" > $LAKEFS_CONFIG_FILE echo "AWS_SECRET_ACCESS_KEY=<minio_secret_access_key>" >> $LAKEFS_CONFIG_FILE echo "LAKEFS_BLOCKSTORE_S3_ENDPOINT=<minio_endpoint>" >> $LAKEFS_CONFIG_FILE echo "LAKEFS_BLOCKSTORE_TYPE=s3" >> $LAKEFS_CONFIG_FILE echo "LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true" >> $LAKEFS_CONFIG_FILE
Then start lakeFS:
curl https://compose.lakefs.io | docker-compose --env-file $LAKEFS_CONFIG_FILE -f - up
Browse to lakeFS to create an admin user: http://127.0.0.1:8000/setup
Take note of the generated access key and secret.
We will use the lakectl binary to perform lakeFS operations. Find the distribution suitable to your operating system here, and extract the lakectl binary from the tar.gz archive. Put it somewhere in your $PATH and run lakectl –version to verify.
Then run the following command to configure lakectl (use the credentials given to you in the setup before):
lakectl config # output: # Config file /home/janedoe/.lakectl.yaml will be used # Access key ID: <LAKEFS_ACCESS_KEY_ID> # Secret access key: <LAKEFS_SECRET_KEY> # Server endpoint URL: http://127.0.0.1:8000/api/v1
Verify that lakectl can access lakeFS with the command:
lakectl repo list
If no error is displayed, you can now set a MinIO alias for lakeFS:
mc alias set lakefs http://s3.local.lakefs.io:8000 <LAKEFS_ACCESS_KEY_ID> <LAKEFS_SECRET_KEY>
Create a bucket in MinIO. Note that this bucket is created directly in your Minio installation. Later we will use lakeFS to enable versioning on this bucket.
mc mb myminio/example-bucket
Create a repository in lakeFS:
lakectl repo create lakefs://example-repo s3://example-bucket
Generate two example files:
echo "my first file" > myfile.txt echo "my second file" > myfile2.txt
Copy the file to your master branch, and commit:
mc cp ./myfile.txt lakefs/example-repo/master/ lakectl commit lakefs://example-repo@master -m "my first commit"
Create a branch named branch1, and copy a file to it:
lakectl branch create lakefs://example-repo@branch1 --source lakefs://example-repo@master mc cp ./myfile2.txt lakefs/example-repo/branch1/
List master and the branch and see that the new file is only visible in the branch, while the older file is visible in both the branch and master.
mc ls lakefs/example-repo/master # only myfile.txt should be listed mc ls lakefs/example-repo/branch1 # both files should be listed
Let’s commit the branch, and merge it back to master:
lakectl commit lakefs://example-repo@branch1 -m "my second commit" lakectl merge lakefs://example-repo@branch1 lakefs://example-repo@master
Now both files are accessible through master:
mc ls lakefs/example-repo/master
Give it a try on your own. Leverage the power of Git branches to run parallel pipelines, experiment and upgrade safely. If you need a little help, check out our documentation and feel free to reach out on our public Slack channel as well.