Few people start using lakeFS without first having some data collected. Consequently, it is common that after getting it up and running, one of the first things people do is import their existing data to lakeFS.
There isn’t a one-size-fits-all approach for doing this. Instead, there are ways that work great for a single file, and some that are designed to handle millions of them.
Let’s walk through, in detail, how it’s done for each situation.
Ready? Let’s go.
Two Things to Know Before We Begin
You should know the values for two critical pieces of information before we go any further.
- Your lakeFS endpoint URL — This is the address of your lakeFS installation’s S3 Gateway. If testing locally, it will likely be
http://localhost:8000. If you have a cloud-deployed lakeFS installation, you should have a DNS record pointing to the server, something like
lakefs.example.com. Know this value it’ll be used in several places.
- Your lakeFS credentials — These are the Key ID and Secret Key generated when you first set up lakeFS and downloaded a
lakectl.yamlfile. Or your lakeFS administrator should set up a user for you and send the key and ID.
These credentials will be used in the following configuration files:
With that out of the way, let’s get started!
Single Local File Copy (AWS CLI)
The Situation — The marketing expert at your company sends you a CSV file of all customers he sent promotional emails to in the past month. You would like to add this file (which currently sits in your local Downloads folder) to your data lake for availability in potential analyses of these customers.
To do this, we’ll use the AWS CLI to copy the file into our lakeFS repo.
The following command copies a file called
customer_promo_2021-11.csv in my local
~/Downloads folder onto the
main branch of a lakeFS repository called
my-repo under the path
aws --profile lakefs \ --endpoint-url https://penv.lakefs.dev \ s3 cp ~/Downloads/customer_promo_2021-11.csv s3://my-repo/main/marketing/customer_promo_2021-11.csv
In order for this to work, we need to make sure we set our
~/.aws/credentials file with an entry for lakeFS. Here’s what mine looks like:
[default] aws_access_key_id=AKIAMYACTUALAWSCREDS aws_secret_access_key=EXAMPLEj2fnHf73J9jkke/e3ea4D [lakefs] aws_access_key_id=AKIAJRKP6EXAMPLE aws_secret_access_key=EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32
This works by overriding the
--endpoint-url parameter of the AWS CLI, allowing you to direct the requests to your lakeFS installation.
Copy Data Without Copying Data (lakectl ingest)
As the lakeFS docs beautifully state:
The lakectl command line tool supports ingesting objects from a source object store without actually copying the data itself. This is done by listing the source bucket (and optional prefix), and creating pointers to the returned objects in lakeFS.
Note that unlike the AWS CLI file copy command above, this works for data already in an object store. We’ll show how to copy the same
customer_promo_2021-11.csv file as last time. Instead of being on our local computer though, now it’ll be located in an S3 bucket named
The parameters for the
lakectl ingest command are quite straightforward. We simply use the
--to params to point to the S3 prefix where the file(s) are located, and where in the lakeFS repo we want the objects to exist.
lakectl ingest \ --from s3://my-beautiful-s3-bucket/customer_promo_2021-11.csv \ --to lakefs://my-repo/main/marketing/customer_promo_2021-11.csv
This works for both single files and multiple files. If instead we had an S3 prefix
/customer_promos/ in our beautiful S3 bucket with multiple CSV files, we could ingest all of them to lakeFS with the command:
lakectl ingest \ --from s3://my-beautiful-s3-bucket/customer_promos/ \ --to lakefs://my-repo/main/marketing/customer_promos/
In order for the
lakectl ingest command to work, we need to make sure our command line is set up to run
As part of the lakeFS tutorial series, I made a 4 minute video explaining exactly how to do this.
Note in the video I use the
lakectl config helper command to configure lakectl. You could also edit the
~/$HOME/.lakectl.yaml config file directly. Using the same example lakeFS host and credentials as before, my
.lakectl.yaml file looks like:
credentials: access_key_id: AKIAJRKP6EXAMPLE secret_access_key: EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32 server: endpoint_url: https://penv.lakefs.dev
Of course, you would not use these exact values, but instead use your lakeFS credentials and lakeFS domain host.
Large-Scale Imports (lakeFS inventory imports)
For even larger data collections, the lakeFS binary comes pre-packaged with an import utility that can handle many, many millions of objects.The way it works is by taking advantage of S3’s Inventory feature to create an efficient snapshot of your bucket.
The following command imports the data summarized from an S3 inventory stored in the bucket
my-beautiful-s3-bucket-inventory to a lakeFS repository named
lakefs import \ lakefs://my-repo \ -m s3://my-beautiful-s3-bucket-inventory/my-beautiful-bucket/my-beautiful-inventory/2021-10-25T00-00Z/manifest.json \ --config .lakefs.yaml
Note that we cannot save the inventory of an S3 bucket in the bucket itself, as this would create a recursive mess that is better discussed in the pages of Gödel, Escher, Bach.
Anyway, for the data stored in
my-beautiful-s3-bucket that we want to import to lakeFS, we create a second bucket named
my-beautiful-s3-bucket-inventory (though it could be called anything) and point the S3 Inventory to be stored there.
If the inventory import works, you’ll see a a response in terminal like this:
Inventory (2021-10-24) Files Read 1 / 1 done Inventory (2021-10-24) Current File 1 / 1 done Commit progress 0 done Objects imported 1 done Added or changed objects: 1 Commit ref:3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0 Import to branch import-from-inventory finished successfully. To list imported objects, run: $ lakectl fs ls lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0/ To merge the changes to your main branch, run: $ lakectl merge lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0 lakefs://my-repo@main
And if you open the lakeFS UI, you’ll see a new
import-from-inventory branch with the added objects on the latest commit.
Let’s walk through step-by-step creating this S3 Inventory, shall we?
Creating an S3 Inventory
Step 1: Go to the Management tab of the data bucket and click “Create Inventory Configuration”.
Step 2: Configure the Inventory. Fill in the inventory config name, which can be any value. For this example, I’ll go with
If you want to limit the inventory to only a certain prefix within the bucket, you can specify that under Inventory Scope. I’ll leave it blank to capture the whole bucket.
Next, I’ll find
my-beautiful-bucket-inventory from the S3 Brower and select it as the inventory destination. Note: If you recently created the destination bucket, it can take a few minutes for it to appear in the S3 Browser. Be patient.
Next choose options for the inventory frequency, format, and status. I recommend Daily, Apache Parquet, and Enable. If you have sensitive data you can turn Server-side encryption on, I’ll leave it off for this example.
For Additional Fields, as the lakeFS documentation states, make sure you check Size, Last modified, and ETag. Once checked, click “Create”
Step 3: Wait! It takes around 24 hours for the first inventory report to be generated, after which we can run the
lakefs import command. Eventually you will see that the inventory job ran with a link to the generated
Running lakeFS is dependent on the configuration file documented here. By default lakeFS looks for the following configuration files:
./config.yaml $HOME/lakefs/config.yaml /etc/lakefs/config.yaml $HOME/.lakefs.yaml
It is not recommended best practice to save this file locally on a laptop in plaintext. For temporary testing purposes, however, it is okay to take your lakeFS config file, save it to your executable path (same as the
.lakectl.yaml) and try out the lakeFS import command.
The lakeFS project is an open source technology that provides a git-like version control interface for data lakes, with seamless integration to popular data tools and frameworks.
Our mission is to maximize the manageability of open source data analytics solutions that scale.
Read Related Articles.
In the universe of Databricks Lakehouse, Databricks SQL serves as a handy tool for querying and analyzing data. It lets SQL-savvy data analysts, data engineers,
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines. It’s the easiest way to transform any Python function
Find out what are vector databases and why you need them as a data practitioner
Table of Contents