Paul Singman
October 26, 2021

Few people start using lakeFS without first having some data collected. Consequently, it is common that after getting it up and running, one of the first things people do is import their existing data to lakeFS.

There isn’t a one-size-fits-all approach for doing this. Instead, there are ways that work great for a single file, and some that are designed to handle millions of them.

Let’s walk through, in detail, how it’s done for each situation.

Ready? Let’s go.

Two Things to Know Before We Begin

You should know the values for two critical pieces of information before we go any further.

  1. Your lakeFS endpoint URL — This is the address of your lakeFS installation’s S3 Gateway. If testing locally, it will likely be http://localhost:8000. If you have a cloud-deployed lakeFS installation, you should have a DNS record pointing to the server, something like lakefs.example.com . Know this value it’ll be used in several places.
  2. Your lakeFS credentials — These are the Key ID and Secret Key generated when you first set up lakeFS and downloaded a lakectl.yaml file. Or your lakeFS administrator should set up a user for you and send the key and ID.

These credentials will be used in the following configuration files:

~/.aws/credentials
.lakectl.yaml

With that out of the way, let’s get started!

Single Local File Copy (AWS CLI)

The Situation — The marketing expert at your company sends you a CSV file of all customers he sent promotional emails to in the past month. You would like to add this file (which currently sits in your local Downloads folder) to your data lake for availability in potential analyses of these customers.

To do this, we’ll use the AWS CLI to copy the file into our lakeFS repo.

The Command

The following command copies a file called customer_promo_2021-11.csv in my local ~/Downloads folder onto the main branch of a lakeFS repository called my-repo under the path marketing/customer_promo_2021–11.csv .

aws --profile lakefs \
--endpoint-url https://penv.lakefs.dev \
s3 cp ~/Downloads/customer_promo_2021-11.csv s3://my-repo/main/marketing/customer_promo_2021-11.csv

Configuration

In order for this to work, we need to make sure we set our ~/.aws/credentials file with an entry for lakeFS. Here’s what mine looks like:

[default]
aws_access_key_id=AKIAMYACTUALAWSCREDS
aws_secret_access_key=EXAMPLEj2fnHf73J9jkke/e3ea4D

[lakefs]
aws_access_key_id=AKIAJRKP6EXAMPLE
aws_secret_access_key=EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32

This works by overriding the --endpoint-url parameter of the AWS CLI, allowing you to direct the requests to your lakeFS installation.

Copy Data Without Copying Data (lakectl ingest)

As the lakeFS docs beautifully state:

The lakectl command line tool supports ingesting objects from a source object store without actually copying the data itself. This is done by listing the source bucket (and optional prefix), and creating pointers to the returned objects in lakeFS.

Note that unlike the AWS CLI file copy command above, this works for data already in an object store. We’ll show how to copy the same customer_promo_2021-11.csv file as last time. Instead of being on our local computer though, now it’ll be located in an S3 bucket named my-beautiful-s3-bucket .

The Command

The parameters for the lakectl ingest command are quite straightforward. We simply use the --from and --to params to point to the S3 prefix where the file(s) are located, and where in the lakeFS repo we want the objects to exist.

lakectl ingest \
--from s3://my-beautiful-s3-bucket/customer_promo_2021-11.csv \
--to lakefs://my-repo/main/marketing/customer_promo_2021-11.csv

This works for both single files and multiple files. If instead we had an S3 prefix /customer_promos/ in our beautiful S3 bucket with multiple CSV files, we could ingest all of them to lakeFS with the command:

lakectl ingest \
--from s3://my-beautiful-s3-bucket/customer_promos/ \
--to lakefs://my-repo/main/marketing/customer_promos/

Configuration

In order for the lakectl ingest command to work, we need to make sure our command line is set up to run lakectl commands.

As part of the lakeFS tutorial series, I made a 4 minute video explaining exactly how to do this. 

Note in the video I use the lakectl config helper command to configure lakectl. You could also edit the ~/$HOME/.lakectl.yaml config file directly. Using the same example lakeFS host and credentials as before, my .lakectl.yaml file looks like:

credentials:
  access_key_id: AKIAJRKP6EXAMPLE
  secret_access_key: EXAMPLEYC5wcWOgF36peXniwEJn5kwncw32
server:
  endpoint_url: https://penv.lakefs.dev

Of course, you would not use these exact values, but instead use your lakeFS credentials and lakeFS domain host.

Large-Scale Imports (lakeFS inventory imports)

For even larger data collections, the lakeFS binary comes pre-packaged with an import utility that can handle many, many millions of objects.The way it works is by taking advantage of S3’s Inventory feature to create an efficient snapshot of your bucket.

The Comannd

The following command imports the data summarized from an S3 inventory stored in the bucket my-beautiful-s3-bucket-inventory to a lakeFS repository named my-repo .

lakefs import \
lakefs://my-repo \
-m s3://my-beautiful-s3-bucket-inventory/my-beautiful-bucket/my-beautiful-inventory/2021-10-25T00-00Z/manifest.json \
--config .lakefs.yaml

Note that we cannot save the inventory of an S3 bucket in the bucket itself, as this would create a recursive mess that is better discussed in the pages of Gödel, Escher, Bach.

Anyway, for the data stored in my-beautiful-s3-bucket that we want to import to lakeFS, we create a second bucket named my-beautiful-s3-bucket-inventory (though it could be called anything) and point the S3 Inventory to be stored there.

If the inventory import works, you’ll see a a response in terminal like this:

Inventory (2021-10-24) Files Read                     1 / 1    done
Inventory (2021-10-24) Current File                   1 / 1    done
Commit progress                                           0    done
Objects imported                                          1    done

Added or changed objects: 1
Commit ref:3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0
Import to branch import-from-inventory finished successfully.

To list imported objects, run:
	$ lakectl fs ls lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0/
To merge the changes to your main branch, run:
	$ lakectl merge lakefs://my-repo@3c1e4222cf2ac89a5c3a9fdd99d106f8bf225e2a17ac013ffae6d19f844420d0 lakefs://my-repo@main

And if you open the lakeFS UI, you’ll see a new import-from-inventory branch with the added objects on the latest commit.

Let’s walk through step-by-step creating this S3 Inventory, shall we?

Creating an S3 Inventory

Step 1: Go to the Management tab of the data bucket and click “Create Inventory Configuration”.

s3 inventory create

Step 2: Configure the Inventory. Fill in the inventory config name, which can be any value. For this example, I’ll go with my-beautiful-inventory .

If you want to limit the inventory to only a certain prefix within the bucket, you can specify that under Inventory Scope. I’ll leave it blank to capture the whole bucket.

create s3 inventory configuration

Next, I’ll find my-beautiful-bucket-inventory from the S3 Brower and select it as the inventory destination. Note: If you recently created the destination bucket, it can take a few minutes for it to appear in the S3 Browser. Be patient.

s3-browse-bucket

Next choose options for the inventory frequency, format, and status. I recommend Daily, Apache Parquet, and Enable. If you have sensitive data you can turn Server-side encryption on, I’ll leave it off for this example.

For Additional Fields, as the lakeFS documentation states, make sure you check Size, Last modified, and ETag. Once checked, click “Create”

s3 inventory additional fields

Step 3: Wait! It takes around 24 hours for the first inventory report to be generated, after which we can run the lakefs import command. Eventually you will see that the inventory job ran with a link to the  generated manifest.json file.

Configuration

Running lakeFS is dependent on the configuration file documented here. By default lakeFS looks for the following configuration files:

./config.yaml
$HOME/lakefs/config.yaml
/etc/lakefs/config.yaml
$HOME/.lakefs.yaml

It is not recommended best practice to save this file locally on a laptop in plaintext. For temporary testing purposes, however, it is okay to take your lakeFS config file, save it to your executable path (same as the .lakectl.yaml) and try out the lakeFS import command. 

Wrapping Up

Whether adding data big or small, I hope this article has been helpful for getting your lakeFS instance hydrated with data! Although we covered three ways to do so, it’s worth noting that two other methods exist – Rclone and distcp. Look out for future articles diving into how those work.

Still have questions about data and lakeFS?

Read Related Articles.

The Guide to Data Versioning

“I have never lied to you, I have always told you some version of the truth.” “The truth doesn’t have versions, okay?” — Something’s Gotta Give (2003) Jack Nicholson

Read More »

LakeFS

  • Get Started
    Get Started