Best Practices, Machine Learning, Tutorials

Import Data to lakeFS: Effortless, Fast, and Zero Copy

Idan Novogroder

Last updated on April 26, 2024

Home > Blog > Import Data to lakeFS: Effortless, Fast, and Zero Copy

When adopting a new technology in our organizational infrastructure, one of the foremost considerations is its initial cost. In other words: how many working hours will we have to invest to start using this technology in our system?

Often, this question will tip the scales in favor of using a certain solution over another. It makes sense to consider this, especially when it comes to new organizations and startups that are at a critical stage of proving their ability. This can be a significant question even in large and stable organizations that want to move quickly and avoid wasting valuable developer work on testing a new technology.

Considering the important cost of imports, lakeFS invested a lot of effort in our import capabilities to reduce the friction of the transition to lakeFS. This article will explore some common scenarios to consider and apply when importing data into lakeFS.

Why import data?

Importing data into lakeFS offers an efficient way to bring in large volumes of data without physically duplicating it. When you import data, lakeFS creates pointers to your existing objects in the new repository. Alternatively, you can introduce data to lakeFS by copying it.

As a rule of thumb, if the data at the source location you’re ingesting from is expected to remain unaltered, importing it is a sensible approach. On the other hand, if you cannot guarantee that the files will remain static, we recommend you consider copying the files rather than importing them.

Common use cases for importing data include:

Import an entire bucket to lakeFS: Copying millions of objects will be slow and costly. Use the import feature to start managing the whole of your data lake in lakeFS..
Consolidate data from multiple locations: When you wish to work with a logical grouping of datasets distributed across various locations within a single lakeFS repository. This is common for data scientists who are training models using multiple datasets.
Continuous import: When you want to continuously introduce data from a landing zone into lakeFS to maintain version control.

Prerequisites

Importing is permitted for users in the Supers (lakeFS open-source) group or the SuperUsers (lakeFS Cloud/Enterprise) group.
The lakeFS server must have permissions to list the objects in the source bucket.
The source bucket must be in the same region as your repository.

Let’s dive deeper into these use cases and share a step-by-step guide on how to optimally import into lakeFS.

Use Case 1: Import an entire bucket to lakeFS

Let’s say I have a bucket that contains all my production data and I want to start managing it using lakeFS. It will help me to have some order in my production data – I’ll be able to commit my changes, trigger pipelines, and have the ability to revert unwanted changes.

To achieve that in the lakeFS UI:

In your repository’s main page, click the Import button to open the import dialog
Under Import from, fill in the URI of the bucket you want to import from.
Click on Import.

Alternatively, you can use the lakectl tool to perform the import from the command line:

Copy Code

lakectl import \
  --from s3://my-date-lake/ \
  --to lakefs://example-repo/main/

Use Case 2: Consolidate data from multiple locations

Sometimes, we want to import objects from multiple buckets. For example,we want to use lakeFS webhooks in order to run pipelines against multiple subsets from different data sources.

In the following example, we show how the lakeFS Python client allows for importing data from multiple locations into a single repository. We bring data from two different buckets, s3://sample-data1 and s3://sample-data2, into our repository.

Copy Code

commit = CommitCreation(message="consolidate my buckets")
paths=[
    ImportLocation(type="common_prefix", path="s3://sample-data1/", destination="dataset1/"),
    ImportLocation(type="common_prefix", path="s3://sample-data2/", destination="dataset2/"),
]
import_creation = ImportCreation(paths=paths, commit=commit)
create_resp = lakefs.import_api.import_start(repo.id, sourceBranch, import_creation)

When the process is completed, the data from each of the locations will be available under its own prefix in lakeFS.

See the complete source code on GitHub.

Use case 3: continuous import

If your raw data is not managed in lakeFS yet, you may want to import it using a daily automated process

Option 1: Daily append a new partition from the data source

Copy Code

def daily_job(dt):
  commit = CommitCreation(message=f"import for date {dt}")
  import_creation = ImportCreation(paths=[ImportLocation(type="common_prefix", path=f"s3://sample-table/dt={dt}", destination=f"dataset/dt={dt}")], commit=commit)
  create_resp = lakefs.import_api.import_start(repo.id, sourceBranch, import_creation)

Option 2: Override the whole data source every day

To sync the state of the data source to lakeFS, perform a simple import from the source to a path in lakeFS. The import process always cleans up the destination before syncing:

Copy Code

def daily_job(dt):
  commit = CommitCreation(message=f"import for date {dt}")
  import_creation = ImportCreation(paths=[    ImportLocation(type="common_prefix", path=f"s3://sample-table/", destination=f"dataset/")], commit=commit)
  create_resp = lakefs.import_api.import_start(repo.id, sourceBranch, import_creation)

Bonus example 1: Import from public buckets

An interesting use case is importing data from publicly available datasets.

To perform such imports, lakeFS will require additional permissions to read from public buckets. The following AWS IAM policy will allow an AWS user/role to access public buckets. It will also block access to any buckets on the user’s own account:

Copy Code

{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Sid": "PubliclyAccessibleBuckets",
         "Effect": "Allow",
         "Action": [
            "s3:GetBucketVersioning",
            "s3:ListBucket",
            "s3:GetBucketLocation",
            "s3:ListBucketMultipartUploads",
            "s3:ListBucketVersions",
            "s3:GetObject",
            "s3:GetObjectVersion",
            "s3:AbortMultipartUpload",
            "s3:ListMultipartUploadParts"
         ],
         "Resource": ["*"],
         "Condition": {
           "StringNotEquals": {
             "s3:ResourceAccount": "<YourAccountID>"
           }
         }
       }
     ]
   }

Note: For a step by step notebook, try running the import-multiple-buckets notebook within our sample repo.

Bonus example 2: Import only objects with specific properties.

Let’s say I want to import objects with specific properties like tags, creation date, size or owner. Here is an example of how to import objects with a specific tag using lakeFS python client:

Copy Code

#!/usr/bin/env python3

import lakefs_client
import time
import os

from lakefs_client.models import *
from lakefs_client.client import LakeFSClient
from lakefs_client.exceptions import NotFoundException


lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io'
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

repo_name = "multi-bucket-import"

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print("Found existing repo {repo_id} using storage namespace {repo_storage_namespace}".format(repo_id=repo.id, repo_storage_namespace=repo.storage_namespace))
except NotFoundException as ex:
    print("Repository {repo_name} does not exist.".format(repo_name=repo_name))
    os._exit(00)

sourceBranch = "main"

# Import Destinations
importDestination = "raw/" # will keep the original files in the raw directory

def get_paths_with_tag(bucket_name, prefix, tag_key, tag_value):
    # Create an S3 client
    s3 = boto3.client('s3')

    # List objects in the bucket
    objects = s3.list_objects_v2(Bucket=bucket_name, Prefix=prefix)['Contents']

    # Initialize an empty list to store paths
    paths = []

    # Iterate through objects and check for the specified tag
    for obj in objects:
        # Get the tags for each object
        tags = s3.get_object_tagging(Bucket=bucket_name, Key=obj['Key'])['TagSet']

        # Check if the specified tag exists and has the correct value
        if any(tag['Key'] == tag_key and tag['Value'] == tag_value for tag in tags):
            paths.append(obj['Key'])

    return paths

# Replace 'your_bucket_name', 'your_tag_key', and 'your_tag_value' with your actual values
bucket_name = 'sample-data'
prefix = 'stanfordogsdataset/'
tag_key = 'subset'
tag_value = '1'

result = get_paths_with_tag(bucket_name, prefix, tag_key, tag_value)

print("Objects with tag '{tag_key}:{tag_value}' in bucket '{bucket_name}':")
for path in result:
    print(path)

# Start Import
commit = CommitCreation(message="import objects", metadata={"key": "value"})
paths = []
for path in result:
    paths.append(ImportLocation(type="obect", path="s3://{bucket_name}/{path}".format(bucket_name=bucket_name, path=path), destination=importDestination))
# Start Import
commit = CommitCreation(message="import objects", metadata={"key": "value"})
import_creation = ImportCreation(paths=paths, commit=commit)
create_resp = lakefs.import_api.import_start(repo.id, sourceBranch, import_creation)

# Wait for import to finish
while True:
    status_resp = lakefs.import_api.import_status(repo.id, sourceBranch, create_resp.id)
    print(status_resp)
    if hasattr(status_resp, "Error in import"):
        raise Exception(status_resp.err)
    if status_resp.completed:
        print("Import completed Successfully. Data imported into branch:", sourceBranch)
        break
    time.sleep(2)

Summary

lakeFS provides a couple of convenient ways (UI, lakectl, Python, and Java clients) to import your data from your object storage to your lakeFS repository. lakeFS does it at a speed of thousands of objects per second, without copying the data. You can use it to import a single object, a single bucket, a subset of your data, or multiple subsets from different buckets. All things considered, it should be obvious: try lakeFS if you’re looking for versioning capabilities for your data lake.