Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Amit Kesarwani
Amit Kesarwani Author

Last updated on July 26, 2023

Introduction

If you want to migrate or clone repositories from a source lakeFS environment to a target lakeFS environment then follow this tutorial. Your source and target lakeFS environments can be running locally or in the cloud.

You can also follow this tutorial if you want to migrate/clone a source repository to a target repository within the same lakeFS environment.

If you have multiple repositories then run the following steps for each repository.

You can also use a sample Python notebook instead of this tutorial to migrate or clone a lakeFS repository.

Prerequisites

  1. Source and a target lakeFS environments (you can deploy one independently or use the hosted solution lakeFS Cloud)
  2. lakectl CLI
  3. Object storage for both source and target repositories
  4. azcopy or aws cli or any other tool of your choice to copy data between object stores (if using object storage on Azure or AWS)

Step 1 – Commit Changes

  1. Commit any uncommitted data in your source repository. Refer to  sample Python notebook if you want to find any uncommitted data programmatically.

Step 2 – Dump Metadata of Source Repository

  1. Dump source repo by using the following lakectl command:
lakectl refs-dump lakefs://source-repo-name
  1. Above command takes a few seconds and dumps refs (branches, commits and tags) to the underlying object store (where lakeFS repository data is stored) by creating a refs_manifest.json file inside _lakefs folder.

Step 3 – Copy Data from Source to Target

Source and Target on Azure

If you are using Azure then copy data from source Storage Container to the target Storage Container by using the following azcopy command (refer to azcopy doc to download and get started with azcopy software). You can use Azure AD or Shared Access Signatures (SAS) token to Authorize azcopy. Use Create SAS tokens for your storage containers doc to create SAS token.

azcopy copy 'https://source-storage-account-name.blob.core.windows.net/sourceContainer?SAS-Token' 'https://target-storage-account-name.blob.core.windows.net/targetContainer?SAS-Token' –recursive

Source and Target on AWS

If you are using AWS then copy data from source Storage Bucket to target Storage Bucket by using a command like this (refer to AWS doc):

aws s3 sync s3://sourceBucket s3://targetBucket

Source Locally and Target on Cloud

If your source lakeFS environment is running locally then by default it keeps the data under ~/lakefs/data/block folder. It is controlled by configuration/environment variable.

Data for local repository is found inside a folder which you specify when you create a repository e.g. if you ran following command to create a repository then data is saved in ~/lakefs/data/block/localSourceFolder

lakectl repo create lakefs://source-repo-name local://localSourceFolder

Copy all files from local source folder to target storage (which can be either locally or in the cloud) by following these steps:

If you are running lakeFS in a Docker container then you can copy files from the container to host first:

docker cp lakefs-container-name:/home/lakefs/lakefs/data/block/localSourceFolder/ localDownloadedSourceFolder/

If you are going to use AWS for the target repo then copy data to target Storage Bucket by using a command like this (refer to AWS doc):

aws s3 sync ./localDownloadedSourceFolder/ s3://targetBucket

If you are going to use Azure for the target repo then copy data to the target Storage Container by using the following azcopy command (refer to azcopy doc to download and get started with azcopy software). You can use Azure AD or Shared Access Signatures (SAS) token to Authorize azcopy.

cd localDownloadedSourceFolder
azcopy copy '*' 'https://storage-account-name.blob.core.windows.net/targetContainer/SAS-Token' –recursive

Step 4 – Create Target Bare Repository

  1. Configure lakectl to point to target lakeFS environment. You can create multiple .lakectl.yaml files (e.g. .lakectl_source_lakefs.yaml and .lakectl_target_lakefs.yaml) to store configurations for multiple lakeFS environments. When you run any lakectl command then you can add –config option flag to use a particular YAML file e.g.
lakectl repo list --config .lakectl_target_lakefs.yaml
lakectl repo create-bare lakefs://target-repo-name s3://targetBucket  --config .lakectl_target_lakefs.yaml
  1. If you are using Azure then create a bare/blank repo in target lakeFS environment by using following command:
lakectl repo create-bare lakefs://target-repo-name https://target-storage-account-name.blob.core.windows.net/targetContainer
  1. If you are using AWS then create a bare/blank repo in target lakeFS environment by using following command:
lakectl repo create-bare lakefs://target-repo-name s3://targetBucket

Step 5 – Restore Metadata to Target Repository

  1. If you are using Azure then restore data to target repo by using following command (replace target-repo-name with the repository name that you created in the previous step):
az storage blob download --container-name targetContainer --name _lakefs/refs_manifest.json --account-name target-storage-account-name --account-key target-storage-account-key | lakectl refs-restore lakefs://target-repo-name --manifest -
  1. If you are using AWS then restore data to target repo by using following command (replace target-repo-name with the repository name that you created in the previous step):
aws s3 cp s3://targetBucket/_lakefs/refs_manifest.json - | lakectl refs-restore lakefs://target-repo-name --manifest -
  1. Above command takes a few seconds and restores refs (branches, commits and tags) to the target repository.

Step 6 – Use Target Environment

  1. Add users in the target lakeFS environment but if you are using Single Sign On (SSO) on lakeFS Cloud then follow SSO doc to implement SSO for the target lakeFS environment.
  2. Change your code/tools to use the target lakeFS environment and repository.
  3. Re-configure Garbage Collection (GC) (if you defined GC for the source repository). If you are using lakeFS Cloud then GC is managed automatically but re-configure GC rules for the target repository.

Wrap Up

In this tutorial we outlined the six (6) key steps we recommend to migrate or clone a lakeFS repository and used specific examples along with sample notebooks to help visualize the steps. If you’re new to lakeFS, you can get started now by running locally.
Already a fellow lakeFS-er? Share your experience and advice on our Slack Community!

Git for Data – lakeFS

  • Get Started
    Get Started