If you want to migrate or clone repositories from a source lakeFS environment to a target lakeFS environment then follow this tutorial. Your source and target lakeFS environments can be running locally or in the cloud.
You can also follow this tutorial if you want to migrate/clone a source repository to a target repository within the same lakeFS environment.
If you have multiple repositories then run the following steps for each repository.
You can also use a sample Python notebook instead of this tutorial to migrate or clone a lakeFS repository.
- Source and a target lakeFS environments (you can deploy one independently or use the hosted solution lakeFS Cloud)
- lakectl CLI
- Object storage for both source and target repositories
- azcopy or aws cli or any other tool of your choice to copy data between object stores (if using object storage on Azure or AWS)
Step 1 – Commit Changes
- Commit any uncommitted data in your source repository. Refer to sample Python notebook if you want to find any uncommitted data programmatically.
Step 2 – Dump Metadata of Source Repository
- Dump source repo by using the following lakectl command:
lakectl refs-dump lakefs://source-repo-name
- Above command takes a few seconds and dumps refs (branches, commits and tags) to the underlying object store (where lakeFS repository data is stored) by creating a refs_manifest.json file inside _lakefs folder.
Step 3 – Copy Data from Source to Target
Source and Target on Azure
If you are using Azure then copy data from source Storage Container to the target Storage Container by using the following azcopy command (refer to azcopy doc to download and get started with azcopy software). You can use Azure AD or Shared Access Signatures (SAS) token to Authorize azcopy. Use Create SAS tokens for your storage containers doc to create SAS token.
azcopy copy 'https://source-storage-account-name.blob.core.windows.net/sourceContainer?SAS-Token' 'https://target-storage-account-name.blob.core.windows.net/targetContainer?SAS-Token' –recursive
Source and Target on AWS
If you are using AWS then copy data from source Storage Bucket to target Storage Bucket by using a command like this (refer to AWS doc):
aws s3 sync s3://sourceBucket s3://targetBucket
Source Locally and Target on Cloud
If your source lakeFS environment is running locally then by default it keeps the data under ~/lakefs/data/block folder. It is controlled by configuration/environment variable.
Data for local repository is found inside a folder which you specify when you create a repository e.g. if you ran following command to create a repository then data is saved in ~/lakefs/data/block/localSourceFolder
lakectl repo create lakefs://source-repo-name local://localSourceFolder
Copy all files from local source folder to target storage (which can be either locally or in the cloud) by following these steps:
If you are running lakeFS in a Docker container then you can copy files from the container to host first:
docker cp lakefs-container-name:/home/lakefs/lakefs/data/block/localSourceFolder/ localDownloadedSourceFolder/
If you are going to use AWS for the target repo then copy data to target Storage Bucket by using a command like this (refer to AWS doc):
aws s3 sync ./localDownloadedSourceFolder/ s3://targetBucket
If you are going to use Azure for the target repo then copy data to the target Storage Container by using the following azcopy command (refer to azcopy doc to download and get started with azcopy software). You can use Azure AD or Shared Access Signatures (SAS) token to Authorize azcopy.
cd localDownloadedSourceFolder azcopy copy '*' 'https://storage-account-name.blob.core.windows.net/targetContainer/SAS-Token' –recursive
Step 4 – Create Target Bare Repository
- Configure lakectl to point to target lakeFS environment. You can create multiple .lakectl.yaml files (e.g. .lakectl_source_lakefs.yaml and .lakectl_target_lakefs.yaml) to store configurations for multiple lakeFS environments. When you run any lakectl command then you can add –config option flag to use a particular YAML file e.g.
lakectl repo list --config .lakectl_target_lakefs.yaml
lakectl repo create-bare lakefs://target-repo-name s3://targetBucket --config .lakectl_target_lakefs.yaml
- If you are using Azure then create a bare/blank repo in target lakeFS environment by using following command:
lakectl repo create-bare lakefs://target-repo-name https://target-storage-account-name.blob.core.windows.net/targetContainer
- If you are using AWS then create a bare/blank repo in target lakeFS environment by using following command:
lakectl repo create-bare lakefs://target-repo-name s3://targetBucket
Step 5 – Restore Metadata to Target Repository
- If you are using Azure then restore data to target repo by using following command (replace target-repo-name with the repository name that you created in the previous step):
az storage blob download --container-name targetContainer --name _lakefs/refs_manifest.json --account-name target-storage-account-name --account-key target-storage-account-key | lakectl refs-restore lakefs://target-repo-name --manifest -
- If you are using AWS then restore data to target repo by using following command (replace target-repo-name with the repository name that you created in the previous step):
aws s3 cp s3://targetBucket/_lakefs/refs_manifest.json - | lakectl refs-restore lakefs://target-repo-name --manifest -
- Above command takes a few seconds and restores refs (branches, commits and tags) to the target repository.
Step 6 – Use Target Environment
- Add users in the target lakeFS environment but if you are using Single Sign On (SSO) on lakeFS Cloud then follow SSO doc to implement SSO for the target lakeFS environment.
- Change your code/tools to use the target lakeFS environment and repository.
- Re-configure Garbage Collection (GC) (if you defined GC for the source repository). If you are using lakeFS Cloud then GC is managed automatically but re-configure GC rules for the target repository.
In this tutorial we outlined the six (6) key steps we recommend to migrate or clone a lakeFS repository and used specific examples along with sample notebooks to help visualize the steps. If you’re new to lakeFS, you can get started now by running locally.
Already a fellow lakeFS-er? Share your experience and advice on our Slack Community!
Table of Contents