Version Control Data Pipelines Using the Medallion Architecture

Iddo Avneri

Last updated on February 20, 2024

Home > Blog > Version Control Data Pipelines Using the Medallion Architecture

A step by step guide to running pipelines on Bronze, Silver and Gold layers with lakeFS

Introduction

The Medallion Architecture is a software design pattern that organizes a data pipeline into three distinct tiers based on functionality: bronze, silver, and gold. The bronze tier represents the core functionality of the system, while the silver and gold tiers build on top of the previous tier, offering more advanced features.

The overall goal of the Medallion Architecture is to create a scalable, flexible, and maintainable system that can evolve over time to meet changing requirements.

One key benefit of the Medallion Architecture that you can separate concerns and manage dependencies between tiers. By organizing the system into different tiers, developers can focus on specific areas of functionality, reducing the likelihood of conflicts and making it easier to test and deploy the system. Additionally, the Medallion Architecture can help improve performance, as each tier can be optimized for a specific purpose.

Another advantage is that it allows for incremental development and improvement. Developers can focus on building out the bronze tier first and then gradually add more advanced features in the silver and gold tiers. This approach can help ensure that the system meets the most critical requirements first while also giving the team flexibility to add additional features later on.

The different layers are sometimes referred to as Landing / Staging / Curated, Raw / Transformed / Final or Zones0 / Zone1 / Zone2.

For example:

Stage Name	Description	Example
Zone 0 (Bronze): Raw / Landing	Initial data ingestion and storage as is, without processing or transformations	Saving logs from an application to a distributed file system
Zone 1 (Silver): Transformed / Staging	Cleaning, filtering, and transforming raw data into a more usable format	Parsing logs to extract useful information, joining data from different sources
Zone 2 (Gold): Final / Curated	Further processing and refinement of data to meet specific business requirements	Aggregating data by time intervals, enriching data with additional information

Example scenarios by stage layer

When implementing lakeFS, it may be necessary for users to maintain separate physical storage for each stage. However, it is important to version control all changes made to each layer (gold, silver, bronze) and maintain lineage between them for several reasons.

Firstly, version control helps to keep track of changes made to the data over time, making it easier to troubleshoot issues and roll back changes if necessary. This is especially critical in complex applications where changes can have far-reaching effects.

Secondly, maintaining lineage between layers helps to ensure consistency and traceability across the system. By keeping track of which versions of each layer are used in production, developers can quickly identify the root cause of issues and make targeted fixes. This can help reduce downtime and improve overall system performance.

Finally, version control and lineage tracking improve collaboration and communication between developers working on different layers of the system. By providing a clear history of changes and dependencies, developers can work more efficiently and avoid conflicts. This is especially important in large teams where different people may be responsible for different layers of the system.

Overall, version control and lineage tracking are key practices for maintaining the integrity and reliability of complex data pipelines.

Approach

This guide outlines how to utilize lakeFS to achieve version control and lineage tracking of data through a data engineering pipeline.

The approach involves creating two separate repositories, one for raw data and the other for transformed data, which sit in different buckets. As data is promoted through the pipeline, commit metadata is used to reference the version of the upstream data source, allowing for lineage tracking for every dataset.

By enabling this lineage, you can trace back to the data in the upstream bucket that was used to create the current dataset:

Lineage and data versioning for each individual file or dataset

Finally, the guide describes how to export the data in a human-readable format back to a gold bucket.

Implementation

Prerequisites

To follow this guide, you will need:

A lakeFS server (you can either install one or spin one up for free on the cloud)
Access to buckets (or creating buckets) on your object store. You will need a minimum of 3 buckets (bronze, silver, and gold).
Use this sample-repo to spin up a notebook you can configure to run against the lakeFS Server.

Step 1: Writing the data into the bronze bucket

To set up the pipeline, the first step is to create a bronze data repository and a silver data repository. Data can be imported into the bronze data repository by creating an ingestion branch, uploading the data to the ingestion branch, committing the change, and then merging the ingestion branch into the main branch.

We will start by setting up some environment variables, and creating the bronze-data repository (note that we are using the python client API below, see full code in the sample-repo):

Copy Code

environment = 'dev' # Using environment as a variable FFU, see at end of this guide
lakefsEndPoint = 'https://mylakefscloud.us-east-1.lakefscloud.io' # you can spin up an instance on https://lakefs.cloud
lakefsAccessKey = 'YOURACCESSKEY'
lakefsSecretKey = 'YOURSECRETKEY'
bronzeRepo = environment + "-bronze"
bronzeRepoStorageNamespace = 's3://lakefs-' + environment + '-bronze'

# Using Python Client, see full example on the sample repo
client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=bronzeRepo,
        storage_namespace=bronzeRepoStorageNamespace,
        default_branch=mainBranch))

Similarly, we will spin up a silver repository.

Copy Code

client.repositories.create_repository(
    repository_creation=models.RepositoryCreation(
        name=silverRepo,
        storage_namespace=silverRepoStorageNamespace,
        default_branch=mainBranch))

Once we have the repositories, we can either upload or import data into your bronze repository. It is common practice in lakeFS is to import the data into a separate ingest branch:

Import the data into a separate ingest branch

Create an ingestion branch:

Copy Code

client.branches.create_branch(
    repository=bronzeRepo,
    branch_creation=models.BranchCreation(
        name=bronzeIngestionBranch,
        source=mainBranch))

Upload the object to the ingestion branch:

Copy Code

client.objects.upload_object(
    repository=bronzeRepo,
    branch=bronzeIngestionBranch,
    path=fileName, content=contentToUpload)

Commit the change on the ingest branch:

Copy Code

source = 'bronze'
target = lakefsEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + bronzeIngestionBranch + '&path=' + fileName

client.commits.commit(
    repository=bronzeRepo,
    branch=bronzeIngestionBranch,
    commit_creation=models.CommitCreation(
        message='Added my first file in ' + bronzeRepo + ' repository!',
        metadata={'using': 'python_api',
                  'source': source,
                  'target': target}))

Important

When committing the data, we have added metadata to the commit that includes source and target. In the example above, we are dynamically building the path to the file we are committing as the target.

Finally, merge the ingestion branch into the bronze main branch:

Copy Code

client.refs.merge_into_branch(
    repository=bronzeRepo,
    source_ref=bronzeIngestionBranch,
    destination_branch=mainBranch)

In the lakeFS UI, you should be able to see the file you uploaded under the main branch of the bronze repository:

Uploaded repository including metadata of the source and target of the commit

This includes the metadata of the source and target of the commit.

Step 2: Reading from the bronze bucket and transforming the data in the silver bucket, keeping lineage

To read data from lakeFS, you can use the S3A gateway (which you can set up by following the instructions in our sample repository). When we retrieve the data to be transformed, we also capture the commit ID of the branch we’re reading from using the following code:

Copy Code

bronzecommits = client.refs.log_commits(repository=bronzeRepo, ref=mainBranch, amount=1, objects=[fileName])

We’ll use bronzecommits later on to establish lineage between different versions of the data that reside in different buckets.

As with ingestion, it’s a best practice in lakeFS to perform transformations on a separate branch. This approach facilitates rollbacks and lineage, and allows you to promote data atomically when you have multistep steps within a transformation (i.e. multi-table transactions):

Pre-merge hook quality check. Perform transformations on a separate branch

Create an ingestion branch in the silver repository:

Copy Code

client.branches.create_branch(
    repository=silverRepo,
    branch_creation=models.BranchCreation(
        name=silverIngestionBranch,
        source=mainBranch))

Run the transformation. For simplicity, partition the data by a column:

Copy Code

newDataPath = f"s3a://{silverRepo}/{silverIngestionBranch}/{silverDataPath}"
df.write.partitionBy("_c0").mode("overwrite").csv(newDataPath)

And finally, commit the changes alongside the source and destination metadata:

Copy Code

source = lakefsEndPoint + '/repositories/' + bronzeRepo + '/object?ref=' + mainBranch + '&path=' + fileName
source_commit =  lakefsEndPoint + '/repositories/' + bronzeRepo + '/commits/' + bronzecommits.results[0].id
target = lakefsEndPoint + '/repositories/' + silverRepo + '/objects?ref=' + silverIngestionBranch + '&path=' + silverDataPath + '/'

client.commits.commit(
    repository=silverRepo,
    branch=silverIngestionBranch,
    commit_creation=models.CommitCreation(
        message='Added transformed data in ' + silverRepo + ' repository!',
        metadata={'using': 'python_api',
                  'source': source,
                  'source_commit': source_commit,
                  'target': target}))

Important

Notice how we use bronzecommits to create metadata pointing back to the commit ID of the bronze data set, i.e. the exact version of the bronze data used to create this transformed data within the silver repository.

Once the transformation is done, we can merge the data back into the Transformation repository main branch:

Copy Code

client.refs.merge_into_branch(
    repository=silverRepo,
    source_ref=silverIngestionBranch,
    destination_branch=mainBranch)

Reviewing this dataset in the lakeFS UI will provide lineage across buckets. By clicking on “Blame” for any dataset transformed, we can see in the metadata the link to the commit in the upstream bronze bucket the data was input from:

Step 3 (optional): Exporting the final dataset to a gold bucket

To make the latest production files accessible to data consumers who are not using lakeFS, you can export the data to an external bucket. It’s worth mentioning that you can still use the lakeFS S3 endpoint to access files from any branch or historical commit, as needed.

Exporting data from lakeFS can be done in various ways, but one simple method is to use docker. To accomplish this, run the following command:

Copy Code

docker run -e LAKEFS_ACCESS_KEY_ID=YOURKEY \
-e LAKEFS_SECRET_ACCESS_KEY=YOURSECRET \
-e LAKEFS_ENDPOINT=https://youd_lakefs.cloud.io \
-e AWS_ACCESS_KEY_ID=aaa -e AWS_SECRET_ACCESS_KEY=bbb \
-it  treeverse/lakefs-rclone-export:latest \
dev-silver s3://YOURS3BUCKETEXPORT/main/ --branch=main

This command exports the content of the last commit of the main branch of the dev-silver repository to your S3 bucket. The exported data is structured in a way that anyone can easily access it.

It’s crucial to note that the export process is not intended to overwrite the original bucket where the data resides. Doing so could introduce significant risks in case of failure and would strip away the version control benefits provided by lakeFS.

????Pro Tip????

For users leveraging open table formats, there’s an option to export only the table metadata, preserving access to the entire dataset while significantly reducing the size and speed of the operation. For instance, implementing an export_delta_log operation in a Lua hook automatically after every commit can streamline this process more efficiently.

Step 4 (optional): Separate Prod/QA/Test physical environments

There are scenarios where companies need to have separate environments for testing, QA, or production.

While lakeFS allows easy isolation of different environments using branching, some policies may require each environment to sit on a different bucket. To address this, we can use one or more lakeFS servers (depending on your policy) to version control the different environments:

Since the environment is a parameter in our notebook, we can easily change it and rerun the exact same pipeline on different environments. For instance:

Copy Code

environment = 'prod'

After a simple parameter configuration, we can now run the same pipeline on the prod environment.

Summary

In this article, we explored how to leverage lakeFS to build scalable and reliable data pipelines, executing across different buckets. With its versioning capabilities and integration with popular data tools, lakeFS provides a solid foundation for managing complex data workflows.

By following the best practices outlined in this article, data engineers can easily create a pipeline that is versioned, testable, and reproducible. With lakeFS, data teams can confidently iterate on their pipelines, ensure data quality, and quickly adapt to changing business needs.
Take lakeFS for a spin and try it out yourself.