Continuous integration (CI) of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment (CD) of data ensures the quality of data at each step of a production pipeline. These approaches are commonly used by application developers of course. But in this article we will review the benefits of CI/CD for data pipelines, learn how to implement CI/CD for data lakes using hooks with lakeFS, and share some examples for best practices.
The Benefits of CI/CD for data pipelines
Data pipelines feed processed data from data lakes to downstream consumers like business dashboards and machine learning models. As more and more organizations rely on data to enable critical business decisions, data reliability and trust are of paramount concern. Continuous integration and continuous deployment of data provide a faster pace of testing and deployment, allowing for a more agile workflow and less turnaround time for changes and new releases.
How do I Achieve CI/CD for data pipelines with lakeFS?
lakeFS, provides a git-like data versioning control system, making the implementation of CI/CD pipelines for data simpler. lakeFS provides a feature called ‘hooks’ that allows the automation of checks and validations of data on lakeFS branches. These checks can be triggered by certain data operations like committing, merging, etc.
What exactly are lakeFS Hooks?
Functionally, lakeFS hooks are similar to Git Hooks. lakeFS hooks run remotely on a server, and they are guaranteed to run when the appropriate event is triggered.
Here are some examples of the hooks lakeFS supports:
and so on.
By leveraging the pre-commit and pre-merge hooks with lakeFS, you can implement CI/CD pipelines on your data lakes.
Specific trigger rules, quality checks and the branch on which the rules are to be applied are declared in actions.yaml file. When a specific event (say, pre-merge) occurs, lakeFS runs all the validations declared in actions.yaml file. If validation errors emerge, the merge event is blocked.
Currently, lakeFS allows executing hooks when 2 types of events occur: pre-commit events that run before a commit is acknowledged, and pre-merge events that trigger right before a merge operation. For both event types, returning an error will cause lakeFS to block the operation from happening – and will return that failure to the requesting user.
This is an extremely powerful guarantee – We can now codify and automate the rules and practices that all Data Lake participants have to adhere to.
This guarantee is then made available to data consumers: You’re reading from production/important/collection/. Great, you’re guaranteed to never see breaking schema changes – and all data must have passed a known set of statistical quality checks. If it’s on the main branch, it’s safe.
CI/CD for Data pipelines: Using hooks as data quality gates
Hooks are run on a remote server that can serve http requests from lakeFS server. lakeFS supports two types of hooks.
- Webhooks (run remotely on a web server. e.g.: flask server in python)
- Airflow hooks (a dag of complex data quality checks/tasks that can be run on airflow server)
In the example below, we will show how to use webhooks (python flask webserver) to implement quality gates on your data branches. Specifically, how to configure hooks to allow only parquet and delta lake format files in the main branch.
This example uses an existing lakeFS environment (lakeFS running on everything bagel docker container), python flask server running on a docker container, a Jupyter notebook and sample data sets to demonstrate the integration of lakeFS hooks with Apache Spark and Python.
To understand how hooks work and how to configure hooks in your production system, refer to the documentation: Hooks. To configure lakeFS hooks with custom quality check rules, refer to the lakefs-hooks repository.
CI/CD for Data pipelines: Implementing CI/CD pipelines with lakeFS
Before we get started, make sure docker is installed on your machine.
Setup Webhooks server
First, as you know, lakeFS webhooks need a remote server to serve the http requests from lakeFS. So let us set up the python flask server for webhooks.
Start with cloning the lakeFS-hooks repository. The following command can be run in your terminal to get the hooks image.
git clone https://github.com/treeverse/lakeFS-hooks.git && cd lakeFS-hooks/ docker build -t <lakefs-hooks-image-name> .
Once you have the <lakefs-hooks-image-name> image built, you can add this image in the lakeFS everything bagel docker. This ensures that flask server and lakeFS server are running inside the same docker container, and are part of the same docker network. This is to simplify the setup, and is not mandatory to have both services running in the same container.
Setup lakeFS server
To get a local lakeFS instance running on docker, you can use everything bagel docker. Let us start with cloning the lakeFS repository.
Setup lakeFS server To get a local lakeFS instance running on docker, you can use everything bagel docker. Let us start with cloning the lakeFS repository.
The python flask server image we built in the above section needs to be added to everything bagel docker-compose.yaml file. So add the following contents to the yaml file. For lakeFS instance running on everything bagel, the lakeFS endpoint, access key id and secret key are found in docker-compose.yaml file under lakefs section.
lakefs-webhooks: image: <lakefs-hooks-image-name> container_name: lakefs-hooks ports: - 5001:5001 environment: - LAKEFS_SERVER_ADDRESS=<lakefs_server_endpoint> - LAKEFS_ACCESS_KEY_ID=<lakefs_access_key_id> - LAKEFS_SECRET_ACCESS_KEY=<lakefs_secret_key>
Start the docker container to run lakeFS server and hooks server: docker compose up -d
Next step is to configure the custom logic for your hooks server and configuring hooks in lakeFS.
CI/CD for Data pipelines – You’re Almost There: A few other useful examples
Webhooks allow users to add any custom behavior or validation they wish. A few useful examples:
- Metadata validation: On merge, ensure new tables or partitions are registered in a Meta Store
- Reproducibility guarantee: All writes to production/tables/ must also add commit metadata fields describing the Git commit hash of the job that produced them
- Schema Enforcement: Allow only backwards-compatible changes to production table schemas. Disallow certain column names that expose personally-identifiable information from being written to paths that shouldn’t contain them
- Data Quality checks: Ensure the data itself meets a set of statistical tests to ensure their quality and avoid issues downstream
- Format enforcement: Ensure the organization standardizes on columnar formats for analytics
- Partial or corrupted data validation: make sure partitions are complete and have no duplications before merging them to production
Running CI/CD pipelines for data can be done, rather easily. With lakeFS you have the ability to roll back the entire lake with a click of a button to minimize errors. Complex jobs are grouped together across multiple tables/pipelines into a single “transaction” to allow an easy undo across all affected tables, and you can easily prevent exposing bad or only partially processed data to the data consumers.
With the ability to branch data, you can easily achieve advanced use cases such as running parallel pipelines with different logic to experiment or conduct what-if analysis, compare large result sets for data science and machine learning, and more.