The data mesh paradigm
The Data Mesh paradigm was first introduced by Zhamak Dehghani in her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.
Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, domain-specific data consumers and views data as-a product. In this architecture each domain is responsible to handle their own data pipelines.
The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same infrastructure, syntax, and data standards.
I couldn’t agree more with the data mesh concept, as it’s an implementation of the basic application development principles I believe in, to the big data domain.
So why are we struggling to apply these principles to data naturally?
- We perceive our body of data as a monolith
- We don’t perceive our data operations as products in the known application sense. In most organizations the data operations do not have an application development lifecycle that includes requirements, quality assurance and resiliance.
- We didn’t know what roles are required to create our data products. These roles were created with technologies used in the past decade in mind.
Data mesh serves as a wake-up call for data teams. It’s time to shift to the trusted paradigm we know to work so well in software applications.
Data infrastructure as a platform
In the organization structure Dehghani presented, she established a team of data infra engineers responsible for the “Data Infra as a Platform”.
This team is responsible to provide self-service tooling to ingest, process, and serve data. Unlike databases, object storage provides an infrastructure-as-a-platform due to its natural scalability, throughput, and cost effectiveness.
Each data service in the mesh can use the object storage in isolation without much effort from the infrastructure team, by using their own buckets, and protecting them with permissions. But is it enough to make the mesh work in a trusted way?
No. It isn’t.
What object storage is missing to support the mesh?
- Easy setup of a data development environment for each data service
- A method to ensure data governance and best practices across all services
- Continuous deployment of quality data for each data mesh service.
In other words, the data mesh paradigm is missing necessary tools to implement application development best practices on object storage.
Setting up a data lake for a data mesh service
By using lakeFS, data infrastructure teams can provide each data mesh service with its own atomic, versioned data lake, over the common object storage without duplicating data or excessive use of permissions. Furthermore, lakeFS’s Git-Like operations will allow the missing capabilities like data governance and continuous deployment of quality data.
Our goal is to create a lakeFS repository for each data mesh service. This will allow each service to work in isolation, and publish high quality data to other services/consumers.
- Protect all existing data in your object storage with read only permissions.
- For each data service, create a repository in lakeFS, and onboard its historical input and output data. This is a metadata operation. Data is not transported. If some data sets serve several services, they’ll be onboarded to several repositories.
- Onboarding data is a commit action to the master branch of each new repository. The historical data you decide to onboard is the first version of the master branch in the repository.
- Create an onboarding script for each service, from the repositories of the services that provide its input, such that each run of the script is a new commit to the master branch, with the changes and updates to the input data.
- All set. Each service has the data it needs in isolation within its repository, and with the ability to time travel between different versions of the input per commit. The master branch of the repository is now it’s single source of truth.
- Run the service data analysis processes that consume the input and produce the output over the lakeFS repository. Every new output is also committed to the master and creates a new version that can be consumed by others directly.
- Now you need to set up a development environment and CI/CD each data mesh service, to ensure efficient work and high quality outputs.
Providing a development environment for a data mesh service
The goal is to allow developing changes to the service code, infrastructure or data in isolation, with minimal effort. The way to go about it would be:
- Create a branch from the repository master branch. “dev-environment” seems like a suitable name for it. You may wish to automate merges from master to it, so experiments can be conducted on any version of the master.
- The good practice for testing during development is to open a branch from the “dev-environment” branch and use it to experiment. Discard the branch once the experiment is done. You can conduct several experiments on the same branch sequentially using revert, or on different branches in parallel, where the results of the different experiments can be compared. Read more about this here.
Continuous integration of data to the repository
The goal here is to guarantee new data sources, or updated data of existing sources ingested into the repository adhere to quality and engineering (format, schema, etc’) specifications.
When we described the setup of the repository for a data mesh service, we instructed onboard updates to the data from the input repositories directly to master. Scratch that, it’s bad practice, as once the data is in master it might cascade into the service’s data pipelines before you validate its quality. Allowing quality issues, data downtime, and slow recovery and root cause analysis.
The best practice is:
- Create a branch to ingest the data. Ideally each input data set has its own ingest branch. Give it a meaningful name. For example, “daily-sales-data”.
- Use pre-merge-hooks to run tests on the data to ensure best practices and data quality. You can integrate existing testing tools like Great Expectations or Monte Carlo. If the tests pass, data is merged to master. If the test fails, a monitoring system of your choice will send an alert.
- In case of a failed test, you now have a snapshot of the repository at the time of the failure. This allows you to find the root cause of the issue faster. No data is lost, as it’s not exposed to the master. See the example here.
Continuous Deployment of data to the repository
The goal here is to ensure the data the service provides to other services/consumers is of high quality. A complex data service may run a DAG of hundreds of small jobs, with a total run time of a few hours.
To create an automated continuous deployment environment for a data service we need three components: Version control (lakeFS), Orchestration (Airflow, Dagster or compatible), and a testing framework .
This setup works as follows:
- The orchestration is running the DAG on a dedicated branch (DAG branch is the name we will use for it going forward).
- Each job is performed on a branch that is created from a DAG branch.
- Once the job completes, a webhook is initiated and runs the relevant test to ensure the quality of the data.
- If the test passes, the data of the job is merged automatically into the DAG branch, and the next job starts.
- If the test fails, a webhook creates an event in an alerting system with all relevant data. The DAG stops running.
- Once the DAG has completed and execution passed all tests, the data it produced is merged back to master. It can now be consumed by other services, or exported from the object storage to a serving layer interface.
Two main things important to notice about the practice of continuous integration and continuous deployment of data.
- The master of each repository becomes a trusted single source of truth. Data is validated before it is saved into it, whether its input data, intermediate results or output data the service provides the organization.
- Data is tested on both sides of the interface, once before published to master as production data by the providing service side, and once when onboarded to the repository on the consuming service side.
Considering lakeFS allows Git-like operation over object storage it is a natural enabler of the data mesh for data infrastructure teams. The practices I have laid before you are part of what lakeFS has to offer to the data mesh, and the same practices can also be implemented with lakeFS using slightly different branching schemas.
Once you install lakeFS and play around with it, you’ll find the practice that best suits your needs and the needs of your data services developers, both on the ingesting and the consuming side of the data mesh. For more information, check out our GitHub repository and documentation.