The lakeFS team
August 10, 2022

Organizations have practically always needed data analytics, and they jumped on the analytics bandwagon as soon as the first computers appeared on the scene. In the 80s, businesses built data warehouses using relational databases as their decision-support systems (DSS). However, as companies generated more diverse data at high velocity, relational databases showed their limitations. 

This got us to the 2000s and the Big Data trend. New solutions emerged to let teams analyze large volumes of diverse data generated with great velocity. Modern architectural and analytical patterns combined data warehousing with more recent Big Data technologies.

Still, organizations would encounter problems when deploying analytical solutions like that. We’re talking about pretty monolithic solutions, where a single team serves as the platform provider and does data integration. That may work for smaller companies with a high degree of centralization. In large organizations, this setup quickly became a bottleneck, causing a massive backlog in data integration services and analytical solutions.

Here’s the lesson learned from all these decades of handling data analytics: 

Having a single team perform data ingestion on a single platform is a bad idea in large organizations. Most of them are decentralised and distributed from a business perspective, so experts are often spread out across various sectors. The old setup just doesn’t work.

This is where a new architectural pattern called data mesh comes in. The objective of data mesh is to allow distributed teams to work with and share information in a decentralised and agile way. By implementing data mesh principals such as multi-disciplinary teams that publish and consume data products, companies get to reap many benefits enabled by their data.

But what exactly is data mesh? How does it work? And how do you set up a data lake for data mesh? Keep reading this article to find out.

Data Mesh

Table of Contents

What is data mesh?

Data mesh is a pattern for implementing data platforms that helps to scale analytics adoption beyond a single platform and a single implementation team. 

First introduced by Zhamak Dehghani in How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, data mesh facilitates distributed data pipelines. Contrary to traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, data mesh supports distributed domain-specific data consumers that view data as a product.

Source: MartinFowler

 What is the tissue connecting domains and their associated data assets? It’s a universal interoperability layer that applies the same infrastructure, syntax, and data standards.

 Data mesh architecture: Understanding the core concept

These four concepts are foundational for understanding data mesh architecture:

  • Data domains – this concept comes from Domain Driven Design (DDD), a software development paradigm used to model complex software solutions. In data mesh, a data domain is a way to define boundaries around your enterprise data. The boundaries can vary depending on your organization and its needs. Sometimes, you might choose to model data domains based on your business processes or source systems.
  •  Data products – an important component of the data mesh that applies product thinking to data. To work, a data product must provide long-term value to users, as well as be usable, valuable, and feasible. It can be delivered as an API, report, table, or dataset in a data lake. 
  •  Self-service platform – a data mesh builds on a foundation of generalists who create and manage generalized products. While using decentralization and alignment with business users that understand data, you’ll have specialized teams developing autonomous products that are not dependent on a central platform. That way, you can’t have specialized tools that require specialist knowledge to operate the core foundation of your mesh-based platform.
  •  Federated governance – when you adopt a self-serve distributed data platform, you must focus on governance. If there’s no governance, you’ll soon see silos and data duplication across your data domains. That’s why you need to implement automated policies around both platform and data needs.

Data mesh architecture: Top reasons to implement

Data mesh is a solution to the shortcomings of data lakes monoliths. It provides greater autonomy and flexibility for data owners by facilitating data experimentation, innovation, and collaboration. At the same time, it reduces the burden on data teams to field the needs of every data consumer through a single pipeline. 

Meanwhile, the self-service platform approach provided by the data mesh offers data teams state of the art data technologies that are ready to use with small to no investment, and in addition allow a universal and often automated approach to data standardization, product lineage, and quality metrics (collection and sharing). When brought together, these benefits provide a competitive edge compared to traditional data architectures that are hamstrung by a lack of standardization between both ingestors and consumers.

These benefits are just the tip of the iceberg. Here are a few more that make a convincing case for data mesh:

Lower costs and greater speed

Until now, organizations have turned to centralization to process extensive data containing various data types and use cases. However, centralization requires users to import/transport data from edge locations to a central data lake to gain insights. This is time-consuming and expensive. A centralized data team may create bottlenecks, as many employees create data, and the centralized team has to prioritize which of their tickets to work on. 

Data mesh helps reduce time-to-insights with a distributed architecture that views data as a product with separate ownership by each business unit. This decentralized model allows teams to access and analyze “non-core” data quicker than ever before.

Business agility

As data volumes continue to increase, a centralized management model fails to respond at scale. Business agility is negatively affected by the amount of time it takes to get value from data and respond to change. 

Data mesh solves this problem by enabling business agility and change at scale. How? It delegates dataset ownership from the central to the domains (individual teams or business users). This closes the gap between an event happening and its consumption/process for analysis.

Easier compliance

Data residency and privacy guidelines can be problematic for organizations with data stored in an EU country but accessed by a user in North America. Abiding by these regulations is time-consuming, tedious, and can delay critical business intelligence that helps companies maintain a competitive advantage. 

Data mesh provides a connectivity layer that enables direct access and query capabilities by technical and non-technical users to data sets where they reside, avoiding costly data transfers and residency concerns.

Data mesh architecture: challenges

You need to be ready for a few challenges to crop up when implementing your modern data mesh architecture. Here are the most important ones.

Budget constraints

Several factors threaten the financial viability of a platform project. This includes the inability to pay for infrastructure, the development of costly applications, delivery of data products, or maintenance of such systems. 

If the platform team manages to build a tool that effectively bridges a technical gap, but data volumes get bigger, and data products become more complex, the solution price tag might get too big.

Collaboration between the platform team and domains

There’s no denying that data mesh brings a lot of extra work for the domains that previously mostly consumed data reports. You need to find a way to convince them somehow that it’s all worth the effort. And once they’re on board, you must coordinate serious releases with them. 

For example, improving the platform might cause some breaking changes. What if one domain is in the middle of testing its new applications? In this scenario, they might hold you off for several months. 

Building a data management skill set

This is a massive challenge for companies looking to implement data mesh. When you decentralize data management, you have domains manage their own data. Is this a better solution than a central team that offers integration? The answer to this question depends on the industry-specific business domains and the origin of your data.

Technical skills shortage 

Delegating full ownership to the domains means that they need to be able to commit to the project. They can hire new people or get themselves trained, but that might soon become too much for them to handle. 

Issues might start popping up all over the place when performance dramatically decreases. No tools can solve this issue because the knowledge of how things work in data engineering is necessary here.

Monitoring data products

The team needs to have the appropriate tools to deliver data products and monitor what is going on. Some domains might lack a deeper understanding of what exactly all technical metrics mean and how they affect workloads. Your platform team needs to have the resources at hand to identify and solve problems like overutilization or ineffectiveness.

Data virtualization and duplication

The end of the single-source approach to data management is here. People now want to combine data from multiple sources and don’t want to be constrained by the limitations of a “single place.” 

This can be done in two ways, through data virtualization and duplication – and they both come with unique challenges. Data virtualization creates a semantic data model outside the data sources without physically transferring data to another database. It breaks down users’ queries, pushes its parts to the sources, and assembles the results back together. 

Data duplication, on the other hand, requires teams to handle data transfers from original sources to downstream applications. This can increase your cloud bills by a lot. And we’re not talking only about the storage fees but also potential egress costs.

Data mesh implementation: How to transform your data lake into data mesh services?

Data infrastructure teams can use lakeFS to provide each data mesh service with its own versioned data lake over the common object store. The Git-Like operations available in lakeFS will bring forward all the missing capabilities like data governance and continuous deployment of quality data.

Data Mesh Implementation steps:

  1. The goal here is to create a lakeFS repository for each data mesh service. This will let each service to work in isolation and publish high-quality data to other services/consumers.

  2. Protect your existing data in the object store by setting read-only permissions. 

  3. Create a repository in lakeFS for each data service, and then onboard its historical input and output data. This is a metadata operation – no data is transported. If some data sets serve multiple services, they’ll be onboarded to several repositories. 

  4. Write an onboarding script for each service from the repositories of services that provide its input. Each run of this script should be a new commit to the master branch, with changes and updates to the input data. 

Now you’re all set! Each service has the data it needs in isolation within its repository. It has the ability to time travel between different versions of the input per commit. The master branch of the repository serves as its single source of truth.

To analyze the service data, you need to run the processes that consume input and produce output over the lakeFS repository. New output is also committed to the master branch, which creates a new version that can be consumed by others directly. 

Now, it’s time to set up a development environment and CI/CD for each data mesh service. This is how you ensure efficient work and high-quality outputs.

Development environment for a data mesh service 

To facilitate the development of a data mesh, we need a development environment that allows changes to the service code, infrastructure, or data in isolation. 

We can create a branch from the repository master branch and name it “dev-environment.” Merges made to it from the master will allow us to experiment with any version of the master. 

We can open a branch from “dev-environment” for testing during development and discard it once experimentation is complete. We can conduct several experiments on one branch sequentially using revert or on different branches in parallel, where we can compare the results of different experiments. 

Providing a development environment for a data mesh service (Source: lakeFS)

In addition – here is an extensive guide on building a data development environment with lakeFS.

Continuous integration of data to the repository

When ingesting new data sources or updating existing ones into a repository, it’s important to guarantee that the data adheres to quality and engineering specifications. When we described how to set up the repository for a data mesh service, we suggested onboarding updates to the data from the input repositories directly to the master. 

This is bad practice because it might cascade into the service’s data pipelines before you validate its quality. You don’t want to end up dealing with quality issues, data downtime, or slow recovery.

Here’s what you can do instead:

  1. The best practice is to create a branch to ingest data. Ideally, each data set has its own ingestion branch. 

  2. Give it a meaningful name, for example, “daily-sales-data.” 

  3. Use pre-merge hooks to run tests on the data to ensure that good practices and quality standards are met. 

  4. If the test passes, merge the data into the master; if not, an alert will be sent out via a monitoring system of your choice. In case of failure, you will have a snapshot of the repository at the time of failure and can find the cause faster. No data is lost since it’s not exposed to the master.

Continuous deployment of data to the repository

The purpose of this infrastructure is to ensure that data provided by a service to other services or consumers are of high quality. A complex data service might execute hundreds of small jobs over several hours, so we need a continuous deployment environment that can automatically rebuild the service if any errors are detected. 

You can do that by combining version control (lakeFS) with automated workflow management (Airflow, Dagster, or compatible) and a testing framework.

Here’s how it works:

  1. Orchestration runs a DAG on a dedicated branch. Each job is performed on a branch created from the DAG branch. 

  2. Once the job is completed, a webhook is initiated and tested to ensure the data’s quality. 

  3. If the test passes, the data of this job will be merged automatically into the DAG branch, and the next job will start. 

  4. If the test fails, a webhook will create an event in an alerting system with all relevant data. The DAG will stop running. 

  5. Once all tests pass and execution has finished successfully, its data is merged back to the master. Now it can be consumed by other services or exported from object storage to a serving layer interface.

The master of each repository serves as a trusted single source of truth; data is validated before it’s saved into it (no matter if it’s input data, intermediate results, or output data). Moreover, data is tested on both sides of the interface.

Conclusion 

Data Mesh draws its concepts from software engineering best practices such as micro and agile development. Applying those concepts to data operations is challenging, but offers great rewards if successfully implemented. The term data lake comes with a connotation of a monolit, while in practice it is implemented on a highly distributed technology, such as object storage, and hence allows the platform teams of a data mesh to offer the data products isolated data environments. So the monolith can be split to small data lakes per product. To avoid data duplication an abstraction over the data lake is required, and lakeFS provides it, allowing each data product to use its own repository, but also consume/expose data to/from other repositories.  

Want to learn more about lakeFS? check out our GitHub repository and join our amazing community on Slack!

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on October 12th:

    Troubleshoot and Reproduce Data with Apache Airflow
    +