Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Idan Novogroder
Idan Novogroder Author

Idan has an extensive background in software and DevOps engineering....

Last updated on January 2, 2025

What kind of tooling can you use to make data mesh work? Should you go after open-source or commercial solutions? There’s no single answer to these questions. 

But the first step is to know what tools are out there that could potentially help you build your data mesh architecture. Once you have a clear idea about your options, you’ll know where to look for solutions to your unique requirements.

Explore the data mesh tools landscape to find the best match for your use cases.

Introduction to Data Mesh

Data mesh is a data architecture for organizing and distributing enterprise data. It’s founded on the idea that business domains should be able to design, access, and govern their own data products rather than depending entirely on centralized data teams.

These four principles serve as the foundation for the data mesh concept.

  1. Data ownership and architecture are domain-oriented and decentralized
  2. Data is treated as a product
  3. Data infrastructure is a self-service data platform
  4. Data governance is federated

In a decentralized, domain-oriented data mesh architecture, data products combine, process, and oversee datasets, giving unified, clean data to authorized data consumers whenever they need it.

Despite the distributed nature of data governance, where each business domain oversees its data products, the implementation of security rules and compliance standards still calls for the use of centralized data governance technologies.

Data mesh architecture
Source: https://www.datamesh-architecture.com/ 

Data mesh accelerates and democratizes data delivery through a self-service approach to data access. At the same time, the mesh obscures the underlying data intricacies from users.

Because they define and manage their own data products, domain teams can evaluate and operationalize them as they see fit. This enables domains to make faster decisions and extract more value from their data.

Data mesh also minimizes dependency on centralized IT teams, promotes domain autonomy, and prepares enterprises to use more data. Domain-based data operations, when combined with automated data governance enforcement, improve access to new, high-quality data.

Types of Data Mesh Tools & Examples

Data Orchestration Tools

Orchestration tools are important because they automate repetitive operations and relieve you of manual labor, allowing your teams to focus on more strategic objectives. They also standardize deployments across environments to reduce errors and keep everything running smoothly.

With orchestration technologies, you can streamline processes, eliminate errors, and shorten time-to-market, making your entire process more efficient and dependable.

In this section, we’ll take a closer look at Airflow, Dagster, and Prefect. Each tool provides a unique set of powerful features aimed at distinct areas of data orchestration, such as continuous delivery and integration, open-source platform capabilities, and support for DevOps and machine learning professionals. 

Apache AirFlow

Airflow is a workflow orchestration solution for managing dispersed applications. It schedules jobs over multiple servers or nodes using Directed Acyclic Graphs (DAGs) and has a robust user interface that allows you to easily visualize the flow of data through the pipeline. It also lets users track the status of each task and inspect log files.

Airflow can create dynamic workflows using DAGs, allowing users to create complicated dependencies and task linkages. This makes Airflow a superior choice for managing complex ML workflows and model training pipelines. Whether you’re running workflows on a single server or across several nodes, Airflow’s design allows you to scale up or down as needed, using executors such as LocalExecutor, CeleryExecutor, and KubernetesExecutor.

Airflow is backed by the Apache Software Foundation, which fosters a lively community dedicated to improving the tool’s capabilities through continual development and innovation.

Dagster

Dagster delivers a life cycle-oriented approach with exceptional flexibility, notably during development and testing. Its handy scheduler, dynamic pipeline generation, and seamless integrations let you create dependable and adaptable data workflows that meet diverse engineering requirements, such as machine learning model deployment and model metric monitoring.

Dagster’s scalability allows for efficient scaling of data operations, making it a flexible option for enterprises managing complex ML workflows. It improves developer productivity and debugging capabilities, streamlining the process of orchestrating complicated data pipelines.

Dagster includes built-in observability tools that provide detailed insights into how your workflows are executed. You can monitor pipeline runs, check logs, and track the status of individual components to improve transparency and control, particularly for model training activities.

The highly modular form encourages reuse and flexibility. You can quickly build reusable pipeline components, making adapting and growing your workflows easier. While Prefect and Airflow enable modular workflows, Dagster’s emphasis on modularity makes it especially useful for difficult data engineering jobs.

Prefect

Prefect focuses on making it extremely simple to convert that code into a distributed pipeline. Its robust scheduling and orchestration engines make the process easier, but its simplicity can be a disadvantage when dealing with more complex data pipelines. Nonetheless, Prefect has swiftly acquired traction and is constantly evolving, helping you in overcoming many of the issues that classic solutions such as Airflow face.

Prefect’s cloud-native workflows interact effortlessly with platforms such as AWS and Google Cloud Platform, providing scalability and performance optimization for modern cloud settings, particularly for production deployments.

The solution excels in managing dynamic workflows with changing requirements, giving users a lightweight yet strong solution for data orchestration. Prefect has an API that allows you to programmatically control executions, communicate with the scheduler, and manage workflows, giving you more automation and control over your data pipelines.

Prefect’s flexible scheduling feature supports time-based calendars and event-driven triggers, making it simple to schedule workflows. This adaptability ensures that your workflows can run precisely when required, whether on a set timetable or reacting to specific occurrences.

Data Storage Tools

Next, there’s data storage. In a data mesh architecture, we typically spread data storage around the enterprise, with each team responsible for its storage requirements. 

This has certain advantages, but it can also lead to increased complexity because each team is free to choose its storage. As a result, we may wind up with various storage types employed in different regions. 

With this in mind, here are three most common data storage options used in data mesh architectures.

Amazon S3

A highly scalable solution from AWS that is suitable for storing massive amounts of unstructured data and data lakes, as well as for use in a variety of AWS-based data architectures. It interfaces with AWS Identity and Access Management (IAM) to provide access control, also offering server-side encryption to safeguard data.  Users can manage metadata through versioning and tagging. 

Azure Blob 

Azure Blob Storage is a cloud-based object storage solution from Microsoft. It’s ideal for storing large amounts of unstructured data, which is data that doesn’t fit into a specific data model or specification, such as text or binary data.

Google Cloud Storage

Cloud Storage is a service that allows you to store items in Google Cloud. An object is an immutable chunk of data, including a file in any format. Buckets are containers that you use to hold items. All buckets are associated with a project; you can organize your projects by organization.

Data Cataloging Tools

A data catalog is one of the most important components of data mesh design because it serves as a consolidated inventory of all data assets available within the company. It’s crucial for data consumers because it allows teams from various organizations and domains in the data network to identify data assets and learn about their extent. 

Collibra

Despite being data governance software, Collibra’s data catalogue features stand out. It lets users manage metadata such as domain words, definitions, and classifications. You can use it to visualize the data lineage and trace its flow and transformation throughout the company. It’s one of the products I recommend for fostering collaboration among data stakeholders and promoting data culture.

Informatica 

The solution consistently ranks as a product leader in the Gartner Magic Quadrant. Although it shares many features with Collibra, it also includes sub-products. One of them is PowerCenter, which focuses on data quality. It also allows you to use keywords in data profiles, which makes data finding easier.

Data Quality Tools

Data quality management is crucial in data mesh where many distinct data products will be developed and consumed by numerous teams within the organization. However, each area may no longer have a dedicated data quality management staff. As a result, ensuring the consistency and dependability of various data products will require appropriate data quality management technology.

Monte Carlo

Monte Carlo’s Data Observability Platform is a comprehensive data stack solution that monitors and alerts on data issues across data warehouses, data lakes, ETL, and business intelligence. The platform employs machine learning to infer and learn from your data, proactively identifying issues, assessing their impact, and informing those needing to know.

By automatically and immediately identifying the main cause of an issue, teams can interact more readily and solve problems faster. Monte Carlo also offers automatic, field-level lineage and centralized data cataloguing, enabling teams to better understand the accessibility, location, health, and ownership of their data assets while adhering to tight data governance requirements.

Great Expectations

Great Expectations (GX) is an open-source, Python-based data quality management tool. It enables data teams to profile, test, and generate reports on data. The solution has a straightforward command-line interface (CLI), making it simple to put up new tests and change existing reports.

Great Expectations can be integrated with a number of extract, transform, and load (ETL) technologies, including Airflow and databases. You can use Great Expectations to add data quality dimensions to your dataset and validate your data to ensure that it fulfills the standards for a range of data quality aspects, such as completeness, validity, consistency, uniqueness, and more. 

Data Governance Tools

Data governance is becoming increasingly important, particularly in data mesh architectures. Data governance guarantees that data is managed in accordance with regulatory standards and previously established organizational policies. If you have the right data governance tool for data mesh, domain teams will have a considerably easier time implementing data governance principles and standards.

Alation

Alation is a data cataloging solution that helps teams identify, understand, and manage their data assets. It lets you catalog data assets, capture and add technical and business-level information about them, and steward and govern these assets. Alation also assists stakeholders in understanding the existence, composition, and use of data assets, as well as managing risk, data privacy, and compliance.

Alation comprises four key functional areas, one of which is Data governance and stewardship. This module enables companies to develop, manage, and implement policies governing who has access to data, how it is used, and how privacy and compliance are protected.

DataHub

DataHub is a modern data catalog that aims to simplify metadata management, data discovery, and data governance. It allows users to efficiently examine and comprehend their data, trace data lineage, characterize datasets, and create data contracts. 

This extensible metadata management solution is designed to help developers manage the complexity of their quickly developing data ecosystems and data practitioners maximize the overall value of data inside their business.

Data lineage is particularly important for successful data governance, and DataHub offers a comprehensive picture of an organization’s data lineage. This comprises details about the data’s origins, modifications, and utilization. Furthermore, DataHub includes tools for data lineage visualization, which helps data users comprehend the flow of data and identify any errors or anomalies.

Data Visualization and Reporting

Data visualization and reporting are another vital aspect of the data network to consider. APIs and service mesh are good for exchanging data between domains, but keep in mind that the majority of data consumers/users are not technical. This is where data visualization and reporting tools come in.

Tableau 

The solution stands out for its user-friendly interface and excellent visualization capabilities. Tableau is a powerful tool for comprehending and visualizing complicated data relationships inside a data network and allows for easy integration with major data sources.

Note that Tableau may suffer performance issues while processing huge data sets. It also lacks advanced analytics capabilities compared to Power BI.

Power BI 

Power BI integrates really well with the Microsoft environment, making it easier for users to access and analyze data. It’s useful for merging and reporting data sets within a data mesh. It also allows users to generate quick and useful reports.

However, Power BI’s advanced analytics may have limitations. When it comes to analyzing massive datasets, it may not perform as well as competitors like Tableau.

Information Sharing and Collaboration Tools

Collaboration and knowledge exchange are essential, particularly in independent and scalable multidisciplinary companies. This increases consumer and revenue potential while emphasizing the importance of excellent communication amongst entirely independent teams. 

During the move to data mesh deployment, the integration and use of collaboration tools become critical components that support an organization’s overall performance.

Git 

Git is an open-source, distributed version management system that enables numerous users to work on a project simultaneously. First developed by Linus Torvalds in 2005, it has now become the standard option for maintaining source code and tracking changes over time.

Simply put, Git is a platform that allows individuals to collaborate on software projects while tracking the changes they make to the project files. Each Git commit provides a clear, accessible snapshot of the project at a certain point, which improves code reviews and collaboration.

Notion

Notion is a collaboration platform that supports Markdown and includes kanban boards, tasks, wikis, and databases. It serves as a workspace for taking notes, managing information and data, and managing projects and tasks.

It integrates file management into a single workspace, allowing users to comment on ongoing projects, participate in discussions, and receive feedback. It’s accessible through cross-platform apps and major online browsers.

Slack

Slack is a cloud-based team communication tool created by Slack Technologies and owned by Salesforce since 2020. Slack includes numerous IRC-style features, such as persistent chat rooms called channels that are arranged by subject, private groups, and direct-messaging capabilities. Slack allows users to search all material, including files, conversations, and persons. Users can use emojis to react to any communication. Slack’s message history is limited to communications from the last 90 days on the free plan.

Data Version Control

Data versioning capabilities are critical for data mesh architectures because they directly relate to matters like data quality management, data governance, and compliance. On top of that, such tools enable smooth collaboration and precise data lineage tracking.

lakeFS

lakeFS acts as a wrapper for the parts of the data lake that you want to version. It’s an additional layer that allows Git-like actions on the object storage.

lakeFS is ideal for developing and testing in isolation over object storage, managing a long-term production environment with versioning, and enabling seamless collaboration. The version control solution can manage both structured and unstructured data and is format-agnostic, making it compatible with all existing compute engines.

Key Features of lakeFS include:

  • The ability to version data for S3 object storage 
  • Integration with any tool or framework that works with S3 (including Airbyte, Spark, Iceberg, and Delta Lake)
  • Scalability that enables you to easily manage petabyte-sized data
  • Zero-copy branching removes the requirement for data duplication while ensuring ACID transactions
  • Excellent performance for data lakes of any size

DVC

DVC (Data Version Control) is a project inspired by Git LFS that was designed with data scientists and researchers in mind. DVC provides remote code storage in the form of a Git server and all modes of data storage, including object storage on any cloud provider or on-premises hosting. SSH access enables you to access fire systems and local storage.

The data is now saved and available for editing and viewing in your repository. You also have a caching layer (local cache): when you receive a file, it is saved in the local cache to improve performance when others access that file. That is why DVC is better suited for data science than Git LFS.

What’s missing from the DVC? If you prefer relational databases, the solution could be more effective. When working on a petabyte scale with hundreds of millions of objects, caching becomes impractical.

Top Data Mesh Tools

Databricks

Legacy systems and enterprise data warehouses have disadvantages, such as high costs, sluggish performance, and data silos. The Databricks Lakehouse Platform helps to address them by bringing  together the best of data lakes and data warehouses to provide high governance, dependability, flexibility, openness, and performance for data warehouses.

Using Databricks Lakehouse capabilities, you can create a data mesh that adheres to domain-driven design principles. This means that data will be handled as a product and controlled by specific domain teams.

Unity Catalog’s Data Governance tool provides a consistent solution for all AI assets and data in your lakehouse. Informational cataloging features include data discovery, lineage, fine-grained access control, and auditing.

The Databricks Delta Sharing solution allows you to securely transfer data products across organizational, technological, and regional barriers. The system is excellent for large, internationally distributed enterprises with installations in several clouds and regions.

Diagram illustrating data mesh architecture with interconnected data domains, Unity Catalog for governance, and metadata/data flows for publishing and consumption.
Source: https://www.databricks.com/blog/building-data-mesh-based-databricks-lakehouse-part-2 

AWS Lake Formation

AWS Lake Formation helps implement data mesh by enabling decentralized data ownership, where teams manage their data products independently while ensuring centralized governance. It facilitates federated data governance, allowing each data domain to enforce security, manage data lineage, and ensure data quality. 

The solution also provides a central data catalog for data discovery, access control, and auditing, ensuring compliance across the organization. This approach supports scalable, autonomous data domains while maintaining strong governance, leveraging AWS tools like Glue, Athena, and Redshift to manage, share, and consume data efficiently in a data mesh architecture.

How to design a data mesh architecture
Source: https://aws.amazon.com/blogs/big-data/design-a-data-mesh-architecture-using-aws-lake-formation-and-aws-glue/ 

Snowflake

The Snowflake Data Cloud can help your team benefit from implementing a data mesh strategy in several ways. 

Domain teams require on-demand access to information and technologies that will help them at each stage of the data product lifecycle. Snowflake offers a wide range of tools for automating data transformation pipelines and developing and managing data products. 

Snowflake’s platform prioritizes ease of use, low maintenance, and quick resource scaling, enabling a genuine self-service experience. Each domain team can deploy and grow its resources based on its requirements without affecting others, reducing its reliance on an infrastructure team.

Snowflake also has numerous native cross-cloud governance controls required to allow federated governance. This includes keeping an eye on data lineage, object dependencies, metadata tags for data products, row-level access control, dynamic data masking for private data, and other safety measures.

Diagram showing Snowflake Marketplace integration with multiple domain environments across regions, connecting data sources to internal and external consumers.
Source: https://www.snowflake.com/en/solutions/use-cases/data-mesh/ 

Why are Data Mesh Tools Important?

A data platform architecture may eventually lead to dissatisfied data consumers, disconnected data producers, and an overburdened data management team. Data mesh design seeks to address these issues by granting business units significant autonomy and ownership of their data domain

Traditional data architectures are more complicated, calling for collaboration to maintain and adapt. Instead, the data mesh reorganizes the core system’s technical implementation according to the business domains. This eliminates central data pipelines, reducing operational bottlenecks and technological strain on the system.

A data mesh delegated data control to domain experts, who produced useful data products within a decentralized governance framework. Data consumers can also request access to data products and get permissions or adjustments directly from data owners. As a result, everyone has faster access to relevant data, which increases business agility.

Distributed data architecture shifts away from batch processing and toward real-time data streaming usage. You gain visibility into resource allocation and storage expenses, leading to improved budgeting and lower costs.

Data mesh architectures enforce data security principles within and between domains. They offer centralized monitoring and auditing of data exchange processes. For example, you can impose log and trace data requirements across all domains. Your auditors can monitor the amount and frequency of data access.

Key Features to Look for in Data Mesh Tools

Choosing the appropriate tools for a data mesh architecture is critical for creating a decentralized, scalable, and effective data ecosystem. While numerous solutions are available, certain characteristics are crucial when comparing data mesh tools. 

Here’s an overview of the most critical capabilities to consider:

Decentralized Data Ownership

A fundamental principle of data mesh is decentralized data ownership, in which individual domains govern their own data products. This organization necessitates solutions that promote autonomy, allowing domain teams to produce, update, and manage their datasets without relying on a centralized data team. 

Tools that support this decentralized structure offer more flexible and responsive data management, allowing teams to adjust data to their specific needs.

Scalability and Performance

Scalability is extremely important in a data mesh setting. Each domain’s data requirements may evolve independently, and data mesh technologies must accommodate this expansion while retaining good performance. 

Look for solutions that can handle large-scale data operations efficiently, allowing domain-specific data products to scale without sacrificing speed. High-performance data mesh technologies provide instant access to insights while handling the data demands of complicated analytics and machine learning workloads.

Integration Capabilities

To work well, a data mesh must easily interact with existing data sources, platforms, and business intelligence (BI) tools. The best data mesh solutions include extensive integration capabilities, enabling enterprises to combine data from a variety of sources such as databases, data warehouses, and data lakes. This flexibility ensures that data products from many domains stay connected, accessible, and usable for enterprise-wide analytics.

Data Governance and Security

In a decentralized system, data governance is crucial as it necessitates appropriate management across domains to guarantee quality, compliance, and security. Effective data mesh tools have strong governance features like role-based access, row- and column-level permissions, and auditing capabilities.

User Experience and Accessibility

Data mesh solutions are more accessible to data teams within an organization thanks to their user-friendly interface and intuitive design. Look for solutions that simplify the user experience, allowing teams to easily generate, manage, and access data products. Tools that prioritize accessibility allow for speedier onboarding, more productive processes, and a better acceptance rate among data teams.

Focusing on these qualities allows companies to ensure that they choose data mesh tools that not only support decentralized data ownership but also provide the performance, governance, and accessibility required to maximize the value of their data.

Best Practices for Implementing Data Mesh Tools

Define Clear Data Ownership and Accountability

The relevant domain-driven team should assign a specific owner to each data product, such as datasets, APIs, or reports. This ownership is crucial for ensuring the data’s quality and utility.

Ownership guarantees the creation of data products with a clear purpose and to meet users’ specific needs, thereby enhancing their value to the organization. It also allows for greater control over data quality, security, and lifecycle management, which contributes to the overall integrity and reliability of the data mesh.

Establish Robust Data Governance Policies

To set up federated computational governance, you need to make a way to manage data across many areas while keeping overall control through centralized rules.

This governance approach guarantees that data management processes are consistent and meet regulatory standards, while also providing flexibility for domain-specific adjustments.

It contains methods for policy enforcement, data quality management, and security measures that are critical to the data mesh’s integrity and safety.

Adopt Domain-Driven Design for Data Products

The process of defining data domains and products involves the categorization of the organization’s data into logical groups that mirror the organization’s utilization and handling practices.

Data domains are often associated with certain business functions or areas, allowing for more focused and efficient data administration. Data products are defined within each domain. These can contain datasets, APIs, or analytical tools specialized to the domain.

Ensure Interoperability with Standardized APIs

An API-first strategy improves communication across systems by providing a flexible and scalable means to communicate data. SQL Federation is an important component of the data mesh, but standardized APIs are critical to collaboration between data platforms and backend services.

Focus on Self-Service Capabilities for Teams

Creating a self-serve data infrastructure allows domain teams to access and manage their data independently, enabling data mesh’s decentralized nature.

This infrastructure should contain tools and platforms that enable users to extract, load, transform, analyze, and visualize data without requiring ongoing IT support. Self-service features let users complete data-related tasks quickly and efficiently, promoting a culture of data-driven decision-making.

Leverage Automation for Data Quality and Lineage

Automated data quality controls promote trust in the data mesh and guarantee that data-driven decisions are based on reliable information. Guidelines also provide as a foundation for continuing data quality reviews and changes, which help to maintain the general health of the data ecosystem.

Invest in Scalable and Flexible Infrastructure

Selecting the appropriate technology is critical to creating a scalable and efficient data mesh. This includes selecting data storage systems, processing frameworks, and analytics tools that meet the organization’s requirements.

It’s critical to pick technologies that are compatible with existing systems and can accommodate future changes in data volume and complexity. Investing in the right technology at the outset can decrease future costs and complications involved with scaling and maintaining the data mesh.

Monitor Performance with Metrics and KPIs

Regular monitoring helps in identifying performance bottlenecks, security risks, and areas for development. Scaling the mesh entails changing infrastructure and resources to accommodate increased data volume and complexity, ensuring the mesh remains strong and responsive.

Continuously expanding the data mesh by embracing new technologies, techniques, and feedback maintains the architecture in sync with the organization’s changing requirements.

Data Mesh Implementation with lakeFS

Using lakeFS, data infrastructure teams may give each data mesh service its own atomic, versioned data lake on top of the shared object storage, eliminating data duplication and wasteful permission use. Furthermore, lakeFS’s Git-like operations will enable missing functionalities such as data governance and continuous distribution of quality data.

To learn more , continue with this practical guide to implementing lakeFS: Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS

Conclusion

Data mesh tools play a pivotal role in enabling organizations to adopt a decentralized, domain-driven approach to data management. By empowering teams to own and manage their data as a product, these tools promote scalability, agility, and faster insights. 

Key features such as self-service, data governance, and automation are essential for ensuring that data remains accessible, trustworthy, and compliant. As businesses increasingly recognize the value of democratizing data, data mesh tools will continue to evolve, driving innovation and more effective data strategies across industries.

lakeFS

We use cookies to improve your experience and understand how our site is used.

Learn more in our Privacy Policy