No time for the full article now? Read the abbreviated version here

Often, data lake platforms lack simple ways to enforce data governance. This is especially challenging since data governance requirements are complicated, even without the added complexities of managing data in a data lake. Therefore, enforcing them is an expensive, time-consuming ongoing effort, requiring continuous management. Typically, it is at the expense of data engineering or other DevOps activities that can benefit the business.

In this article, I’ll review how lakeFS helps achieve governance at scale in a simple, quick, and straightforward way.

We will review the following benefits of lakeFS:

Role-Based Access Control
Immediate Backup and Restore of the Entire Data Lake
Branch Aware Managed Garbage Collection
Data Lineage
Auditing

What Is Data Lake Governance?

The goal of data lake governance is to apply policies, standards, and processes to the data. This allows for the creation of high-quality data and the assurance that it’s used appropriately across the organization.

Due to its flat architecture and less structured nature compared to a data warehouse, a data lake poses many challenges to all the areas of enterprise data management, including governance. However, if you want to get spot-on analytics output from your data lake, you must include it in your overall data governance initiative.

Components of Data Lake Governance

As with regulating data in other types of systems, some common first steps in data lake governance include the following:

Documenting the business case for regulating a data lake, including data quality indicators and other ways to assess the advantages of governance efforts.
Finding an executive or business sponsor to help you secure permission and financing for the governance initiative.
If you don’t currently have a data governance structure in place, establish one that comprises a governance team, data stewards, and a data governance committee comprised of business executives and other data owners.
Work with the governance committee to create data standards and policies for the data lake.
Another effective first step is to create a data catalog to helps end users in locating and understanding the data housed in a data lake. Alternatively, if you have a catalog of other data assets, it might be expanded to include the data lake. A data catalog collects metadata and produces an inventory of available data from which users can search to locate what they need. You can also include information on your organization’s data governance policy in the catalog and means for enforcing rules and restrictions.

Benefits of Data Lake Governance

Data lake governance improves the data quality and increases data usage for business decision-making, leading to operational improvements, better-informed business strategies, and stronger financial performance.

Here are a few specific benefits of data lake governance organizations see from its applications:

Greater Access to Data for Advanced Analytics

A well-governed data lake helps data scientists and other analytics team members to find the data they require for machine learning, predictive analytics, and other data-driven applications.

More Efficient Data Preparation for Analytics

When data resides in a data lake, it’s usually left in a raw form until a team needs it for a specific application. The data preparation process often becomes time-consuming – unless you have a well-governed environment. Cleaning data upfront eliminates the need to fix errors and problems later on.

Reduced Data Management Costs

If your data lake spirals out of control, you’ll have to spend more on data processing and storage resources. Data lake governance lowers data management costs thanks to improved data accuracy, consistency, and cleanliness.

Airtight Security and Regulatory Compliance for Sensitive Data

If your data lake includes customer data used for analytics purposes by marketing or sales teams, you might be dealing with sensitive information. Strong data governance lets you ensure that all data is properly secured and isn’t exposed to any misuse.

Data Lake Governance Common Use Cases

Here are a few common use cases for data lake governance:

Data quality management – Ensuring data accuracy, consistency, and completeness enables firms to make better decisions and increase overall operational efficiency.
Data security and privacy – Data governance is crucial for preserving sensitive information, following privacy requirements, and preventing unwanted access or data breaches.
Compliance and regulatory requirements – Meeting industry-specific legal regulations, such as GDPR, HIPAA, and CCPA, is critical for avoiding penalties and maintaining a good reputation.
Data lineage and traceability – Understanding the data’s origin, flow, and transformations promotes transparency, data integrity, and auditability.
Data access and sharing – Ensuring that the proper users have access to the right data with suitable permissions is critical for cooperation and avoiding unwanted entry.
Data lifecycle management – Implementing policies and procedures for data generation, storage, archiving, and destruction enables maximum resource usage and regulatory compliance.
Data standardization and integration – Promoting standard data formats, definitions, and architectures improves data integration and analysis throughout the company.
Master data management – Establishing a single, authoritative source for important company data, such as customer or product information, improves decision-making and decreases data inconsistency.
Change management – Data governance enables businesses to manage and respond to changes in data requirements, business processes, and technologies.

Is It Possible to Achieve Data Governance in Data Lakes?

Yes, it’s possible to achieve data lake governance – but it comes with specific challenges. Companion disciplines to data governance include data quality, metadata management and data security, all of which factor into data lake governance and the problems it poses.

Some of the most common data governance challenges teams encounter in data lakes are identifying and maintaining correct data sources, lack of coordination around data governance and quality, issues with metadata management, and potential conflicts among data lake users.

Data Lake Governance Challenges and Limitations

Here are some of the most prevalent data governance challenges found during data lake implementations.

Identifying and maintaining the correct data sources – Many data lake systems fail to capture or make public the source metadata, casting doubt on the data lake’s contents. For example, the system of record or the business owner of data sets may not be disclosed, and clearly redundant data may present problems for data analysts. At the very least, the source metadata for every piece of data in a data lake should be recorded and made available to users so that they may understand its origin.
Metadata management difficulties – Metadata contextualizes the content of data sets and is an essential component in making data comprehensible and usable in applications. However, many data lake implementations fail to recognize the need to apply proper definitions to the acquired data. Furthermore, because raw data is frequently fed into a data lake, many organizations overlook the processes required to validate or apply organizational data standards. The lack of effective metadata management renders data in a data lake less usable for analytics.
Lack of coordination in data governance and quality – Failure to coordinate data lake governance and data quality work might result in poor-quality data entering a data lake. This can result in erroneous results when the data is used for analytics and business decisions, producing a loss of faith in the data lake and a widespread distrust of data throughout a company. Effective data lake deployments require data quality analysts and engineers to collaborate closely with the data governance team and corporate data stewards to apply data quality policies, profile data, and take appropriate actions to improve its quality.
Lack of collaboration in data governance and security – In this scenario, data security standards and rules that are not adequately implemented as part of the governance process can lead to concerns with access to personal data protected by privacy regulations, as well as other sensitive data. Although data lakes are intended to be a relatively open source of data, security and access control mechanisms are required, and the data governance and data security teams should collaborate during the data lake design and loading processes, as well as ongoing data governance initiatives.
Conflict between business divisions that use the same data lake – Distinct departments may have distinct business standards for the same data, resulting in the inability to reconcile data variances for reliable analytics. A strong data governance program, complete with an enterprise view of data policies, standards, processes, and definitions and an enterprise business lexicon, can help mitigate challenges that develop when several business units use the same data lake. An organization with numerous data lakes should be included in the data lake governance process and allocated business data stewards.

How to Get Started with Data Lake Governance and lakeFS

1. Role-Based Access Control

Accessing and modifying the data should be restricted according to users’ roles and responsibilities. Data lakes typically gather data from many different systems and expose data to (other) systems and users.

Therefore, restricting this access, often for subsets of the data, becomes a complicated challenge. The most simple example – a developer shouldn’t necessarily have access to production data. However, you might want to maintain a separate production environment for Disaster Recovery (DR) or to stage your production data before promoting fresh data to production. You might also need to train new ML models with production data. In addition, you might have different teams with access to different parts of the data that is deliberately managed together in the data lake. The complexity quickly builds up.

Simple yet Flexible RBAC in lakeFS

lakeFS provides a Role-Based Access Control (RBAC) capability that is granular down to the individual branch level. Since lakeFS restructures the data on the object store, all access will be done via lakeFS. Every action in the system – be it an API request, UI interaction, S3 Gateway call, or CLI command – requires a set of actions to be allowed for one or more resources.

A simple-to-operate IAM-like authorization mechanism allows granularity of which groups or individual users have access to which branches on specific repositories and what types of actions (read/write of data/metadata, etc.) they can execute.

This becomes especially powerful when combined with lakeFS hooks – forcing, for example, Personal Identifiable Information (PII) removal on every branch out of production that will be used for isolated development or ETL testing.

Immediate Backup and Restore of the Entire Data Lake

A key component of data governance is restoring the service in case of an outage. In the world of data lakes, that often also means restoring the service in case the production data gets corrupted.

One option would be to regularly back up the storage. The challenges with this approach are that (1) it’s expensive, (2) it’s time-consuming, and (3) you can only restore to the point of the backup.

Another option would be adopting an open table format like Delta table, Hudi or Iceberg. Using open table formats, you have a historical audit of all changes per table and can run operations or query/restore a table at a specific point in time. The challenges with this approach would be (1) you often need to restore several tables together across multi-table transactions, and (2) it’s limited to structured data.

Backup And Restore With lakeFS

Usually, most of the files in a data lake are static, while a smaller subset of objects is added/removed regularly. lakeFS uses a copy-on-write mechanism to avoid data duplication. For example, creating a new branch is a metadata-only operation: no objects are actually copied. Only when an object changes does lakeFS create another version of the data in the storage.

Beyond the storage savings and performance advantages of this design, this mechanism allows speedy recovery of historical commits without taking multiple snapshots of the lake since lakeFS deduplicates the objects in the data lake over time.

2. Branch Aware Managed Garbage Collection

Managing data, you need to manage the tradeoff between being able to restore and recover historical data on the one hand and delete old data on the other. Data deletion might be required simply for cost reduction (i.e. should I keep all steps in an ETL process or only be able to restore the output? Should I also be able to restore the raw data?).

Furthermore, regulations like GDPR might require deleting users’ information throughout your entire data stack (including older versions of the data).

Simple Configurations for Advanced GC Capabilities

lakeFS provides GC (Garbage Collection) functionality, that is branch and repository aware (This is a managed service in lakeFS Cloud). This gives a simple ability to control how far back data should be reproducible for different types of data.

In the example below, production data will be stored (and easily restored) for 7 days,, while features data will be stored for only 3 days.

Some companies require years of historical data for different regulations. In those cases, an intelligent policy (deduplication over time and deleting unnecessary files) can have a tremendous impact on your storage cost.

Traditional approaches like backup, and archiving to lower costs of storage, face challenges supporting the user’s right to be forgotten as a part of the General Data Protection Regulation (GDPR).

Using a system like lakeFS, users can support the right to be forgotten, losing reproducibility. Or keep reproducibility up to right to be forgotten support.

Read about GDPR best practices while using lakeFS

3. Data Lineage

Data lineage is a critical component of data lake governance strategy. Due to the ongoing growth of these complexities, companies are facing manageability challenges in capturing lineage cost-effectively and consistently.

In data lakes, data lineage often refers both to tracking the transformation of data (i.e. what data was used to create the data) as well as the associated code (i.e. what code was used to generate the data).

lakeFS provides powerful lineage by providing blame functionality for every object in the data lake, tying it to a commit, and providing metadata for that, and any other commit; Including the correlations between those commits.

lakeFS Blame

Via the API, CLI or UI, select any file or dataset in the system to find out which commit added this object:

Similar to “Git blame”, this will help understand what objects were changed and by whom:

lakeFS Commit History and Metadata

Next, you can choose the commit itself to get the full context of the commit. This includes what other datasets / objects were changed as part of the same commit.

The commit also includes programmable metadata key-value pairs. They can be used to reference commits with the associated code version or an individual pipeline run.

When you click on the Parent commit, lakeFS will provide the full data set of the original data that was used to create the data on this commit:

data lake governance gif — Source: lakeFS

This is a simple way to provide the entire journey of “raw data” → “transformation code” → “Generated data” that is otherwise incredibly complicated to troubleshoot.

4. Auditing

When managing data, it is critical to have a complete audit trail of actions. It is critical to understand what data is in the lake, who is using it, and how much they’ve used. This basic requirement is necessary to prevent/detect data leakage and ensure privacy regulations are enforced.

lakeFS Audit Trail

lakeFS Cloud provides a fully searchable, exportable audit trail of all actions performed against the data. What resource was accessed when, and with what API against what part of the data? This is not only required for governance but is also an easy way to integrate with monitoring systems to quickly identify errors, outages or security breaches.

Summary

You can increase a data lake’s value by incorporating strong data governance, metadata management, data quality, and data security protocols into the environment’s design, loading, and maintenance. Otherwise, your data lake may become more of a data swamp.

The complexity of data lakes presents difficult challenges for companies to achieve data governance. Using a Git-like approach to data helps apply well-established engineering best practices from code to data.

Data Lake Governance: Benefits & How to Scale with lakeFS

What Is Data Lake Governance?

Components of Data Lake Governance

Benefits of Data Lake Governance

Greater Access to Data for Advanced Analytics