No time for the full article now? Read the abbreviated version here
Introduction
Often, data lake platforms lack simple ways to enforce data governance. This is especially challenging since data governance requirements are complicated to begin with, even without the added complexities of managing data in a data lake. Therefore, enforcing them is an expensive, time-consuming ongoing effort, requiring continuous management. Typically, at the expense of data engineering or other DevOps activities that can benefit the business.
In this article, I’ll review some of the ways lakeFS helps to achieve governance at scale in a simple, quick, straightforward way.
We will review the following benefits of lakeFS:
- Role-Based Access Control
- Immediate Backup and Restore of the Entire Data Lake
- Branch Aware Managed Garbage Collection
- Data Lineage
- Auditing
What is data lake governance?
The goal of data lake governance is to apply policies, standards and processes. on the data. This allows creating high-quality data and ensuring that it’s used appropriately across the organization.
Due to its flat architecture and less structured nature compared to a data warehouse, a data lake poses many challenges to all the areas of enterprise data management, including governance. However, if you want to get spot-on analytics output from your data lake, you need to include it in your overall data governance initiative.
What are the benefits of data lake governance?
Data lake governance improves the data quality and increases data usage for business decision-making, leading to operational improvements, better-informed business strategies, and stronger financial performance.
Here are a few specific benefits of data lake governance organizations see from its applications:
Greater access to data for advanced analytics
A well-governed data lake helps data scientists and other analytics team members to find the data they require for machine learning, predictive analytics, and other data-driven applications.
More efficient data preparation for analytics
When data resides in a data lake, it’s usually left in a raw form until a team needs it for a specific application. The data preparation process often becomes time-consuming – unless you have a well-governed environment. Cleaning data upfront eliminates the need to fix errors and problems later on.
Reduced data management costs
If your data lake spirals out of control, you’ll have to spend more on data processing and storage resources. Data lake governance lowers data management costs thanks to improved data accuracy, consistency, and cleanliness.
Airtight security and regulatory compliance for sensitive data
If your data lake includes customer data used for analytics purposes by marketing or sales teams, you might be dealing with sensitive information. Strong data governance lets you ensure that all data is properly secured and isn’t exposed to any misuse.
Is it possible to achieve data governance in data lakes?
Yes, it’s possible to achieve data lake governance – but it comes with specific challenges. Companion disciplines to data governance include data quality, metadata management and data security, all of which factor into data lake governance and the problems it poses.
Some of the most common data governance challenges teams encounter in data lakes are identifying and maintaining correct data sources, lack of coordination around data governance and quality, issues with metadata management, and potential conflicts among data lake users.
lakeFS Governance Solutions
Role-Based Access Control
Accessing and modifying the data should be restricted according to users’ roles and responsibilities. Data lakes typically gather data from many different systems, and expose data to (other) systems and users.
Therefore, restricting this access, often for subsets of the data, becomes a complicated challenge. The most simple example – a developer shouldn’t necessarily have access to production data. However, you might want to maintain a separate production environment for Disaster Recovery (DR) or to stage your production data before promoting fresh data to production. Or, you might need to train new ML models with Production data. In addition, you might have different teams with access to different parts of the data, that is deliberately managed together in the data lake. The complexity quickly builds up.
Simple yet Flexible RBAC in lakeFS
lakeFS provides a Role-Based Access Control (RBAC) capability that is granular down to the individual branch level. Since lakeFS restructures the data on the object store, all access will be done via lakeFS. Every action in the system – be it an API request, UI interaction, S3 Gateway call, or CLI command – requires a set of actions to be allowed for one or more resources.
A simple to operate, IAM like authorization mechanism, allows granularity of which groups or individual users have access to which branches on specific repositories, and what types of actions (read / write of data / metadata etc.) they can execute.
This becomes especially powerful when combined with lakeFS hooks – forcing, for example, Personal Identifiable Information (PII) removal on every branch out of production that will be used for isolated development or ETL testing.
Immediate Backup and Restore of the Entire Data Lake
A key component of data governance is the ability to restore the service in case of an outage. In the world of data lakes, that often also means restoring the service in case the production data got corrupted.
One option would be to regularly backup the storage. The challenges with this approach are that (1) it’s expensive, (2) it’s time-consuming, (3) you can only restore to the point of the backup.
Another option would be to adopt an open table format such as Delta table, Hudi or Iceberg. Using open table formats, you have a historical audit of all changes per table, and can run operations or query/restore a table at a specific point in time. The challenges with this approach would be (1) often you need to restore several tables together across multi table transactions, (2) it’s limited to structured data.
Backup And Restore With lakeFS
Usually, most of the files in a data lake are static, while a smaller subset of objects is added / removed on a regular basis. lakeFS uses a copy-on-write mechanism to avoid data duplication. For example, creating a new branch is a metadata-only operation: no objects are actually copied. Only when an object changes does lakeFS create another version of the data in the storage.
Beyond the storage savings and the performances advantages of this design – This mechanism allows a speedy recovery of historical commits without the need to take multiple snapshots of the lake, since lakeFS deduplicates the objects in the data lake over time.
Branch Aware Managed Garbage Collection
Managing data, you need to manage the tradeoff between being able to restore and recover historical data on one hand, and delete old data on the other. Data deletion might be required simply for cost reduction (i.e. should I keep all steps in an ETL process or only be able to restore the output? Should I also be able to restore the raw data?). Furthermore, regulations like GDPR might require deleting users’ information throughout your entire data stack (including older versions of the data).
Simple configurations for advanced GC Capabilities
lakeFS provides GC (Garbage Collection) functionality, that is branch and repository aware (This is a managed service in lakeFS Cloud). This gives a simple ability to control how far back data should be reproducible for different types of data.
In the example below, production data will be stored (and easily restored) for 7 days while features data only for 3 days.
Some companies require years of historical data for different regulations. In those cases, an intelligent policy (deduplication over time and deleting unnecessary files) can have a tremendous impact on your storage cost.
GDPR Support
Traditional approaches like backup, and archiving to lower costs of storage, face challenges supporting the user’s right to be forgotten as a part of the General Data Protection Regulation (GDPR).
Using a system like lakeFS, users have the flexibility to support the right to be forgotten, losing reproducibility. Or keep reproducibility up to right to be forgotten support.
Read more about it here.
Data Lineage
Data lineage is a critical component of a data governance strategy for data lakes. Due to the ongoing growing complexities of data lakes, companies are facing manageability challenges in capturing lineage in a cost-effective and consistent manner.
In data lakes, data lineage often refers both to tracking the transformation of data (i.e. what data was used to create the data) as well as the associated code (i.e. what code was used to generate the data).
lakeFS provides powerful lineage by providing blame functionality for every object in the data lake, tying it to a commit, and providing metadata for that, and any other commit; Including the correlations between those commits.
lakeFS Blame
Via the API, CLI or UI, select any file or dataset in the system to find out which commit added this object:
Similar to “Git blame”, this will help understand what objects were changed and by whom:
lakeFS Commit History and Metadata
Next, you can choose the commit itself to get the full context of the commit. This includes what other datasets / objects were changed as part of the same commit.
The commit also includes programmable metadata key value pairs. They can be used to reference commits with the associated code version, or an individual pipeline run.
When you click on the Parent commit, lakeFS will provide the full data set of the original data that was used to create the data on this commit:
Source: lakeFS
This is a simple way to provide the entire journey of “raw data” → “transformation code” → “Generated data” that is otherwise incredibly complicated to troubleshoot.
Auditing
When managing data, it is critical to have a complete audit trail of actions. It is critical to understand what data is in the lake, who is using it, and how much they’ve used. This basic requirement is necessary to prevent / detect data leakage and ensure privacy regulations are enforced.
lakeFS Audit Trail
lakeFS Cloud provides a full searchable, exportable audit trail of all actions performed against the data. What resource was accessed when, with what API against what part of the data. This is not only required for governance, but is also an easy way to integrate with monitoring systems to quickly identify errors, outages or security breaches.
Summary
The complexity of data lakes present difficult challenges for companies to achieve data governance. Taking advantage of a “git” approach towards data, helps apply well established engineering best practices from code, to data.
For more information, https://lakefs.io