Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

Last updated on February 13, 2025

Metadata enforcement is a broad term that can refer to different aspects of managing and controlling metadata. Let’s explore few key areas:

Understanding Metadata Enforcement

1 – Data Privacy and Protection:

  • Compliance with regulations: GDPR, CCPA, and other data privacy laws often mandate specific metadata handling practices.
  • Preventing data breaches: Ensuring accurate and complete metadata can aid in incident response and recovery.
  • Protecting sensitive information: Identifying and controlling metadata associated with sensitive data is crucial.

2 – Data Governance and Quality:

  • Data standardization: Enforcing consistent metadata schemas and standards improves data usability.
  • Data quality: Ensuring metadata accuracy and completeness enhances data reliability.
  • Metadata lifecycle management: Controlling metadata creation, modification, and deletion.

3 – Legal and Compliance

  • Discoverability: Metadata can be essential in legal proceedings.
  • Auditability: Metadata can be used to track data changes and comply with regulations.

Challenges in Metadata Enforcement

  • Metadata complexity: The variety and volume of metadata can be overwhelming.
  • Data silos: Metadata inconsistencies can arise from data stored in different systems.
  • Changing regulations: Data privacy laws and industry standards evolve rapidly.
  • Technological limitations: Tools and technologies for metadata management may be insufficient.

Strategies for Effective Metadata Enforcement

  • Metadata management policies: Develop clear guidelines for metadata creation, usage, and retention.
  • Metadata governance: Establish a centralized authority for metadata oversight.
  • Metadata audits: Regularly assess metadata quality and compliance.
  • Metadata tools: Utilize software to automate metadata enforcement and improve efficiency.
  • Employee training: Educate staff about the importance of metadata and best practices.

We will focus on utilizing the software to automate metadata enforcement and if metadata policies are not met then data can’t be promoted to production.

Promoting software code to production is a straightforward process by using software like Git but promoting data to production is not. In this article, we will examine how lakeFS enables metadata enforcement on promotion of data to production with the help of Git-like Actions.

What is lakeFS, and how does it work for Metadata Enforcement?

lakeFS is an open-source data versioning system that lets you work with data the way you would work with code. 

lakeFS sits on top of any object storage, from Amazon S3 to on-premise storage, providing it with a Git interface. You can now create branches in your storage bucket, make changes in data, commit those changes, promote data to production or revert to the previous version if something doesn’t work.

lakeFS Actions, similar to Git Actions, allow you to automate any process including metadata enforcement. Only if all metadata enforcements pass, the data will be merged to production.

How does this work in practice? Let’s break it down with a demo.

Metadata Enforcement Demo

Step 1: Run a lakeFS Server

You will run lakeFS and the Jupyter demo notebook in a Docker container. Clone lakeFS Samples Git Repository and run lakeFS server along with Jupyter notebook server:

git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples
docker compose --profile local-lakefs up

Now go to lakeFS UI (http://localhost:8000/) in your browser and login by using following credentials:

Access Key ID: AKIAIOSFOLKFSSAMPLES

Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

lakeFS access

Step 2: Open Jupyter Demo Notebook

Access the Jupyter notebook UI by going to the address  http://localhost:8888/ in your browser and open the hooks-metadata-validation notebook from the list of notebooks listed in Jupyter UI:

Jupyter Notebook for metadata enforcement with lakeFS

Step 3: Run Demo Notebook

You can run the demo notebook as is without making any changes but let’s look into how lakeFS Actions and metadata enforcement work in this demo.

Metadata is enforced by using a software program. This demo uses Lua programming/scripting language for this purpose. lakeFS server uses an embedded Lua VM to execute Lua programs via Lua Hooks. But you can write this program in any programming language and you can call it via webhooks.For example, this Lua script checks commit metadata to validate that mandatory metadata fields are present and value for the metadata fields match the required pattern:

regexp = require("regexp")

for k, props in pairs(args) do
    -- let's see that we indeed have this key in our metadata
    local current_value = action.commit.metadata[k]
    if current_value == nil then
        error("missing mandatory metadata field: " .. k)
    end
    if props.pattern and not regexp.match(props.pattern, current_value) then
        error("current value for commit metadata field " .. k .. " does not match pattern: " .. props.pattern .. " - got: " .. current_value)
    end
end

You can also review Metadata validation Lua script which validates the existence of mandatory metadata describing a dataset.

You execute these Lua scripts via lakeFS Hooks. You will upload the following Hooks configuration YAML file to check for mandatory metadata before data is merged into the main branch. Hooks config file must be uploaded to “_lakefs_actions” prefix in the lakeFS repository:

name: Validate Commit Metadata and Dataset Metadata Fields
on:
    pre-merge:
        branches:
            - main
hooks:
    - id: check_commit_metadata
        type: lua
        properties:
            script_path: scripts/commit_metadata_validator.lua
            args:
                notebook_url: {"pattern": "github.com/.*"}
                spark_version:  {}
    - id: validate_datasets
        type: lua
        properties:
            script_path: scripts/dataset_validator.lua
            args:
                prefix: 'datasets/'
                metadata_file_name: dataset_metadata.yaml
                fields:
                    - name: contains_pii
                        required: true
                        type: boolean
                    - name: approval_link
                        required: true
                        type: string
                        match_pattern: 'https?:\/\/.*'
                    - name: rank
                        required: true
                        type: number
                    - name: department
                        type: string
                        choices: ['hr', 'it', 'other']

The above configuration ensures that metadata included with a lakeFS commit must include “notebook_url” and “spark_version” metadata fields and “notebook_url” must follow a certain pattern.

It also ensures that a metadata file named “dataset_metadata.yaml” must exist either in the same directory as the modified dataset, or in any parent directory. And the “dataset_metadata.yaml” file must contain all required metadata fields. Values in certain metadata fields must follow a certain pattern or must contain a value from a defined list of values e.g. department value must be in [‘hr’, ‘it’, ‘other’] list.

lakeFS automates metadata enforcement and if metadata policies are not met then data can’t be promoted to production:

  • Metadata management policies: lakeFS lets you define metadata policies via a simple and configurable YAML file.
  • Metadata governance: lakeFS acts as a centralized authority for metadata oversight.
  • Metadata audits: lakeFS ensures metadata quality and compliance.
  • Metadata tools: lakeFS utilizes software programming to automate metadata enforcement

Want to learn more?

If you have questions about lakeFS, then drop us a line or join the discussion on the lakeFS Slack channel.

lakeFS