Metadata enforcement is a broad term that can refer to different aspects of managing and controlling metadata. Let’s explore few key areas:

Understanding Metadata Enforcement

1 – Data Privacy and Protection:

Compliance with regulations: GDPR, CCPA, and other data privacy laws often mandate specific metadata handling practices.
Preventing data breaches: Ensuring accurate and complete metadata can aid in incident response and recovery.
Protecting sensitive information: Identifying and controlling metadata associated with sensitive data is crucial.

2 – Data Governance and Quality:

Data standardization: Enforcing consistent metadata schemas and standards improves data usability.
Data quality: Ensuring metadata accuracy and completeness enhances data reliability.
Metadata lifecycle management: Controlling metadata creation, modification, and deletion.

3 – Legal and Compliance

Discoverability: Metadata can be essential in legal proceedings.
Auditability: Metadata can be used to track data changes and comply with regulations.

Challenges in Metadata Enforcement

Metadata complexity: The variety and volume of metadata can be overwhelming.
Data silos: Metadata inconsistencies can arise from data stored in different systems.
Changing regulations: Data privacy laws and industry standards evolve rapidly.
Technological limitations: Tools and technologies for metadata management may be insufficient.

Strategies for Effective Metadata Enforcement

Metadata management policies: Develop clear guidelines for metadata creation, usage, and retention.
Metadata governance: Establish a centralized authority for metadata oversight.
Metadata audits: Regularly assess metadata quality and compliance.
Metadata tools: Utilize software to automate metadata enforcement and improve efficiency.
Employee training: Educate staff about the importance of metadata and best practices.

We will focus on utilizing the software to automate metadata enforcement and if metadata policies are not met then data can’t be promoted to production.

Promoting software code to production is a straightforward process by using software like Git but promoting data to production is not. In this article, we will examine how lakeFS enables metadata enforcement on promotion of data to production with the help of Git-like Actions.

What is lakeFS, and how does it work for Metadata Enforcement?

lakeFS is an open-source data versioning system that lets you work with data the way you would work with code.

lakeFS sits on top of any object storage, from Amazon S3 to on-premise storage, providing it with a Git interface. You can now create branches in your storage bucket, make changes in data, commit those changes, promote data to production or revert to the previous version if something doesn’t work.

lakeFS Actions, similar to Git Actions, allow you to automate any process including metadata enforcement. Only if all metadata enforcements pass, the data will be merged to production.

How does this work in practice? Let’s break it down with a demo.

Metadata Enforcement Demo

Step 1: Run a lakeFS Server

You will run lakeFS and the Jupyter demo notebook in a Docker container. Clone lakeFS Samples Git Repository and run lakeFS server along with Jupyter notebook server:

Copy Code

git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples
docker compose --profile local-lakefs up

Now go to lakeFS UI (http://localhost:8000/) in your browser and login by using following credentials:

Access Key ID: AKIAIOSFOLKFSSAMPLES

Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Step 2: Open Jupyter Demo Notebook

Access the Jupyter notebook UI by going to the address http://localhost:8888/ in your browser and open the hooks-metadata-validation notebook from the list of notebooks listed in Jupyter UI:

Jupyter Notebook for metadata enforcement with lakeFS

Step 3: Run Demo Notebook

You can run the demo notebook as is without making any changes but let’s look into how lakeFS Actions and metadata enforcement work in this demo.

Metadata is enforced by using a software program. This demo uses Lua programming/scripting language for this purpose. lakeFS server uses an embedded Lua VM to execute Lua programs via Lua Hooks. But you can write this program in any programming language and you can call it via webhooks.For example, this Lua script checks commit metadata to validate that mandatory metadata fields are present and value for the metadata fields match the required pattern:

Copy Code

regexp = require("regexp")

for k, props in pairs(args) do
    -- let's see that we indeed have this key in our metadata
    local current_value = action.commit.metadata[k]
    if current_value == nil then
        error("missing mandatory metadata field: " .. k)
    end
    if props.pattern and not regexp.match(props.pattern, current_value) then
        error("current value for commit metadata field " .. k .. " does not match pattern: " .. props.pattern .. " - got: " .. current_value)
    end
end

You can also review Metadata validation Lua script which validates the existence of mandatory metadata describing a dataset.

You execute these Lua scripts via lakeFS Hooks. You will upload the following Hooks configuration YAML file to check for mandatory metadata before data is merged into the main branch. Hooks config file must be uploaded to “_lakefs_actions” prefix in the lakeFS repository:

Copy Code

name: Validate Commit Metadata and Dataset Metadata Fields
on:
    pre-merge:
        branches:
            - main
hooks:
    - id: check_commit_metadata
        type: lua
        properties:
            script_path: scripts/commit_metadata_validator.lua
            args:
                notebook_url: {"pattern": "github.com/.*"}
                spark_version:  {}
    - id: validate_datasets
        type: lua
        properties:
            script_path: scripts/dataset_validator.lua
            args:
                prefix: 'datasets/'
                metadata_file_name: dataset_metadata.yaml
                fields:
                    - name: contains_pii
                        required: true
                        type: boolean
                    - name: approval_link
                        required: true
                        type: string
                        match_pattern: 'https?:\/\/.*'
                    - name: rank
                        required: true
                        type: number
                    - name: department
                        type: string
                        choices: ['hr', 'it', 'other']

The above configuration ensures that metadata included with a lakeFS commit must include “notebook_url” and “spark_version” metadata fields and “notebook_url” must follow a certain pattern.

It also ensures that a metadata file named “dataset_metadata.yaml” must exist either in the same directory as the modified dataset, or in any parent directory. And the “dataset_metadata.yaml” file must contain all required metadata fields. Values in certain metadata fields must follow a certain pattern or must contain a value from a defined list of values e.g. department value must be in [‘hr’, ‘it’, ‘other’] list.

lakeFS automates metadata enforcement and if metadata policies are not met then data can’t be promoted to production:

Metadata management policies: lakeFS lets you define metadata policies via a simple and configurable YAML file.
Metadata governance: lakeFS acts as a centralized authority for metadata oversight.
Metadata audits: lakeFS ensures metadata quality and compliance.
Metadata tools: lakeFS utilizes software programming to automate metadata enforcement

Want to learn more?

If you have questions about lakeFS, then drop us a line or join the discussion on the lakeFS Slack channel.

Metadata Enforcement: Step-by-Step Tutorial

Understanding Metadata Enforcement

1 – Data Privacy and Protection:

2 – Data Governance and Quality:

3 – Legal and Compliance

Challenges in Metadata Enforcement

Strategies for Effective Metadata Enforcement

What is lakeFS, and how does it work for Metadata Enforcement?

Metadata Enforcement Demo

Step 1: Run a lakeFS Server

Step 2: Open Jupyter Demo Notebook

Step 3: Run Demo Notebook

Want to learn more?

Learn how lakeFS data version control works

Need help getting started?

lakeFS

Metadata Enforcement: Step-by-Step Tutorial

Understanding Metadata Enforcement

1 – Data Privacy and Protection:

2 – Data Governance and Quality:

3 – Legal and Compliance

Challenges in Metadata Enforcement

Strategies for Effective Metadata Enforcement

What is lakeFS, and how does it work for Metadata Enforcement?

Metadata Enforcement Demo

Step 1: Run a lakeFS Server

Step 2: Open Jupyter Demo Notebook

Step 3: Run Demo Notebook

Want to learn more?

Related articles

Building Compliant and Reproducible ML Pipelines

Iceberg REST Catalog Alternatives: Top Options & How to Choose The Best One For Your Team

lakeFS Top 10 Defining Product Milestones in 2025

Learn how lakeFS data version control works

lakeFS

Pick up the Slack with lakeFS