Oz Katz
December 20, 2022

TLDR – lakeFS hooks just became so much easier to use.
As of v0.87.1, lakeFS includes a Lua Virtual Machine that allows you to run hooks directly in lakeFS, without relying on any external component.

What are lakeFS hooks

lakeFS allows users to define logic that would happen before or after important events in the system. This idea is very powerful – it allows users to build workflows and define a set of behavior that everyone has to abide by. You can think of it as “CI/CD for data”

This helps create explicit standards around changes to data, enforce quality standards and define strong data contracts between teams.

See this example:


Here, a pre-merge hook is defined on the main branch – it ensures that only a set of allowed formats could ever end up on the main branch. While simple, this is a very powerful guarantee! No more slow-to-process CSV or json files in production! Of course, real world examples could be much more complex – validating data, schema changes or other quality standards.

Hooks have been around for a while now (initially released in lakeFS v0.33.0, in March 2021), supporting execution of webhooks and later with support for Airflow hooks that allow triggering DAGs to execute on certain events.

Making hooks more approachable

While Webhooks and Airflow hooks are great and could be very powerful – they have one clear downside: They add an additional dependency on an external system.

With Webhooks, users need to run and maintain a webserver and make it accessible to lakeFS. This is not always possible both for security and operational reasons.

Airflow on the other hand, isn’t always part of the stack and while it makes sense to trigger DAGs after a commit or a merge, it’s not very well suited to running synchronously in order to validate changes.

This meant that for most users, experimenting with hooks and grasping their usefulness was hard, especially when just initially getting started or running a POC.

Introducing Lua Hooks

As of lakeFS v0.87.1, users can now configure self-contained hooks that require no integration and no external components. lakeFS now embeds a Lua Virtual Machine that can execute a controlled, secure set of operations that allow users to build hooks with minimal effort.

name: pre commit metadata field check
on:
pre-commit:
   branches:
   - main
   - dev
hooks:
 - id: ensure_commit_metadata
   type: lua
   properties:
     args:
       notebook_url: {"pattern": "my-jupyter.example.com/.*"}
       spark_version:  {}
     script: |
       regexp = require("regexp")
       for k, props in pairs(args) do

     current_value = action.commit.metadata[k]
         if current_value == nil then
           error("missing mandatory metadata field: " .. k)
         end
         if props.pattern and not regexp.match(props.pattern, current_value) then
           error("commit metadata field " .. k .. " does not match pattern: " .. props.pattern)
         end
       end

In this simple example, we set a pre-commit hook on the dev and main branches. This hook makes sure that every commit contains a spark_version field and a notebook_url that has to match a specific pattern.

This is very useful for reproducibility: We can easily trace back every piece of data in the system to the notebook or job that created it!

As you can see, the Lua code required is quite simple and straightforward, even if you never programmed in Lua before.

Real world example

Let’s show a more complex example to understand the value:

name: auto symlink
on:
 post-create-branch:
   branches: ["view-*"]
 post-commit:
   branches: ["view-*"]
hooks:
 - id: symlink_creator

type: lua
   properties:
     script_path: scripts/s3_hive_manifest_exporter.lua
     args:
       # Export configuration
       aws_access_key_id: "AKIA..."
       aws_secret_access_key: "..."
       aws_region: us-east-1
       export_bucket: example-repo
       export_path: lakefs_tables
       sources:
         - tables/my-table/

This hook does something pretty cool! When a user creates a branch named view-* – lakeFS would automatically export a set of symlink.txt files to S3 that allows Trino and AWS Athena to access data on the newly created branch with no additional configuration required!

As you can see, we don’t embed the Lua script directly into the hook .yaml file – instead we refer to a Lua object with script_path – a relative path to an object in the same repository.

You can see the script in the lakeFS examples/hooks/ directory on GitHub

Getting started

Getting started with Lua hooks requires no additional configuration. If you’re running lakeFS v0.87.1 or above, all you have to do is configure a new hook of type: lua – lakeFS will automatically recognize and execute it for you.

If you don’t have a running lakeFS environment you can run a local lakeFS instance with this one liner:

docker run --pull always -p 8000:8000 

treeverse/lakefs run --local-settings

Alternatively, see the quickstart guide for more ways to get started with lakeFS!

More information

To learn more about how to use hooks – see the hooks reference on the documentation. Additionally, the examples/hooks/ directory is a good starting point – feel free to copy and adapt these examples to your use case.

We’d love your feedback! Are hooks useful for you? How could they be improved? What other features would you like to see? Join our community and the discussion on the lakeFS Slack.

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +