In the ever-evolving world of data, lakeFS continues to push the boundaries of what’s possible in managing data lakes. A great example of this is the support that we recently added for writing hooks in Lua.
This feature opens up a whole new realm of possibilities for data lake administrators, enabling them to customize and automate actions based on specific events that occur within their data lake.
In this blog post, we will explore the benefits and potential applications of writing Lua hooks in lakeFS.
What are Hooks in lakeFS?
Hooks in lakeFS provide a flexible way to extend lakeFS functionality and tailor it to specific use cases. They serve as event-driven triggers that allow users to execute custom logic or actions when certain events occur within the data lake environment. If you’re familiar with using GitHub Actions to do things like check code, run tests, and enforce standards, hooks in LakeFS are a very similar concept.
lakeFS supports three different types of hooks: Lua, Webhook, and Airflow. Lua hooks have the benefit of running within an embedded VM in the lakeFS server and so do not require external services to use.
By writing hooks in Lua, users can define their own scripts to be executed during various stages of the data lifecycle: before committing a change or merging it into production, after creating an experimentation branch, when creating a new tag – or pretty much anything that happens in a lakeFS repository.
Why Lua?
Lua is a lightweight, embeddable scripting language known for its simplicity and versatility.
It offers a clean and expressive syntax, making it easy to write and maintain scripts.
It is used in a variety of platforms: from game engines to in-memory databases to allow users to plug in custom behaviors within the hosting system.
By choosing Lua as the scripting language for hooks, lakeFS provides users with a powerful yet approachable tool for customizing their data lake workflows.
The Lua implementation is also very secure: because of Lua’s small footprint and customizability, it was adapted to run in a way that isolates its VM from the host’s file system and allows it to run in a narrow and tightly controlled context.
The Power of Lua Hooks in LakeFS
Customized Workflows
Hooks enable users to create tailored workflows that align with their unique data management requirements.
By leveraging Lua’s flexibility, users can define custom actions and logic that are executed automatically in response to specific events.
This level of customization allows for efficient and streamlined data operations.
Automation
Writing hooks in Lua empowers users to automate repetitive tasks or complex operations within the data lake ecosystem.
Whether it’s triggering notifications, updating a data catalog, or triggering a downstream process to start when new data arrives – hooks offer an avenue to automate processes and ensure consistent and reliable results.
Zero Ops!
Since Lua hooks are executed within lakeFS, there’s no need to configure or host any additional infrastructure.
Lua hooks have been available in lakeFS since version 0.87.0 or later, so you can start running Lua hooks immediately!
What can you do with Lua hooks in lakeFS?
Schema Validation
Users can write hooks to perform schema checks, ensuring that incoming data adheres to predefined rules or standards. This can involve validating file formats, checking metadata consistency, and ensuring backwards compatibility.
The included Lua runtime even ships with a Parquet schema reader that could be used to read and parse a Parquet file’s structure and columns.
Schema validation hooks are typically configured to execute pre-merge to a production branch.
This provides a very strong guarantee to downstream consumers of the data: consumers either see validated, correct data, or no data at all. Since merges in lakeFS are atomic, consumers will never be exposed to breaking changes or consistency issues that could arise from wrong or malformed schema.
Ensuring privacy and compliance
Building on the previous use case, it is possible to define rules for where personally identifiable could be located.
A common example for this would be to make sure all tables containing PII (recognizable by column names such as user_name, email, ssn, …) must be under a /private prefix. As a pre-merge hook, we can ensure this is enforced on our production branch with only a few lines of code.
In conjunction with the strong RBAC model in lakeFS, we can then define access control rules that only allow specific personnel to access such sensitive data.
Custom Notifications and Alerts
Hooks can be used to send notifications or alerts when specific events occur.
For example, a hook could be triggered to notify users when a commit is made to a critical branch, enabling timely communication and collaboration among team members.
We can additionally write conditional hooks that are executed on failure – useful for notifying an on-call engineer or for updating a monitoring system.
Data Discoverability
Using post-merge and post-commit hooks, we can define rules to automatically register new data written to lakeFS into an external system. This could be a metastore update to register new partitions or some other data catalog tool.
Since hooks are passed an action object describing the commit, author, branch and other information about the change, it is also a great way to update Data Lineage systems in which consistency can often be harder to ensure.
Anatomy of a Lua Hook
Deploying a Lua hook to a lakeFS repository is pretty simple. Like any other hook, we first need to write a yaml action file describing which events our hook will be triggered for, give it a name, and of course – define which Lua scripts should be executed.
The Lua code itself could be either embedded directly into our yaml definition file, or placed as its own independent file within our lakeFS repository, with the definition file linking to it.
A RealHello World example
Here’s the Hello World example of a Lua hook.
It is very simple: after every commit, it will print to its output the ID of the generated commit and the branch it was committed on:
# contents of `_lakefs_actions/log_commit_and_branch.yaml`
---
name: Log commit ID and Branch
description: Write out every commit ID and branch name after a commit has been made.
on:
post-commit:
hooks:
- id: print_commit_branch
type: lua
properties:
script: |
print("We have a new commit!")
print("Commit ID: ", action.commit_id)
print("Branch ID: ", action.branch_id)
Once our action file is uploaded to a lakeFS repository, we should commit our changes.
Since this is a post-commit hook, it should execute immediately after we commit.
To see its output, we should navigate to the Actions tab in our repository in the lakeFS UI.
We should now have a new post-commit entry for our branch, that looks something like this:
Success! We’ve successfully deployed and executed our first Lua hook.
Conclusion
The introduction of writing hooks in Lua within lakeFS marks an exciting development in the realm of data lake management.
By leveraging the power of Lua scripting, lakeFS users can now customize and automate their workflows to fit their unique needs.
Hooks enable enhanced flexibility, automation, integration, and extensibility, opening up a multitude of possibilities for improving data operations within the lakeFS ecosystem.
Next Steps
You can get started with Lua hooks by setting up a local lakeFS installation and following the steps above!
Table of Contents