Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Ariel Shaqed (Scolnicov)
Ariel Shaqed (Scolnicov) Author

Ariel Shaqed (Scolnicov) is a Principal Software Engineer at lakeFS....

August 14, 2023

Introduction

“How do I integrate X with lakeFS” is an ever-green question on lakeFS Slack. lakeFS takes a “tooling-first” strategy to data management: it slots into your existing lineup of tools. So a significant part of our work on lakeFS is devoted to leveraging lakeFS and these other tools to improve these integrations. Our latest addition is Project Syphon!

Soda syphon, by Avi Nahmias, available from https://commons.wikimedia.org/wiki/User:Avin

Airflow was one of our earliest integrations: My colleague Itai published the original Air and Water 2 years ago. Airflow runs workflows, known as “DAGs”. Our users typically run DAGs to generate or merge data. lakeFS is a great way to track consistent versions of data, and the lakeFS Airflow provider makes it easy for Airflow DAGs to operate on lakeFS repositories. Now it’s time to deepen our integrations!

An Airflow DAG that modifies data will typically commit or merge to a branch. Airflow tracks DAG runs, their logs, their intermediate data, and additional user-defined metadata. lakeFS does something similar for data: it tracks versions of data, and additional user-defined metadata. Our users were faced with the challenge of uncovering the exact connection between Airflow DAG runs and the commits that they created on lakeFS.

The latest release of the lakeFS provider for Airflow, codename “Project Syphon”, adds deep linking and metadata between Airflow DAG runs and the lakeFS commits that they create. It connects the two for interactive or programmatic use. Our Airflow operators now add formatted metadata to your commits. lakeFS uses this metadata to link to Airflow DAGs.

You can also leverage these new capabilities to add your own operable metadata to any commits that you perform from any tool, not only from Airflow. lakeFS now defines a general format for metadata keys that will be forwards-compatible. By defining metadata keys that match this format, your metadata will operate with lakeFS.

How are you integrating these new features into your workflows? How can we improve them and help you? Please let us know on lakeFS Slack or by opening an issue!

Using the lakeFS Airflow provider

Links are generated by lakeFS provider versions 0.47.0 and above. To use them, ensure your Airflow requirements.txt includes

airflow-provider-lakefs>=0.47.0

That is pretty much all. Using LakeFSCommitOperator or LakeFSMergeOperator normally in your DAGs will add metadata to your commits.

Any version of lakeFS can serve commit metadata to clients. The lakeFS GUI adds the ability to interpret Airflow provider metadata to link back to Airflow starting at version 0.100.0.

Metadata on lakeFS commits

When examining a commit in lakeFS, the UI shows metadata and a link “Open Airflow UI” to the DAG run in the UI.

Image of lakeFS GUI showing  Airflow run metadata generated by lakeFS, including links to Airflow runs.

The button “Open Airflow UI” links to the generating DAG run on Airflow. Metadata from the DAG run on Airflow is added to the commit, in keys starting “::lakefs::”.

When examining the task of a LakeFSCommitOperator or a LakeFSMergeOperator of a workflow on Airflow, the UI shows a “lakeFS” button.

This button links to the generated commit on lakeFS.

How it works

Knowing how all this works can help you perform additional tasks by leveraging these features. Part of the tooling approach of Airflow and lakeFS is to provide simple interfaces that can readily be used to integrate and reliably extract data. We built all these features by combining tools:

  • Extra links in Airflow tasks
  • Commit metadata in lakeFS
  • A new convention for creating link buttons from commit metadata

By leveraging existing features in lakeFS and in Airflow, and adding a small useful convention, we ensure that other tools can also leverage these features.

lakeFS structured metadata

When the lakeFS Airflow provider creates a commit, it creates metadata from the DAG run. All metadata uses the key prefix ::lakefs::Airflow:: prefixes. We ask that you create metadata with key prefix ::lakefs:: in accordance with specific guidelines. Doing so will allow metadata to continue to be useful as we add more lakeFS-side features to commit metadata.

Fields added by the Airflow provider mostly parallel the fields documented in the DagRun REST API. We use structured metadata keys, ensuring that keys from different products can be separated and correctly handled automatically in most circumstances.

Structured metadata keys have the format “::lakefs::LABEL::KEY[TYPE]“:

  • LABEL should be a human-readable name of the generating process or product. The Airflow provider uses Airflow.
  • KEY should describe the value.
  • [TYPE] should describe the type of the value. Currently we define several types:
    • To specify a plain string, just use “::lakefs::LABEL::KEY“.

      The Airflow provider places multiple string fields from the DAG run, including “::lakefs::Airflow::dag_id“, “::lakefs::Airflow::dag_run_id“, “::lakefs::Airflow::run_type“, and “::lakefs::Airflow::note“.
    • [url:ui] specifies that the value contains a URL for a UI. The lakeFS UI identifies such metadata and creates a clickable button.

      The Airflow provider places the DAG run UI URL in “::lakefs::Airflow::url[url:ui]“. This allows the lakeFS UI to generate the “Open Airflow UI” button.
    • [url:id] specifies that the value contains an identifying URL. This can be used by other clients to access the generating resource.

      The Airflow provider places the URL of the DAG run in the Airflow API in “::lakefs::Airflow::url[url:id]“. Clients that need to access the Airflow DAG run can use this URL as their base.
    • [iso8601] specifies that the value contains a timestamp in ISO 8601 format.

      The Airflow provider places all its timestamps in such keys, including “::lakefs::Airflow::logical_date[iso8601]“, “::lakefs::Airflow::data_interval_start[iso8601]“, “::lakefs::Airflow::data_interval_end[iso8601]“, and “::lakefs::Airflow::last_scheduling_decision[iso8601]“.
    • [boolean] specifies a boolean encoded as a string true or false.

The Airflow provider creates the “lakeFS” link from a task to the commit that it generates. This is just an extra link. Generating the link requires additional data. This data is stored on XCom using the key lakefs_commit. That key is a Python dict:

{
  'base_url': 'https://example-org.us-east-1.lakefscloud.io',
  'repo': 'example-repo',
  'commit_digest': '51ee91ec0ffee...'
}

Need help getting started?

Git for Data – lakeFS

  • Get Started
    Get Started