Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Iddo Avneri
Iddo Avneri Author

Iddo has a strong software development background. He started his...

Published on March 7, 2024

lakeFS is a powerful solution for data version control that enables data practitioners to manage data as code using Git-like operations and achieve reproducible, high-quality data pipelines. While getting started with lakeFS is simple through its Quickstart guide, many seek tailored examples that integrate with their existing tech stack or address specific use cases. To cater to these needs, lakeFS offers a comprehensive collection of samples, showcasing various integrations and scenarios. In this post, we’ll explore some of the different lakeFS samples available and how they facilitate seamless data versioning control at scale.

lakeFS Samples

The lakeFS samples repository hosts a plethora of examples designed to demonstrate the versatility and power of lakeFS.
These samples are split between 2 directories: 

  • 00_notebooks: A collection of notebooks that can work together in a single docker compose to show different use cases
  • 01_standalone_examples: Multiple separately-run standalone examples, typically presenting an integration with other data / ML tools. 


These samples can run against lakefs.cloud or using lakeFS Open Source. 

To make things even easier, you can run many of the samples with a local-lakefs profile:

docker compose --profile local-lakefs up

This allows users to spin up the notebook along with a local lakeFS OSS server and MinIO, providing a self-contained environment for experimentation and learning.

lakeFS Samples getting started


Let’s review some of these samples:

Notebook Samples (00_notebooks):

  • Integration of lakeFS with Spark and Python (spark-demo.ipynb):
    A fundamental example showcasing the use of Spark for data processing within the lakeFS environment. This example highlights the usage of branching, a core feature of lakeFS, to isolate and execute ETL processes, ensuring data integrity and reproducibility.
Integrating lakeFS with Spark and Python

Want to learn more about this use case? Check out this webinar on how to create a dev/test environment using Spark and Python

  • Prevent unintended schema change (hooks-schema-validation.ipynb):
    Building upon the previous example, this notebook emphasizes the role of lakeFS in enforcing CI/CD practices to prevent inadvertent schema changes. This sample illustrates the implementation of hooks within lakeFS to enforce schema validation, ensuring that unpredictable schema changes are not promoted to production.
lakeFS sample: Prevent unintended schema change
  • Avoid leaking PII data (hooks-schema-and-pii-validation.ipynb):
    Expanding even further on the previous notebook, this example showcases lakeFS’s capabilities in preventing the exposure of Personally Identifiable Information (PII).
    💡Pro Tip – Combined with lakeFS RBAC, this architecture helps design a highly secured versioned data lake.
  • Only allow specific file formats in the data lake (hooks-webhooks-demo.ipynb):
    A third example of utilizing hooks, this time to enforce constraints on file formats, fostering better collaboration and adherence to data contracts.
  • Integration of lakeFS with Delta Lake & Iceberg (delta-lake.ipynb / delta-lake-python.ipynb / iceberg-lakefs-basic.ipynb / iceberg-lakefs-nyc.ipynb):
    This set of notebooks showcase lakeFS’s seamless integration with popular open table formats like Iceberg and Delta tables. lakeFS is format-agnostic, allowing users to effortlessly combine it with these formats. These examples provide detailed insights into the integration process and illustrate various use cases, including multi-table transactions.
  • Version Control of multi-buckets pipelines (version-control-of-multi-buckets-pipelines.ipynb):
    This example walks through a medallion architecture, common in data engineering, and demonstrates its implementation using lakeFS. Furthermore, this specific implementation is using different buckets, since often physical separation of environments is required.
lakeFS sample: Version Control Multi-Bucket Pipelines

Learn more about how to version control Medallion Architecture pipelines.

  • Import into a lakeFS repository from multiple paths (import-multiple-buckets.ipynb):
    Provides a quick and easy method to introduce data into a lakeFS repository without copying it, streamlining the process of working with large-scale datasets.
  • Data Lineage with lakeFS (data-lineage.ipynb):
    Illustrates how lakeFS enables data lineage tracking at scale, akin to using Git blame on big datasets, enabling users to understand data lineage and history effectively.
lakeFS sample: Data Lineage
  • Reprocess and Backfill Data with new ETL logic (reprocess-backfill-data.ipynb):
    Reprocessing is a common, manual and error-prone process that data engineers need to occasionally execute. This notebook demonstrates an easy, bulletproof way to execute reprocessing with no risk. 
  • lakeFS Role-Based Access Control (rbac-demo.ipynb):
    Walks through granular RBAC capabilities achievable with lakeFS Cloud.

Standalone Examples (01_standalone_examples):

  • Airflow / Dagster / Prefect:
    These examples demonstrate the integration of lakeFS with popular orchestration tools like Airflow, Dagster, and Prefect, showcasing how lakeFS enhances workflow management and version control.
    For each orchestration system, there is an example of how to wrap around existing DAGs with lakeFS branches for isolation and easy rollback, and an example of version controlling the data throughout a DAG execution for additional troubleshooting and reproducibility. 
  • ML Examples:
    • Labelbox Integration:
      Imports data into a lakeFS repository based on Labelbox labeling, facilitating seamless data management in ML pipelines.
    • llm-openai-langchain-integration:
      Provides reproducibility and data version control for LangChain and LLM/OpenAI Models, ensuring consistency and reliability.
    • Image Segmentation:
      Demonstrates ML Data Version Control and Reproducibility at Scale, including integration with Git for versioning code and lakeFS for data, highlighting the importance of managing both code and data versions in ML projects.

      For an in depth review of the ML reproducibility capabilities of lakeFS, check out this webinar

Conclusion:

The lakeFS samples repository offers a rich tapestry of examples, catering to diverse use cases and integration scenarios. Whether you’re looking to streamline data processing workflows, enforce data governance policies, or enhance reproducibility in ML pipelines, lakeFS provides the tools and resources necessary for effective data version control. By leveraging these samples, data engineers and scientists can harness the full potential of lakeFS to build robust, scalable, and secure data lakes. Dive into the world of data version control with lakeFS and unlock new possibilities for your data management endeavors.

Git for Data – lakeFS

  • Get Started
    Get Started