lakeFS Blog

Data Engineering Project

Advancing lakeFS: Version Data At Scale With Spark

Tal Sofer
March 24, 2022

Combining lakeFS and Spark provides a new standard for scale and elasticity to distributed data pipelines. When integrating two technologies, the aim should be to expose the strengths of each as much as possible. With this philosophy in mind, we are excited to announce the beta release of the lakeFS FileSystem! This native Hadoop FileSystem …

Advancing lakeFS: Version Data At Scale With Spark Read More »

Data Engineering Integrations

Air & Water: The Airflow and lakeFS Integration

Itai Admi
May 27, 2021

Today we are excited to announce the official release of the lakeFS Airflow provider! What this package does is allow you to easily integrate lakeFS functionality to your Airflow DAGs. The library is published on PyPI so it can easily be installed in your project via the command: pip install airflow-provider-lakefs Once installed, you are …

Air & Water: The Airflow and lakeFS Integration Read More »

Data Engineering

Solving Data Reproducibility

Paul Singman
November 21, 2022

Debugging an issue is never fun, but why make it harder? In this post, we show how reproducing data is possible whether interacting with a single file, entire table, or data repository. Introducing Data Reproducibility There are two types of issues in the world — reproducible and unreproducible.  A reproducible issue is one where the original conditions for …

Solving Data Reproducibility Read More »

Go

Managing Multiple Go Versions with Go

Barak Amar
April 6, 2022

Updated on April 5, 2022 As a user of the Go programming language, I’ve found it useful to enable the running multiple versions within a single project. If this is something you’ve tried or have considered, great! In this post I’ll present the when and the how of enabling multiple Go versions. Finally, we’ll conclude …

Managing Multiple Go Versions with Go Read More »

Data Engineering Project

The State of Data Engineering in 2021

Einat Orr, PhD.
May 11, 2022

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it. What’s …

The State of Data Engineering in 2021 Read More »

Project

Concrete Graveler: Splitting for Reuse

Ariel Shaqed (Scolnicov)
May 19, 2021

Welcome to another episode “Concrete Graveler”, our deep-dive into the implementation of Graveler, the committed object storage for lakeFS. Graveler is our versioned object store, inspired by Git. It is designed to store orders of magnitude more objects than Git does.  The last episode focused on how we store a single commit — a snapshot …

Concrete Graveler: Splitting for Reuse Read More »

Data Engineering

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared

Oz Katz
March 26, 2022

Introduction When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility. It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the …

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared Read More »

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +