Project

Project

New in lakeFS: Data Retention Policies

Yoni Augarten, Guy Hardonag
July 26, 2021

“I can remember everything. That’s my curse, young man. It’s the greatest curse that’s ever been inflicted on the human race: memory.” — Jedediah Leland, Citizen Kane (1941) lakeFS makes data corruptions easy to avoid and fix by allowing you to travel back in time to any state of your data. This new capability has …

New in lakeFS: Data Retention Policies Read More »

Data Engineering Project

Advancing lakeFS: Version Data At Scale With Spark

Tal Sofer
June 23, 2021

Combining lakeFS and Spark provides a new standard for scale and elasticity to distributed data pipelines. When integrating two technologies, the aim should be to expose the strengths of each as much as possible. With this philosophy in mind, we are excited to announce the beta release of the lakeFS FileSystem! This native Hadoop FileSystem …

Advancing lakeFS: Version Data At Scale With Spark Read More »

Data Engineering Project

The State of Data Engineering in 2021

Einat Orr, PhD.
June 1, 2021

Let’s start with the obvious: the lakeFS project doesn’t exist in isolation. It belongs to a larger ecosystem of data engineering tools and technologies adjacent and complementary to the problems we are solving. What better way to visualize our place in this ecosystem, I thought, than by creating a cross-sectional LUMAscape to depict it. What’s …

The State of Data Engineering in 2021 Read More »

Project

Concrete Graveler: Splitting for Reuse

Ariel Shaqed (Scolnicov)
May 19, 2021

Welcome to another episode “Concrete Graveler”, our deep-dive into the implementation of Graveler, the committed object storage for lakeFS. Graveler is our versioned object store, inspired by Git. It is designed to store orders of magnitude more objects than Git does.  The last episode focused on how we store a single commit — a snapshot …

Concrete Graveler: Splitting for Reuse Read More »

Project

Power Amazon EMR Applications with Git-like Operations Using lakeFS

Itai Admi
May 19, 2021

This article will provide a detailed explanation on how to use lakeFS with Amazon EMR. Today it’s common to manage a data lake using cloud object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage as the underlying storage service. Each cloud provider offers a set of managed services to simplify the way …

Power Amazon EMR Applications with Git-like Operations Using lakeFS Read More »

Data Engineering Project

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks

Oz Katz
March 2, 2021

Continuous integration of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and showcase a …

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks Read More »

Project

Concrete Graveler: Committing Data to Pebble SSTables

Ariel Shaqed (Scolnicov)
April 25, 2021

Introduction In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores.  Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits much …

Concrete Graveler: Committing Data to Pebble SSTables Read More »

Project

Tiers in the Cloud: How lakeFS caches immutable data on local-disk

Itai Admi
May 19, 2021

Introduction We recently released the first version of lakeFS supported by Pebble’s sstable library – RocksDB. The release introduced a new data model which is now much closer to Git. Instead of using a PostgreSQL server that quickly becomes a bottleneck, committed metadata now lives on the object store itself. Early on we realized that …

Tiers in the Cloud: How lakeFS caches immutable data on local-disk Read More »

Data Engineering Project

Why Data Versioning as an Infrastructure Matters

Einat Orr, PhD.
March 24, 2021

The demand for infrastructure that contributes to the collection, storage, and analysis of data is growing with the increasing amounts of data managed by organizations. Every organization that manages data pipelines to extract insights from data encounters the need for reproducibility, safe experimentation, and means to ensure data quality. The path to answering these needs …

Why Data Versioning as an Infrastructure Matters Read More »

LakeFS

  • Get Started
    Get Started