Project

Project

Power Amazon EMR Applications with Git-like Operations Using lakeFS

Itai Admi
April 1, 2021

This article will provide a detailed explanation on how to use lakeFS with Amazon EMR. Today it’s common to manage a data lake using cloud object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage as the underlying storage service. Each cloud provider offers a set of managed services to simplify the way …

Power Amazon EMR Applications with Git-like Operations Using lakeFS Read More »

Data Engineering Project

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks

Oz Katz
March 2, 2021

Continuous integration of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and showcase a …

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks Read More »

Project

Concrete Graveler: Committing Data to Pebble SSTables

Ariel Shaqed (Scolnicov)
February 16, 2021

Introduction In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores.  Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits much …

Concrete Graveler: Committing Data to Pebble SSTables Read More »

Project

Tiers in the Cloud: How lakeFS caches immutable data on local-disk

Itai Admi
April 1, 2021

Introduction We recently released the first version of lakeFS supported by Pebble’s sstable library – RocksDB. The release introduced a new data model which is now much closer to Git. Instead of using a PostgreSQL server that quickly becomes a bottleneck, committed metadata now lives on the object store itself. Early on we realized that …

Tiers in the Cloud: How lakeFS caches immutable data on local-disk Read More »

Data Engineering Project

Why Data Versioning as an Infrastructure Matters

Einat Orr, PhD.
March 24, 2021

The demand for infrastructure that contributes to the collection, storage, and analysis of data is growing with the increasing amounts of data managed by organizations. Every organization that manages data pipelines to extract insights from data encounters the need for reproducibility, safe experimentation, and means to ensure data quality. The path to answering these needs …

Why Data Versioning as an Infrastructure Matters Read More »

Go Project

Loosely Coupled Monolith vs Tightly Coupled Microservices

Barak Amar
December 14, 2020

TL;DR With some thoughtful engineering, we can achieve a lot of the benefits that come with a microservice oriented architecture, while retaining the simplicity and low operating cost of being a monolith. What is lakeFS? lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like capabilities …

Loosely Coupled Monolith vs Tightly Coupled Microservices Read More »

Data Engineering Project

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS

Einat Orr, PhD.
March 24, 2021

The data mesh paradigm The Data Mesh paradigm was first introduced by Zhamak Dehghani in her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.  Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, …

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS Read More »

Project

System Tests: Lessons Learned From Developing For OSS Project

Itai Admi
March 8, 2021

Overview In this article, I will try to cover some do’s and don’ts for system testing from the perspective of an open-source project. To keep things simple, it all boils down to running the system as our customers would: think of the different use-cases of your system, the environment where it runs, the configuration options, …

System Tests: Lessons Learned From Developing For OSS Project Read More »

Data Engineering Project

Building A Data Development Environment with lakeFS

Barak Amar
March 8, 2021

Overview As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, …

Building A Data Development Environment with lakeFS Read More »

Project

The lakeFS Playground – Interactive Data Versioning Learning

Guy Hardonag
March 8, 2021

If you’re interested in playing around and exploring lakeFS, you can now easily get started using the Katacoda playground which provides a personalized sandboxed environment – all from your browser, without installing anything.  lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, …

The lakeFS Playground – Interactive Data Versioning Learning Read More »

LakeFS

  • Get Started
    Get Started