lakeFS Blog

Data Engineering

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared

Oz Katz
August 20, 2021

Introduction When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility. It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the …

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared Read More »

People

Why I’m Joining lakeFS

Paul Singman
April 6, 2021

Thoughts on a personal journey into the world of developer advocacy at an open-source data project. In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this …

Why I’m Joining lakeFS Read More »

Data Engineering

3 Data Lake Anti-Patterns to Avoid

Paul Singman
May 19, 2021

Rid yourself of these troubling habits and start the journey towards data lake mastery! Introduction Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience. This is troublesome since I believe the developer experience is as …

3 Data Lake Anti-Patterns to Avoid Read More »

Data Engineering

Data Lakes: The Definitive Guide

Paul Singman
May 27, 2021

What is a Data Lake? A data lake is a system of technologies that allow for the querying of data in file or blob objects.  When employed effectively, they enable the analysis of structured and unstructured data assets at tremendous scale and cost-efficiency. The number of organizations employing data lake architectures has increased exponentially since …

Data Lakes: The Definitive Guide Read More »

Project

Power Amazon EMR Applications with Git-like Operations Using lakeFS

Itai Admi
May 19, 2021

This article will provide a detailed explanation on how to use lakeFS with Amazon EMR. Today it’s common to manage a data lake using cloud object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage as the underlying storage service. Each cloud provider offers a set of managed services to simplify the way …

Power Amazon EMR Applications with Git-like Operations Using lakeFS Read More »

Data Engineering Project

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks

Oz Katz
March 2, 2021

Continuous integration of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and showcase a …

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks Read More »

Data Engineering

Data Quality Testing: Ways to Test Data Validity and Accuracy

Einat Orr, PhD.
May 19, 2021

Introduction If Sisyphus had been a data analyst or a data scientist, the boulder she’d be rolling up the hill would have been her data quality assurance. Even if all engineering processes of ingesting, processing, and modeling are working impeccably, the ability to test data quality at any stage of the data pipeline, and being …

Data Quality Testing: Ways to Test Data Validity and Accuracy Read More »

Project

Concrete Graveler: Committing Data to Pebble SSTables

Ariel Shaqed (Scolnicov)
April 25, 2021

Introduction In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores.  Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits much …

Concrete Graveler: Committing Data to Pebble SSTables Read More »

Project

Tiers in the Cloud: How lakeFS caches immutable data on local-disk

Itai Admi
May 19, 2021

Introduction We recently released the first version of lakeFS supported by Pebble’s sstable library – RocksDB. The release introduced a new data model which is now much closer to Git. Instead of using a PostgreSQL server that quickly becomes a bottleneck, committed metadata now lives on the object store itself. Early on we realized that …

Tiers in the Cloud: How lakeFS caches immutable data on local-disk Read More »

LakeFS

  • Get Started
    Get Started