lakeFS Hooks: Implementing Write-Audit-Publish for Data Using Pre-Merge Hooks

Oz Katz

March 2, 2021

Write-Audit-Publish (continuous integration/continuous deployment of data) is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and […]

Best Practices Data Engineering Tutorials

Concrete Graveler: Committing Data to Pebble SSTables

Ariel Shaqed (Scolnicov)

February 16, 2021

Introduction In our recent version of lakeFS, we switched to base metadata storage on immutable files stored on S3 and other common object stores. Our design is inspired by Git, but for object stores rather than filesystems, and with (much) larger repositories holding machine-generated commits. The design document is informative but by nature omits

Best Practices Data Engineering

Tiers in the Cloud: How lakeFS caches immutable data on local-disk

Itai Admi

February 9, 2021

Introduction We recently released the first version of lakeFS supported by Pebble’s sstable library – RocksDB. The release introduced a new data model which is now much closer to Git. Instead of using a PostgreSQL server that quickly becomes a bottleneck, committed metadata now lives on the object store itself. Early on we realized that

Best Practices Tutorials

Working with Embed in Go 1.16 Version

Barak Amar

January 26, 2021

The new Golang v1.16 embed directive helps us keep a single binary and bundle out static content. This post will cover how to work with embed directive by applying it to a demo application. Why Embed One of the benefits of using Go is having your application compiled into a single self-contained binary. Having a

Best Practices Data Engineering

Loosely Coupled Monolith vs Tightly Coupled Microservices

Barak Amar

December 9, 2020

TL;DR With some thoughtful engineering, we can achieve a lot of the benefits that come with a microservice oriented architecture, while retaining the simplicity and low operating cost of being a monolith. What is lakeFS? lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like capabilities

Best Practices Data Engineering

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS

Einat Orr, PhD

December 1, 2020

The data mesh paradigm The Data Mesh paradigm was first introduced by Zhamak Dehghani in her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed,

Best Practices Data Engineering

System Tests: Lessons Learned From Developing For OSS Project

Itai Admi

November 10, 2020

Overview In this article, I will try to cover some do’s and don’ts for system testing from the perspective of an open-source project. To keep things simple, it all boils down to running the system as our customers would: think of the different use-cases of your system, the environment where it runs, the configuration options,

Best Practices Data Engineering Tutorials

Building A Data Development Environment with lakeFS

Barak Amar

October 27, 2020

Overview As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration,

Best Practices Data Engineering

How to Manage Your Data the Way You Manage Your Code

Einat Orr, PhD

October 5, 2020

50 years ago it was very hard to collaborate over code. When developing large scale software projects it was difficult to manage changes to source code over time, as revision control tools were only starting to enter mainstream computing. The adoption of version control tools, first centralized and then distributed, changed all that, and now

Best Practices

Improving Postgres Performance Tenfold Using Go Concurrency

Tzahi Yaacobovicz

September 29, 2020

In this article I will show how Go concurrency enabled us to cut through a daunting DB performance barrier. This blog post continues our journey to big data performance. The first post on this issue discussed in-process caching in Go. The Pain lakeFS is a versioned directory over objects stores like AWS S3 and GCS

Best Practices Tutorials

In-process Caching In Go: Scaling lakeFS to 100k Requests/Second

Barak Amar

September 23, 2020

This is a first in a series of posts describing our journey of scaling lakeFS. In this post we describe how adding an in-process cache to our Go server speed up our authorization flow. Background lakeFS is an open-source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build

Best Practices Tutorials

From Zero to Versioned Data in Spark

Guy Hardonag

August 18, 2020

This tutorial aims to give you a fast start with lakeFS and use its git-like terminology in Spark. It covers the following: This simple flow gives a sneak peak to how seamless and easy it is to make changes to data using lakeFS. Once you get the value of a resilient data flow, you can

Best Practices

Pick up the Slack with lakeFS