lakeFS Blog

Data Engineering

Object Storage: Everything You Need to Know

Yael Rivkind
March 24, 2021

While Object Storage is not novel technology, it can still be overwhelming when getting started. Here’s a definitive guide to object-based storage with everything you need to know.   What is object storage? At its core, object storage or object-based storage represents a data storage architecture that allows you to store large amounts of unstructured …

Object Storage: Everything You Need to Know Read More »

Data Engineering

Chaos Data Engineering

Oz Katz
May 19, 2021

Modern Data Lakes are a complexity tar pit. They involve many moving parts: distributed computation engines, running on virtualized servers connected by a software defined network, running on top of distributed object stores, orchestrated by a distributed stream processor or pipeline execution engine. These moving parts fail. All the time. Handling these failures is not …

Chaos Data Engineering Read More »

Project

System Tests: Lessons Learned From Developing For OSS Project

Itai Admi
March 8, 2021

Overview In this article, I will try to cover some do’s and don’ts for system testing from the perspective of an open-source project. To keep things simple, it all boils down to running the system as our customers would: think of the different use-cases of your system, the environment where it runs, the configuration options, …

System Tests: Lessons Learned From Developing For OSS Project Read More »

Data Engineering Project

Building A Data Development Environment with lakeFS

Barak Amar
August 14, 2022

Overview As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, …

Building A Data Development Environment with lakeFS Read More »

Project

The lakeFS Katacoda Sandbox Environment – Interactive Data Versioning Learning

Guy Hardonag
March 2, 2022

If you’re interested in playing around and exploring lakeFS, you can now easily get started using the Katacoda demo which provides a personalized sandboxed environment – all from your browser, without installing anything.  lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, …

The lakeFS Katacoda Sandbox Environment – Interactive Data Versioning Learning Read More »

Data Engineering Project

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes

Oz Katz
May 19, 2021

Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which …

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes Read More »

Data Engineering Project

How to Manage Your Data the Way You Manage Your Code

Einat Orr, PhD.
May 9, 2021

50 years ago it was very hard to collaborate over code. When developing large scale software projects it was difficult to manage changes to source code over time, as revision control tools were only starting to enter mainstream computing. The adoption of version control tools, first centralized and then distributed, changed all that, and now …

How to Manage Your Data the Way You Manage Your Code Read More »

Go Project

Improving Postgres Performance Tenfold Using Go Concurrency

Tzahi Yaacobovicz
March 8, 2021

In this article I will show how Go concurrency enabled us to cut through a daunting DB performance barrier. This blog post continues our journey to big data performance. The first post on this issue discussed in-process caching in Go.  The Pain lakeFS is a versioned directory over objects stores like AWS S3 and GCS …

Improving Postgres Performance Tenfold Using Go Concurrency Read More »

Go Project

In-process Caching In Go: Scaling lakeFS to 100k Requests/Second

Barak Amar
March 8, 2021

This is a first in a series of posts describing our journey of scaling lakeFS. In this post we describe how adding an in-process cache to our Go server speed up our authorization flow. Background lakeFS is an open-source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build …

In-process Caching In Go: Scaling lakeFS to 100k Requests/Second Read More »

Data Engineering

Diary of a Data Engineer

Oz Katz
May 19, 2021

A glimpse into the life of a data engineer. Day 1: Finally, an easy one Got a pretty simple task for a change – read a new type of event stream generated by sales, and publish it to the data lake. Sounds like a straightforward ETL. I estimate this as one day of work. I …

Diary of a Data Engineer Read More »

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on December 1st: Promote only high-quality data to production

    Register here
    +