lakeFS Blog

Integrations

Building Reproducible Data Pipelines with Airflow and lakeFS

Guy Hardonag
May 27, 2021

Update (May 26th, 2021): We officially released the lakeFS Airflow provider. Read all about it in the latest blog post. In this post, we’ll see how easy it is to use lakeFS with an existing Airflow DAG, to make every step in a pipeline completely reproducible in both code and data. This is done without …

Building Reproducible Data Pipelines with Airflow and lakeFS Read More »

Go

Working with Embed in Go 1.16 Version

Barak Amar
May 19, 2021

The new Golang v1.16 embed directive helps us keep a single binary and bundle out static content. This post will cover how to work with embed directive by applying it to a demo application.  Why Embed One of the benefits of using Go is having your application compiled into a single self-contained binary. Having a …

Working with Embed in Go 1.16 Version Read More »

Data Engineering

Ensuring Data Quality in a Data Lake Environment

Einat Orr, PhD.
March 24, 2021

The quality of the data we introduce determines the overall reliability of our data lake. And the ingestion stage is a critical point for ensuring the soundnes of our service and data.  The same way software engineers apply automatic testing to new code, data engineers should continuously test newly ingested data while ensuring they meet …

Ensuring Data Quality in a Data Lake Environment Read More »

Integrations

Git-like Operations Over MinIO with lakeFS

Yoni Augarten
May 7, 2021

lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like operations over your MinIO storage environment and works seamlessly with all modern data frameworks such as Spark, Hive, Presto, Kafka, R and Native Python etc. Common use-cases include creating a development environment without copying or mocking …

Git-like Operations Over MinIO with lakeFS Read More »

Data Engineering Project

Why Data Versioning as an Infrastructure Matters

Einat Orr, PhD.
March 24, 2021

The demand for infrastructure that contributes to the collection, storage, and analysis of data is growing with the increasing amounts of data managed by organizations. Every organization that manages data pipelines to extract insights from data encounters the need for reproducibility, safe experimentation, and means to ensure data quality. The path to answering these needs …

Why Data Versioning as an Infrastructure Matters Read More »

Go Project

Loosely Coupled Monolith vs Tightly Coupled Microservices

Barak Amar
May 19, 2021

TL;DR With some thoughtful engineering, we can achieve a lot of the benefits that come with a microservice oriented architecture, while retaining the simplicity and low operating cost of being a monolith. What is lakeFS? lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like capabilities …

Loosely Coupled Monolith vs Tightly Coupled Microservices Read More »

Data Engineering Project

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS

Einat Orr, PhD.
March 24, 2021

The data mesh paradigm The Data Mesh paradigm was first introduced by Zhamak Dehghani in her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.  Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, …

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS Read More »

Data Engineering

Object Storage: Everything You Need to Know

Yael Rivkind
March 24, 2021

While Object Storage is not novel technology, it can still be overwhelming when getting started. Here’s a definitive guide to object-based storage with everything you need to know.   What is object storage? At its core, object storage or object-based storage represents a data storage architecture that allows you to store large amounts of unstructured …

Object Storage: Everything You Need to Know Read More »

Data Engineering

Chaos Data Engineering

Oz Katz
May 19, 2021

Modern Data Lakes are a complexity tar pit. They involve many moving parts: distributed computation engines, running on virtualized servers connected by a software defined network, running on top of distributed object stores, orchestrated by a distributed stream processor or pipeline execution engine. These moving parts fail. All the time. Handling these failures is not …

Chaos Data Engineering Read More »

LakeFS

  • Get Started
    Get Started