Product

Advancing lakeFS: Version Data At Scale With Spark

Tal Sofer

June 23, 2021

Combining lakeFS and Spark provides a new standard for scale and elasticity to distributed data pipelines. When integrating two technologies, the aim should be to expose the strengths of each as much as possible. With this philosophy in mind, we are excited to announce the release of the lakeFS FileSystem! This native Hadoop FileSystem implementation […]

Product

Guarantee Consistency in Your Delta Lake Tables With lakeFS

Oz Katz

June 2, 2021

How to use lakeFS to keep multiple Delta tables consistent and guarantee data quality.

Product

Air & Water: The Airflow and lakeFS Integration

Itai Admi

May 25, 2021

Today we are excited to announce the official release of the lakeFS Airflow provider! What this package does is allow you to easily integrate lakeFS functionality to your Airflow DAGs. The library is published on PyPI so it can easily be installed in your project via the command: pip install airflow-provider-lakefs Once installed, you are

Product Tutorials

Power Amazon EMR Applications with Git-like Operations Using lakeFS

Itai Admi

March 9, 2021

This article will provide a detailed explanation of how to use lakeFS with Amazon EMR. Today, it’s common to manage a data lake using cloud object stores like AWS S3, Azure Blob Storage, or Google Cloud Storage as the underlying storage service. Each cloud provider offers a set of managed services to simplify the way

Product

Building Reproducible Data Pipelines with Airflow and lakeFS

Guy Hardonag

February 3, 2021

Update (May 26th, 2021): We officially released the lakeFS Airflow provider. Read all about it in the latest blog post. In this post, we’ll see how easy it is to use lakeFS with an existing Airflow DAG, to make every step in a pipeline completely reproducible in both code and data. This is done without

Product

Git-like Operations Over MinIO with lakeFS

Yoni Augarten

January 5, 2021

lakeFS is an open source tool that delivers resilience and manageability to object-storage based data lakes. lakeFS provides Git-like operations over your MinIO storage environment and works seamlessly with all modern data frameworks such as Spark, Hive, Presto, Kafka, R and Native Python etc. Common use-cases include creating a development environment without copying or mocking

Product

The Quick Guide for Running Presto Locally on S3

Guy Hardonag

August 31, 2020

This post aims to cover our experience running Presto in a local environment with the ability to query Amazon S3 and other S3 Compatible Systems. We will: TL;DR: If you just want to use the environment you could skip to the example. Context As part of developing lakeFS we needed to ensure that it’s API

Product

Pick up the Slack with lakeFS