Data Engineering

Data Engineering

Thoughts on the Future of the Databricks Ecosystem

Paul Singman
September 8, 2021

Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark. Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case …

Thoughts on the Future of the Databricks Ecosystem Read More »

Data Engineering

The Docker Everything Bagel™ – Spin Up A Local Data Stack

Paul Singman
August 25, 2021

Introduction An important part of developing an open source project like lakeFS is assisting and advising our users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally. This means recreating the environment, running the same code, and raising the same error. In complex, modern data stacks …

The Docker Everything Bagel™ – Spin Up A Local Data Stack Read More »

Data Engineering

Hive Metastore – Why It’s Still Here and What Can Replace It?

Einat Orr, PhD.
August 19, 2021

Hive & Hadoop — A Brief History Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was the novel and innovative way of doing big data analytics.  What Hive did was implement a SQL interface to Hadoop. Its architecture consisted of two main services: A Query Engine …

Hive Metastore – Why It’s Still Here and What Can Replace It? Read More »

Data Engineering

Making Sure Your Data Lifecycle Management Makes Sense

Paul Singman, Einat Orr, PhD.
July 15, 2021

What is Data Lifecycle Management Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table. Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing this process is …

Making Sure Your Data Lifecycle Management Makes Sense Read More »

Data Engineering Project

Advancing lakeFS: Version Data At Scale With Spark

Tal Sofer
June 23, 2021

Combining lakeFS and Spark provides a new standard for scale and elasticity to distributed data pipelines. When integrating two technologies, the aim should be to expose the strengths of each as much as possible. With this philosophy in mind, we are excited to announce the beta release of the lakeFS FileSystem! This native Hadoop FileSystem …

Advancing lakeFS: Version Data At Scale With Spark Read More »

Data Engineering Integrations

Air & Water: The Airflow and lakeFS Integration

Itai Admi
May 27, 2021

Today we are excited to announce the official release of the lakeFS Airflow provider! What this package does is allow you to easily integrate lakeFS functionality to your Airflow DAGs. The library is published on PyPI so it can easily be installed in your project via the command: pip install airflow-provider-lakefs Once installed, you are …

Air & Water: The Airflow and lakeFS Integration Read More »

Data Engineering

Solving Data Reproducibility

Paul Singman
May 24, 2021

Debugging an issue is never fun, but why make it harder? In this post, we show how reproducing data is possible whether interacting with a single file, entire table, or data repository. Introducing Data Reproducibility There are two types of issues in the world — reproducible and unreproducible.  A reproducible issue is one where the original conditions for …

Solving Data Reproducibility Read More »

LakeFS

  • Get Started
    Get Started