Data Engineering

Data Engineering

The Guide to Data Versioning

Paul Singman
November 25, 2021

“I have never lied to you, I have always told you some version of the truth.” “The truth doesn’t have versions, okay?” — Something’s Gotta Give (2003) Jack Nicholson and Diane Keaton discuss data versioning in Something’s Gotta Give. Table of Contents Introduction A version of something is defined as “a particular form in which some details are different …

The Guide to Data Versioning Read More »

Data Engineering

Data Versioning – Does It Mean What You Think It Means?

Einat Orr, PhD.
November 24, 2021

Introduction When we first thought about a tagline for our open source project lakeFS, we instinctively gravitated to terms like “Data versioning”, “Manage data the way you manage code”, “Git for data”, or any variation of the three that is grammatically correct.  We were very pleased with ourselves for 5 minutes, or maybe 7, before …

Data Versioning – Does It Mean What You Think It Means? Read More »

Data Engineering Hive Metastore

Takeaways From the Future of Metadata After Hive Metastore Roundtable

Paul Singman
November 16, 2021

Overview of Hive’s Metastore Let’s get right into it. This is not an objective recap of every topic covered at the Future of Metadata After Hive Roundtable last week. But it is a summary of what I found most interesting from the discussion between panelists Lior Ebel, Ryan Blue, Seshu Adunuthula and host Oz Katz. Watch the full talk below! …

Takeaways From the Future of Metadata After Hive Metastore Roundtable Read More »

Data Engineering

3 Ways to Add Data to lakeFS

Paul Singman
October 26, 2021

Few people start using lakeFS without first having some data collected. Consequently, it is common that after getting it up and running, one of the first things people do is import their existing data to lakeFS. There isn’t a one-size-fits-all approach for doing this. Instead, there are ways that work great for a single file, …

3 Ways to Add Data to lakeFS Read More »

Data Engineering

Thoughts on the Future of the Databricks Ecosystem

Paul Singman
September 8, 2021

Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark. Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case …

Thoughts on the Future of the Databricks Ecosystem Read More »

Data Engineering

The Docker Everything Bagel™ – Spin Up A Local Data Stack

Paul Singman
August 25, 2021

Introduction An important part of developing an open source project like lakeFS is assisting and advising our users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally. This means recreating the environment, running the same code, and raising the same error. In complex, modern data stacks …

The Docker Everything Bagel™ – Spin Up A Local Data Stack Read More »

Data Engineering

Hive Metastore – Why It’s Still Here and What Can Replace It?

Einat Orr, PhD.
November 9, 2021

Hive & Hadoop — A Brief History Apache Hive burst onto the scene in 2010 as a component of the Hadoop ecosystem, when Hadoop was the novel and innovative way of doing big data analytics.  What Hive did was implement a SQL interface to Hadoop. Its architecture consisted of two main services: A Query Engine …

Hive Metastore – Why It’s Still Here and What Can Replace It? Read More »

LakeFS

  • Get Started
    Get Started