Paul Singman

Data Engineering

Thoughts on the Future of the Databricks Ecosystem

Paul Singman
May 11, 2022

Databricks has come a long way since growing out of a Berkeley Lab in 2013 with an open-source distributed computing framework called Spark. Fast forward eight years and in addition to the core Spark product, there are a dizzying number of new features in various stages of public preview within the Databricks platform. In case …

Thoughts on the Future of the Databricks Ecosystem Read More »

Data Engineering

The Docker Everything Bagel™ – Spin Up A Local Data Stack

Paul Singman
May 24, 2022

Update Dec 16, 2021: Part II of the Everything Bagel series is published! Click here to read.  Introduction An important part of developing an open source project like lakeFS is assisting and advising our users. When they run into an issue and feel pain, we want to feel that pain, too. Quite literally. This means recreating …

The Docker Everything Bagel™ – Spin Up A Local Data Stack Read More »

Data Engineering

Making Sure Your Data Lifecycle Management Makes Sense

Paul Singman, Einat Orr, PhD.
May 11, 2022

Table of Contents What is Data Lifecycle Management Datasets are the foundational output of a data team. They do not appear out of thin air. No one has ever snapped their fingers and created an orders_history table. Instead, useful sets of data are created and maintained through a process that involves several predictable steps. Managing …

Making Sure Your Data Lifecycle Management Makes Sense Read More »

Integrations Machine Learning

Build Reproducible Experiments with Kubeflow and lakeFS

Tal Sofer, Paul Singman
November 21, 2022

Introducing Kubeflow and lakeFS Kubeflow is a cloud-native ML platform that simplifies the training and deployment of machine learning pipelines on Kubernetes. An ML project using Kubeflow will consist of isolated components for each stage of the ML lifecycle. And each component of a Kubeflow pipeline is packaged as a Docker image and executed in a …

Build Reproducible Experiments with Kubeflow and lakeFS Read More »

Data Engineering

Solving Data Reproducibility

Paul Singman
November 21, 2022

Debugging an issue is never fun, but why make it harder? In this post, we show how reproducing data is possible whether interacting with a single file, entire table, or data repository. Introducing Data Reproducibility There are two types of issues in the world — reproducible and unreproducible.  A reproducible issue is one where the original conditions for …

Solving Data Reproducibility Read More »

People

Why I’m Joining lakeFS

Paul Singman
April 6, 2021

Thoughts on a personal journey into the world of developer advocacy at an open-source data project. In March of 2021, I chose to leave the data team at Equinox Media and join a nascent open-source project lakeFS as the first developer advocate. In this post, I share a few reasons why I’m excited about starting this …

Why I’m Joining lakeFS Read More »

Data Engineering

3 Data Lake Anti-Patterns to Avoid

Paul Singman
May 19, 2021

Rid yourself of these troubling habits and start the journey towards data lake mastery! Introduction Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience. This is troublesome since I believe the developer experience is as …

3 Data Lake Anti-Patterns to Avoid Read More »

Data Engineering

Data Lakes: The Definitive Guide

Paul Singman
May 27, 2021

What is a Data Lake? A data lake is a system of technologies that allow for the querying of data in file or blob objects.  When employed effectively, they enable the analysis of structured and unstructured data assets at tremendous scale and cost-efficiency. The number of organizations employing data lake architectures has increased exponentially since …

Data Lakes: The Definitive Guide Read More »

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +