Data Engineering

Data Engineering

Towards Effective DataOps

Paul Singman
May 11, 2022

Gain the confidence to mess with your datawithout making a mess of your data. “If it hurts, do it more often.” is a wise piece of advice that DevOps engineers often repeat. Unless you are a masochist, following this advice will naturally lead you to finding ways to make the process being repeated less painful.  …

Towards Effective DataOps Read More »

Data Engineering

Clearing the mess – How to ensure data quality with versioning

The lakeFS team
May 11, 2022

The last decade saw an unprecedented rise in the number of organizations that base their decisions and operations on data. The number of digital products that collect and process data and use it to fuel decision-making algorithms for enhancing future services is also growing at a very fast pace. That’s why data and data quality …

Clearing the mess – How to ensure data quality with versioning Read More »

Data Engineering

5 Painful mistakes data engineers make, and how to avoid them

The lakeFS team
May 11, 2022

In today’s world of data engineering, we need to store more than just simple text information in relational or non-relational databases, tables or documents. Data formats include email, images, video, web pages, audio files, datasets, sensor data and other types of media content. Basically, a big chunk of unstructured data.  Studies have shown that somewhere …

5 Painful mistakes data engineers make, and how to avoid them Read More »

Data Engineering

Closing the Gap: Lifecycle Management for Data Products

Einat Orr, PhD.
March 7, 2022

As data practitioners, we use many different terms to talk about what we do – we call it business intelligence, analytics, data pipelines, or insights. But there’s one term that captures what we do really well: delivering products. When I was leading a 200 person engineering team at SimilarWeb, I couldn’t help but notice about …

Closing the Gap: Lifecycle Management for Data Products Read More »

Data Engineering

Level Up Your Data Lake

Paul Singman
May 11, 2022

What is the Basic Data Lake? A data lake is primarily two things: an object store and the objects being stored. It might look something like this: Even with this basic setup, your data is in a good position to support all three of the main use cases for data: 1. BI Analytics 2. Data-Intensive APIs …

Level Up Your Data Lake Read More »

Data Engineering

How Easy It Is to Re-use Old Pandas Code in Spark 3.2?

Paul Singman
May 11, 2022

In October, it was announced that the Pandas API was being integrated with Spark. This was particularly exciting news for a Pandas-baby like myself, whose first exposure to data analytics were Pandas-based notebook tutorials. Spark 3.2 has been out for several months now and a curiosity has been building inside me – how easy it is to …

How Easy It Is to Re-use Old Pandas Code in Spark 3.2? Read More »

Data Engineering Integrations

The Everything Bagel II: Versioned Data Lake Tables with lakeFS and Trino

Paul Singman
May 11, 2022

Introduction: Dockerize Your Data Pipeline I can remember times when my company started using a new technology — be it Redis, Kafka, or Spark — and in order to try it out I found myself staring at a screen like this: At the time I thought nothing of doing this. And even wore it as a badge of pride …

The Everything Bagel II: Versioned Data Lake Tables with lakeFS and Trino Read More »

Data Engineering

The Guide to Data Versioning

Paul Singman
March 8, 2022

“I have never lied to you, I have always told you some version of the truth.” “The truth doesn’t have versions, okay?” — Something’s Gotta Give (2003) Jack Nicholson and Diane Keaton discuss data versioning in Something’s Gotta Give. Table of Contents Introduction A version of something is defined as “a particular form in which some details are different …

The Guide to Data Versioning Read More »

Data Engineering

Data Versioning – Does It Mean What You Think It Means?

Einat Orr, PhD.
May 11, 2022

Introduction When we first thought about a tagline for our open source project lakeFS, we instinctively gravitated to terms like “Data versioning”, “Manage data the way you manage code”, “Git for data”, or any variation of the three that is grammatically correct.  We were very pleased with ourselves for 5 minutes, or maybe 7, before …

Data Versioning – Does It Mean What You Think It Means? Read More »

LakeFS

  • Get Started
    Get Started