Data Engineering

Data Engineering

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared

Oz Katz
April 13, 2021

Introduction When building a data lake, there is perhaps no more consequential decision than the format data will be stored in. The outcome will have a direct effect on its performance, usability, and compatibility. It is inspiring that by simply changing the format data is stored in, we can unlock new functionality and improve the …

Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared Read More »

Data Engineering

3 Data Lake Anti-Patterns to Avoid

Paul Singman
March 31, 2021

Rid yourself of these troubling habits and start the journey towards data lake mastery! Introduction Data lakes offer tantalizing performance upside, which is a major reason for their high rate of adoption. Sometimes though, the promise of technological performance can overshadow an unpleasant developer experience. This is troublesome since I believe the developer experience is as …

3 Data Lake Anti-Patterns to Avoid Read More »

Data Engineering

Data Lakes: The Definitive Guide

Paul Singman
March 22, 2021

What is a Data Lake? A data lake is a system of technologies that allow for the querying of data in file or blob objects.  When employed effectively, they enable the analysis of structured and unstructured data assets at tremendous scale and cost-efficiency. The number of organizations employing data lake architectures has increased exponentially since …

Data Lakes: The Definitive Guide Read More »

Data Engineering Project

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks

Oz Katz
March 2, 2021

Continuous integration of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and showcase a …

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks Read More »

Data Engineering

Data Quality Testing: Ways to Test Data Validity and Accuracy

Einat Orr, PhD.
April 1, 2021

Introduction If Sisyphus had been a data analyst or a data scientist, the boulder she’d be rolling up the hill would have been her data quality assurance. Even if all engineering processes of ingesting, processing, and modeling are working impeccably, the ability to test data quality at any stage of the data pipeline, and being …

Data Quality Testing: Ways to Test Data Validity and Accuracy Read More »

Data Engineering

Ensuring Data Quality in a Data Lake Environment

Einat Orr, PhD.
March 24, 2021

The quality of the data we introduce determines the overall reliability of our data lake. And the ingestion stage is a critical point for ensuring the soundnes of our service and data.  The same way software engineers apply automatic testing to new code, data engineers should continuously test newly ingested data while ensuring they meet …

Ensuring Data Quality in a Data Lake Environment Read More »

Data Engineering Project

Why Data Versioning as an Infrastructure Matters

Einat Orr, PhD.
March 24, 2021

The demand for infrastructure that contributes to the collection, storage, and analysis of data is growing with the increasing amounts of data managed by organizations. Every organization that manages data pipelines to extract insights from data encounters the need for reproducibility, safe experimentation, and means to ensure data quality. The path to answering these needs …

Why Data Versioning as an Infrastructure Matters Read More »

Data Engineering Project

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS

Einat Orr, PhD.
March 24, 2021

The data mesh paradigm The Data Mesh paradigm was first introduced by Zhamak Dehghani in her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.  Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, …

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS Read More »

LakeFS

  • Get Started
    Get Started