Data Engineering

Data Engineering

Ensuring Data Quality in a Data Lake Environment

Einat Orr, PhD.
January 19, 2021

The quality of the data we introduce determines the overall reliability of our data lake. And the ingestion stage is a critical point for ensuring the soundnes of our service and data.  The same way software engineers apply automatic testing to new code, data engineers should continuously test newly ingested data while ensuring they meet …

Ensuring Data Quality in a Data Lake Environment Read More »

Data Engineering Project

Why Data Versioning as an Infrastructure Matters

Einat Orr, PhD.
January 8, 2021

The demand for infrastructure that contributes to the collection, storage, and analysis of data is growing with the increasing amounts of data managed by organizations. Every organization that manages data pipelines to extract insights from data encounters the need for reproducibility, safe experimentation, and means to ensure data quality. The path to answering these needs …

Why Data Versioning as an Infrastructure Matters Read More »

Data Engineering Project

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS

Einat Orr, PhD.
December 10, 2020

The data mesh paradigm The Data Mesh paradigm was first introduced by Zhamak Dehghani in her article How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.  Unlike traditional monolithic data infrastructures that handle the consumption, storage, transformation, and output of data in one central data lake, a data mesh supports distributed, …

Data Mesh Applied: How to Move Beyond the Data Lake with lakeFS Read More »

Data Engineering

Object Storage: Everything You Need to Know

Yael Rivkind
January 17, 2021

While Object Storage is not novel technology, it can still be overwhelming when getting started. Here’s a definitive guide to object-based storage with everything you need to know. What is object storage? At its core, object storage or object-based storage represents a data storage architecture that allows you to store large amounts of unstructured data …

Object Storage: Everything You Need to Know Read More »

Data Engineering

Chaos Data Engineering

Oz Katz
December 13, 2020

Modern Data Lakes are a complexity tar pit. They involve many moving parts: distributed computation engines, running on virtualized servers connected by a software defined network, running on top of distributed object stores, orchestrated by a distributed stream processor or pipeline execution engine. These moving parts fail. All the time. Handling these failures is not …

Chaos Data Engineering Read More »

Data Engineering Project

Building A Data Development Environment with lakeFS

Barak Amar
December 30, 2020

Overview As part of our routine work with data we develop code, choose and upgrade compute infrastructure, and test new data. Usually, this requires running parts of our production pipelines in parallel to production, testing the changes we wish to apply. Every data engineer knows that this convoluted process requires copying data, manually updating configuration, …

Building A Data Development Environment with lakeFS Read More »

Data Engineering Project

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes

Oz Katz
December 30, 2020

Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which …

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes Read More »

Data Engineering Project

How to Manage Your Data the Way You Manage Your Code

Einat Orr, PhD.
October 5, 2020

50 years ago it was very hard to collaborate over code. When developing large scale software projects it was difficult to manage changes to source code over time, as revision control tools were only starting to enter mainstream computing. The adoption of version control tools, first centralized and then distributed, changed all that, and now …

How to Manage Your Data the Way You Manage Your Code Read More »

Data Engineering

Diary of a Data Engineer

Oz Katz
December 30, 2020

Day 1: Finally, an easy one Got a pretty simple task for a change – read a new type of event stream generated by sales, and publish it to the data lake. Sounds like a straightforward ETL. I estimate this as one day of work. I can reuse a bunch of code from similar jobs …

Diary of a Data Engineer Read More »

LakeFS

  • Get Started
    Get Started