Oz Katz

Data Engineering

Chaos Data Engineering

Oz Katz
December 13, 2020

Modern Data Lakes are a complexity tar pit. They involve many moving parts: distributed computation engines, running on virtualized servers connected by a software defined network, running on top of distributed object stores, orchestrated by a distributed stream processor or pipeline execution engine. These moving parts fail. All the time. Handling these failures is not …

Chaos Data Engineering Read More »

Data Engineering Project

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes

Oz Katz
December 30, 2020

Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which …

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes Read More »

Data Engineering

Diary of a Data Engineer

Oz Katz
December 30, 2020

Day 1: Finally, an easy one Got a pretty simple task for a change – read a new type of event stream generated by sales, and publish it to the data lake. Sounds like a straightforward ETL. I estimate this as one day of work. I can reuse a bunch of code from similar jobs …

Diary of a Data Engineer Read More »

LakeFS

  • Get Started
    Get Started