Oz Katz

Data Engineering Project

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks

Oz Katz
March 2, 2021

Continuous integration of data is the process of exposing data to consumers only after ensuring it adheres to best practices such as format, schema, and PII governance. Continuous deployment of data ensures the quality of data at each step of a production pipeline. In this blog, I will present lakeFS’s web hooks, and showcase a …

lakeFS Hooks: Implementing CI/CD for Data using Pre-merge Hooks Read More »

Data Engineering

Chaos Data Engineering

Oz Katz
May 19, 2021

Modern Data Lakes are a complexity tar pit. They involve many moving parts: distributed computation engines, running on virtualized servers connected by a software defined network, running on top of distributed object stores, orchestrated by a distributed stream processor or pipeline execution engine. These moving parts fail. All the time. Handling these failures is not …

Chaos Data Engineering Read More »

Data Engineering Project

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes

Oz Katz
May 19, 2021

Lakeview is a new open source visibility tool for AWS S3 based data lakes. Think of it as ncdu, but for Petabyte-scale data. It’s goal is to provide you with an easy way to see the total size of your S3 bucket (prefix) storage. Instead of scanning billions of objects using the S3 API, which …

Introducing lakeview: A Visibility Tool for AWS S3 Based Data Lakes Read More »

Data Engineering

Diary of a Data Engineer

Oz Katz
May 19, 2021

A glimpse into the life of a data engineer. Day 1: Finally, an easy one Got a pretty simple task for a change – read a new type of event stream generated by sales, and publish it to the data lake. Sounds like a straightforward ETL. I estimate this as one day of work. I …

Diary of a Data Engineer Read More »

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +