Iddo Avneri

Use Cases

Data Lake Governance at Scale with lakeFS

Iddo Avneri
January 30, 2023

Introduction Often, data lake platforms lack simple ways to enforce data governance. This is especially challenging since data governance requirements are complicated to begin with, even without the added complexities of managing data in a data lake. Therefore, enforcing them is an expensive, time-consuming ongoing effort, requiring continuous management. Typically, at the expense of data …

Data Lake Governance at Scale with lakeFS Read More »

Integrations Tutorials

Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial

Iddo Avneri
January 11, 2023

Introduction This tutorial will review all steps needed to configure lakeFS on Databricks.  This tutorial assumes that lakeFS is already set up and running against your storage (in this example AWS s3), and is focused on setting up the Databricks and lakeFS integration. Prerequisites Step 1 – Acquire lakeFS Key and Secret In this step, …

Databricks and lakeFS Integration: Step-by-Step Configuration Tutorial Read More »

Data Engineering

ETL Testing: A Practical Guide

Iddo Avneri
January 16, 2023

What is ETL Testing? ETL testing is the process of evaluating and verifying that the ETL (Extract, Transform, Load) processes work correctly.  What is ETL? An ETL process Extracts data of potentially many different structure or unstructured formats from multiple sources into a centralized repository. Then, an ETL process Transforms the data to a format …

ETL Testing: A Practical Guide Read More »

Integrations Use Cases

Troubleshoot and Reproduce Data with Apache Airflow

Iddo Avneri
December 6, 2022

Apache airflow enables you to build multistep workflows across multiple technologies. The programmatic approach, allowing you to schedule and monitor workflows, helps users build complicated ETLs on their data that will be difficult to achieve automatically otherwise.This enabled the evolution of ETLs from simple single steps to complicated, parallelized, multi steps advance transformations: The challenge …

Troubleshoot and Reproduce Data with Apache Airflow Read More »

Data Engineering Use Cases

How to Develop Spark ETL Pipelines in Isolation

Amit Kesarwani, Vino SD, Iddo Avneri
November 7, 2022

You’re bound to ask yourself this question at some point: Do I need to test the Spark ETLs I’m developing? The answer is yes; you certainly should – and not just with unit testing but also integration, performance, load, and regression testing. Naturally, the scale and complexity  of your data set matters a lot, so …

How to Develop Spark ETL Pipelines in Isolation Read More »

Git for Data – lakeFS

  • Get Started
    Get Started
  • LIVE: Develop Spark pipelines against production data on February 15 -

    Register Now
    +