Event Type: Webinar

Develop ETL pipelines with zero copy prod data on top of AWS EMR Serverless

Ankit Srinivas
November 8, 2022

Delivering high-quality data products requires strict testing of pipelines before deploying those into production. Today, to test using quality data, one either needs to use a subset of the production data, or is forced to create multiple copies of the entire data. Testing against sample data is not good enough. The alternative, however, is costly …

Develop ETL pipelines with zero copy prod data on top of AWS EMR Serverless Read More »

Promote only high-quality data to production

Ankit Srinivas
November 8, 2022

Engineering best practices dictate having an isolated staging environment. And yet today, data transformation is done most often directly on production data.Moreover, even if the code and infrastructure doesn’t change, the data might, and those changes introduce potential quality issues. In this webinar, you will learn: How to create a staging environment for your data …

Promote only high-quality data to production Read More »

Get familiar with lakeFS playground using Notebooks

Ankit Srinivas
November 3, 2022

In this session, we will review a quick way to get started with lakeFS, taking advantage of the playground. 1. Spin up an environment. 2. Configure the lakeFS command line interface to work against the playground. 3. Spin up a docker container with spark, python and Jupiter notebook pre-configured. 4. Configure Spark and Python to …

Get familiar with lakeFS playground using Notebooks Read More »

Develop Spark ETL pipelines with no risk against production data

Ankit Srinivas
October 28, 2022

Delivering high-quality data products requires strict testing of pipelines before deploying those into production. Today, to test using quality data, one either needs to use a subset of the data, or is forced to create multiple copies of the entire data. Testing against sample data is not good enough. The alternative, however, is costly and …

Develop Spark ETL pipelines with no risk against production data Read More »

Achieve Multi-Table Transactions On Delta tables

Ankit Srinivas
October 19, 2022

Data engineers typically need to implement custom logic in scripts to guarantee two or more data assets (tables) are updated synchronously. This logic often requires extensive rewrites or periods during which data is unavailable or not synchronized. We will demonstrate a way to run data transformation in isolation across multiple tables, without ever creating a …

Achieve Multi-Table Transactions On Delta tables Read More »

Troubleshoot and Reproduce Data with Apache Airflow

Ankit Srinivas
September 27, 2022

A property of data pipelines one might observe is that they rarely stay still. Instead, there are near-constant updates to some aspect of the infrastructure they run on, or in the logic they use to transform data. To efficiently apply the necessary changes to a pipeline requires running it parallel to production to test the …

Troubleshoot and Reproduce Data with Apache Airflow Read More »

Develop Spark ETL pipelines with no risk against production data

Ankit Srinivas
September 7, 2022

Delivering high-quality data products requires strict testing of pipelines before deploying those into production. Today, to test using quality data, one either needs to use a subset of the data, or is forced to create multiple copies of the entire data. Testing against sample data is not good enough. The alternative, however, is costly and …

Develop Spark ETL pipelines with no risk against production data Read More »

Data Superstream: Data Lakes and Warehouses

Paul Singman
February 9, 2022

Gain insight into how to increase the scalability, speed, and availability of your data, along with best practices for utilizing your data warehouse, data lake, or data lakehouse.

LakeFS

  • Get Started
    Get Started
  • Git for Data - What, How and Why Now?

    Read the article
    +