Develop Spark ETL pipelines with no risk against production data

Develop Spark ETL pipelines with no risk against production data

Description

Delivering high-quality data products requires strict testing of pipelines before deploying those into production. Today, to test on production data, one either needs to use a subset of the data, or is forced to create multiple copies of the entire data. Testing against sample data is not enough. Testing environment must enable you to test your end-to-end data pipeline against production data. With lakeFS, you get the entire production data set with zero-copy. However, the complexity, scale and variety of the production data environment significantly challenges testing the performance of an ETL.

You will learn:

  1. Set up your environment in under 5 minutes
    • Integrate lakeFS and Spark
    • Execute Git-like Action using lakeFS Python Client
  2. Create multiple isolated testing environments without copying data
  3. Easily run multiple test on your environment using git-like operations (such as commit, branch, revert, etc.)

 

Speakers:

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on October 12th:

    Troubleshoot and Reproduce Data with Apache Airflow
    +