Troubleshoot and Reproduce Data with Apache Airflow

A property of data pipelines one might observe is that they rarely stay still. Instead, there are near-constant updates to some aspect of the infrastructure they run on, or in the logic they use to transform data.

To efficiently apply the necessary changes to a pipeline requires running it parallel to production to test the effect of a change. And once deployed, one needs to continuously monitor the production environment for any data issues or failed ETL jobs. Most data engineers would agree that the best way to do this is far from a solved problem.

Most attempts at troubleshooting the failed ETL jobs are manual and error-prone, thus adhering to the SLA requirements of the data products becomes difficult. This approach falls short of ensuring integrity and availability of data for downstream applications while the issues are being troubleshooted. An effective way would be to revert the production data to a consistent state (i.e., a state before the data issue occured) for improved data availability.

The open source project lakeFS makes reverting production data to an error-free state an extremely simple one line command. lakeFS supports git-like branching, committing and reverting operations on the data lake and this enables a safe and error-free way of troubleshooting production issues.

In this session, we will showcase how to use lakeFS to quickly analyze and troubleshoot failed airflow jobs, thereby improving data integrity and trustworthiness.

Troubleshoot and Reproduce Data with Apache Airflow

Speakers

Amit Kesarwani

Iddo Avneri

Troubleshoot and Reproduce Data with Apache Airflow

Speakers

Amit Kesarwani

Iddo Avneri

Pick up the Slack with lakeFS