lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data
Case Study
How EPCOR Built Write-Audit-Publish for Data Pipelines Using lakeFS
Company
EPCOR Distribution and Transmission Inc. (EPCOR) operates in the electric power generation, transmission, and distribution industry in Canada.
Problem
EPCOR experienced data quality challenges in their data lake and on their development side, they faced challenges with consistency.
Results
By using lakeFS, the ability to branch data helped the EPCOR team achieve advanced use cases such as running parallel pipelines with different logic to experiment or conduct analysis, ensure data quality and consistency in their development cycles.
Table of Contents
The company
EPCOR Distribution and Transmission Inc. (EPCOR) operates in the electric power generation, transmission, and distribution industry in Canada. Over the years, the company has modernized its grid infrastructure to support improved meter reading and system control. In doing so, EPCOR started accumulating vast amounts of data that could be leveraged to support electric system planning, reliability, and customer service initiatives.
One thing holding the company back were the limitations of traditional on-premise data operations and the common data quality challenges shared by many organizations. This was the initial driver for EPCOR Distribution and Transmission’s digital journey to implement a data analytics platform that could support advanced analytical use cases and progress our overall data maturity.
One such use case allows the company to aggregate hourly meter reads from over 400 thousand customer sites through the entire distribution system and provide visualizations to the various engineering teams at multiple levels:
- The Customer Connections Services team now has a dashboard to help them support solar panel and electric vehicle upgrade requests
- The System Planning team can look at year-round hourly level system loading and better forecast future loads
- The Asset Management team can also monitor historical loads on assets and better predict and prevent future asset failures
EPCOR’s data lake architecture
The tech stack EPCOR chose runs Databricks on top of Azure Cloud to enable fast data processing at scale.
The company moved data across multiple zones, ingested data from eight different source systems, and had a lake growing by approximately 100,000,000 records nightly. This cutting-edge data lake architecture presented new challenges.
The challenges
Challenge: Data quality in production
On the production side, EPCOR experienced data quality challenges in the data lake. The system had sequential ETL processes running against multiple tables, which the team wanted to be able to revert in cases where just one step failed logically. To try and solve this challenge, the team wrote scripts for individual rollbacks for every step of the process.
However, this approach presented several limitations:
- The team had to continuously monitor the progress to initiate a manual execution of scripts to revert each predecessor pipeline.
- It had to manually maintain the “undo” logic to stay in sync with pipeline changes, which increased the effort.
- The team couldn’t easily diagnose and debug an issue at a past point in time without impacting the production pipeline.
Challenge: Consistency in development
On the development side, the EPCOR team was experiencing challenges with consistency. They didn’t want developers to develop and test against the production environment. However, they wanted developers to build and test against similar scales, data variety, etc. Replicating the entire lake in a non-production environment and keeping it current took significant effort and cost.
Additionally, when the development team worked on multiple changes, it was at times impossible to ensure all data requirements were met at the same time, and testing changes would occasionally corrupt the environment. This would cause developers to encounter conflicting data states frequently.
Adopted solution
Challenge solved: Data quality in production
With lakeFS, EPCOR found that we could avoid a significant amount of effort in building and maintaining these solutions.
Challenge solved: Consistency in development
Complex jobs are grouped across multiple tables/pipelines into a single “transaction” to allow an easy undo across all affected tables. To achieve the latter, the team runs ETLs on separate branches:
This way, bad or only partially processed data is never exposed to the data consumers.
Results
With the ability to branch data, EPCOR can now more easily achieve advanced use cases such as running parallel pipelines with different logic to experiment or conduct what-if analysis, comparing large result sets for data science and machine learning, and more.
