Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Case Study

How EPCOR Built Write-Audit-Publish for Data Pipelines Using lakeFS

Cory Matheson
Cory Matheson Author

A senior software developer and agile team lead with a...

,
Raghvendra Verma
Raghvendra Verma Author

With over 17 years of progressive and diverse Information Technology...

,
Stephen Seewald
Stephen Seewald Author

Stephen Seewald is an asset management and information system professional...

Last updated on November 11, 2024
Company

EPCOR Distribution and Transmission Inc. (EPCOR) operates in the electric power generation, transmission, and distribution industry in Canada.

Problem

EPCOR experienced data quality challenges in their data lake and on their development side, they faced challenges with consistency.

Results

By using lakeFS, the ability to branch data helped the EPCOR team achieve advanced use cases such as running parallel pipelines with different logic to experiment or conduct analysis, ensure data quality and consistency in their development cycles.

The company

EPCOR Distribution and Transmission Inc. (EPCOR) operates in the electric power generation, transmission, and distribution industry in Canada. Over the years, the company has modernized its grid infrastructure to support improved meter reading and system control. In doing so, EPCOR started accumulating vast amounts of data that could be leveraged to support electric system planning, reliability, and customer service initiatives. 

One thing holding the company back were the limitations of traditional on-premise data operations and the common data quality challenges shared by many organizations. This was the initial driver for EPCOR Distribution and Transmission’s digital journey to implement a data analytics platform that could support advanced analytical use cases and progress our overall data maturity. 

One such use case allows the company to aggregate hourly meter reads from over 400 thousand customer sites through the entire distribution system and provide visualizations to the various engineering teams at multiple levels:

  • The Customer Connections Services team now has a dashboard to help them support solar panel and electric vehicle upgrade requests
  • The System Planning team can look at year-round hourly level system loading and better forecast future loads
  • The Asset Management team can also monitor historical loads on assets and better predict and prevent future asset failures


EPCOR’s data lake architecture

The tech stack EPCOR chose runs Databricks on top of Azure Cloud to enable fast data processing at scale.

The company moved data across multiple zones, ingested data from eight different source systems, and had a lake growing by approximately 100,000,000 records nightly. This cutting-edge data lake architecture presented new challenges.


The challenges

Challenge: Data quality in production

On the production side, EPCOR experienced data quality challenges in the data lake. The system had sequential ETL processes running against multiple tables, which the team wanted to be able to revert in cases where just one step failed logically. To try and solve this challenge, the team wrote scripts for individual rollbacks for every step of the process.

However, this approach presented several limitations:

  1. The team had to continuously monitor the progress to initiate a manual execution of scripts to revert each predecessor pipeline.
  2. It had to manually maintain the “undo” logic to stay in sync with pipeline changes, which increased the effort.
  3. The team couldn’t easily diagnose and debug an issue at a past point in time without impacting the production pipeline.


Challenge: Consistency in development

On the development side, the EPCOR team was experiencing challenges with consistency. They didn’t want developers to develop and test against the production environment. However, they wanted developers to build and test against similar scales, data variety, etc. Replicating the entire lake in a non-production environment and keeping it current took significant effort and cost. 

Additionally, when the development team worked on multiple changes, it was at times impossible to ensure all data requirements were met at the same time, and testing changes would occasionally corrupt the environment. This would cause developers to encounter conflicting data states frequently.


Adopted solution

Challenge solved: Data quality in production

With lakeFS, EPCOR found that we could avoid a significant amount of effort in building and maintaining these solutions.


Challenge solved: Consistency in development

Complex jobs are grouped across multiple tables/pipelines into a single “transaction” to allow an easy undo across all affected tables. To achieve the latter, the team runs ETLs on separate branches:

This way, bad or only partially processed data is never exposed to the data consumers. 

Results

With the ability to branch data, EPCOR can now more easily achieve advanced use cases such as running parallel pipelines with different logic to experiment or conduct what-if analysis, comparing large result sets for data science and machine learning, and more.

lakeFS