It is no secret that modern businesses run on big data. If your business was a car, big data would be the engine that powers it. All businesses want to leverage their data to the hilt to make better-informed decisions that accelerate their success. But with the volume, velocity, and variety of data growing exponentially, this has become ever more challenging.
Enter the world of data lakes. Data lakes are repositories that can take hold of large amounts of data in its native, raw format. Data is only processed upon being called for usage (compared to a data warehouse, which processes all incoming data). This ultimately allows data lakes to be an efficient way for storage, resource management, and data preparation.
Many businesses have embarked upon a digital journey to better harness their big data. EPCOR was one such company. EPCOR Distribution and Transmission Inc. serves over 428,000 customers in the City of Edmonton and has taken the lead on EPCOR’s Data Analytics journey.
EPCOR Distribution and Transmission’s Digital Journey
By Stephen Seewald, Manager – Risk & Performance – EPCOR
Raghvendra Verma, Project Manager – EPCOR
Cory Matheson, Technical Lead – EPCOR
Like many industries, the electricity industry is rapidly changing. As consumer demand for new and green technologies rises, the traditional power grid is faced with new challenges that impact how we must plan and operate in order to prudently deliver safe and reliable service to customers. Over the years, EPCOR Distribution and Transmission Inc. (EPCOR) has modernized its grid infrastructure to support improved meter reading and system control. In doing so EPCOR has started accumulating vast amounts of data that could be leveraged to support electric system planning, reliability and customer service initiatives. One thing holding us back was limitations of traditional on-premise data operations and the common data quality challenges shared by many organizations. This was the initial driver for EPCOR Distribution and Transmission’s digital journey to implement a data analytics platform that could support advanced analytical use cases and progress our overall data maturity. One such use case allows us to aggregate hourly meter reads from over 400K customer sites through the entire distribution system and provide visualizations at multiple levels to the various engineering teams. Our Customer Connections Services team now has a dashboard to help them support solar panel and electric vehicle upgrade requests, and the System Planning team can look at year round hourly level system loading and better forecast future loads. Asset Management can also monitor historical loads on assets and better predict and prevent future asset failures.
Pre lakeFS Architecture
The tech stack we chose runs Databricks on top of Azure Cloud, to enable fast data processing at scale.
At the time we decided to look into lakeFS, we were moving data across multiple zones, ingesting data from 8 different source systems and had a lake growing by approximately 100,000,000 records nightly.
This cutting-edge data lake architecture, presented new challenges.
Challenges posed by the data lake architecture
Before we implemented lakeFS, our project team had encountered a number of challenges that highlighted some deficiencies in our dev-test-prod workflows.
On the production side, we were experiencing data quality challenges on our data lake. The system had sequential ETL processes, running against multiple tables, which we wanted to be able to logically revert in cases where just one step failed. To try and solve this challenge, we had written scripts for individual rollbacks to every step of the process; however, this approach had a couple of limitations:
- We had to continuously monitor the progress, to initiate a manual execution of scripts to revert each predecessor pipeline.
- We had to manually maintain the “undo” logic to stay in sync with any pipeline changes, thereby increasing effort.
- We were unable to easily diagnose and debug an issue at a past point in time without impacting the production pipeline.
On the development side, we were experiencing consistency challenges.
We didn’t want our developers to develop and test against the production environment. But, we did want to enable them to develop and test against similar scale, data variety, etc. Replicating the entire lake to a non-prod environment and keeping it current took significant effort and cost.
Additionally, when the development team worked on multiple changes, it was at times impossible to ensure all data requirements were met at the same time and testing changes would occasionally corrupt the environment. This would cause developers to frequently encounter conflicting data states.
Our project team looked into some alternatives for addressing these issues. Initially, we read about lakeFS, a git like data versioning control system that could potentially addresses many of our challenges.
At the time, lakeFS did not offer a hosted solution or a trial version. So we had to spin up a Proof of Concept (POC) environment on our own.
The installation was simple. It consisted of setting up a PostgreSQL (V11 or higher) and we chose to run lakeFS during the POC phase on an Azure Container Instance, starting lakeFS with a single docker command:
Concerns that came up during the POC:
Compatibility with various ADLS versions
We learned that lakeFS supports Azure Blob Storage through the S3 gateway. This means that our Azure Databricks must proxy through our lakeFS instance using the lakeFS S3 API. We were also advised that the lakeFS’ roadmap includes supporting ADLS Gen2 natively in Azure Databricks via the Hadoop file system client.
While the installation approach mentioned above was quick, we did give very few resources to lakeFS during the POC phase, and it did not scale very well.
We experienced challenges during a load test using our preexisting ADLS Gen 2 instance; we pushed >300 GB of partitioned data. We observed a couple of issues:
- Some uploads seemed to time out or take substantially longer than baseline.
The issue of performance was due to resource limits set through the container; we addressed this issue by deploying an AKS instance with several gateways.
- We saw some errors relating to multi-part uploads. We did not see these issues reproduced after deploying lakeFS into a standard Azure Blob Storage instance.
Before committing to lakeFS in our tech stack, we wanted to make sure it was supported and stable enough to host our workloads for years to come. Treeverse had listed a number of customers currently using lakeFS as part of their tech stack. At the end of the POC, we met with a customer who has been using it for a couple of years. The lakeFS slack group can also be very useful to engage with other users of the product.
With lakeFS, we found that we could avoid a significant amount of effort in building and maintaining these solutions.
Here are the two main value points we get today out of lakeFS:
- Version control on prod. We now have the ability to roll back the entire lake with a click of a button (or a simple API call for our automated pipelines) to minimize duration of being in a corrupt state.
- Complex jobs are grouped together across multiple tables/pipelines into a single “transaction” to allow an easy undo across all affected tables.
To achieve the latter, we run ETLs on separate branches:
This way, we never expose bad or only partially processed data to the data consumers.
With the ability to branch data, we can now more easily achieve advanced use cases such as running parallel pipelines with different logic to experiment or conduct what-if analysis, compare large result sets for data science and machine learning, and more.
Post lakeFS Architecture
We evaluated a couple different options for hosting lakeFS. We ended up spinning up an Azure Kubernetes Service (AKS) cluster to host the application and database:
- PostgreSQL database hosted on a 2 Core VM with 8GB of memory (supporting roughly up to 10K requests/second).
- lakeFS application hosted on an auto-scaling cluster with 4 Cores and 16GB of memory per VM. Pod Topology Spread Constraints were applied to ensure pods are distributed across VMs to leverage network throughput capacity per VM.
- Utilized spot nodes to keep costs low during periods of high utilization relating to larger batch runs.
Areas for improvement for lakeFS
We ran into two issues with lakeFS:
- The first was resolved within 24 hours and had to do with merging empty branches.
- The second issue was addressed with the recent release of Garbage Collection support for Azure. This new feature allows us to permanently delete objects if they do not exist in any repository after a configurable amount of time, thereby reducing the overall data footprint and storage costs.