A brief study, in collaboration with Paige Engineering team, on how lakeFS enables Paige to use Iceberg and dbt in production.
Paige (Pathology Artificial Intelligence Guidance Engine) is an AI-driven healthcare technology company that revolutionizes clinical diagnosis and treatment in oncology. It uses data from millions of images (digitized glass slides) and its proprietary computational pathology technology to diagnose different types of cancer, thus making treatments efficient.
At Paige, we train machine learning algorithms to help diagnose cancer by identifying patterns in highly sensitive image data stored in the AWS Cloud. Paige’s data science and machine learning team leverages lakeFS for versioning the data lake to enforce reproducibility of machine learning experiments, in compliance with FDA regulations.
lakeFS enabled us to streamline and run 200+ dbt models in production, increase data deployment velocity, efficiently reproduce ML experiments, increase productivity of the data teams and adhere to FDA compliance requirements
Data at Paige
The Data Platform team enables the AI development and Analytics teams within Paige to manage the medical data sourced from different vendors and partners. We train computer-vision models on the medical data containing pathology images (digitized glass slides) to diagnose cancer. We add about 2000-3000 images a day to the training data set, which is about 2-3 TB total. In addition, these images are immutable.
The medical data also includes raw textual diagnosis data like diagnostic reports from Laboratory information systems (LIS), doctor’s note and diagnosis, patient’s medical history, genomic data, and demographic information in CSV format files. We enhance the textual data with additional features that we automatically extract from the images. We then supplement every medical image with additional information such as the presence of a tumor, the size of the tumor, and whether the image is positive for a particular test. Based on the textual diagnosis data, we append these new features to the CSV. The enhanced CSV data set by itself amounts to about 10GB per day.
Both the datasets are stored in AWS S3 and updated daily.
Machine Learning at Paige
We run the feature creation and data enhancement steps hourly, as ETL jobs. These jobs produce the CSV we just discussed in the previous section.
Later, we take the CSV files we’ve just created, encode it, and use it for predicting the output labels. These output labels are then used along with pathology images data by the Computer Vision models to diagnose cancer.
Furthermore, to support the process, we build around 200 dbt tables containing enhanced training datasets along with the predicted labels. This is critical since the ML teams use these tables to train the 200+ ML models with the business goal of diagnosing cancer. This enhanced dataset is further used by Analytics teams to check the validity of ML training experiments as well. Does the training data have enough samples to produce statistically significant results? Can we use BI dashboards for different model trainings? We carry out correlation and statistical analysis to understand the population, estimation, and many more important business questions.
Since we are discussing image processing and text metadata, some of these tables go beyond 200GB in size and are growing at a rate of ~10GB per day.
ML Requirements and Challenges
Capture each version of the ML training features:
ML researchers experiment with different feature creation and data enhancement methods to test different label prediction logic. This requires the teams to capture each version of the feature set and its corresponding label prediction to track all the run experiments and arrive at highly accurate cancer prediction ML models.
Comply with FDA requirements:
Since the FDA and the healthcare industry consider AI and software as medical devices, there is a lot of emphasis on responsible AI, explainable AI, removing bias in prediction, and so on. Thus, healthcare tech needs to adhere to strict regulatory compliance requirements.
As a healthcare company under the purview of the FDA for HIPAA compliance, planning for data security, governance and reproducibility of experiments upfront is a requirement as the ramifications of not complying can be severe.
Make ML experiments reproducible:
Suppose we have an AI model trained to detect a specific type of breast cancer. How was it trained? What exact data and information went into training that model? Are we sure there is no bias, there are no mistakes in the pipeline, and all the data was properly collected, versioned and saved (i.e., with appropriate user consents and so on)? That is why ML experiments must be reproducible.
Our partner network of hospitals and laboratories share medical images and diagnostic data with us on a daily cadence. So, we need a way of capturing the state of the training dataset at a specific point in time. This way we can use the same training dataset as the previous iteration while experimenting with different hyper-parameter options. Only then we can study the effect of hyper-parameter tuning on ML training.
Since data versioning and reproducibility are a must for our ML platform, we assessed a number of mature data versioning tools available in the ecosystem. We built POCs to evaluate the different features offered by these tools and check how well they integrate with our system.
Initially, we used Git for data versioning. Although Git for LFS (Large File Storage) worked for a few hundred MBs of data, it didn’t scale when the size of the CSV files went beyond the source code scale (~200MBs).
Our team then evaluated DVC for data versioning at scale. However, DVC required invasive changes to the existing ML training pipelines, and the learning curve for ML engineers to use DVC was very steep as well.
- Originally, we stored the pathology images data in AWS S3.
- DVC requires downloading the data locally for versioning. So, we need to implement a cache for ML training pipelines as well.
- We implemented a LRU (Least Recently Used) cache to download the images to a local fast storage on MinIO during ML training. And we used DVC primarily to manage the file system in the local storage.
- The data team was building an entire data platform to cater to multiple AI and analytics teams within Paige. However, DVC worked well only for the AI use-cases.
- DVC has a built-in orchestrator for ML pipelines, which comes bundled. But our team uses Prefect to orchestrate both the ETL pipelines and ML training workloads. Maintaining DVC orchestration in addition to Prefect created redundancy and overhead.
- We were only looking for a versioning engine that works for broader use cases, and DVC wasn’t a good fit for that requirement.
lakeFS offered exactly what we were looking for. A versioning engine for broader use cases and isn’t limited to AI/ML workloads only.
- Our data resides in a hybrid environment. Pathology images reside in the AWS S3 environment managed by the FSx file system. We store the textual ML training data and enhanced features in AWS S3 as CSV files. ML model training happens in a local fast storage on-premises. But we run the analytics workloads in AWS and dbt.
- We needed a tool that worked well for our hybrid data environment. A tool that can work equally well for AWS S3 and on-prem data environment.
- lakeFS fit the bill exactly as it works with any object store that has an S3 like API. All we need to do is point lakeFS to a different bucket URL path to switch between AWS and on-prem. This required minimal changes to make lakeFS work with our existing workflows.
- DVC, on the other hand, required invasive changes to the existing ML training pipelines.
- lakeFS offered flexibility and ease of use, which is an imperative requirement for our data platform.
Following the POC, we concluded that lakeFS is the best solution for us among the existing options. In the next section, you will learn more about how we integrated lakeFS into our data architecture.
Data Architecture with lakeFS
lakeFS versions the raw ML training data together with the enhanced features data. We use Prefect (a data flow automation tool) for orchestrating our ETL pipelines and ML training jobs. At the compute layer, we have Spark jobs running on AWS EMR. On the warehousing front, we store tables in Apache Iceberg format in dbt for further analytics workloads. As for BI, Looker is the primary tool we use to visualize the ML experiments and statistical analysis.
So, how did we go about integrating lakeFS? Our central data platform has used lakeFS as a core versioning engine in production for a year now. We went through an interesting journey and decided to implement a daily CI/CD cadence for data product release.
What does a daily CI/CD cadence for data product release mean? Our pipeline automatically creates a lakeFS data branch, saves the incremental daily data load on it, commits it, and runs a dbt build. It also runs relevant tests against daily data. If the tests succeed, our automation merges the ingest data branch to the main.
The daily commit id is the product release that ML engineers refer to. By leveraging the lakeFS commit log, we audit the changes to the training data and track the lineage. This enables us to identify the training data set that corresponds to a specific label logic. This is critical for us because it lets us reproduce the experiments and adhere to the FDA requirements, as discussed earlier.
We also leverage the branch protection rules. It enables us to improve the reliability and quality of data in the main branch as we experiment and test data on dedicated feature data branches. Of course, lakeFS also makes sure that all the data versions are available for us at any time.
Let’s dive deeper into the integration and implementation of lakeFS in our system:
Effective data ingestion with a dirty ingest branch and a protected main branch
- On the ML front, lakeFS is used to version the raw textual diagnosis data as well as the feature enhanced training data set in AWS S3. We use the lakeFS S3 gateway to configure the AWS S3-lakeFS connectivity.
- Data availability issues arising from running dbt builds directly on incremental daily load in the production environment posed a data integrity challenge.
- To overcome the data integrity issue, our production data lives on the main branch and has branch protection rules enabled. So, it’s safeguarded from any modifications from upstream or downstream applications. This ensures high availability of production data at all times and increased data reliability for downstream ML/analytics workloads.
- On the lakeFS data repository, we create a dirty ingest branch to run the ETL pipelines that transform the input data and run the dbt builds as well. Once the dbt builds are completed, the enhanced input data undergoes several data quality checks. If the tests are successful, we then merge it to the production branch.
- We used lakeFS hooks to run these quality checks and to enable CI/CD workflows for the AWS S3 data lake.
Multi-table transactions with lakeFS, Iceberg, and dbt
- On the warehousing side, lakeFS enables us to treat datasets as data products that get deployed every day into dbt tables.
- Initially, the data platform used Hive for metadata management. However, Hive doesn’t offer schema versioning and provides only one view of the catalog at any point in time.
- We implemented schema versioning in Hive by manually creating a new schema every time the table’s version was updated. This was an error-prone process that resulted in several stale schema/tables in Hive metastore.
- By using lakeFS and Iceberg together, we store the data along with metadata in the data lake itself, versioned by lakeFS.
- By leveraging lakeFS’ capability to make updates to the entire data lake atomically, we update the metadata along with the data and then promoted to production atomically.
- This ensures the data integrity in the warehouse, and the high availability of dbt tables to downstream analytics users.
By using lakeFS, we increased the robustness of our data platform that serves our data scientists, ML engineers, data engineers, and analysts. Implementing lakeFS immensely accelerated the data team’s productivity, helped achieve higher data deployment velocity (daily), increased data availability, and improved compliance with FDA regulations.