Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Case Study

How Paige AI Uses lakeFS and dbt to Revolutionize AI-Powered Cancer Diagnosis

Sander Hartlage
Sander Hartlage Author

Sander Hartlage joined Paige AI in 2020 as a Staff...

,
Sid Senthilinathan
Sid Senthilinathan Author

Last updated on April 25, 2025
Company

Paige is an AI-driven healthcare technology company that revolutionizes clinical diagnosis and treatment in oncology. Paige trains machine learning algorithms to help diagnose cancer by identifying patterns in highly sensitive image data stored in the AWS Cloud.

Problem

Paige AI uses data from millions of images (digitized glass slides) and its proprietary computational pathology technology to diagnose different types of cancer. They needed a data version control system that can handle mass volumes of data for ML training and CI/CD data pipelines, while remaining compliant with FDA requirements.

Results

By using lakeFS, Paige increased the robustness of its data platform, which serves data scientists, ML engineers, data engineers, and analysts. Implementing lakeFS immensely accelerated the data team’s productivity, increased deployment velocity and availability, while improving compliance with FDA regulations.

The company

Paige (Pathology Artificial Intelligence Guidance Engine) is an AI-driven healthcare technology company that revolutionizes clinical diagnosis and treatment in oncology. Paige trains machine learning algorithms to help diagnose cancer by identifying patterns in highly sensitive image data stored in the AWS Cloud. 

The company uses data from millions of images (digitized glass slides) and its proprietary computational pathology technology to diagnose different types of cancer, thus making treatments efficient.


Data management practices at Paige

The Data Platform team enables the AI development and Analytics teams within Paige to manage the medical data sourced from different vendors and partners. They train computer-vision models on the medical data containing pathology images (digitized glass slides) to diagnose cancer. The team adds about 2000-3000 images a day to the training data set, which is about 2-3 TB total. In addition, these images are immutable.

The medical data also includes raw textual diagnosis data like diagnostic reports from Laboratory Information Systems (LIS), doctor’s notes and diagnoses, patient’s medical histories, genomic data, and demographic information in CSV format files. The company enhances the textual data with additional features that are automatically extracted from the images. 

Every medical image is supplemented with additional information, such as the presence of a tumor, the size of the tumor, and whether the image is positive for a particular test. Based on the textual diagnosis data, these new features are appended to the CSV. The enhanced CSV data set by itself amounts to about 10 GB per day.

Both datasets are stored in AWS S3 and updated daily.


Machine Learning at Paige

Paige runs the feature creation and data enhancement steps hourly, as ETL jobs. These jobs produce the CSV discussed in the previous section. Later, the team takes the CSV files that have just been created, encodes them, and uses them to predict the output labels. These output labels are then used by the Computer Vision models along with pathology images data to diagnose cancer.

To support the process, Paige builds around 200 dbt tables containing enhanced training datasets along with the predicted labels. This is critical since the ML teams use these tables to train the 200+ ML models with the business goal of diagnosing cancer. This enhanced dataset is further used by Analytics teams to check the validity of ML training experiments as well. 

For discussing image processing and text metadata, some tables go beyond 200 GB in size and grow at a rate of ~10 GB per day.


The challenges

Challenge 1: CI/CD for data pipelines

Data integrity at Paige was challenged by data availability problems resulting from executing dbt builds immediately on incremental daily load in the production environment. The team needed a solution that would ensure data on the main branch was protected against changes made by apps that were either upstream or downstream. 

This would guarantee consistent, high-quality production data availability and improved data dependability for workloads, including machine learning and analytics later on.


Challenge 2: Compliance with FDA requirements

Since the FDA and the healthcare industry consider AI and software to be medical devices, there is a lot of emphasis on responsible AI, explainable AI, removing bias in prediction, etc. Healthcare technology needs to adhere to strict regulatory compliance requirements.

As a healthcare company under the purview of the FDA for HIPAA compliance, planning for data security, governance, and reproducibility of experiments upfront is required, and the ramifications of not complying can be severe.


Challenge 3: Capturing each version of the ML training features and making ML experiments reproducible

ML researchers experiment with different feature creation and data enhancement methods to test different label prediction logics. This requires the teams to capture each version of the feature set and its corresponding label prediction to track all the running experiments and arrive at highly accurate cancer prediction ML models.

Suppose we have an AI model trained to detect a specific type of breast cancer. How was it trained? What exact data and information went into training that model? Are we sure there is no bias, no mistakes in the pipeline, and all the data was adequately collected, versioned, and saved (with appropriate user consent and so on)? 

This is why ML experiments must be reproducible.

Paige’s partner network of hospitals and laboratories shares medical images and diagnostic data with them daily. The company needed a way of capturing the state of the training dataset at a specific point in time. This would enable Paige to use the same training dataset as the previous iteration while experimenting with different hyperparameter options. Only then could they study the effect of hyperparameter tuning on ML training. 


Systems evaluated: Git-LFS, DVC, lakeFS

Since data versioning and reproducibility are a must for their ML platform, Paige assessed several mature data versioning systems available in the ecosystem. They built POCs to evaluate these solutions’ different features and evaluated how well they integrate with their system.


Git-LFS

Initially, the team used Git for data versioning. Although Git for LFS (Large File Storage) worked for a few hundred MBs of data, it didn’t scale when the size of the CSV files exceeded the source code scale (~200 MBs). 


DVC

Paige next evaluated DVC for data versioning at scale. However, DVC required invasive changes to the existing ML training pipelines, and the learning curve for ML engineers to use it was also very steep.

  • Originally, Paige stored the pathology image data in AWS S3.
  • DVC requires downloading the data locally for versioning. So, they needed to implement a cache for ML training pipelines as well.
  • The team implemented an LRU (Least Recently Used) cache to download the images to a local fast storage on MinIO during ML training, using DVC primarily to manage the file system in the local storage.
  • The data team was building an entire data platform to cater to multiple AI and analytics teams within Paige. However, DVC worked well only in AI use cases.
  • DVC has a built-in orchestrator for ML pipelines, which comes bundled. Paige’s team uses Prefect to orchestrate both the ETL pipelines and ML training workloads. Maintaining DVC orchestration in addition to Prefect created redundancy and overhead.
  • The company was only looking for a versioning engine that works for broader use cases, and DVC wasn’t a good fit for that requirement.


lakeFS

lakeFS offered exactly what Paige was looking for: a versioning engine for broader use cases than AI/ML workloads.

Paige’s data resides in a hybrid environment. Pathology images are in the AWS S3 environment managed by the FSx file system. They store the textual ML training data and enhanced features in AWS S3 as CSV files. ML model training happens in fast local storage on-premises, but Paige runs the analytics workloads in AWS and dbt.

The company needed a tool that worked well for this hybrid data environment, as well as for AWS S3 and an on-prem data environment. After evaluating DVC, Paige found that it required invasive changes to the existing ML training pipelines and presented several other limitations. 

lakeFS fits the bill exactly as it works with any object store with an S3-like API. All Paige needed to do was point lakeFS to a different bucket URL path to switch between AWS and on-prem. This required minimal changes to make lakeFS work with existing workflows. It was also scalable, as opposed to Git-LFS, which they also evaluated.


Adopted solution

lakeFS offered exactly what Paige was looking for: a versioning engine for broader use cases than AI/ML workloads.

Paige’s data resides in a hybrid environment. Pathology images are in the AWS S3 environment managed by the FSx file system. They store the textual ML training data and enhanced features in AWS S3 as CSV files. ML model training happens in fast local storage on-premises, but Paige runs the analytics workloads in AWS and dbt.

The company needed a tool that worked well for this hybrid data environment, as well as for AWS S3 and an on-prem data environment. Paige evaluated another data version control solution, DVC, but found that it required invasive changes to the existing ML training pipelines and presented several other limitations. 

lakeFS fits the bill exactly as it works with any object store with an S3-like API. All Paige needed to do was point lakeFS to a different bucket URL path to switch between AWS and on-prem. This required minimal changes to make lakeFS work with existing workflows.

Challenge solved: Capturing each version of the ML training features 

lakeFS is used to version the raw textual diagnosis data and the feature-enhanced training data set in AWS S3. 

Data availability issues arising from running dbt builds directly on incremental daily load in the production environment posed a data integrity challenge. To overcome this issue, our production data lives on the main branch and has branch protection rules enabled. So, it’s safeguarded from any modifications from upstream or downstream applications. This ensures high availability of production data at all times and increased data reliability for downstream ML/analytics workloads.

On the lakeFS data repository, Paige creates a dirty ingest branch to run the ETL pipelines that transform the input data and the dbt builds. Once the dbt builds are completed, the enhanced input data undergoes several data quality checks. If the tests are successful, they merge them into the production branch. Paige also uses lakeFS hooks to run these quality checks and to enable CI/CD workflows for the AWS S3 data lake.


Challenge solved: Compliance with FDA requirements

The data pipeline automatically creates a lakeFS data branch, saves the incremental daily data load on it, commits it, and runs a dbt build. It also runs relevant tests against daily data. If the tests succeed, our automation merges the ingest data branch into the main.

The daily commit id is the product release that ML engineers refer to. By leveraging the lakeFS commit log, Paige can audit the changes to the training data and track the lineage. This enables the team to identify the training data set corresponding to a specific label logic. This is critical because it lets the team reproduce the experiments and adhere to the FDA requirements.


Challenge solved: Making ML experiments reproducible

Paige also leverages the branch protection rules of lakeFS. This enables the team to improve the reliability and quality of data in the main branch as they experiment and test data on dedicated feature data branches. lakeFS also makes sure that all the data versions are available for Paige at any time.


Data architecture with lakeFS

lakeFS versions of the raw ML training data together with the enhanced features data. Paige uses Prefect (a data flow automation tool) to orchestrate our ETL pipelines and ML training jobs. At the compute layer, they have Spark jobs running on AWS EMR. 

On the warehousing front, they store tables in Apache Iceberg format in dbt for further analytics workloads. As for BI, Looker is the primary tool they use to visualize the ML experiments and statistical analysis.


Results

By using lakeFS, Paige increased the robustness of its data platform, which serves data scientists, ML engineers, data engineers, and analysts. Implementing lakeFS immensely accelerated the data team’s productivity, helped achieve higher data deployment velocity (daily), increased data availability, and improved compliance with FDA regulations.

lakeFS