Vino SD
September 1, 2022

PAIGE.AI (Pathology Artificial Intelligence Guidance Engine) is an AI-driven healthcare technology company that revolutionized clinical diagnosis and treatment in oncology. It uses proprietary computational pathology technology with data from millions of images (digitized glass slides) to recognize tissue patterns and diagnose cancer. 

The healthcare industry in recent years has been digitizing the data and leveraging advanced AI technology in their diagnostic studies. Digitization of healthcare data comes with significant challenges in data lifecycle management as well. According to PharmTech, this means an increase in data governance and data integrity issues, making it difficult to comply with stringent FDA regulatory requirements for reproducibility of healthcare studies.

Architecture

  • AWS
  • Apache Iceberg
  • lakeFS
  • Dbt

AI at PAIGE

Paige trains advanced ML algorithms to diagnose cancer, with highly sensitive data from its AWS data sources in the AWS cloud.  

The training data includes clinical images (digitized glass slides) consisting of more than 4 Million images stored in AWS S3 data lake. Petabytes of these digitized images are then transformed by spark jobs running in AWS Elastic MapReduce clusters. dbt builds are run on top of this data before being used for predictive AI algorithms.

Challenge

Initially, Paige would use ORC format to store their data. In order to prevent consumers from seeing partial data, they would drop the tables when running dbt builds, and bring them back only when the jobs were complete. This effectively meant there were times when tables were not available. 

To overcome this data availability challenge, Paige used Iceberg table format. However, Iceberg tables lack the atomicity of data operations. That is, when creating a new branch on an Iceberg table, the meta store needs to be updated with the new table’s metadata manually. Similarly, merging the data branches needs to be updated manually within the meta store as well. This is a manual and error-prone process, which in turn resulted in data integrity issues.  

As a healthcare company under the purview of the FDA for HIPAA compliance, planning for data security, governance and reproducibility of experiments upfront is a requirement since the ramifications of not complying can be severe.

Solution

data ingestion in lakeFS

Prior to using lakeFS, Paige used ORC format which resulted in data availability issues. While using Iceberg format solved this challenge, it introduced manual update of the meta store which is a cumbersome, error-prone and non-atomic process.  

 It is then Paige chose lakeFS to complement the Iceberg format to tackle these challenges. Paige implemented CI/CD on their data lake to work with Iceberg and dbt effectively. By leveraging lakeFS data versioning, Paige created multiple branches (dirty ingest branch and production branch) from the data repository. Branch protection rules are configured using lakeFS, so that production data is preserved, thus ensuring high availability. Meanwhile, dbt builds are run on the dirty ingest branch. The data in the ingest branch is then tested and only if the tests are successful, it is merged to the production branch. The data quality tests are implemented using lakeFS hooks that enable CI/CD workflows for data.

They use lakeFS S3 gateway to access the data in AWS S3 buckets. Since lakeFS commits are atomic across multiple tables and helps achieve consistent data state, thus overcoming the data integrity issues.

Result

lakeFS alleviates the burden on Paige ML/data team by accelerating productivity, increasing data availability and improved compliance with FDA regulations. With the help of lakeFS, Paige builds over 200 dbt models in production, saves data changes in production for auditing purposes and complies with FDA regulations as well.

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on October 12th:

    Troubleshoot and Reproduce Data with Apache Airflow
    +