Case Study

Home > Case Studies > How Karius Used lakeFS to Comply with FDA Regulations for Disease Diagnostic Studies

How Karius Used lakeFS to Comply with FDA Regulations for Disease Diagnostic Studies

Vino SD

Vino SD Author

Full Bio →

Last updated on April 25, 2025

Company

Karius is a Forbes AI-50 listed life sciences company based in California, that uses genomics and AI to advance infectious disease diagnostics using advanced ML algorithms for data-intensive complex genomics analysis in real time.

Problem

As part of the healthcare industry, reproducibility of students/experiments is very pronounced.

Solution

With lakeFS enabling data versioning and atomic commits on the data lake, the Karius team was able to overcome the crisis of reproducibility in an error-free and robust manner

Table of Contents

The company

Karius is a Forbes AI-50 listed life sciences company based in California, that uses genomics and AI to advance infectious disease diagnostics. It uses advanced machine learning algorithms for data-intensive complex genomics analysis in real time.

Karius uses disease data from NCBI (National Center for Biotechnology Information) APIs and their proprietary reference genome databases for machine learning studies. Disease and genome data from different input sources and data formats are ingested into AWS S3 for consumption by downstream analytics workloads.

The challenge

Reproducibility of experiments

For decades, one of the most significant challenges in the healthcare industry has been the reproducibility of studies/experiments. With the advent of machine learning (ML) and the use of black-box ML models in healthcare studies, the reproducibility crisis is even more pronounced.

Challenges to reproducibility often include:

Hypothesis testing
Inherent randomness in the data analysis
Incomplete documentation
Restricted access to the underlying data and code

Add to it the duplication of data across multiple ML experiments, reproducibility, and manageability of experiments increase manifold.

Karius’s approach to reproducibility

Karius has been using various MLOps tools to effectively reproduce, track, and monitor machine learning studies.

However, a tool that bundles model compute, experiment tracking and data versioning features requires customers to use all or none of its features. It hampers the flexibility of data teams in picking the best tool for each task.

It became clear to the data team that an alternative solution that decouples compute from the rest of the ML training workflow was needed.

lakeFS effectively helped our team to overcome the reproducibility problem in an error-free and resilient manner. It also decreased data duplication across several tests, which helped to maintain data integrity and ensure FDA regulatory compliance.

Adopted solution

Challenge solved: Reproducibility of experiments

Storing different versions of data, code, model artifacts, and metrics for every experiment atomically is essential to ensuring reproducibility. Complex and proprietary input data formats and petabyte-scale data lake amplify these challenges.

Without atomically storing the different components of these experimental studies, accessing the data and code from a specific point in time is an error-prone manual process (using timestamps).

To mitigate these issues, Karius uses lakeFS on top of an S3 object store to enable data versioning on the ML training data. The training data is periodically ingested into a dirty ingest branch which is then cleaned, transformed, quality-checked, and merged to a protected main branch.

The data teams then branch out of the main branch to run various machine learning experiments and the model artifacts are then captured. With lakeFS, atomic commits can then be performed on data, code, model artifacts, and metrics all in one place.

When there is an audit, one could simply check out the training data, code and model artifacts from a specific commit and reproduce the experiments in no time.

Results

With lakeFS enabling data versioning and atomic commits on the data lake, the Karius team was able to overcome the crisis of reproducibility in an error-free and robust manner. It also reduced data duplication across multiple experiments, thus preserving data integrity and ensuring FDA regulatory compliance.