How Karius Used lakeFS to Comply with FDA Regulations for Disease Diagnostic Studies
Karius is a Forbes AI-50 listed life sciences company based in California, that uses genomics and AI to advance infectious disease diagnostics using advanced ML algorithms for data-intensive complex genomics analysis in real time.
As part of the healthcare industry, reproducibility of students/experiments is very pronounced.
With lakeFS enabling data versioning and atomic commits on the data lake, the Karius team was able to overcome the crisis of reproducibility in an error-free and robust manner
Table of Contents
The company
Karius is a Forbes AI-50 listed life sciences company based in California, that uses genomics and AI to advance infectious disease diagnostics. It uses advanced machine learning algorithms for data-intensive complex genomics analysis in real time.
Karius uses disease data from NCBI (National Center for Biotechnology Information) APIs and their proprietary reference genome databases for machine learning studies. Disease and genome data from different input sources and data formats are ingested into AWS S3 for consumption by downstream analytics workloads.
The challenge
Reproducibility of experiments
For decades, one of the most significant challenges in the healthcare industry has been the reproducibility of studies/experiments. With the advent of machine learning (ML) and the use of black-box ML models in healthcare studies, the reproducibility crisis is even more pronounced.
Challenges to reproducibility often include:
- Hypothesis testing
- Inherent randomness in the data analysis
- Incomplete documentation
- Restricted access to the underlying data and code
Add to it the duplication of data across multiple ML experiments, reproducibility, and manageability of experiments increase manifold.
Karius’s approach to reproducibility
Karius has been using various MLOps tools to effectively reproduce, track, and monitor machine learning studies.
However, a tool that bundles model compute, experiment tracking and data versioning features requires customers to use all or none of its features. It hampers the flexibility of data teams in picking the best tool for each task.
It became clear to the data team that an alternative solution that decouples compute from the rest of the ML training workflow was needed.
Adopted solution
Challenge solved: Reproducibility of experiments
Storing different versions of data, code, model artifacts, and metrics for every experiment atomically is essential to ensuring reproducibility. Complex and proprietary input data formats and petabyte-scale data lake amplify these challenges.
Without atomically storing the different components of these experimental studies, accessing the data and code from a specific point in time is an error-prone manual process (using timestamps).
To mitigate these issues, Karius uses lakeFS on top of an S3 object store to enable data versioning on the ML training data. The training data is periodically ingested into a dirty ingest branch which is then cleaned, transformed, quality-checked, and merged to a protected main branch.
The data teams then branch out of the main branch to run various machine learning experiments and the model artifacts are then captured. With lakeFS, atomic commits can then be performed on data, code, model artifacts, and metrics all in one place.

When there is an audit, one could simply check out the training data, code and model artifacts from a specific commit and reproduce the experiments in no time.
Results
With lakeFS enabling data versioning and atomic commits on the data lake, the Karius team was able to overcome the crisis of reproducibility in an error-free and robust manner. It also reduced data duplication across multiple experiments, thus preserving data integrity and ensuring FDA regulatory compliance.