Vino SD
August 10, 2022

A brief study, in collaboration with Karius team, on how lakeFS enabled Karius to run their experimental studies effectively.

Karius is a Forbes AI-50 listed life sciences company based in California, that uses genomics and AI to advance infectious disease diagnostics. It uses advanced machine learning algorithms for data-intensive complex genomics analysis in real time. 

Karius uses disease data from NCBI (National Center for Biotechnology Information) APIs and their proprietary reference genome databases for the machine learning studies. Disease and genome data from different input sources and data formats are ingested into AWS S3 for consumption by downstream analytics workloads.

Challenges

For decades, one of the most significant challenges in the healthcare industry has been the reproducibility of studies/experiments. With the advent of machine learning (ML) and the use of black box ML models in healthcare studies, the crisis of reproducibility is even more pronounced. Challenges to reproducibility often include several hypothesis testing, inherent randomness in the data analysis, incomplete documentation and restricted access to the underlying data and code. Add to it the duplication of data across multiple ML experiments, reproducibility and manageability of experiments increase manifold.

Karius had been operating with various MLOps tools, to effectively reproduce, track and monitor machine learning studies. However, a tool that bundles model compute, experiment tracking, and data versioning features requires customers to use all or none of its features. It hampers the flexibility of data teams from picking the best tool for each task. It became clear to the data team that  an alternative solution that decouples compute from the rest of the ML training workflow was needed.

Solution

Storing different versions of data, code, model artifacts, and metrics for every experiment atomically is essential to ensure reproducibility. Complex & proprietary input data formats and petabyte scale data lake amplify these challenges. Without atomically storing the different components of these experimental studies, accessing the data and code from a specific point in time is an error prone manual process (using timestamps).

To mitigate these issues, Karius uses lakeFS on top of S3 object store to enable data versioning on the ML training data. The training data is periodically ingested into a dirty ingest branch which is then cleaned, transformed, quality-checked and merged to a protected main branch. The data teams then branch out of the main branch to run various machine learning experiments and the model artifacts are then captured. With lakeFS, atomic commits can then be performed on data, code, model artifacts and metrics all at one place.

data ingestion in lakeFS

When there is an audit, one could simply check out the training data, code and model artifacts from a specific commit and reproduce the experiments in no time.

Results

With lakeFS enabling data versioning and atomic commits on the data lake, Karius team was able to overcome the crisis of reproducibility in an error free and robust manner. It also reduced data duplication across multiple experiments, thus preserving data integrity, and ensuring FDA regulatory compliance.

LakeFS

  • Get Started
    Get Started
  • Join our live webinar on October 12th:

    Troubleshoot and Reproduce Data with Apache Airflow
    +