TL;DR
- CytoReason uses lakeFS to version their data product and intermediate data, overcoming Nextflow integration challenges by developing a custom plugin that eliminates intermediate storage.
- The integration allows CytoReason to treat our disease models as versioned, reproducible data products ready for consumption, exploration, and application.
- The integration transformed the model release cycle – models are faster, cleaner, and far more reproducible.
CytoReason is a technology company transforming biopharma’s decision-making—from trial and error to data-driven—through its AI platform of computational disease models. Leveraging an extensive database of public and proprietary data, the company maps human diseases tissue by tissue and cell by cell. Researchers at leading pharma companies, including Pfizer and Sanofi, rely on CytoReason’s technology to make data-driven decisions across the entire drug development lifecycle.
Our core product is the disease model itself: rich, structured data delivered via APIs and UI. We design these models to directly consume and apply extensive analysis in research and development. Our process combines expert biologists and bioinformaticians who analyze data through visualizations and code, supported by internal machine learning and AI systems that enhance evaluation accuracy.
Recently, we re-architected our platform to integrate lakeFS, enabling robust data versioning and streamlining our workflows.
The bioinformatics community widely adopts Nextflow, which we have incorporated into our model creation pipeline. While Nextflow supports publishing results to cloud storage, it doesn’t seamlessly connect with lakeFS.
To address this, we built custom integrations that allow us to run Nextflow pipelines and publish results directly into our data infrastructure.
Our setup: lakeFS for data versioning and Nextflow for building data pipelines
lakeFS setup
We treat our product – that is, our structured disease model data – as a versioned release, much like software packages. lakeFS provides an ideal solution for managing releases with precision and consistency, and one of our main use cases is versioning and tracking product releases.
Additionally, we use lakeFS to version intermediate datasets generated during pipeline execution. This allows us to snapshot specific analysis stages, internally release them with corresponding data, and reuse them in downstream workflows.
Our goal is to enable bioinformaticians to run complex pipelines composed of multiple analyses, approve the results biologically, and store them in a lakeFS branch. Once approved, we tag the branch to create a versioned release, such as 1.2.1 or 1.3.0, depending on whether we’re fixing an analysis or adding new datasets to the disease model. In essence, it’s like retraining and versioning a model.
Nexflow setup
While platforms like Airflow are great for building complex, data-centric pipelines, bioinformatics workflows demand a hybrid approach: data-driven execution and domain-specific scientific logic. The bioinformatics community widely adopts Nextflow, which we have incorporated into our model creation pipeline.
Our pipeline is substantial and primarily written in R, reflecting the deep expertise of our developers. Often, the analytical logic is tightly coupled with R, and we continue to maintain and extend this codebase to preserve the integrity and reproducibility of our scientific workflows. We continue to rely on R for many of our core analyses, as much of our domain-specific logic is deeply embedded in the R code developed by our team. At the same time, we’re expanding to support Python and other languages to diversify and scale our analytical capabilities.
Nextflow is a perfect fit for this approach because it’s language-agnostic. It allows us to decouple the workflow orchestration from the actual analysis logic. Our workflows are written in Nextflow DSL, while the analysis modules – what we call “analysis producers” – can be implemented in any language, whether R, Python, or others.
Each analysis module generates structured data outputs, which we refer to as “evidences.” They represent biological insights, such as gene expression differences between disease and healthy states, cell-type abundance shifts, or pathway-level perturbations. By modularizing our pipeline this way, we maintain flexibility, reproducibility, and scalability across diverse analytical tasks.
lakeFS + Nextflow?
At the core of our product is what we call E2: a structured collection of evidences and entities. This forms the foundation of our disease model; a rich dataset generated through complex analyses orchestrated by Nextflow and versioned using lakeFS.
Each analysis produces evidences, such as gene expression differences between disease and healthy states, pathway shifts, or cell-type abundance changes. We wanted them to be pushed to lakeFS as part of our product release process. From there, they would be consumed via our computational environment (R, Python), visualizations, UI components, or notebooks provided to our customers.
However, to make that happen, we had to expand Nextflow native integration to work with a modern data platform like lakeFS.
Challenge: Nexflow lacked a native integration with lakeFS
By default, Nextflow pipelines publish results to cloud storage like GCS or S3. However, we needed a direct integration with lakeFS to streamline versioning and product delivery, bypassing generic cloud buckets.
To address this, we built custom integrations that allow us to run Nextflow pipelines and publish results directly into our data infrastructure.
This integration now allows us to treat our disease models as versioned, reproducible data products ready for consumption, exploration, and application.
Solution: custom integration plugin connecting Nextflow and lakeFS
Fortunately, Nextflow’s plugin architecture is highly flexible. So we developed a custom plugin focused on two key tasks:
- Pulling data from lakeFS branches to initiate analysis
- Pushing results back into lakeFS as versioned outputs
The plugin bridges the gap between pipeline execution and data versioning, allowing us to treat each analysis as a reproducible, traceable unit within our disease modeling framework.
The plugin introduces a lakeFS protocol, similar to how Nextflow supports GCS (gs://) and S3 (s3://) URLs. With this protocol, we can use lakefs:// followed by a path to directly pull data from or push data to lakeFS, just like any other supported storage backend.
Our plugin primarily leverages Nextflow’s publishDir mechanism, which allows us to define where and how outputs from each process are stored. For example, if a process generates gene expression differences for a specific comparison, we can dynamically construct the output path using parameters like dataset ID. This enables structured, partitioned storage—similar to Hive-style directory partitioning (e.g., dataset_id=12345/condition=healthy/).
By teaching Nextflow to understand and interact with lakeFS paths, we created a streamlined workflow where bioinformaticians can run analyses, publish results directly to lakeFS, and version them as part of our disease model pipeline. This eliminates the need to manually move data from GCS to lakeFS and ensures that every analysis is traceable, reproducible, and ready for consumption.
Result: Faster, cleaner, and more reproducible models
Our main goal in developing the lakeFS plugin for Nextflow was to move away from low-level cloud storage like GCS and toward a more structured, versioned data management approach. GCS is great for raw storage, but it lacks the visibility and organizational capabilities we need for reproducible bioinformatics workflows.
While lakeFS helps us save storage by avoiding unnecessary data duplication, the real value lies in the time it saves during model development and release. Previously, bioinformaticians and biologists would invest significant effort into generating and validating disease models, but the release process itself was slow, manual, and painful – especially after biological validation.
One of the biggest bottlenecks was the lack of robust data validation. With lakeFS, we now leverage pre-commit and post-commit actions to enforce validation checks automatically. Once a model is biologically approved, we simply tag the lakeFS branch to create a new version, streamlining the release process dramatically.
This shift has transformed how we release disease models: faster, cleaner, and far more reproducible.
Use cases for Nextflow + lakeFS
Streamlining updates and scaling models
Instead of regenerating the entire disease model from scratch, we build it incrementally. We take the existing model and simply add new dataset analyses to it. This offers two major advantages:
- Efficiency – We avoid recalculating the entire model across dozens of datasets (sometimes 30, 40, or more), which saves significant time and compute.
- Smart versioning – Our release process typically doesn’t involve copying data from staging to production. Instead, we can link staging data to lakeFS branches before approving the production version. This means the new version of the disease model can be served as a reference to validated data, eliminating duplication and overhead. That is, with the exception of dedicated environments that we create for our customers who ask for separate backing storage (and therefore also a separate lakeFS instance.
By treating lakeFS repositories as versioned data products, we streamline updates, maintain reproducibility, and scale our model evolution with minimal friction.
Directory of experiment configurations
Now that we have a directory of experiment configurations, we can traverse lakefs://config/ and load all relevant files to kick off multiple comparisons across disease states. This flexibility is critical for scaling our analyses.
Routing outputs dynamically
We also leverage Nextflow’s publishDir mechanism to dynamically route outputs based on parameters like dataset ID, using Hive-style partitioning (e.g., dataset_id=12345/condition=healthy/). This structure makes it easy to integrate with data platforms like Trino or DuckDB, though our focus remains on lakeFS for its versioning and reproducibility.
Linking for zero-copy
One of the most powerful features we’re now using is lakeFS linking. Instead of copying intermediate data to BigQuery or other buckets just to organize it, we link to the original data stored in GCS. This is used in the research and development phase to dramatically accelerate model exploration and testing, and also save costs, before the approval of the disease model
This linking also enables us to reuse the same intermediate data across multiple model versions. For instance, if we generate new gene expression or cell abundance analyses on a new dataset, we can update the disease model by linking to the relevant staging data. No need to copy or restructure anything.
Easy permissions and built-in admin panel
lakeFS also gives us fine-grained permission controls, so we can manage visibility across teams and collaborators. What’s more, features we previously built custom admin panels for – like version tagging, feature toggling, and model updates – are now handled natively by lakeFS. We no longer need to build or maintain separate UIs for version management; it’s all integrated.
Next steps
We are advancing toward a tighter lakeFS–Nextflow integration by extending full support for the Nextflow import mechanism, enabling workflows to resolve and ingest lakeFS-backed assets natively.
Our roadmap includes glob-pattern resolution for lakeFS paths, allowing broad file discovery across versioned datasets. Support for the AWS lakeFS backend will be introduced with minimal overhead, as storage and retrieval operations can be delegated to the underlying Nextflow cloud provider plugins.
Additionally, the plugin will expose lakeFS SDK functions within workflow contexts, enabling direct programmatic interaction – such as branch operations, commits, and metadata queries – without leaving the Nextflow runtime.



