Computer Vision in Healthcare: Incorporating Data Version Control Into Your ML Workflow

Joe Pringle

February 18, 2026

Based on my presentation at PyData Global 2025. My colleague Yoav recently wrote about why reproducibility matters so much in healthcare AI and how data version control addresses the gap. This post is a follow up to that, with an overview of how to incorporate data version control into your existing ML workflow to address […]

Best Practices Data Engineering Machine Learning Tutorials

Building Compliant and Reproducible ML Pipelines

Itai Gilo

February 3, 2026

Based on my presentation at PyData Global 2025 When we – engineers – hear the word “compliance,” we tend to roll our eyes. We want to build features, not fill out forms. But here’s good news: the exact same tools that help you debug your code can also keep you out of trouble. In this

Best Practices Product Tutorials

Adding Data Version Control Capabilities to MATLAB with lakeFS

Joe Pringle

November 17, 2025

Many lakeFS customers in the aerospace, automotive, healthcare & life sciences, and manufacturing industries also are heavy users of MATLAB. lakeFS solves a range of data ops challenges for these organizations by serving as a “control plane” for AI-ready data – versioning complex data pipelines, tracking metadata and lineage, and enabling team collaboration through git-like

Best Practices Product Tutorials

Versioned Data with Apache Iceberg Using lakeFS Iceberg REST Catalog

Amit Kesarwani

August 14, 2025

lakeFS Enterprise offers a fully standards-compliant implementation of the Apache Iceberg REST Catalog, enabling Git-style version control for structured data at scale. This integration allows teams to use Iceberg-compatible tools like Spark, Trino, and PyIceberg without any vendor lock-in or proprietary formats. By treating Iceberg tables as versioned entities within lakeFS repositories and branches, users

Best Practices Machine Learning Product Tutorials

A Single Pane of Glass to Your Data: Multiple Storage Backends Support in lakeFS

Tal Sofer

May 13, 2025

Today’s organizations don’t just use a single data storage solution – they operate across on-prem servers, multiple cloud providers, and hybrid environments. This distributed approach has become necessary, but it comes with significant costs: teams struggle with siloed tools, duplicated processes, and an endless cycle of environment management that diverts focus from delivering actual value.

Best Practices Product Tutorials

How to Avoid Data Breaches by using RBAC

Amit Kesarwani

February 18, 2025

Introduction Role-Based Access Control (RBAC) is an effective way to minimize the risk of data breaches by ensuring users only have access to the data and systems necessary for their job roles. Here’s how you can use RBAC to avoid data breaches: 1. Principle of Least Privilege (PoLP) 2. Define Clear Roles and Responsibilities 3.

Product Tutorials

How To Get Started with lakeFS Enterprise: Step-by-Step Tutorial

Amit Kesarwani

November 27, 2024

What is lakeFS Enterprise? lakeFS Enterprise is a commercially-supported version of lakeFS, offering additional features and functionalities that meet the needs of organizations from a production-grade system. Why did we build lakeFS Enterprise? lakeFS Enterprise was built for organizations that require the support, security standards and features required of a production-grade system and are not

Best Practices Product Tutorials

Collaborating Over Data: Introducing Pull Requests in lakeFS

Oz Katz, Itai Gilo

October 21, 2024

In modern software development, Pull Requests (PRs) are a fundamental tool for collaborating on code. They allow teams to review, discuss, and merge changes in a controlled and transparent way. But what if you could apply that same concept to data? At lakeFS, we’re excited to introduce Pull Requests for data — a new feature

Best Practices Tutorials

Automated Testing in Isolated Environments with GitHub Actions and lakeFS

Amit Kesarwani

September 24, 2024

Promoting ETL code for production is a straightforward process. We have our code – usually stored in Git – and want to build and test it. We can check out the code, write changes, and commit them. Once we’re ready to promote our ETL into production, we run a pull request that might run automated

Best Practices Product Tutorials

Guide To The lakeFS File Representation

Iddo Avneri

September 22, 2024

Once you start using lakeFS, the files on your object store will form a new representation. The names and paths of the files on the object store will no longer look the same. This article provides a high-level overview of the lakeFS file representation to help you understand the lakeFS file representation and how it

Best Practices Product Tutorials

Metadata Enforcement: Step-by-Step Tutorial

Amit Kesarwani

August 27, 2024

Metadata enforcement is a broad term that can refer to different aspects of managing and controlling metadata. Let’s explore few key areas: Understanding Metadata Enforcement 1 – Data Privacy and Protection: 2 – Data Governance and Quality: 3 – Legal and Compliance Challenges in Metadata Enforcement Strategies for Effective Metadata Enforcement We will focus on

Best Practices Tutorials

Delta Time Travel in Databricks: How It Works

Tal Sofer

August 21, 2024

Databricks Delta Lake includes a number of time travel features to let you access any previous version of the extensive data that Delta automatically versions and stores in your data lake. This makes it simple to audit, roll back data in the event of unintentional poor writes or deletes, and reproduce reports and trials. How

Tutorials

Pick up the Slack with lakeFS