This post recaps a comprehensive tutorial published by Alex Merced from Dremio and Tal Sofer from lakeFS, highlighting how version control transforms multimodal data management for AI teams.

The Challenge: Keeping Diverse Data Types in Sync and Queriable

Modern AI pipelines consume more than just structured data. Training sets include images, model artifacts, logs, and metadata tables – all evolving at different rates and living across disparate systems. The fundamental problem isn’t storage or processing; it’s keeping these diverse assets synchronized and enabling unified analysis across them.

When data scientists need to trace which exact version of training images produced a specific model result, or when ML engineers want to experiment with new preprocessing logic without risking production datasets, traditional data management approaches fall short. You end up with manual versioning schemes, duplicated storage, or worse: undocumented data drift.

The solution requires two capabilities working together: lakeFS keeps your datasets in sync through version control, while Dremio unlocks unified querying across structured and unstructured data. Together, they transform how teams manage and analyze multimodal datasets.

Version Control as the Foundation

The core insight demonstrated in the Dremio blog post is deceptively simple: apply Git workflows to your entire data lake. Not just tables, but images, logs, model binaries – everything.

With lakeFS, you get:

Zero-copy branching for isolated experimentation
Atomic commits across multiple data types simultaneously
Merge workflows that promote validated changes to production
Tags and references for reproducible snapshots

The practical impact is immediate. Teams can spin up experimental branches, test transformations on real data volumes, and merge confidently, all without duplicating petabytes of storage or coordinating complex freeze windows. Most importantly, all versioned assets remain synchronized and versioned holistically; when you reference a specific commit or tag, you get a consistent snapshot across images, tables, models and metadata. This holistic versioning is what makes true reproducibility possible.

The Multimodal Architecture Pattern

The tutorial walks through an elegant architecture that leverages three complementary technologies:

lakeF, a control plane for AI-ready data provides the version control layer, managing both structured tables (via its Iceberg REST Catalog) and unstructured objects (via S3-compatible APIs) under unified snapshots. This ensures that when you reference a specific commit or tag, you’re getting a consistent view across all data types.

Apache Iceberg brings transactional guarantees and high-performance a to object storage based structured datasets. The lakeFS Iceberg REST Catalog extends Iceberg’s capabilities by making every table operation version-aware. Namespace conventions encode repository and branch information directly, so queries are automatically pinned to exact snapshots.

Dremio serves as the query engine that ties everything together, enabling high-performance SQL across versioned Iceberg tables and AI-powered analysis of unstructured files. The combination removes the need for data movement while maintaining governance.

Real-World Implementation: The PD12M Example

The tutorial demonstrates this architecture using the PD12M public domain image dataset. The workflow progression is instructive:

Repository creation establishes a multimodal-pd12m repo with an ingest working branch
Zero-copy Import registers millions of S3-hosted images as lakeFS objects without duplication
Metadata transformation rewrites image URLs to reference lakeFS paths, creating logical connections
Iceberg table creation stores the transformed metadata via the lakeFS REST Catalog
Branch merge and tagging promotes the ingestion to main and creates a baseline tag for permanent reference

What makes this powerful is the atomicity: the baseline tag captures both the complete image collection and the metadata table in perfect alignment. Anyone querying that tag gets exactly the same data, whether today or six months from now.

Branching Enables Safe Experimentation

Another section covered in the blog is experimental workflows. Data scientists can create feature branches, run transformations, test new preprocessing pipelines, or add derived columns – all without touching production data.

Because branching is metadata-only, the cost approaches zero. Teams can maintain dozens of active experiments simultaneously, each isolated but working with full-scale data volumes. When a branch proves valuable, a simple merge operation promotes it. Failed experiments are abandoned without cleanup overhead.

This mirrors modern software development practices but solves a harder problem: coordinating changes across structured and unstructured data that may total petabytes.

Querying Versioned Data at Scale

Connecting Dremio to the lakeFS REST Catalog creates version-aware query routing. When you specify a repository branch or tag in your SQL, Dremio automatically fetches data from that exact snapshot while reading actual files directly from object storage.

The result is reproducible analytics. Queries against the baseline tag return identical results indefinitely, even as the main branch continues evolving. For compliance, auditing, or debugging, this provides an immutable data foundation that traditional lakes lack.

Some examples demonstrate joining Iceberg metadata tables with unstructured image references, all within a single versioned context. The query engine handles the complexity while developers work with straightforward SQL.

AI Functions Unlock Unstructured Analysis

One particularly innovative section covers Dremio’s AI functions operating on lakeFS-managed files. By connecting Dremio as an S3-compatible source pointing at lakeFS, teams can use functions like AI_GENERATE and AI_CLASSIFY directly on versioned PDFs, images, or documents.

The example shows extracting structured recipe metadata from PDF files in SQL, with each extraction tied to a specific lakeFS path. This closes the loop between raw unstructured data ingestion and structured analysis; all with full version control backing every step.

Getting Started

The tutorial provides complete working examples using Python, but the concepts apply to any language or tool that can interact with S3 and Iceberg REST catalogs. The barrier to entry is remarkably low: lakeFS can run anywhere, from local development to cloud-native deployments.

For teams already using Iceberg or planning multimodal AI pipelines, the investment pays immediate dividends in reliability, reproducibility, and development velocity.

Read the complete technical walkthrough on Dremio’s blog for detailed code samples, configuration steps, and advanced query patterns. Special thanks to Alex Merced for the comprehensive tutorial and collaboration, demonstrating these patterns in practice.

Git-Style Workflows for Multimodal AI Data Using Dremio and lakeFS

The Challenge: Keeping Diverse Data Types in Sync and Queriable

Version Control as the Foundation

The Multimodal Architecture Pattern

Real-World Implementation: The PD12M Example

Branching Enables Safe Experimentation

Querying Versioned Data at Scale

AI Functions Unlock Unstructured Analysis

Getting Started

Watch lakeFS Iceberg tutorial

Need help getting started?

lakeFS

Git-Style Workflows for Multimodal AI Data Using Dremio and lakeFS

The Challenge: Keeping Diverse Data Types in Sync and Queriable

Version Control as the Foundation

The Multimodal Architecture Pattern

Real-World Implementation: The PD12M Example

Branching Enables Safe Experimentation

Querying Versioned Data at Scale

AI Functions Unlock Unstructured Analysis

Getting Started

Related articles

Building Compliant and Reproducible ML Pipelines

Iceberg REST Catalog Alternatives: Top Options & How to Choose The Best One For Your Team

lakeFS Top 10 Defining Product Milestones in 2025

Watch lakeFS Iceberg tutorial

lakeFS

Pick up the Slack with lakeFS