This post recaps a comprehensive tutorial published by Alex Merced from Dremio and Tal Sofer from lakeFS, highlighting how version control transforms multimodal data management for AI teams.
The Challenge: Keeping Diverse Data Types in Sync and Queriable
Modern AI pipelines consume more than just structured data. Training sets include images, model artifacts, logs, and metadata tables – all evolving at different rates and living across disparate systems. The fundamental problem isn’t storage or processing; it’s keeping these diverse assets synchronized and enabling unified analysis across them.
When data scientists need to trace which exact version of training images produced a specific model result, or when ML engineers want to experiment with new preprocessing logic without risking production datasets, traditional data management approaches fall short. You end up with manual versioning schemes, duplicated storage, or worse: undocumented data drift.
The solution requires two capabilities working together: lakeFS keeps your datasets in sync through version control, while Dremio unlocks unified querying across structured and unstructured data. Together, they transform how teams manage and analyze multimodal datasets.
Version Control as the Foundation
The core insight demonstrated in the Dremio blog post is deceptively simple: apply Git workflows to your entire data lake. Not just tables, but images, logs, model binaries – everything.
With lakeFS, you get:
- Zero-copy branching for isolated experimentation
- Atomic commits across multiple data types simultaneously
- Merge workflows that promote validated changes to production
- Tags and references for reproducible snapshots
The practical impact is immediate. Teams can spin up experimental branches, test transformations on real data volumes, and merge confidently, all without duplicating petabytes of storage or coordinating complex freeze windows. Most importantly, all versioned assets remain synchronized and versioned holistically; when you reference a specific commit or tag, you get a consistent snapshot across images, tables, models and metadata. This holistic versioning is what makes true reproducibility possible.
The Multimodal Architecture Pattern
The tutorial walks through an elegant architecture that leverages three complementary technologies:
lakeF, a control plane for AI-ready data provides the version control layer, managing both structured tables (via its Iceberg REST Catalog) and unstructured objects (via S3-compatible APIs) under unified snapshots. This ensures that when you reference a specific commit or tag, you’re getting a consistent view across all data types.
Apache Iceberg brings transactional guarantees and high-performance a to object storage based structured datasets. The lakeFS Iceberg REST Catalog extends Iceberg’s capabilities by making every table operation version-aware. Namespace conventions encode repository and branch information directly, so queries are automatically pinned to exact snapshots.
Dremio serves as the query engine that ties everything together, enabling high-performance SQL across versioned Iceberg tables and AI-powered analysis of unstructured files. The combination removes the need for data movement while maintaining governance.
Real-World Implementation: The PD12M Example
The tutorial demonstrates this architecture using the PD12M public domain image dataset. The workflow progression is instructive:
- Repository creation establishes a
multimodal-pd12mrepo with aningestworking branch - Zero-copy Import registers millions of S3-hosted images as lakeFS objects without duplication
- Metadata transformation rewrites image URLs to reference lakeFS paths, creating logical connections
- Iceberg table creation stores the transformed metadata via the lakeFS REST Catalog
- Branch merge and tagging promotes the ingestion to
mainand creates abaselinetag for permanent reference
What makes this powerful is the atomicity: the baseline tag captures both the complete image collection and the metadata table in perfect alignment. Anyone querying that tag gets exactly the same data, whether today or six months from now.
Branching Enables Safe Experimentation
Another section covered in the blog is experimental workflows. Data scientists can create feature branches, run transformations, test new preprocessing pipelines, or add derived columns – all without touching production data.
Because branching is metadata-only, the cost approaches zero. Teams can maintain dozens of active experiments simultaneously, each isolated but working with full-scale data volumes. When a branch proves valuable, a simple merge operation promotes it. Failed experiments are abandoned without cleanup overhead.
This mirrors modern software development practices but solves a harder problem: coordinating changes across structured and unstructured data that may total petabytes.
Querying Versioned Data at Scale
Connecting Dremio to the lakeFS REST Catalog creates version-aware query routing. When you specify a repository branch or tag in your SQL, Dremio automatically fetches data from that exact snapshot while reading actual files directly from object storage.
The result is reproducible analytics. Queries against the baseline tag return identical results indefinitely, even as the main branch continues evolving. For compliance, auditing, or debugging, this provides an immutable data foundation that traditional lakes lack.
Some examples demonstrate joining Iceberg metadata tables with unstructured image references, all within a single versioned context. The query engine handles the complexity while developers work with straightforward SQL.
AI Functions Unlock Unstructured Analysis
One particularly innovative section covers Dremio’s AI functions operating on lakeFS-managed files. By connecting Dremio as an S3-compatible source pointing at lakeFS, teams can use functions like AI_GENERATE and AI_CLASSIFY directly on versioned PDFs, images, or documents.
The example shows extracting structured recipe metadata from PDF files in SQL, with each extraction tied to a specific lakeFS path. This closes the loop between raw unstructured data ingestion and structured analysis; all with full version control backing every step.
Getting Started
The tutorial provides complete working examples using Python, but the concepts apply to any language or tool that can interact with S3 and Iceberg REST catalogs. The barrier to entry is remarkably low: lakeFS can run anywhere, from local development to cloud-native deployments.
For teams already using Iceberg or planning multimodal AI pipelines, the investment pays immediate dividends in reliability, reproducibility, and development velocity.
Read the complete technical walkthrough on Dremio’s blog for detailed code samples, configuration steps, and advanced query patterns. Special thanks to Alex Merced for the comprehensive tutorial and collaboration, demonstrating these patterns in practice.



