Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Tal Sofer
Tal Sofer Author

Tal Sofer is a product manager at Treeverse, the company...

Last updated on December 18, 2025

Frequently Asked Questions

Multimodal data refers to datasets that integrate multiple types of inputs – such as text, images, audio, and sensor data – into a unified system for analysis. Unstructured data, on the other hand, refers to data that lacks a predefined schema or organization, such as raw text documents, videos, or images. While unstructured data can exist in multiple formats, the key distinction is that multimodal systems are specifically designed to process and integrate different data types together, extracting insights from their relationships and interactions.

lakeFS applies Git-like version control to the data lake (object storage), treating a collection of heterogeneous files (images, text files, audio segments, Parquet files) as a single, unified dataset. This allows teams to use a commit to capture the exact state and alignment of all modalities simultaneously. If you link a product image, its review text, and its inventory record together, lakeFS ensures that all three are tracked, branched, and reverted cohesively by that single commit ID.

Organizations can enforce data governance principles, implement access controls, anonymize sensitive information, and ensure that data handling procedures comply with standards such as GDPR and HIPAA. Special attention should be paid to modality-specific requirements, such as biometric data in audio and video, cross-modal re-identification risks, and the challenges of exercising data subject rights across integrated multimodal systems.

lakeFS guarantees reproducibility by linking the exact, immutable version (the commit ID) of the entire multimodal dataset to a specific model training run. This ensures that the same version of the medical images, patient records, and sensor logs used to train a diagnostic model can be retrieved and tested months later, eliminating the ambiguity of data drift and ensuring consistency across all modalities in the pipeline.

lakeFS offers comprehensive, tamper-proof audit logs that track all data operations (who accessed, changed, or merged which multimodal asset and when). When combined with Role-Based Access Control (RBAC) and metadata tagging, this provides the traceability needed to demonstrate compliance (e.g., proving that a sensitive dataset version was never used in production) and maintain transparency across teams.

lakeFS