TL;DR
- If your models interact with the physical world – cars, drones, cameras, robots – the ability to reproduce a specific moment in time is essential. It’s how you debug, certify, and ship safely.
- lakeFS provides your object store with Git-like mechanisms (branch, commit, merge, tag), allowing you to freeze exact sensor states, replay incidents, and promote only what has passed. All without the need to copy petabytes of data.
Software-only systems can be rerun from the source, but physics-bound workflows face a tougher challenge. Once a moment is gone, it’s gone. Sensor drift, hardware changes, and environmental uniqueness make it impossible to recreate the exact conditions. For audits, safety, and machine learning, you need full data provenance, including raw data, derived outputs, and context.
lakeFS makes time travel possible for data lakes. It lets you branch, snapshot, and track every element of your pipeline (raw payloads, configs, calibrations, and code). This allows you to replay rare edge cases and validate across evolving systems with confidence.
Keep reading to see:
- Why physical systems need time travel capabilities
- How lakeFS helps teams build time-travel-ready pipelines
- Why lakeFS offers a better approach to data versioning than labeling solutions
Why “Physics‑Bound” Workflows Need Data Versioning, Labeling, and Time Travel
In software-only systems, rerunning from the source is often enough to reproduce results. But physical systems operate in a world that doesn’t rewind. Once a moment passes – a near-miss at dusk, a specific rain pattern, or a subtle lens occlusion – it’s gone.
Hardware evolves silently: sensors drift, calibrations shift, and firmware updates alter behavior without changing a single line of code. These changes make reproducibility a challenge, especially when audits or safety reviews demand proof of exactly what data and logic led to a decision. Moreover, edge cases are rare and expensive, as teams need to branch, protect, and replay them across evolving pipelines.
lakeFS addresses this challenge by turning your data lake into a versioned, immutable timeline. It enables zero-copy branching, atomic snapshots, and rich metadata tagging, so you can isolate and replay edge cases across changing pipelines.
True reproducibility requires more than raw data; you need the raw payloads, all derived artifacts, and the context (calibrations, configs, third‑party data, code refs, and seeds). With lakeFS, you can recreate any point in time with precision and confidence.
lakeFS: Practical Building Blocks for Time‑Travel‑Ready Pipelines
lakeFS enables time travel because it transforms your data lake into a versioned, immutable timeline. Here are the essential features of lakeFS for physics-bound projects:
Zero-Copy Branches & Atomic Commits
Create lightweight branches instantly over your existing object store, whether it’s S3, GCS, or Azure. Every commit is atomic, ensuring consistency even in distributed environments.
Tags for Golden Datasets & Production Cutovers
Mark specific commits as “golden” to lock in trusted datasets for production use, model training, or regulatory reporting. These tags serve as clean handoff points for downstream systems.
Commit Metadata for Full Context
Attach rich metadata to each commit: pinning calibration parameters, configuration files, and external dependencies. This makes every version fully reproducible and auditable.
Isolated CI/CD for Data
Run Spark, Trino, Flink, or DBT jobs against isolated branches. Validate changes in a safe sandbox before merging to production. This opens the door to real CI/CD for data workflows, not just code.
Rollback Safety
If a pipeline breaks or a model drifts, simply revert to a previous commit. Recovery takes seconds, not hours—no need to reprocess or manually restore data.
Example promotion pattern that works for physics-bound workflows
- Ingest → Snapshot – Capture a stable snapshot of raw or processed data at a specific point in time.
- Branch per Experiment or Incident – Create a dedicated branch for each model experiment, data anomaly, or operational incident.
- Validate in Isolation – Run metrics checks, bias audits, and safety validations without affecting production.
- Tag as Golden – Once validated, tag the branch (e.g.
prod-2025-08-15) to mark it as ready for deployment. - Merge and Monitor – Merge into production and monitor performance with full lineage and rollback capability.
- Immutable Audit Trail – Preserve the incident branch as-is for compliance, postmortems, or forensic analysis.
Data Labeling for Time Travel
Data annotation enables time travel by labeling time-series data (such as physiological signals or system logs) to identify specific temporal events or trends for machine learning training purposes.
Data labeling tools provide interfaces to annotate data (images, videos, text, etc.) and manage labeling workflows (assignments, reviews, automation). Many leading solutions allow users to use cloud object storage (like Amazon S3) to store the raw images or data being labeled, rather than upload everything to the tool’s own storage. Since lakeFS operates on top of object stores, integrating with this functionality is crucial.
Some common data labeling platforms include:
Labelbox
A popular training data platform offering labeling UIs for images, text, etc., with a data Catalog for managing assets. Labelbox lets you keep data in your own cloud bucket via IAM Delegated Access, meaning “your data stays your data” – you can point Labelbox at an S3 bucket and label the content without moving it. Labelbox will use signed URLs to securely display and annotate the images, so you maintain control of the underlying files, making it feasible to use a lakeFS-backed bucket as the data source for Labelbox.
Dataloop –
An end-to-end AI data platform with a strong focus on computer vision. Dataloop provides an annotation toolset and touts advanced dataset management features. Notably, Dataloop can connect to external storage like S3 for managing dataset files. It even offers data versioning features similar to code version control—including “virtual” dataset versions that don’t duplicate storage, data branching to create subsets based on properties (with a timeline/history of changes), and sandboxed experimentation (clone, merge, compare versions).
Open-Source Tools (Label Studio, CVAT) –
For teams that require on-premises labeling, open-source frameworks like Label Studio or CVAT are common. These can be deployed with access to cloud storage or network file systems (e.g. using S3 or MinIO as the backend). They typically lack built-in dataset versioning but can be combined with external version control systems such as lakeFS.
Other Enterprise Solutions –
Platforms like SuperAnnotate and Scale AI also support large-scale labeling with enterprise features. Many provide storage integrations similar to Labelbox’s (e.g., SuperAnnotate allows storing datasets on-premises or in your cloud buckets, rather than only on their servers). Interestingly, some of these platforms advertise dataset version control as part of their feature set. This again highlights that managing versions of data is a recognized need in data labeling workflows – though the implementations are usually specific to each platform.
Why Object Storage Matters In Data Labeling
Most labeling tools dealing with image/video data rely on object stores for scalability. Object stores (like S3) handle millions of objects and heavy I/O, which is essential when datasets contain tens of millions of images or hours of video. By using object storage integrations, labeling tools can stream data for annotation without copying it into a proprietary database.
lakeFS is designed to work on top of these same storage systems, acting as a version control layer. So, if a labeling tool can work with your cloud bucket, it can likely work with a lakeFS repository (since lakeFS presents an S3-compatible endpoint). The images or data can reside in a lakeFS branch, and the labeling tool will read/write them as if it were a normal bucket.
This setup establishes a foundation for seamless integration: labelers continue using the tool’s user interface, while all data changes (new labels, new images, etc.) are automatically stored in versioned storage.
Data Labeling Tools vs. lakeFS: Key Differences in Data Version Control Capabilities
Many data labeling platforms offer some level of dataset management, but how do those capabilities compare to a dedicated version control system like lakeFS? Let’s examine a few key differences and complementary aspects:
Scope of Version Control
A labeling tool’s “versioning” (if available) is usually confined to annotating data and the dataset subsets within that platform. For example, Dataloop can version the labeled dataset in its system, allowing the cloning or merging of annotated data slices.
However, lakeFS can version all data in the pipeline – not just the images and labels, but also related structured data, model predictions, augmentation code outputs, etc. lakeFS treats the data lake holistically. This capability is crucial for heterogeneous data scenarios where you have unstructured data (e.g., images, sensor logs) and structured data (e.g., databases of metadata) that need to stay in sync.
A labeling tool might manage the images and their labels, but it won’t track your CSV of patient info or if your ML features were updated in tandem. lakeFS will track any file in the repository, enabling consistent snapshots of all the data.
In complex projects (say, autonomous driving), you might have images, LiDAR point clouds, and scenario metadata; lakeFS can version all of it together, whereas the labeling platform might only cover the images and some label files.
Branching and Experimentation Workflows
Some advanced labeling platforms have introduced branching concepts for datasets – for example, creating a “branch” with only certain classes or a training/validation split. But these are often limited to the data within their system. lakeFS provides full Git-like branching to the underlying storage. This means you can spin up an independent copy of your entire dataset as a branch in seconds (no physical copy).
Labelers or data scientists may, for instance, work on a branch to test new labeling guidelines or add a new set of images without affecting the main dataset. Multiple labeling efforts can proceed in parallel on different branches, a level of isolation difficult to achieve in traditional labeling tools. Once the new annotations are validated, you’d merge the branch back to main in lakeFS (just like merging a feature branch in Git).
Data branching at scale is a core strength of lakeFS, ensuring experiments or new data additions don’t disrupt others until ready. Labeling tools typically don’t offer multi-branch concurrent workflows in this manner; at best, they might let you duplicate a dataset (which could be expensive in storage or clunky to keep in sync).
History and Lineage of Annotations
Labeling tools often record annotation metadata, like who labeled an item and maybe a history of changes on a per-item basis. However, these tools may not maintain a comprehensive version history that reflects the entire state of the dataset over time.
lakeFS treats each commit as a point-in-time snapshot of all files, providing a global version history. This provenance tracking is incredibly important when labels evolve. For example, if labeling guidelines change or errors are corrected, a system like lakeFS can maintain the old version and the new version of the labels, making it clear what changed.
lakeFS enables end-to-end provenance; in contrast, if a labeling tool without proper dataset versioning is used, once you update a label, the previous state might be lost (unless you exported backups manually). Some platforms might let you export a “dataset version” at a point in time, but doing this consistently and for all data modalities is easier if you use a dedicated version control system.
Collaboration and Access Control
Labeling platforms excel at collaborative annotation: assigning tasks to multiple labelers, reviewing and approving labels, etc. They also often provide user roles and permissions for who can view or label data.
lakeFS’s role is different: it adds collaboration at the data infrastructure level – multiple users can safely collaborate on a data repository via branches, and fine-grained access controls can restrict which branches or paths a user can access. lakeFS can serve as the backend where, say, the “raw” data branch is read-only for annotators (to ensure they only label approved data), while a “staging” branch is where new data is ingested.
Using lakeFS actions or CI/CD, you could even automate processes. For instance, when a labeler finishes annotating a batch on a branch, it triggers a merge into the main dataset and notifies the training pipeline. The key is that lakeFS offers programmatic, reproducible data ops (hooks, API integration) that complement the human-centric collaboration of the labeling tool.
Reproducibility for ML Training
Once data is labeled, training a model should be reproducible and auditable. If you only rely on a labeling tool, you might export a dataset (images + labels) as a snapshot for training. But if you later realize that the model has issues, can you easily retrieve the exact same data again, especially if the labeling project has continued to evolve?
With lakeFS, every commit or tag can serve as a reference for training data. For example, you can tag dataset-v1 at the moment you train your model. Even after more data is labeled in the future, dataset-v1 remains available for audits or rollbacks. Industry best practices recommend using data versioning tools like lakeFS in ML pipelines:
“Just as code versioning enables tracking changes, data versioning tools like DVC or lakeFS allow you to snapshot and manage different versions of datasets (e.g., training v1, v2 after new data added, etc.). Maintaining a clear history of changes improves data integrity and reproducibility. If a model’s performance drops after an update, you can pinpoint which data version or change might have caused it.”
In high-stakes domains, this capability is not optional. It’s necessary for building trust and achieving compliance.
Scaling and Performance
As the scale of data grows, the differences become more pronounced. LaLabeling platforms are primarily designed to manage the labeling process, but their dataset versioning and project duplication features may not be suitable for extreme scales, such as billions of objects, and could involve significant data copying behind the scenes.
lakeFS is designed specifically for large-scale data lakes, leveraging the underlying object store’s scalability. For instance, creating a branch in lakeFS is a constant-time operation no matter how many files, because it doesn’t copy data – it just creates a new branch pointer (like a lightweight metadata reference). This means you can have many parallel versions of a massive dataset without extra cost.
By contrast, if a labeling tool required physically cloning data for a “version,” it would quickly become infeasible as data sizes explode. Moreover, lakeFS’s server can handle a high throughput of read/write operations using object store APIs, so it can serve data to the labeling tool efficiently.
In practice, you can integrate lakeFS such that the labeling tool experiences minimal difference – it’s labeling data from an S3 API, unaware that lakeFS is intercepting to provide version control.
Summary
Labeling tools might offer some version control or dataset management features (especially newer platforms that advertise “data curation” or “dataset versioning” capabilities). These features are valuable but generally limited to the scope of the annotation process.
In contrast, lakeFS provides deeper version control at the storage level, which is broader and often more rigorous. The good news is that they are not mutually exclusive – in fact, using them together yields a powerful combination.
Integrating lakeFS with Labeling Workflows
How can teams practically combine lakeFS with a data labeling tool? There are a few integration patterns, which we’ll illustrate with examples:
1. lakeFS as the Storage Backend for the Labeling Tool
In this approach, the labeling tool reads and writes data directly from a lakeFS repository (via S3 API or equivalent). This is possible if the tool supports custom S3 endpoints or if you configure the tool to use your cloud storage, which lakeFS manages.
For example, with Labelbox you can set up an S3 bucket integration. Instead of pointing it to a raw S3 bucket, you could point it to the lakeFS endpoint for your repository (which internally forwards to the real bucket). The images that require labeling would be stored in a lakeFS branch (say the main branch, under a path like s3://<repo>/main/unlabeled/ for raw images). Labelbox will fetch those images via lakeFS, and when labelers add annotations or perhaps new derived image files (e.g. annotated masks), those too can be written back to the lakeFS branch.
Essentially, lakeFS acts as a versioned object store – the labeling tool doesn’t need to know about the versioning; it just does its job storing files. Meanwhile, every change is tracked in lakeFS commits. One could commit after a labeling session or even commit each new label file as it arrives.
2. Importing Data from Labeling Tool into lakeFS (post-labeling)
If direct integration is not possible, another pattern is export-and-import. Teams can perform labeling in the tool as usual (perhaps the tool stores data internally or in its cloud during labeling), then export the labeled dataset and import it into lakeFS for version control. lakeFS has convenient APIs and even a UI for importing data from an object store path into a repository.
For example, with Labelbox, you can export annotations and maybe the referenced images to an S3 location, then use lakeFS to import that path into a repository commit. This gives you an immutable snapshot of the labeled data at a specific point in time.
Every time you complete a new round of labeling, you import it as a new commit (or into a branch). Over time, you accumulate a history of dataset versions in lakeFS (v1, v2, … corresponding to labeling rounds).
3. Synchronizing Updates and Feedback
Integration can also be two-way. Suppose you found a mistake in the labels after some model training – for example, a certain object was consistently mislabeled). With lakeFS, you could branch the dataset, fix those labels (perhaps using a script or even manually editing annotation files), and test your model. If the fix works, you’d want those corrections to propagate back to the labeling tool’s database (so their UI reflects the latest truth).
Many labeling platforms have APIs to update or import corrected annotations. You can script an integration where a lakeFS commit (with corrected labels) triggers a sync to the labeling tool via API, so its internal records update. Conversely, if labelers update something in the tool, a webhook could notify a process to fetch that change and commit to lakeFS.
Designing such a feedback loop ensures your single source of truth (lakeFS repository) stays aligned with the labeling tool’s view. In practice, this is advanced usage, but the point is that lakeFS’s open APIs and the APIs of labeling tools make it possible to automate consistency between systems.
4. Using Branches for Labeling Experiments
Another integration idea is to leverage lakeFS branches directly in the labeling workflow. For instance, you might have a production branch containing the currently trusted dataset and a labeling_experiment branch where annotators are currently adding new data or trying a different annotation style.
The labeling tool could be configured (via different credentials or endpoints) to point to the labeling_experiment branch for its data I/O. Labelers work there without fear of messing up production data.
Once everything is satisfactory (perhaps after QA and model testing), you merge labeling_experiment into production in lakeFS. This one merge operation might publish hundreds of new labels to production in one atomic action – much safer than manually replacing files in a bucket. If something turned out wrong, you could revert the merge commit to undo the changes.
This branch and merge workflow, familiar from software development, can bring order and control to what is otherwise a potentially chaotic process of updating datasets.
Best Practices for Integrating lakeFS with Your Labeling Workflow
| Best Practice | Description |
|---|---|
| Consistent ID or Naming | Ensure that data items (images, etc.) have consistent identifiers between lakeFS and the labeling tool. Often, labeling tools use an internal ID or the file name. Using a clear naming scheme for files in lakeFS (or maintaining a mapping) will help trace items across systems. |
| Automation | Use the APIs/SDKs. For example, Labelbox provides a Python SDK to programmatically import data and export annotations. lakeFS has an API/CLI for commits and branches. Writing a small script or using a Jupyter notebook (as shown in lakeFS sample repositories) can automate the import of labeled data into a new commit, etc. |
| Batching | It might be inefficient to commit to every single label change. Instead, accumulate a batch of changes (maybe wait until a labeling job or project is complete) and then commit a bulk update to lakeFS. This corresponds to how one might accumulate file changes and commit once with a message like “Labeled 500 new images for Class X.” |
| Metadata and Schema | Keep label metadata versioned as well. If the labeling ontology (the set of labels or classes) changes, treat that as a versioned artifact too. lakeFS can version JSON/YAML config files (e.g. a label schema). This way, you know which version of the label definitions was in effect for any given labeled dataset commit. This is vital if labels get renamed or added over time. |
By integrating at the storage level, lakeFS does not interfere with the user experience of the labeling tool. Annotators continue using the same interface to draw boxes or segment images. But the data they produce is captured in a disciplined way.
The organization benefits from having an audit trail and the ability to roll back or compare datasets as they evolve. In regulated industries, this can help with compliance, proving exactly what data a model was trained on, even if the labeling platform’s state has moved on.
Time Travel with lakeFS: 6 Examples Across Various Sectors
1. Autonomous Driving & Robotics
High-dimensional sensors such as video, lidar, radar, and IMU along with map tiles and scenario labels, are always evolving. To support safety investigations, regression analysis, and homologation, it’s essential to reconstruct the exact input state with precision.
A typical lakeFS pattern for this use case would look as follows:
- Cut a snapshot at ingestion time for each log set:
/raw/2025-06-14/run-8421 → commit abc123. - Branch per investigation:
incident-2025-06-14-8421-glare - Freeze and link calibration bundles, HD map versions, and labeling guidelines to the branch (commit metadata)
- Run training/validation pipelines against the branch
- Promote the passing commit back to a stable mainline via merge/tag
Example: Reproducing a disengagement
Imagine that the navigation stack regresses at low sun angles. The team branches the exact versions of sensor capture, calibration, and map tiles. Next, they replay the pipeline – perception to tracking to planner – compare metrics across commits, fix the glare preprocessing, and merge. No re-ingestion, no guesswork.
Here are the components the team needs to version for autonomy:
- Raw sensor logs (video frames, lidar packets), time sync, and frame indices
- Preprocessing outputs (undistorted frames, range images, BEV transforms)
- Camera/lidar calibrations, extrinsics/intrinsics, firmware versions
- HD map tiles and localization references used at train/test time
- Label sets and labeling policies/guides in force at that time
- Pipeline code refs/configs, container digests, random seeds
- Model artifacts and evaluation metrics, tied to the same commit
2. Drones & Aerial Autonomy
Flight logs integrate EO/IR video, lidar, GPS/IMU data, gimbal angles, and airspace constraints. For BVLOS approvals, incident reports, and customer SLAs, precise reconstruction is often required.
Example: Corridor‑mapping re‑flight
A mapping provider branches the client’s mission data, DSM/orthos, and lens calibration, then re-evaluates the stitching parameters. For future audits, a single commit reveals the exact assets and configurations used to generate the deliverable.
Components the team needs to version:
- Raw telemetry and sensor payloads, camera/gimbal metadata
- Flight plans, geofences, and NOTAM overlays used in planning
- Photogrammetry/SLAM intermediate artifacts
- Versioned deliverables (DSM/DTM/orthos/point clouds) tied to source commit
3. Computer Vision in the Physical World
Vision models are very sensitive to lighting conditions, camera placement, optical characteristics, and labeling instructions – especially in domains like inspection, retail, warehousing, and agriculture.
Example: Production line false rejects
Imagine a new lens hood that changes glare. The team rolls back to the last golden dataset, branches the flagged window, re‑labels with updated guidelines, and validates a preprocessing change. The winning branch is merged and tagged for roll‑out.
Versioning essentials are as follows:
- Camera stream captures (or frame dumps) with timing metadata
- Preprocessing outputs (crops, masks, augmentations)
- Label sets with versioned instructions and rubrics
- Site‑specific configs (exposure, ROI) and hardware part numbers
4. Manufacturing & Industrial IoT
When machines fail, teams need the exact state of sensor buffers, controller versions, and set points in order to reproduce conditions and test fixes offline.
Example: Predictive maintenance
Suppose a plant branches the 72‑hour window of vibration and temperature streams, controller logs, and feature store outputs. It then compares feature drift across commits, adjusts filtering, and promotes a corrected feature pipeline without overwriting the incident record.
Here’s everything the team needs to version:
- Raw time‑series from PLC/SCADA and derived features
- Controller firmware/config snapshots
- Maintenance actions and operator notes (commit metadata)
- Model binaries and thresholds tied to a specific data commit
5. Energy, Utilities & Climate
Grid events, satellite passes, and extreme weather events can’t be repeated. This is why investigations and models must anchor to immutable data cuts.
Example: Storm‑driven load forecast miss
Operations creates branches from meter and SCADA data windows, integrates weather reanalysis, and captures topology snapshots. They then replay the forecasting model to identify where it underperformed, quantify the errors, and tag the corrective updates for future storm scenarios. Every step is linked to specific, named commits for full traceability.
6. Life Sciences & Healthcare
From imaging to omics, trial phases and protocol changes require strict data lineage and auditability.
Example: Imaging + clinical metadata
Imagine a research group links DICOM series, derived segmentations, and cohort definitions through synchronized commits. This setup ensures that reviewers can precisely reproduce figures, even as the dataset evolves and expands over time.
Beyond the physical world
Versioning isn’t just for the physical world. It’s a game-changer across finance, media, and SaaS. A/B tests, backtests, and fraud detection models all gain precision and agility through branch-and-merge workflows. But when digital systems intersect with the physical world – where atoms meet bits – versioning shifts from merely useful to absolutely essential.
Labeling anti‑patterns to avoid
Copying “folders” of petabytes to test a hypothesis
This is a common practice across teams that don’t have a zero-copy solution in place. By copying datasets every time they want to run an experiment, teams greatly increase storage costs. Other common issues are low iteration cycles and a lack of a clear lineage. Without version control, hypotheses become very difficult to reproduce or audit.
Mutating shared buckets while debugging an incident
Live edits to shared data stores poses the risk of corrupting production pipelines, masking root causes, and closing the door to postmortem analysis by making it unreliable.
Treating labels, calibrations, and map tiles as “out-of-band” from data
When critical context lives outside the data versioning system, reproducibility suffers. Models trained on “invisible” dependencies can’t be trusted or validated.
Promoting models without a commit ID for their training/eval data
This anti-pattern breaks data traceability. If performance degrades or bias emerges, teams can’t pinpoint the exact data state that shaped the model.
Conclusion: Version the Moment, Not Just the Files
In physics‑bound systems, you’re not only managing data but also the truth about a moment in time. lakeFS lets you checkpoint that instant, do branch experiments without copies, and merge only what you trust. Every commit is a replayable story: the raw signals, the derived artifacts, and the exact context.


