Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Tal Sofer
Tal Sofer Author

Tal Sofer is a product manager at Treeverse, the company...

Last updated on October 23, 2025

In today’s data-driven world, businesses don’t just rely on data – they are built on it. But as data infrastructure sprawls across on-prem systems, multiple cloud providers, and third-party platforms, a new challenge is taking center stage: distributed data management.

It’s a silent bottleneck with loud consequences.

Challenges in Distributed Data Management 

Managing data across clouds, regions, and on-prem systems isn’t just a technical challenge – it affects how teams collaborate, govern, and access their data day-to-day. As organizations scale their infrastructure, the complexity of managing data scales with it. The result is a set of operational and organizational challenges that slow down innovation and increase risk.

In this article, we’ll break down the key challenges that make distributed data so difficult to work with, and why they demand more than manual fixes.

The Data Is Everywhere – And That’s A Problem

Modern data teams need fast, reliable access to datasets – but when data is scattered across cloud buckets, on-prem storage, and partner networks, the process slows to a crawl. Engineers and analysts spend hours tracking down the right version, requesting permissions, validating formats, or just trying to figure out where a dataset actually lives.

This isn’t just an inconvenience – it directly affects time-to-data – risking critical project delivery times.

Governance Breaks Across Environments

Data governance is only as strong as its weakest link. In distributed environments, managing access control, audit trails, and compliance rules across multiple storage backends is inconsistent and error-prone.

Security teams struggle to enforce policies uniformly, and compliance becomes an ongoing game of catch-up rather than a controlled process. The complexity grows with every new system, region, or cloud account added to the mix.

Distributed Data = Inconsistent Data

One of the most frustrating and hard-to-detect problems of distributed data is inconsistency. Teams may assume they’re working off the same dataset when, in reality, the data has been copied, transformed, or updated differently across locations. The result? Conflicting outputs, unreliable models, and lost trust in the data itself.

Copying Data Isn’t Just Risky – It’s Expensive

Beyond the operational headaches, there’s a very real financial cost. Moving data between environments, especially in hybrid or multi-cloud architectures, can trigger cloud egress fees, increase storage usage, and demand significant engineering effort to maintain. What starts as a simple copy job quickly turns into a recurring expense that scales with the business. You don’t just lose time – you lose money.

Collaboration Gets Lost in Translation

In distributed environments, cross-team data collaboration suffers – not because teams don’t want to work together, but because the infrastructure makes it hard. Accessing a dataset stored in another cloud or on-prem system often requires waiting for approvals, navigating different access controls, or physically copying the data over. These data access delays slow down projects and introduce friction into otherwise simple workflows.

Even discovering existing datasets can be a challenge. Without a unified view across storage systems, teams often aren’t aware of what data already exists, leading to duplicated effort and missed opportunities. And when teams attempt to collaborate using the “same” dataset across different environments, they frequently end up building their own processes just to keep datasets in sync – replicating data manually, managing custom sync jobs, and maintaining separate versions across clouds or regions.

The result is a fragile, high-maintenance model for collaboration that doesn’t scale as data volume and team count grow.

ML Workflows Are Especially Vulnerable

Machine learning pipelines are highly sensitive to their input data. In many organizations, training data is stored in one system, labeled in another, and referenced in a third. Each step introduces a risk of mismatch.

Even minor inconsistencies, like a missing file or outdated record, can lead to hours of debugging or worse: incorrect model behavior. For MLOps teams, managing this complexity manually is a recurring operational headache.

Manual Solutions Don’t Scale

To make distributed data usable, many teams resort to short-term fixes: copying datasets, syncing buckets, writing scripts to stitch things together. These workarounds can work – until they don’t.

As environments grow more complex, so does the overhead. DataOps, MLOps, and Data platform teams often spend hours maintaining a fragmented system of workarounds: writing and debugging sync scripts, manually tracking dataset locations and versions, and coordinating data transfers across environments. All of this wastes time that could be spent building pipelines and shipping models.

Expert Tip: Managing Multiple Storage Systems with lakeFS

Oz Katz Co-founder & CTO

Oz Katz is the CTO and Co-founder of lakeFS, an open source platform that delivers resilience and manageability to object-storage based data lakes. Oz engineered and maintained petabyte-scale data infrastructure at analytics giant SmilarWeb, which he joined after the acquisition of Swayy.

In my experience, lakeFS multi-storage backends support helps overcome many of these challenges, giving you:

Unified Data Access

Interact with data across AWS, Azure, GCP, and any S3-compatible environment through a single API and namespace. With lakefs:// as the universal interface, teams can use the same tools and processes regardless of where the data resides, reducing friction and enabling smooth, consistent workflows.

Centralized Governance

Apply access rules, security policies, and audit mechanisms uniformly across all cloud environments using lakeFS RBAC and hook capabilities; no need to maintain siloed governance setups in each platform.

Lineage Across Storage Systems

Maintain end-to-end visibility into data transformations and movement, even when your pipelines span multiple cloud and storage systems.

Lower Operational Overhead

Fewer lakeFS deployments mean less complexity and upkeep. Consolidating your control layer leads to a leaner architecture and reduces the operational load on your team.

Distributed Complexity Amplifies Existing Problems

Even in centralized environments, data management is a tough problem to solve. Tasks like versioning, access control, and data quality are already complex – but distributed systems add another layer of difficulty.

The challenges we’ve just explored don’t disappear in a distributed setup – they intensify. What might be manageable in a single environment quickly becomes overwhelming when multiplied across clouds, regions, and teams. Operational friction increases, governance becomes harder to enforce, and simple workflows demand more coordination and oversight.

Distributed data doesn’t just introduce new problems – it amplifies the existing ones, turning everyday data tasks into infrastructure projects.

What Effective Distributed Data Management Looks Like

In this article, we looked at the realities of managing data across clouds, regions, and systems – and why today’s approaches fall short.

It’s not a question of whether distributed data needs to be managed, but how to do it in a way that’s scalable, reliable, and efficient.

The right solution won’t eliminate complexity, but it will abstract it away – making distributed environments feel unified and predictable. Done well, it will improve critical metrics like:

  • Time-to-data: How quickly can someone get access to a dataset they need?
  • Operational overhead: How much manual work is needed to keep things in sync?
  • Governance consistency: Are policies enforceable across environments?
  • Cost control: Can we reduce duplication and avoid unnecessary data movement?
  • Data quality: Is the data consistent, reliable, and fit for use across all environments?

Teams that solve this well will move faster, collaborate with confidence, and turn their data into a true competitive advantage. Those that don’t will stay busy managing complexity—while others move ahead.

lakeFS