This article is a summary of a joint session, Why Data Is Killing Your AI Project and What to Do About It, featuring a great panel of experts:

John Darrington, Senior Lead Software Engineer at Exostellar
Ion Koutsouris, Software Engineer (Data & ML)
Iddo Avneri, Chief Business Officer at lakeFS

The past year saw companies and teams weather a rollercoaster of highs and lows, riding the hype cycle of AI excitement followed by disillusionment. Most teams are continually oscillating between the promise of new possibilities and the reality of persistent challenges AI brings.

The MIT report State of AI in Business 2025 found that 95% of AI pilots are failing. That is a sobering number and assessment.

What challenges do teams encounter on their path to AI operationalization? Gartner found that one of the biggest hurdles is that organizations have data that is not AI-ready; in their survey, 57% of companies said their data is not AI-ready. Another survey from EY called out data infrastructure as the number one roadblock for AI in the enterprise: 83% of respondents said AI adoption would be faster if they had a stronger data infrastructure.

What can teams do to deliver AI projects at the speed and quality they need? If data is a blocker, data version control is one of the key enablers. Here’s why.

What is killing enterprise AI projects?

Low team velocity due to manual data preparation

If you look at the AI/MLOps lifecycle in its entirety – from data preparation to training to running the model and all the steps in between – you’ll see a big velocity problem. Data preparation takes a massive amount of time. Data practitioners likely spend 60–80% of their time on this before they even begin AI and model development.

Collaboration

How do you ensure that the part of the data you work on doesn’t mess with your colleagues’ data? How do you make sure that things don’t get copied multiple times across the team, risking control and compliance? Data presents several collaboration challenges that severely impact AI projects.

Data quality

Poor-quality data leads to poor-quality models. Quality issues typically surface at a later stage of the process – but the later you discover them, the harder it is to fix them. In the worst-case scenario, teams may discover issues as late as the live model. Quality encompasses not only source data but also metadata, so it’s essential to ensure that it’s reliable and we don’t lose context.

Reproducibility

Another related issue is reproducibility. How do you know which specific dataset powered which model? How can you pull that same complete dataset again if you need to, which is a common requirement for governance? You want to track ownership, track access, and ensure data hygiene. When data management is manual, it delays progress and hinders project success, likely contributing to the abandonment rate mentioned by MIT.

Compliance

Lastly, compliance – the thing that haunts every team, particularly in regulated industries but certainly across many companies. If you’re held back because you’re asked about certain audit checks and balances that need to be fulfilled – maybe you’re ready to go live – but just getting the paperwork done and checks completed is time-consuming, complex, and error-prone. If you don’t have an automated way to support that, it’s very painful.

What can we do to deliver AI projects at the speed and quality we need? Enter data version control

A strong data foundation is essential for building compliant, efficient, and scalable AI systems. With the right infrastructure, teams can work effectively, and benefits compound as operations grow.

Data version control plays a critical role by creating a centralized, accessible source of truth. It enables isolated experimentation, pipeline execution, and model training through branching, similar to software development workflows. This approach allows for tracking changes, maintaining lineage, and tying data to code and models – tasks that are otherwise manual and time-consuming.

While not identical to Git, data versioning at scale introduces Git-like benefits tailored to large datasets, such as working locally with cloud-based data and running automated checks during data promotion. Use cases include schema validation, format checks, and column-level rules, with new applications emerging regularly.

Versioning also simplifies governance. As AI regulations evolve, organizations must maintain detailed lineage from raw data through all transformations, often over long timeframes. Version control makes such maintenance feasible and reliable.

lakeFS is an infrastructure layer that enables Git-like operations such as branching, committing, and rollback on top of cloud and on-premises object stores like S3, Azure Blob Storage, Google Cloud Storage, MinIO, and others. It provides a versioning interface for any data format, including Iceberg tables, Parquet files, images, and videos, without requiring data to flow through lakeFS itself.

This architecture allows for seamless integration with existing compute environments and supports end-to-end data versioning. Users can create zero-copy branches for isolated experimentation, even on massive datasets, with no additional storage cost. For example, a 200-petabyte data lake can be branched instantly for individual pipelines.

lakeFS also delivers built-in lineage and compliance. Every change is logged and attributed – whether by a user or service account – enabling full reproducibility, traceability, and auditability across data transformations and experiments. This makes it a powerful tool for enterprise AI, where scale, governance, and agility are critical.

lakeFS has seen widespread adoption across industries. It supports production environments in some of the world’s largest data lakes. Examples include:

Lockheed Martin, using lakeFS to ensure reproducibility and compliance in large-scale AI pipelines for defense and aerospace.
Volvo, transitioning from local systems to lakeFS for scalable infrastructure.
ARM, implementing lakeFS to regain control over sprawling datasets and improve performance.

Early adoption of data version control leads to faster development, better compliance, reduced costs, and improved customer outcomes. The lesson: investing in AI data infrastructure early pays off.

Data version control in real-world cases

Building a full end-to-end data management platform

– John Darrington, Senior Lead Software Engineer at Exostellar

One of the bigger projects I worked on – I talked about it at the Databricks Data + AI Summit back in 2024 – had to do with building a full end-to-end data management platform that would allow researchers in the field and different laboratories to collaborate, coordinate on data, bring it into a centralized location, and then allow us to build things like machine learning on top of it.

Our goal was to foster collaboration in a way that made it cost-effective and more efficient for them to collaborate with this system than to do things independently.

There are three things in the government space with regard to data management that were significant challenges:

Challenge	Description
Trash data	Dealing with extra files and not knowing what data is actually important, which requires curation.
Disparate, bespoke file formats	Such formats aren’t in wide use, and naming schemes might clash within namespaces.
Permissions management, isolation, and control of the data	Ensuring visibility and access to those individuals who want to work with it across organizations.

The adoption of lakeFS tends to be smooth and straightforward. The system is designed to integrate easily with existing object storage, requiring minimal configuration, often just hostname updates and a few parameter changes. lakeFS provides extensive documentation and intuitive frameworks that make the learning curve manageable. This transparency and accessibility help teams quickly understand its value and apply it effectively.

Lessons learned

Do get your permissions and management strategy down prior to opening your tools and environments to end users. If you don’t have a good story for how management is going to happen within lakeFS or another tool for versioning, or your compute layers, and you open the gates, you can’t easily close those gates and correct that foundation. You’ll be constantly building the train tracks underneath you as you go.
Do understand the end user who’s actually looking at and computing on this data. How are they using it? Is it cloud compute or a local machine using DuckDB? Once you know the end user, you can set up the best tooling and access patterns.

Communicating the business value of data infrastructure

When talking to stakeholders, I use the cluttered desktop analogy. Many executives have a very cluttered desktop with files and shortcuts. I ask how they feel if somebody sees their desktop. Usually, it’s embarrassment. They have a vested interest in presenting a unified structure for data.

Then I ask how hard it is to find things on their desktop. ‘Oh, it’s easy, I know where things are.’ Imagine if a thousand people were using your desktop. How challenging is it for them to find things? How much more efficiently could they complete their tasks and comprehend your problem space? When I use this analogy about data structure, they see that it not only looks appealing but also speeds up data processing.

You can take the analogy further: do you want them to delete your files? Do you want them to work with them? Do you want them to have access to your entire desktop?

We start talking about a massive distributed file system and wanting to utilize the cheapest way to manage and store it, typically object storage. You need tooling on top to provide a more unified experience. Once they understand it on a small scale, they’re smart enough to take it from there on ROI and value.

Integrating multiple data sources from different systems

– Ion Koutsouris, Software Engineer (Data & ML)

We wanted to build a model that would give insights into all kinds of maintenance-related activities of the systems. This meant we had to integrate multiple data sources from different systems: machine systems, operational information, and supply chains. There were various datasets and a lot of volume.

The complexity was that six or five people were working at the same time on a single algorithm, meaning we all had to fine-tune small components. If someone upstream changed something, it might affect the downstream input of my next step. At the beginning, there wasn’t really a way of working collaboratively at that scale.

Eventually, we looked into ways to properly version our datasets, orchestrate our tasks, and modularize our code – to split pipelines into proper sub-steps that can be executed separately. If those are executed separately, you can start off from a later point. We also looked at different ways of storing data.

Initially, we saved in Parquet, but we moved to using Delta Lake. We examined ways to optimize the storage format of our lakehouse and enhance the efficiency of our compute runtime, allowing six of us to collaborate on the same algorithm without causing discomfort for others or stepping on each other’s toes, which occurred frequently at first.

We discovered lakeFS while exploring data versioning solutions that could support collaborative work on shared datasets. While Delta Lake offers built-in versioning, it lacked certain capabilities needed for team-based workflows. lakeFS stood out due to its abstraction layer that sits above object storage rather than being embedded within a specific lakehouse format. This design provides greater flexibility and functionality. After initial experimentation, it became clear that lakeFS addressed the core challenges and was the right fit for the organization’s needs.

Lessons learned

Don’t just select the most common tool in the industry. Understand your use case – what you need and what you want to achieve – before selecting a random, popular tool and slapping it on.
Do create many modular components you can reuse across your codebase. If you can use certain UDFs for data transformations across multiple datasets, set up a repo or library you can share.

Communicating the business value of data infrastructure

What works well is showing an example that’s working and showing an actual improvement, then showing the contrast. I used it for internal architectural changes: I just started doing it, made a proper demo, and showed the value of the change and how long it would take – giving pros and benefits of continuing the current direction or deviating to do the improvement.

The future of data infrastructure engineering for AI

Composable data systems are becoming increasingly prevalent in data engineering and the broader data ecosystem. Tools such as Iceberg, along with lakeFS supporting the Iceberg catalog, exemplify this trend. The emergence of interoperable data systems is a positive development, enabling flexible architecture design by allowing components to be integrated seamlessly. This shift is largely driven by open standards like Apache Arrow and storage formats such as Parquet, which collectively facilitate this modularity.

In AI engineering, the pace of innovation is accelerating rapidly. Numerous projects are emerging, often addressing similar challenges. While this proliferation makes it difficult to track developments, many initiatives are likely to be short-lived due to limited value delivery, with only a few standout solutions enduring. The current abundance of AI gateways exemplifies this dynamic, with the expectation that only a select few will persist over time. We continue to observe the field’s evolving trajectory.

In domains focused on efficiency and hardware, both efficiency and data quality are anticipated to become central concerns in the near term. Organizations managing extensive hardware resources and ambitious workloads – such as model training, inference, and data storage – face increasing pressure to optimize tool usage and operational efficiency.

Concurrently, data quality is gaining attention. The exponential growth in data generation has led to a rise in low-quality inputs. The ability to assess, control, and maintain data quality is becoming a critical factor in the success of data-driven enterprises. Poor-quality data undermines outcomes, regardless of the sophistication of the tools employed.

Wrap up

At lakeFS, we are on a mission to close the data infrastructure gap that has become a huge bottleneck for many AI initiatives. We are doing this with a highly scalable, Git-like data version control system designed for enterprise use cases.
To get more insights about the value of data engineering for AI projects, get in touch with one of our experts.

Why Data is Killing Your AI Projects And What To Do About It

What is killing enterprise AI projects?

Low team velocity due to manual data preparation

Collaboration

Data quality

Reproducibility

Compliance

What can we do to deliver AI projects at the speed and quality we need? Enter data version control

Data version control in real-world cases

Building a full end-to-end data management platform

– John Darrington, Senior Lead Software Engineer at Exostellar

Lessons learned

Communicating the business value of data infrastructure

Integrating multiple data sources from different systems

– Ion Koutsouris, Software Engineer (Data & ML)

Lessons learned

Communicating the business value of data infrastructure

The future of data infrastructure engineering for AI

Wrap up

Watch the full webinar

Need help getting started?

lakeFS

Why Data is Killing Your AI Projects And What To Do About It

What is killing enterprise AI projects?

Low team velocity due to manual data preparation

Collaboration

Data quality

Reproducibility

Compliance

What can we do to deliver AI projects at the speed and quality we need? Enter data version control

Data version control in real-world cases

Building a full end-to-end data management platform

– John Darrington, Senior Lead Software Engineer at Exostellar

Lessons learned

Communicating the business value of data infrastructure

Integrating multiple data sources from different systems

– Ion Koutsouris, Software Engineer (Data & ML)

Lessons learned

Communicating the business value of data infrastructure

The future of data infrastructure engineering for AI

Wrap up

Related articles

Building Compliant and Reproducible ML Pipelines

Iceberg REST Catalog Alternatives: Top Options & How to Choose The Best One For Your Team

lakeFS Top 10 Defining Product Milestones in 2025

Watch the full webinar

lakeFS

Pick up the Slack with lakeFS