Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Learn from AI, ML & data leaders

March 31, 2026  |  Live

Case Study

How Microsoft’s Overture Maps Team Uses lakeFS to Enable Collaboration and Data Quality

Vara Ghanta
Vara Ghanta Author

Vara is a Principal Software Engineering Manager at Microsoft, Bing...

Last updated on January 14, 2026
Company

Founded in December 2022 by Amazon, Meta, Microsoft, and TomTom, and hosted by The Linux Foundation, The Overture Maps Foundation (Overture) is a collaborative effort to enable current and next-generation interoperable open map services and products. The Overture datasets are intended to work like base maps upon which other companies build mapping applications on top of, whether open or proprietary.

Problem

The Overture Data Platform team needed to create a robust data platform for Overture that would seamlessly integrate diverse open data sources from civic and ecosystem members and ensure the delivery of a high-quality, comprehensive open dataset.

Solution

By implementing lakeFS, Overture’s Data Platform is able to standardize their dataset version management, with easy access provided by the lakeFS portal, enabling higher quality across their data and improved team collaboration; all by saving tremendous amounts of storage due to lakeFS’s zero-copy clone approach.

The company

Founded in December 2022 by Amazon, Meta, Microsoft, and TomTom, and hosted by The Linux Foundation, The Overture Maps Foundation (Overture) is a collaborative effort to enable current and next-generation interoperable open map services and products.

Overture has so far released open source datasets for 2.3 billion buildings in the world, nearly 54 million places of Interest, 86 million kilometers of roads, national and regional administrative boundaries, and contextual layers including land and water data to help complete display maps when needed. The Overture datasets are intended to work like base maps upon which other companies build mapping applications on top of, whether open or proprietary. On the Microsoft side, Microsoft Maps is the unit collaborating with The Overture.

The Overture Data Platform team recognized the need to create a robust data platform for Overture that would seamlessly integrate diverse open data sources from civic and ecosystem members and ensure the delivery of a high-quality, comprehensive open dataset.


The challenges

Building a data platform based on open source solutions

The team wanted to avoid using proprietary solutions to build the pipeline, opting instead for publicly available, preferably open-source solutions, thus ensuring portability and reducing complexity.

“Starting something fresh, we wanted to see what was publicly available and reusable, something mature enough to meet our requirements,” said Vara Ghanta, Principal Software Engineering Manager at Microsoft. It was important for the team to give back to the open source community and provide the ability to contribute back when developing new things. “Were we to find additional needs, we would contribute them back to the community and get them implemented,” Ghanta said.

Overture’s approach is to leverage public and open-source technologies rather than build custom solutions, thereby accelerating the process. With that goal in mind, the Microsoft team started evaluating available options. Since Overture inspires collaboration among multiple companies, the team recognized that they must all be dealing with similar problems. The Microsoft team reached out to internal Slack channels for recommendations and were directed to the open-source data version control solution lakeFS. 

“One of our engineers reviewed the options, and we felt that lakeFS worked very well for what we were looking to do,” said Vara Ghanta.



Dataset version management

Overture processes various open data sources. In order to produce a high-quality open dataset, the team needed to build solid pipelines and implement a system to help manage feed versions throughout the pipeline that would support all the different data sources and formats. 

“When we start processing something that doesn’t meet the quality bar, we look at how we roll back feeds and track them beginning to end. Looking at the final data product, we need to review all the provided data specs, the engineered features generated during processing, and so on,” Ghanta said. 

“Tracking is important not just for debugging, but also for cloning datasets into testing environments, or what we call mini-environments, where we can stream sets of data to test new features and set up new code. So, version management of datasets was a crucial feature required for building that data platform,” Ghanta said.


Managing multiple data pipelines

At Overture, data is organized into different themes: transportation, places, buildings, divisions, etc. Therefore, data processing is divided into several pipelines, and the outputs of all those pipelines are assembled into a central data pipeline. 

This added a complication to the existing challenge of managing dataset versions. The team looked at different options, from lightweight ones like simple folder conventions to sophisticated ones like Apache Iceberg which supports versioning. They also examined other common data version control systems such as DVC. They opted not to use it because it could not scale to meet their requirements. Ultimately, the team was looking for a single solution that would cover existing and new pipelines to make dataset management more efficient. 

lakeFS proved to be the only solution to meet these demands. The team delved deeper into all the features, comparing them to other solutions, and chose lakeFS.



Adopted solution

Building a data platform based on open-source solutions

The Overture Data Platform team implemented the open source solution lakeFS across pipelines. “One of the most impactful features is the portal, which allows the team to browse various digests and perform several operations directly from it,” Ghanta said. “lakeFS works across both Azure and AWS ecosystems, supporting S3 paths and making it intuitive and easy to integrate with other solutions.”

lakeFS helped the team manage dozens of datasets more easily. For example, if someone asks what’s happening with a particular release, the lakeFS portal enables them to check if their dataset went through, whether updates were applied, and where the release stands.


Dataset version management

Version management of feeds was a crucial requirement for the Overture Data Platform team, and lakeFS provides that out of the box. The team can now track the “last known version” of the data it processed, as well as candidate versions of data. Once the candidate version meets all requirements, it can be promoted to the next table version. These processes are intuitive to implement, using the Git-like features available through lakeFS APIs.


A format agnostic versioning system

lakeFS turned out to be a solution for a broad set of scenarios in data pipelines for managing dataset versions at Overture Data Platform team.

“The main reason is that the data format of the data we’re working on can be anything. For example, if it’s data from OpenStreetMap, it’s a protocol binary format, which is like a Git file. Some files are tabular; some are not. lakeFS allows managing versions for any type of feed, while Iceberg is mostly for tabular data. Also, Iceberg isn’t lightweight. Importing data into Iceberg takes quite a bit of time, depending on the dataset size. With lakeFS, excluding the data transfer time, tracking feeds in lakeFS is pretty fast,” Ghanta said.


Implementing lakeFS

The implementation process started with the creation of a wiki page documenting and comparing all available options. “There was a scenario where data engineers process data in a candidate branch, run tests, and then promote it to production. That clicked for us. It was probably the closest thing to an ‘aha moment,’ where we thought, ‘Yeah, this looks like a good, promising solution’,” said Ghanta.

The implementation process started with integrating lakeFS into one of the first pipelines: the central data pipeline. However, more data processing pipelines are being built. They all have similar requirements in terms of each dataset versioning management, so the team plans to build the standard set of libraries and utilities so that the whole infrastructure includes consistent technologies – with lakeFS for dataset version management.


Results

Thanks to implementing lakeFS, Overture’s Data Platform team benefits from efficient dataset version management that can cover a broad set of scenarios across data pipelines, with easy access provided by the lakeFS portal; all by saving tremendous amounts of storage due to lakeFS’s zero-copy clone approach.

“People asked ‘Why do we even need to set up lakeFS? Why don’t we just have the latest folder and then a bunch of other folders?’ In other words, why don’t we build a lakehouse solution independently? Obviously, it looks simple, but as the system becomes more and more complex, it becomes a nightmare to manage data versions throughout the infrastructure. We found a few solutions but they didn’t seem to look promising. lakeFS was the only system that met our requirements, enabling us to build a standardized data platform,” said Ghanta.


lakeFS

We use cookies to improve your experience and understand how our site is used.

Learn more in our Privacy Policy