Case Study

Home > Case Studies > How G+D Used lakeFS Data Version Control to Democratize Their Data Infrastructure

How G+D Used lakeFS Data Version Control to Democratize Their Data Infrastructure

Jonas Goltz

Jonas Goltz Author

Jonas joined Giesecke+Devrient (G+D) in 2018, with a focus on...

Full Bio →

Last updated on January 14, 2026

Company

Giesecke+Devrient (G+D) is a global leader in digital security, specializing in secure mobile connections, legal identities, critical infrastructure, payments, and currency technology. With over 14,000 employees in 40 countries, G+D harnesses AI/ML to enhance security, combat fraud, and predict system failures.

Problem

At G+D, several teams are crunching data to bring insights and predictions in several different projects. The different projects and different teams rely on the same data sets in their work, and those datasets evolve over time. This possessed three major challenges: cross-team collaboration, pipeline management & data pre-processing, and airtight security for data management.

Solution

By adopting the lakeFS data version control system, Giesecke+Devrient set up a data management infrastructure across their organization, building a foundation for full adoption across all business divisions. This allowed for better collaboration, more efficient work, and scalable data versioning.

Table of Contents

The company

Giesecke+Devrient is a German technology company that has served for decades as a trusted partner for governments, public authorities, and enterprises all over the world. Operating in 40 countries across the globe, the company counts over 14,000 employees. The company specializes in digital security for mobile connections, legal identities and critical infrastructures, provides contemporary solutions for payments and banking, and is a global leader for currency technology.

G+D embraces the power of artificial intelligence and machine learning (AI/ML) to revolutionize the security landscape. Their AI/ML initiative focuses on developing cutting-edge solutions that enhance security across various aspects of their business. The possibilities in this space are endless. Imagine leveraging facial recognition and machine learning to verify documents and identities, combating fraud through anomaly detection in financial transactions, and even predicting potential failures in security systems for proactive maintenance.

Jonas Goltz, Data Scientist at G+D, is part of the company’s corporate center and manages multiple ML/ AI initiatives from all across the organization. The company is using Azure Cloud, storing its data on ADSL Gen 2, running proprietary Python applications using Pytorch and Tensorflow. The pipelines are orchestrated by Flyte. In some cases Databricks is used with Spark and MLflow.

By harnessing the power of AI/ML, G+D is poised to significantly strengthen security measures and create a safer and more efficient future.

The challenges

Struggling to centralize a decentralized infrastructure

Enabling cross-team collaboration

Each team working on the data seeked isolation from the other teams in order to collaborate and communicate clearly. To do so, teams copied data into isolated storage containers, struggling with the time, costs and accuracy of data copying where large data sets are involved. The process of updating the team’s copy to current state wasn’t simple either, and one naturally ends up with many copies of data and very little ability to collaborate.

“Since developers typically need to modify and write data, multiple copies of the data are often created to ensure that no one overwrites previous datasets, while still maintaining the ability to write new data,” said Jonas Goltz.

Pipeline management & data preprocessing

When running data pipelines from data pre-processing to model training and model deployment, the team struggled with the manual work of creating parallel workflows based on different preprocessed inputs. Again, data had to be copied, orchestration and computation had to be configured to run several times over different datasets, and results were written somewhere to align with all these requirements. They lacked clear data lineage or information about the data, the code in Git and the model parameters in MLflow, making it a manual and tedious task to compare or reproduce their data pipelines.

Airtight security for data management

Any data provided to the data lake and the teams in G+D is bound by strict security requirements, driven by the sensitive data they manage and the company’s strong security culture. Any system adopted by G+D would therefore integrate with its central access management and implement access rights based on the need to know principle.

Realization: We need a data version control system

Jonas identified that the pains the company was experiencing could be resolved using a data version control system that would allow for:

Collaboration
Reproducibility
High security for data access

He then embarked on the journey of selecting a system that would best suit G+D’s needs.

Selecting the right data version control system

G+D sought a solution that allowed for the creation of a centralized repository on top of their data lake. The Azure data versioning tool was the first to come to mind. After testing it, it proved inefficient in terms of both cost and performance, leading the team to seek an alternative.

G+D next turned to the open-source project DVC, but it also turned out to be insufficient as its solution was unable to scale and could only be used for data science work done locally.

“When building the infrastructure, especially when taking orchestration and pipeline training into consideration, we realized that a decentralized tool such as DVC couldn’t meet our needs, as it would create copies of our data everywhere. We decided to opt for a solution that provides proper branching capabilities, where a centralized repository would allow us to read and write data. For containerized setup, this was especially important,” said Jonas Goltz.

G+D then turned to their community and partners, comprising both private companies and universities, for suggestions on a scalable system. They recommended lakeFS, a data version control system that was part of their MLOps suite.

Adopted solution: lakeFS scalable data version control

Challenge solved: Enabling cross-team collaboration

lakeFS facilitates team collaboration within a single repository in a cost-efficient way, thanks to its zero-copy mechanism: “G+D has many projects where people work with the same data. With lakeFS there is no longer a need for copying the data across projects. Using only one repository for three or four projects at a time that consumes this data is extremely convenient and something we simply didn’t have. lakeFS gives us the ability to build something we were previously lacking,” added Jonas Goltz.

The scalability of lakeFS was a powerful differentiator that ultimately made it an excellent match for the massive data volumes of the organization: “We have billions of data points in these datasets and managing them with any other solution would be really difficult.”

With lakeFS there is no longer a need for copying the data across projects. lakeFS gives us the ability to build something we were previously lacking.

Jonas Goltz

Data Scientist at Giesecke+Devrient

Challenge solved: Pipeline management & data preprocessing

lakeFS also addressed a key aspect of data preprocessing: “It’s really helpful when you essentially build different preprocessing steps to support the way you work. Everything data-related is in lakeFS and the pre-processing creates new datasets in different branches. And then we have our MLflow and orchestration tools that basically consume data from there,” said Jonas Goltz.

The team can trace and track where data comes from (data lineage) and easily separate between the data and actual models. “Otherwise, you would have plenty of branches in every single data repository, which at some point gets confusing because it’s hard to differentiate between different datasets,” he added.

Another aspect where lakeFS brought benefits is pipeline management using their integration with other pipeline management tools. G+D is an organization that cooperates with various other organizations and business partners. It offers a frontend service where these partners can upload data on their own so that it’s integrated into a data pipeline.

It’s really helpful when everything data-related is in lakeFS and the pre-processing creates new datasets in different branches.

Jonas Goltz

Data Scientist at Giesecke+Devrient

Challenge solved: Airtight security for data management

Since security is the core specialization of G+D, the centralized data repository and data versioning system on top of it needed to meet strict security requirements. By implementing lakeFS Enterprise, the company could establish RBAC policies in full integration with the Azure cloud environment.

“Access to the tool is restricted via the Microsoft ENTRA. Using lakeFS, we have admins controlling this for the main branches within our company. Each division has their own development teams and admins and each admin team can create their own buckets and repositories, along with their own policies and groups, as needed,” said Jonas Goltz.

G+D has three different groups of users:

Analysts/Non-technical: they can upload data restricted to certain branches, typically using the web interface to upload files.
Developers: they have access to the data relevant to them and are able to modify and write data.
Admins: they’re restricted to their divisions; they have permissions to remove repositories, make deeper changes, and assign new developers to their own projects.

“G+D is a highly secure organization, restricting many of the tools that can be shared. Moving data around can be a real pain and having a system like lakeFS that gives the teams access according to their permissions makes it much easier to upload data and have their pipelines ready without all the hassle of security restrictions. That’s a really nice feature,” said Jonas Goltz.

Results

A solid partnership

G+D currently uses lakeFS and its data versioning capabilities across multiple projects, including early-stage projects where a centralized location is needed for business divisions to share access to unstructured and structured data that comes from different customers in the form of Excel files.

“You have a centralized system for a decentralized organizational structure, which is highly beneficial.
G+D itself is a complex organization with over 120 companies and numerous distributed teams. It’s challenging to get an overview, but lakeFS provides a centralized solution and toolchain for machine learning and data science. This approach is much more efficient than having each division use its own system. Now, data is fairly distributed – previously, no one had access to it or even knew it existed, nor could they use it.”

By adopting the lakeFS data version control system, Giesecke+Devrient set up a data management infrastructure across their organization, building a foundation for full adoption across all business divisions. This allows for better collaboration, more efficient work, and scalable data versioning.

You have a centralized system for a decentralized organizational structure, which is highly beneficial. lakeFS provides a centralized solution and toolchain for machine learning and data science.

Jonas Goltz

Data Scientist at Giesecke+Devrient

Jonas Goltz Author

Jonas joined Giesecke+Devrient (G+D) in 2018, with a focus on AI and data analytics technology. Jonas first got involved in AI during his undergrad in math and believes that it has the potential to change our life significantly which is the reason why he believes that we should never forget about the ethical context of the models we develop.

How G+D Used lakeFS Data Version Control to Democratize Their Data Infrastructure

The company

The challenges

Struggling to centralize a decentralized infrastructure

Enabling cross-team collaboration

Pipeline management & data preprocessing

Airtight security for data management

Realization: We need a data version control system

Selecting the right data version control system

Adopted solution: lakeFS scalable data version control

Challenge solved: Enabling cross-team collaboration

Jonas Goltz

Challenge solved: Pipeline management & data preprocessing

Jonas Goltz

Challenge solved: Airtight security for data management

Results

A solid partnership

Jonas Goltz

lakeFS

Pick up the Slack with lakeFS