lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data
Case Study
How Volvo Cars Streamlines its ML Platform, Enabling Reproducibility with lakeFS
Problem
Volvo Cars manages mass volumes of data, enabling collaboration and experiment reproducibility in their ML platform and for this they need a scalable solution.
Solution
Implementing lakeFS for scalable data version control, fine-grained access control, and reproducibility of ML experiments.
Result
Adopting lakeFS data version control helped Volvo Cars streamline workflows, enhanced collaboration and improved productivity, enabling faster time-to-insights for ML projects.
Table of Contents
The company
Volvo Cars is an automotive industry leader with a reputation for producing vehicles with an excellent safety record. To achieve this goal, the company develops modern automotive solutions based on machine learning (ML).
The ML Platform Engineering team serves as a global center of excellence for data and machine learning initiatives across the entire organization. The team enables, supports, and consults on all the data science, MLOps, and data engineering projects from different business organizations throughout the Machine Learning project lifecycle.
The challenge
Challenge 1: Massive data volumes
At Volvo Cars, ML teams work with hundreds of terabytes (TBs) of sensor data generated by engines and image/video data from the cars. All the data is stored in Amazon S3 in different formats:
- structured IoT data in tabular format (some in Parquet file format),
- unstructured videos and images data encoded in proprietary format,
- and other data sets mix structured and unstructured data.
Some of the stored video files are as much as ~80 TB in size, which requires teams to manage data effectively without duplicating.
To address the data versioning requirements, the Volvo ML platform initially used DVC for both data versioning and running ML Ops workflows. DVC worked well when the Volvo ML platform was in a beta phase, with very few users experimenting and prototyping their ML projects. As the user base increased and data scaled to a few TB per object, managing data with DVC brought a few challenges:
- DVC versioning works at the file level – as the number of files scales up, the versioning metadata stored in the data repository increases proportionately. Keeping track of it becomes a problem. These metadata files or increased overhead add to the working complexity of projects for data scientists and ML engineers.
- Working with data locally – another challenge while working with DVC is that the solution makes engineers download data objects to local compute, run ETL jobs, and then upload the data changes to the remote repository. In a local environment, working with big data files created performance bottlenecks that needed to be addressed.
When the bottlenecks resulting from using DVC and MinIO for data versioning became intolerable, Volvo Cars turned to lakeFS to resolve these challenges and to achieve better data manageability.
Challenge 2: Collaboration
Data scientists and ML engineers from various business units within Volvo Cars Engineering primarily use the Volvo ML platform. Multiple teams work on the same data set for different analytics or machine learning use cases as per their business requirements without interrupting each other or fully copying the data.
Given the scale of data these teams work with, data manageability becomes a challenge. It’s imperative for data to be available in an isolated manner without forcing team members to copy the data.
Another requirement is fine-grained access control of data for different teams.
Challenge 3: Experiment reproducibility
Additionally, data scientists and engineers use the ML platform to run hundreds of automated ML experiments that require effective tracking and reproducibility.
To achieve that, one needs to store the data, code, model artifacts, metrics, model deployment endpoints, and model inferences – all versioned together and stored atomically.
Adopted solution
Challenge solved: Massive data volumes
lakeFS was built to provide data practitioners with a scalable data version control solution for data products based on petabytes of data. The solution offers scalable, high-performance data version control for enterprise-level operations matching the unique requirements of Volvo Cars.
Challenge solved: Colllaboration
When a new Data Scientist or ML Engineer onboards the ML platform, a new user is automatically added to certain user groups. By enabling access controls through user groups on lakeFS repositories, the new user gains access to specific repos that contain tutorial data sets and code. This reduces the friction for new users of the platform and lakeFS helped improve this onboarding experience.
The platform engineering team developed a homegrown MLOps toolkit used by different data science and ML teams internally. The toolkit improves the productivity of data science teams by abstracting away the complex data infrastructure underneath it. It consists of a Python package that includes dependencies (like lakeFS) required for an ML project.
When a data scientist spins up a Jupyter notebook to work on a project, Jupyter comes pre-configured with lakeFS installation, access, and secret keys. By simply importing the package into the notebook and using it, time to value of an ML project increases several times.
Challenge solved: Experiment reproducibility
The data, model, artifacts, metrics, deployment endpoints and predictions are all versioned together atomically using lakeFS. The code used to train ML models is versioned with Git, and the commit hash is stored in a .lakeFS directory in the data repo.
By using the commit hash or version number from lakeFS metadata, users can trace back to a specific commit on Git and lakeFS. This opens the door to easy reproduction of ML experiments by checking out the data from any specific commit.
Results
By implementing lakeFS, Volvo Cars benefits from a data version control layer that enables smooth cross-team collaboration, operations on massive volumes of data, and reproducibility of machine learning experiments.
“With lakeFS, we have streamlined data science and MLOps workflows, adapted data access controls for different data teams, accelerated productivity and reduced time-to-insights for ML engineering projects.”
