lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data
Case Study
How NASA Uses Cloud-Friendly Zarr Stores to Manage Dynamic Data with lakeFS
Company
NASA is a U.S. federal agency focused on space programs and exploration. Its Goddard Earth Sciences (GES) DISC team is working on GeoZarr, a cloud-native format for geospatial data, building on the Zarr format for efficient access to compressed n-dimensional arrays.
Problem
NASA needed a solution to ensure consistent data in a dynamic Zarr store and manage version control for earth science datasets in the cloud, enabling reproducibility and reliable use for scientific publications.
Solution
lakeFS helped NASA ensure data consistency by implementing engineering best practices, maintaining consistent datasets across all users. It also enabled Git-like version control in the cloud, providing reproducibility for NASA’s dynamic Zarr datasets in line with cloud standards.
Table of Contents
The company
Founded in 1958, The National Aeronautics and Space Administration (NASA) is an independent agency of the United States federal government responsible for civil space programs, aeronautics research, and space exploration.
The NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) is one of NASA Science Mission Directorate Data Centers. One of its teams is currently involved in the development of geospatial Zarr Standard (GeoZarr).
Zarr is a cloud-native data format for n-dimensional arrays that allows access to data in compressed chunks from the original array. Zarr has become increasingly popular for geospatial applications and was set as an OGC Community Standard.
The geospatial Zarr Standard (GeoZarr) sets flexible and inclusive rules for the Zarr cloud-native format that match the different requirements of the geospatial domain. These conventions provide a clear and consistent framework for organizing and characterizing data, ensuring unambiguous representation.
The challenges
Ensuring data consistency
When a data user tried to read data during an update, they would receive inconsistent shape information or a weird answer for the calculation code. The team was looking for a solution to help it manage a dynamic Zarr store to ensure consistent user data.
Reproducibility in the cloud
NASA was looking to manage dynamic data for their earth science datasets in a cloud-optimized format and have a version control system for reproducibility. This would enable end users to use the data for scientific publications easily.
Adopted solution
Challenge solved: Ensuring data consistency
The team tested the performance of lakeFS for its data and use case, finding that the solution allowed the implementation of engineering best practices for managing data. This ensures data consistency across datasets for all users.
Challenge solved: Reproducibility in the cloud
By enabling Git actions on top of data lakes, lakeFS provides NASA teams with reproducibility capabilities perfectly aligned with best practices and standards in cloud environments.
Further Reading
GES DISC Collaborations with Open-Source Communities to Migrate Data Collections to Cloud-Friendly Zarr Stores
Read the full case study authored by Dieu My Nguyen on zenodo.org and explore the GitHub project, NASA-IMPACT/zarr-lakefs here.