Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Case Study

How NASA Uses Cloud-Friendly Zarr Stores to Manage Dynamic Data with lakeFS

Iddo Avneri
Iddo Avneri Author

Iddo has a strong software development background. He started his...

Last updated on September 26, 2024
Company

NASA is a U.S. federal agency focused on space programs and exploration. Its Goddard Earth Sciences (GES) DISC team is working on GeoZarr, a cloud-native format for geospatial data, building on the Zarr format for efficient access to compressed n-dimensional arrays.

Problem

NASA needed a solution to ensure consistent data in a dynamic Zarr store and manage version control for earth science datasets in the cloud, enabling reproducibility and reliable use for scientific publications.

Solution

lakeFS helped NASA ensure data consistency by implementing engineering best practices, maintaining consistent datasets across all users. It also enabled Git-like version control in the cloud, providing reproducibility for NASA’s dynamic Zarr datasets in line with cloud standards.

The company

Founded in 1958, The National Aeronautics and Space Administration (NASA) is an independent agency of the United States federal government responsible for civil space programs, aeronautics research, and space exploration. 

The NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) is one of NASA Science Mission Directorate Data Centers. One of its teams is currently involved in the development of geospatial Zarr Standard (GeoZarr).

Zarr is a cloud-native data format for n-dimensional arrays that allows access to data in compressed chunks from the original array. Zarr has become increasingly popular for geospatial applications and was set as an OGC Community Standard. 

The geospatial Zarr Standard (GeoZarr) sets flexible and inclusive rules for the Zarr cloud-native format that match the different requirements of the geospatial domain. These conventions provide a clear and consistent framework for organizing and characterizing data, ensuring unambiguous representation.

The challenges

Ensuring data consistency

When a data user tried to read data during an update, they would receive inconsistent shape information or a weird answer for the calculation code. The team was looking for a solution to help it manage a dynamic Zarr store to ensure consistent user data.


Reproducibility in the cloud

NASA was looking to manage dynamic data for their earth science datasets in a cloud-optimized format and have a version control system for reproducibility. This would enable end users to use the data for scientific publications easily.


Adopted solution


Challenge solved: Ensuring data consistency

The team tested the performance of lakeFS for its data and use case, finding that the solution allowed the implementation of engineering best practices for managing data. This ensures data consistency across datasets for all users.


Challenge solved: Reproducibility in the cloud 

By enabling Git actions on top of data lakes, lakeFS provides NASA teams with reproducibility capabilities perfectly aligned with best practices and standards in cloud environments.


Further Reading


GES DISC Collaborations with Open-Source Communities to Migrate Data Collections to Cloud-Friendly Zarr Stores

Read the full case study authored by Dieu My Nguyen on zenodo.org and explore the GitHub project, NASA-IMPACT/zarr-lakefs here.

lakeFS