How Department of Energy Ensures Proper Data Governance for AI Model Development
The Idaho National Laboratory (INL) is a Department of Energy funded laboratory that was searching for an efficient data storage pathway for its open-source data lake developed for Project Alexandria.
Integrating advanced data storage solutions with technologies like Delta Lake and lakeFS, operating on a massive scale.
Project Alexandria was able to ensure consistent data management practices, enhancing data accessibility, security and long-term stewardship.
Table of Contents
This case study is a summary of the talk presented at Data+AI Summit 2024:
Project Alexandria: A Digital Library for Research Data
The company
The Idaho National Laboratory (INL) is a Department of Energy-funded laboratory with over 6,000 employees. INL was searching for an efficient data storage pathway for its open-source data lake developed for Project Alexandria.
Project Alexandria is a data management platform built for the modular management and search of projects and venture data across the DOE-funded research space. The platform is the result of a collaboration among twelve different national laboratories in the Department of Energy, with INL leading the effort.
The challenges
Big data requirements
To power Project Alexandria, INL examined previous approaches to storing large volumes of scientific data. Much of this data was time series data, which is structured data with potentially millions of rows and thousands of columns.
Previously the team used TimeScaleDB on PostgreSQL. However, PostgreSQL has a 1600-column limit, which was unsuitable for their needs. INL then came across Delta Lake, a Databricks solution that promised to match the project’s unique requirements.
Data management and versioning
Inconsistent data management practices and reliance on file structures to hold metadata led to data retrieval and analysis difficulties.
The team behind Project Alexandria also needed a data versioning solution that could support the project’s scale, spanning multi-gigabyte or terabyte files.
Another important challenge for the team was providing a staging area for data creators. That way, they would avoid the problem of accepting new data that didn’t match their data quality requirements.
Adopted solution
Challenge solved: Big data requirements
Delta Lake allows the team to carry out ACID transactions on Parquet files in object storage. INL found it to be a useful tool for storing large amounts of data, manipulating it, enabling features like time travel, and maintaining data with more structure.
Challenge solved: Data management and versioning
lakeFS allows the team to version and manage data as if they were working with source code in a GitHub repository. The solution can quickly handle a limitless number of files.
In his presentation, John Darrington said “lakeFS gives you the the ability to branch, roll back, and curate your data while at the same time serving as the source of truth. It just works very well and very fast.”
The team only catalogs the data on the main branch of a lakeFS repository. The data creator must pass an additional step. At this step, a data curator or steward determines whether the data is ready to be cataloged or whether another action, like normalization, is necessary.
The core platforms of Project Alexandria include:
- Ingest Platform: Developed by Idaho National Labs for data and metadata collection. It enforces metadata collection and writes data to Delta Lake or JSON, depending on the need, supporting object storage like S3, Azure Blob, or on-premises operations.
- lakeFS: A version control system for data allowing to branch, roll back, and curate data. It integrates with Ingest to provide a staging area for data.
- DataHub: A metadata catalog platform originally developed by LinkedIn. It provides an efficient catalog for easier data access.
Results
Project Alexandria ensures long-term data stewardship and improves data accessibility across different projects and ventures. By integrating advanced data storage solutions and ensuring consistent data management practices, the project enhances data accessibility, security, and long-term stewardship.
Project Alexandria aims to allow projects and ventures to focus on data collection while handling long-term storage and management. Since technologies such as Delta Lake and lakeFS can operate on a massive scale, they can continue supporting Project Alexandria in the future.
In his presentation, John Darrington said:
“I remember sitting in a presentation for lakeFS and thinking to myself, “Hey, this is really neat.” They explained that lakeFS is like this GitHub for data. You get the ability to branch, roll back, and curate your data while at the same time serving as the source of truth. Throw this on top of object storage, and suddenly, you can version and work with this data as if you’re working with source code. And these multi-gigabyte and terabyte files you’re dealing with. lakeFS just works very well and very fast.”
For the full presentation, watch: