Case Study

Home > Case Studies > How Department of Energy Ensures Proper Data Governance for AI Model Development

How Department of Energy Ensures Proper Data Governance for AI Model Development

Iddo Avneri

Iddo Avneri Author

Iddo has a strong software development background. He started his...

Full Bio →

Last updated on June 3, 2025

Company

The Idaho National Laboratory (INL) is a Department of Energy funded laboratory that was searching for an efficient data storage pathway for its open-source data lake developed for Project Alexandria.

Solution

Integrating advanced data storage solutions with technologies like Delta Lake and lakeFS, operating on a massive scale.

Result

Project Alexandria was able to ensure consistent data management practices, enhancing data accessibility, security and long-term stewardship.

Table of Contents

This case study is a summary of the talk presented at Data+AI Summit 2024:

Project Alexandria: A Digital Library for Research Data

The company

The Idaho National Laboratory (INL) is a Department of Energy-funded laboratory with over 6,000 employees. INL was searching for an efficient data storage pathway for its open-source data lake developed for Project Alexandria.

Project Alexandria is a data management platform built for the modular management and search of projects and venture data across the DOE-funded research space. The platform is the result of a collaboration among twelve different national laboratories in the Department of Energy, with INL leading the effort.

The challenges

Big data requirements

To power Project Alexandria, INL examined previous approaches to storing large volumes of scientific data. Much of this data was time series data, which is structured data with potentially millions of rows and thousands of columns.

Previously the team used TimeScaleDB on PostgreSQL. However, PostgreSQL has a 1600-column limit, which was unsuitable for their needs. INL then came across Delta Lake, a Databricks solution that promised to match the project’s unique requirements.

Data management and versioning

Inconsistent data management practices and reliance on file structures to hold metadata led to data retrieval and analysis difficulties.

The team behind Project Alexandria also needed a data versioning solution that could support the project’s scale, spanning multi-gigabyte or terabyte files.

Another important challenge for the team was providing a staging area for data creators. That way, they would avoid the problem of accepting new data that didn’t match their data quality requirements.

Adopted solution

Challenge solved: Big data requirements

Delta Lake allows the team to carry out ACID transactions on Parquet files in object storage. INL found it to be a useful tool for storing large amounts of data, manipulating it, enabling features like time travel, and maintaining data with more structure.

Challenge solved: Data management and versioning

lakeFS allows the team to version and manage data as if they were working with source code in a GitHub repository. The solution can quickly handle a limitless number of files.

In his presentation, John Darrington said “lakeFS gives you the the ability to branch, roll back, and curate your data while at the same time serving as the source of truth. It just works very well and very fast.”

The team only catalogs the data on the main branch of a lakeFS repository. The data creator must pass an additional step. At this step, a data curator or steward determines whether the data is ready to be cataloged or whether another action, like normalization, is necessary.

The core platforms of Project Alexandria include:

Ingest Platform: Developed by Idaho National Labs for data and metadata collection. It enforces metadata collection and writes data to Delta Lake or JSON, depending on the need, supporting object storage like S3, Azure Blob, or on-premises operations.
lakeFS: A version control system for data allowing to branch, roll back, and curate data. It integrates with Ingest to provide a staging area for data.
DataHub: A metadata catalog platform originally developed by LinkedIn. It provides an efficient catalog for easier data access.

Results

Project Alexandria ensures long-term data stewardship and improves data accessibility across different projects and ventures. By integrating advanced data storage solutions and ensuring consistent data management practices, the project enhances data accessibility, security, and long-term stewardship.

Project Alexandria aims to allow projects and ventures to focus on data collection while handling long-term storage and management. Since technologies such as Delta Lake and lakeFS can operate on a massive scale, they can continue supporting Project Alexandria in the future.

In his presentation, John Darrington said:

“I remember sitting in a presentation for lakeFS and thinking to myself, “Hey, this is really neat.” They explained that lakeFS is like this GitHub for data. You get the ability to branch, roll back, and curate your data while at the same time serving as the source of truth. Throw this on top of object storage, and suddenly, you can version and work with this data as if you’re working with source code. And these multi-gigabyte and terabyte files you’re dealing with. lakeFS just works very well and very fast.”

For the full presentation, watch:

Project Alexandria: A Digital Library as presented at the Data+AI Summit, June 2024

Presented by John Darrington at the Data+AI Summit, June 2024

Iddo Avneri Author

Iddo has a strong software development background. He started his career in the army, where he served for 8 years, eventually heading the main software development school. Following his service, Iddo built technical teams for several startups in the Observability, Cloud and data spaces. Prior to joining the lakeFS team Iddo built the technical enterprise field team at Turbonomic, from the ground up, as well as served as the Field CTO, and was the account executive for some of the company’s largest customers; up to the $1.9B IBM acquisition in 2021. In Treeverse, the company behind lakeFS, Iddo runs all customer engagements from sales to customer success.

How Department of Energy Ensures Proper Data Governance for AI Model Development

Project Alexandria: A Digital Library for Research Data

The company

The challenges

Big data requirements

Data management and versioning

Adopted solution

Challenge solved: Big data requirements

Challenge solved: Data management and versioning

Results

Project Alexandria: A Digital Library as presented at the Data+AI Summit, June 2024

Pick up the Slack with lakeFS