Data Version Control for Hugging Face Datasets

Idan Novogroder

Last updated on May 6, 2024

Home > Blog > Data Version Control for Hugging Face Datasets

Hugging Face Datasets (???? Datasets) is a library that allows easy access and sharing of datasets for audio, computer vision, and natural language processing (NLP).

It takes only a single line of code to load a dataset and then use Hugging Face’s advanced data processing algorithms to prepare it for deep learning model training.

Data versioning is a key capability for teams working with these types of datasets. And this is where lakeFS comes in.

lakeFS lets you process huge datasets with zero-copy reads and no memory limits to achieve maximum speed and efficiency. It also has a strong interaction with the Hugging Face Hub, which allows you to simply import and exchange datasets with the larger machine learning community.

Keep reading to learn more about how to add data version control to your Hugging Face Datasets using lakeFS.

Hugging Face Datasets: What are they?

Hugging Face, a company specializing in natural language processing technology, created a collection called Hugging Face Datasets. These are pre-processed and ready-to-use datasets for different NLP, computer vision, and audio applications.

The library aims to make it easier for teams to get and modify datasets, allowing researchers and developers to experiment with alternative models and compare their performance. It offers a consistent interface for accessing a wide range of information, including text categorization, machine translation, question-answering, summarization, and more.

Hugging Face Datasets provides massive datasets from various sources, including academic research, popular benchmark projects, and real-world applications. These datasets have been carefully vetted, processed, and standardized to guarantee consistency and use. The package also includes utilities for data preparation, splitting, shuffling, and obtaining extra resources, such as pre-trained models.

The Hugging Face Datasets library works well with other popular NLP libraries, such as Hugging Face Transformers, allowing you to combine datasets with cutting-edge NLP models seamlessly.

Why use data version control for Hugging Face Datasets?

Reproducibility

Data changes rapidly. This makes maintaining an accurate record of its current condition over time challenging. Teams often keep only one state of their data: the present state.

Frequent data changes make it difficult to debug a data problem, validate machine learning training accuracy (when re-running a model on different data yields different outputs), or comply with data audits.

Data lake best practices call for reproducibility – a capability that allows us to time travel between distinct versions of the data. This, in turn, lets us take snapshots of the data at various periods and under varied conditions.

Exposing a Git-like interface to data enables tracking of more than just the present status of the data. It facilitates actions like branching and committing over big datasets. The result is repeatable, atomic, and versioned data lake activities, which lead to improved data management.

Parallel experimentation

ML practitioners face problems managing the expanding complexity of ML models and the ever-increasing volume of data. Efficient data management and version control are increasingly important for successful machine learning operations.

This is especially true for parallel ML, which entails running experiments in parallel with different parameters (for example, using different optimizers or epochs).

Version control solutions like lakeFS can boost your ML experiments and streamline the development pipeline. For example, by combining lakeFS with MinIO, a high-speed object storage solution, you can fully realize the promise of parallel ML without sacrificing performance or scalability.

Collaboration

One of the most difficult aspects of working with a large number of people on a single project is version control—managing the many contributions your group makes to shared working documents.

Your contributors may be located worldwide or in the same room, working concurrently or asynchronously. No matter how your organization is structured, the efforts of multiple contributors must be combined into a single project.

Version control regulates this process by storing a history of modifications and who made them. It allows you to reverse or go back to older versions of documents and see how various contributors’ contributions have altered the project over time. This is why data versioning is so important for fostering good collaboration practices in a team.

How to use lakeFS with Hugging Face Datasets

The ???? Datasets library supports access to cloud storage providers via fsspec FileSystem implementations.

lakefs-spec is a community implementation of an fsspec Filesystem to help users fully leverage lakeFS data version control capabilities.

Here’s a step-by-step guide to using lakeFS with Hugging Face Datasets.

1. Installation

Copy Code

pip install lakefs-spec

2. Configuration

If you already configured the lakeFS Python SDK and/or lakectl, you should get a $HOME/.lakectl.yaml file. This file contains your access credentials and endpoint for your lakeFS environment.

Otherwise, proceed with installing lakectl and run lakectl config to set up your access credentials.

3. Reading a dataset

To read a dataset, all we have to do is use a lakefs://... URI when calling load_dataset:

Copy Code

>>> from datasets import load_dataset
>>> 
>>> dataset = load_dataset('csv', data_files='lakefs://example-repository/my-branch/data/example.csv')

That’s it! This should automatically load the lakefs-spec implementation you just installed. It will use the $HOME/.lakectl.yaml file to read its credentials, so there’s no need to pass additional configuration.

4. Saving and loading

Once you have loaded a dataset, you can now save it using the save_to_disk method as usual:

Copy Code

>>> dataset.save_to_disk('lakefs://example-repository/my-branch/datasets/example/')

At this point, you might want to commit that change to lakeFS and tag it. That way, you can easily share it with your colleagues.

You can do it through the UI or lakectl, but let’s do it with the lakeFS Python SDK:

Copy Code

>>> import lakefs
>>>
>>> repo = lakefs.repository('example-repository')
>>> commit = repo.branch('my-branch').commit(
...     'saved my first huggingface Dataset!',
...     metadata={'using': '????'})
>>> repo.tag('alice_experiment1').create(commit)

Now, other people in your team can load this exact dataset by using the tag:

Copy Code

>>> from datasets import load_from_disk
>>>
>>> dataset = load_from_disk('lakefs://example-repository/alice_experiment1/datasets/example/')

Conclusion

Data versioning is an essential capability for teams dealing with Hugging Face Datasets. lakeFS allows you to process large datasets with zero-copy reads and no memory limitations, resulting in maximum speed and efficiency. It also has a strong connection to the Hugging Face Hub, allowing you to easily import and exchange datasets with the greater machine learning community.

Check out this documentation page to learn more about Hugging Face integration with lakeFS.