Hugging Face acquired Xethub to build an internal data version control system. XetHub is a platform for collaborative development created by former Apple researchers in 2021 to improve the efficiency of machine learning teams that deal with huge datasets and models.
The solution provides Git-like version management for up to TB-sized repositories, facilitating team collaboration, change tracking, and reproducibility in machine learning processes. Thanks to its capacity to manage intricate scaling requirements resulting from continuously expanding tools, files, and artifacts, XetHub attracted a sizable user base during these three years, including well-known brands like Tableau and Gather AI.
With the Hugging Face acquisition of XetHub, the platform’s data and model handling capabilities will be transferred to the Hugging Face Hub. This will improve the model and dataset-sharing platform by providing better storage and versioning of the backend.
By upgrading its storage backend and integrating XetHub’s technology with its platform, Hugging Face hopes to make it easier for developers to host larger models and datasets than they do now with less effort.
This raises two questions:
- How will Hugging Face Datasets benefit from this acquisition?
- Why does a company like Hugging Face need data versioning capabilities?
Keep reading to find out.
What Is Hugging Face Datasets?
Hugging Face specializes in natural language processing technologies. To this end, the company created the Hugging Face Datasets. These are pre-processed datasets that are ready for use in a variety of NLP, computer vision, and audio applications.
Hugging Face Datasets compiles enormous datasets from a variety of sources, including academic research, popular benchmark projects, and real-world applications. These datasets have been thoroughly verified, processed, and standardized to ensure consistency and use. The package also offers utilities for data preparation, splitting, shuffling, and accessing other resources, such as pre-trained models.
The library aims to make it easier for teams to get and edit datasets, allowing researchers and developers to test different models and evaluate their performance. It provides a uniform interface for accessing a variety of information, such as text classification, machine translation, question-and-answer, summarization, and more.
The Hugging Face Datasets library is compatible with other popular NLP libraries, such as Hugging Face Transformers, allowing you to mix datasets with cutting-edge NLP models effortlessly.
Why Is Data Version Control Important For Hugging Face?
Here’s what the CTO of Hugging Face shared in a LinkedIn post:
“I am super excited to announce that we’ve acquired XetHub! ????
XetHub has developed technologies to enable Git to scale to TeraByte-size repositories.
Under the hood they’ve been adding file chunking and deduplication inside Git.
This will help us unlock the next 5 years of growth of HF datasets and models by switching to our own, better version of LFS as storage backend for the Hub’s repos. ????
In the announcement blogpost (read it here: https://lnkd.in/e-jxSeCf), I also shared a few cool stats about where the Hugging Face Hub is today ????:
• number of repos: 1.3m models, 450k datasets, 680k spaces
• total cumulative size: 12PB stored in LFS (280M files) / 7,3 TB stored in git (non-LFS)
• Hub’s daily number of requests: 1B
• daily Cloudfront bandwidth: 6PB ????”
The last part of the post points to the sheer scale of Hugging Face’s operation.
The Hugging Face Hub has been using Git LFS (Large File Storage) as the storage backend. That storage system was bound to eventually run out of capacity due to the AI ecosystem’s continuously increasing amount of massive data.
That’s because Git LFS only allows a limit of 5GB of file size and 10GB of the repository. In contrast, the XetHub platform enables individual files more than 1 TB, with the overall repository size reaching well over 100TB. As a result, the HF Hub can hold even bigger models, files, and information than it can now.
Furthermore, the extra storage and transfer capabilities XetHub offers will increase the package’s profitability. For example, instead of re-uploading the whole set of files (which takes a lot of time), customers will be able to upload specific chunks of new rows in the event of a dataset update thanks to the platform’s content-define chunking and deduplication features. This also applies to model repositories.
Your Use Case Is Likely Similar If Your Data Operations Are Large Enough
Here are the benefits of data version control to teams working with massive datasets:
Reproducibility
Frequent data changes make it difficult to fix a data problem, evaluate machine learning training accuracy (when a model is re-run on changing data, the outputs differ), or comply with data audits.
Data lake best practices advocate for reproducibility, which allows us to time travel between different versions of the data. This allows us to capture snapshots of the data throughout time and under different settings.
Exposing a Git-like interface to data allows for tracking of more than just the current state of the data. It makes it easier to branch and commit across large datasets. The end result is repeatable, atomic, and versioned data lake actions, which lead to better data management.
Parallel experimentation
ML practitioners encounter challenges in handling the growing complexity of ML models and the ever-increasing volume of data. Effective data management and version control are becoming increasingly critical for effective machine learning operations.
This is especially true for parallel ML, which involves performing tests with various parameters simultaneously. Version control tools can improve your ML experiments and ease the development process.
Collaboration
One of the most difficult components of working with a large team of people on a single project is version control, which involves managing the numerous modifications your group makes to shared working documents.
Your contributors might be situated anywhere on the globe or in the same room, working simultaneously or asynchronously. No matter how your company is organized, numerous contributors’ contributions must be merged into a single project.
Version control governs this process by keeping track of changes and who made them. It allows you to revert to previous versions of documents and explore how different contributors’ contributions have changed the project over time. This is why data versioning is critical for establishing effective team communication.
Data Version Control Is Required With Any Data Lake At Scale
Data versioning is an essential capability for teams working with Hugging Face datasets – or with any other kind of massive data volumes.
The open-source data version control solution lakeFS enables you to process big datasets with zero-copy reads and no memory constraints, resulting in maximum speed and efficiency. It also has a strong integration with Hugging Face, which allows you to quickly import and exchange datasets with the larger machine learning community.
The lakeFS integration with Hugging Face enables more powerful version control for datasets and model training pipelines, giving teams the ability to confidently reproduce results, verify experiments and maintain consistency across different environments or iterations. Furthermore, the integration boosts team collaboration; by having a shared, versioned dataset and model history allows teams to work on the same data and model versions with full traceability, ensuring consistent and reproducible results.


