Training a Large Language Model (LLM) such as ChatGPT or DeepSeek is a complicated, data-intensive process. As a young discipline, best practices and tool chains are still emerging and being formulated for LLM training. lakeFS has a lot to contribute to this field, as a highly scalable data version control system. In this article, we will go step-by-step through the major phases of LLM training and where lakeFS can provide added value.

Data Collection & Preprocessing

LLM Training begins with the most data-intensive part, where we index, download, and serialize large swathes of the internet. Programs called Web Crawlers discover webpages while Web Scrapers download their data into the Data Lakes. The raw data from this process is petabytes in size, but through a process of filtering, cleaning, and compressing, it is reduced by an order of magnitude to hundreds of terabytes. This data is serialized into a token format optimized for the subsequent steps of LLM training. The data should be retained throughout the lifetime of the model due to legal/regulatory requirements.

How can we effectively manage a huge, dynamic dataset for a constantly changing resource like the internet? One option is to store a different copy of the data for each snapshot of the internet. This approach means duplicating huge quantities of identical data and can easily result in million dollar monthly storage costs.

lakeFS offers a more efficient approach for versioning large datasets. Different versions of the data are represented as commits and branches, while the backing data is not actually duplicated unless changed. Smart hashing algorithms allow the system to quickly identify whether a particular object has changed or not. For LLM development, this reduces storage costs without deleting training data for active models.

Pre Training

The second step of LLM Training is called Pre-Training, where a generic model is trained on the serialized Internet data from the previous step. This is the most financially costly step as it requires heavy processing on large data and can take tens-of-millions of GPU-hours to complete. This model can generate original text, but is not trained for a particular task. It is merely the base model that serves as a starting point for subsequent training.

As with Machine Learning data version control use cases, the best practice in lakeFS is to store the model adjacent to its data. This allows us to easily track what version of data a particular model was trained on throughout the change history.

Fine-Tuning

Fine tuning involves additional training processes on top of a base model to turn it into a truly useful tool, i.e. an AI Assistant, Code Generator, Research Tool, etc. This is an interactive process of training on different specialized datasets and evaluating results. It is a much more complicated process than pre-training, but does not incur the extensive hardware costs. Today, there are many data science teams out there who only do fine-tuning, starting with a generic base model and providing specialized training for a particular use case.

The lakeFS version control model of commits and branches is well suited to this sort of non-linear analytical process. Different fine-tuning experiments can be tried, each in their own branch, and the most successful one can be merged back into the mainline. lakeFS also supports rollback such that, if problems are discovered in a particular model late in the training process, we have a full training history and the model can be rolled back to before the problematic training step and the rest of the training process can be replayed from that point on.

Evaluation & Safety Checks

This final step of model development tests a model for performance benchmarks and compliance requirements before release. Failures may require remediation via additional fine-tuning or rolling back to some prior step.

LLM training is still a relatively new field. Currently, evaluation is primarily done via model outputs; however it seems likely this will expand to include evaluation of a model’s complete data lineage to ensure compliance with data and licensing rules. For example, a model’s data lineage could be evaluated for the use of personal data (GDPR), private health data (HIPAA), race/gender data(ECOA), or even simply for using data must be licensed to be used. This sort of evaluation will become required as LLM applications spread to regulated industries.

Conclusion

The field of LLMs is a new one, and the toolchains are still being developed. As a highly-scalable data version control system, lakeFS has the potential to contribute to this toolchain with solutions for experimentation, compliance, and storage cost savings.

lakeFS for LLM Development

Data Collection & Preprocessing

Pre Training

Fine-Tuning

Evaluation & Safety Checks

Conclusion

Watch how lakeFS works

Need help getting started?

lakeFS

lakeFS for LLM Development

Data Collection & Preprocessing

Pre Training

Fine-Tuning

Evaluation & Safety Checks

Conclusion

Related articles

AI Compliance Tools: Types, Pros & Cons and Best Practices

Computer Vision in Healthcare: Incorporating Data Version Control Into Your ML Workflow

What is Metadata Tracking? Types, Tools & Best Practices

Watch how lakeFS works

lakeFS

Pick up the Slack with lakeFS