From Zero to Versioned Data in Spark

Guy Hardonag

Last updated on February 5, 2026

Home > Blog > From Zero to Versioned Data in Spark

Learn from AI, ML & data leaders from Dell, Lockheed Martin, Red Hat & more

This tutorial aims to give you a fast start with lakeFS and use its git-like terminology in Spark. It covers the following:

Quick start to install lakeFS using Docker Compose.
How to create a repository, add files to it, create a branch and make changes to the repository using spark jobs.
How to review changes before exposing them to consumers by merging to master.

This simple flow gives a sneak peak to how seamless and easy it is to make changes to data using lakeFS. Once you get the value of a resilient data flow, you can map it to many use cases within your data architecture from validating writes of raw data, to providing a safety net to your ETL pipelines or your ML (or other algorithmic logic) pipelines. You can pull the trigger, your master data lake is safe.

For more detailed information check out our documentation.