In the world of data management and data version control, understanding the relationships between different versions of your data is crucial.
Just like in software development, where version control systems like Git help developers track changes in their codebase, data versioning tools such as lakeFS are indispensable for tracking changes in data lakes and object storage systems.
One of the key features that sets lakeFS apart is its ability to visualize the commit graph, providing a clear and insightful representation of how your data evolves over time.
In this post, we’ll explore how you can harness the power of the command line to visualize your lakeFS commit graph and gain valuable insights into your data.
What is a commit graph and how to use it
In the context of data versioning, a commit graph is a representation of the historical changes made to your data. It shows how different versions of data objects are related to one another via a series of commits.
Before we dive into the technical details, let’s discuss why visualizing a commit graph is so important. Visualizing this graph can help you in:
- Tracking data changes: Understand when and how data changes occurred, making it easier to pinpoint issues or analyze trends.
- Debugging: Quickly identify when and where data inconsistencies or errors were introduced.
- Auditing: Trace data lineage for compliance and auditing purposes.
- Collaboration: Enhance collaboration among data professionals by providing a clear visual representation of data changes.
lakectl – the lakeFS command line client
To interact with lakeFS from the command line, we use
lakectl, the official command line client for lakeFS.
lakectl provides a wide range of functionalities for managing and exploring your lakeFS repositories.
For our commit graph visualization, we will focus on using the
lakectl log command.
Using lakectl to list lakeFS commits
Let’s start by setting up lakectl and lakeFS. If you already have a running lakeFS server, skip to “using lakectl log”!
Run lakeFS locally
docker run --name lakefs --pull always \ --rm --publish 8000:8000 \ treeverse/lakefs:latest \ run --quickstart
This will start a local lakeFS server, listening on
💡Tip: If you don’t have Docker installed locally, or wish to use a service instead, you can start a trial on lakeFS Cloud for free!
Once running, you should see something like this in your terminal:
Note the Access Key ID and Secret Access Key there! We’ll use them soon to set up lakectl.
On a Mac, you can install lakectl from the command line (if you are using homebrew):
brew tap treeverse/lakefs brew install lakefs
Otherwise, simply download it by following this instruction in the official lakeFS documentation.
Once we have lakectl installed, we need to tell it how to connect to our running lakeFS server.
For that, we’ll use the
lakectl config command and use the Access Key ID and Secret Access Key we wrote down in the steps above:
Let’s see that it works by running
lakectl repo list:
This command prints out the list of available repositories on our lakeFS installation. Since this is a brand new installation of lakeFS, we don’t have any repositories yet! Let’s create one.
Creating a sample repository
Let’s do this step from the lakeFS UI by visiting http://localhost:8000 (our lakeFS server URL). We’ll have to login with the credentials we used above.
Once signed in, we should be greeted with a screen that looks like this:
Click the button and 🤩! You have successfully created your first lakeFS repository.
Let’s validate this through
lakectl repo list again:
Great, our repository is ready for us to use!
Using lakectl log
Now that we have a lakeFS installation, a repository, and a working
lakectl client, let’s explore the
Passing a reference URI (branch, tag, or commit identifier) to lakectl log should show all commits leading up to the one we passed:
In this case, we see two commits on our main branch.
The first one is a special commit that exists in all lakeFS repositories, the creation commit.
Since we chose to create a repository with sample data, we see that 1 second after the creation of our repository, another commit was created, adding sample data.
--dot Flag to Output the Log as a Graph
Now, here’s where the magic happens. You can visualize the commit graph by adding the
--dot flag to the
lakectl log command:
lakectl log --dot lakefs://quickstart/main > commit_graph.dot
As you can see, this is exactly the same as the command above, but with the addition of the
--dot flag. We’re outputting it to a file since the output is in GraphViz dot format.
Viewing the DOT graph in your browser
To view the commit graph, you need a DOT file viewer. There are various options available, including online tools and desktop applications. One popular choice is the Graphviz online visualizer, which you can access here.
commit_graph.dot file to this tool, and you’ll see an interactive visualization of your lakeFS commit graph.
You can zoom, pan, and explore the graph to gain insights into your data history:
Cool! That’s a nice graphical visualization. But kind of pointless for such a simple graph. Let’s create a more complex example to simulate how a real-world data lake would evolve:
# Create a zero-clone branch of our data lake $ lakectl branch create lakefs://quickstart/my-etl --source lakefs://quickstart/main # Make changes on our branch $ touch example_empty_file.txt $ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/example/1.txt $ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/example/2.txt $ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/other/1.txt # Commit said changes $ lakectl commit -m "generated 3 empty text files!" lakefs://quickstart/my-etl # Let's make another change $ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/example/3.txt $ lakectl commit -m "another update to example dataset" lakefs://quickstart/my-etl # Merge all changes, atomically, into our main branch $ lakectl merge lakefs://quickstart/my-etl lakefs://quickstart/main
Now we should now have a slightly more complex commit graph. Let’s see it:
lakectl log --dot lakefs://quickstart/main > commit_graph.dot
Viewing the generated graph gives us a nice overview of how our data evolves over time.
In a real-world scenario, we typically want to look at how a specific table, collection or file changes, not the whole data lake.
Let’s filter this down by looking only at changes that were made to the
lakectl log --dot --prefixes datasets/example/ lakefs://quickstart/main > commit_graph.dot
As we can see, this is a partial graph – it will only display commits that modified that specific location. This is very useful when trying to understand how/when/who changed a dataset or table.
A nice property of this commit graph is that the nodes that make up the commits are actually clickable! Clicking a node will open the relevant commit in the lakeFS UI:
Visualizing the commit graph of your lakeFS repository from the command line is a powerful way to gain deeper insights into your data versioning.
lakectl command line client, you can easily list commits and generate DOT format files for visualization.
And by understanding how your data evolves over time, you can improve data quality, streamline collaboration, and ensure compliance.
To learn more about using lakectl and lakeFS in general, a good next step would be to complete the lakeFS Quickstart tutorial. You’ve already done the first step!
Table of Contents