Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

Last updated on October 26, 2023

In the world of data management and data version control, understanding the relationships between different versions of your data is crucial. 

Just like in software development, where version control systems like Git help developers track changes in their codebase, data versioning tools such as lakeFS are indispensable for tracking changes in data lakes and object storage systems.

One of the key features that sets lakeFS apart is its ability to visualize the commit graph, providing a clear and insightful representation of how your data evolves over time. 

In this post, we’ll explore how you can harness the power of the command line to visualize your lakeFS commit graph and gain valuable insights into your data.

What is a commit graph and how to use it 

In the context of data versioning, a commit graph is a representation of the historical changes made to your data. It shows how different versions of data objects are related to one another via a series of commits.

Before we dive into the technical details, let’s discuss why visualizing a commit graph is so important. Visualizing this graph can help you in:

  • Tracking data changes: Understand when and how data changes occurred, making it easier to pinpoint issues or analyze trends.
  • Debugging: Quickly identify when and where data inconsistencies or errors were introduced.
  • Auditing: Trace data lineage for compliance and auditing purposes.
  • Collaboration: Enhance collaboration among data professionals by providing a clear visual representation of data changes.

lakectl – the lakeFS command line client

To interact with lakeFS from the command line, we use lakectl, the official command line client for lakeFS. lakectl provides a wide range of functionalities for managing and exploring your lakeFS repositories. 

For our commit graph visualization, we will focus on using the lakectl log command.

Using lakectl to list lakeFS commits

Let’s start by setting up lakectl and lakeFS. If you already have a running lakeFS server, skip to “using lakectl log”!

Run lakeFS locally

docker run --name lakefs --pull always \
             --rm --publish 8000:8000 \
             treeverse/lakefs:latest \
             run --quickstart

This will start a local lakeFS server, listening on http://localhost:8000.

💡Tip: If you don’t have Docker installed locally, or wish to use a service instead, you can start a trial on lakeFS Cloud for free!

Once running, you should see something like this in your terminal:

Docker run: lakeFS running in Quickstart mode

Note the Access Key ID and Secret Access Key there! We’ll use them soon to set up lakectl.

Install lakectl

On a Mac, you can install lakectl from the command line (if you are using homebrew):

brew tap treeverse/lakefs
brew install lakefs

Otherwise, simply download it by following this instruction in the official lakeFS documentation.

Configuring lakectl

Once we have lakectl installed, we need to tell it how to connect to our running lakeFS server.

For that, we’ll use the lakectl config command and use the Access Key ID and Secret Access Key we wrote down in the steps above:

Configuring lakectl on Mac

Let’s see that it works by running lakectl repo list:

Running lakectl repo list

This command prints out the list of available repositories on our lakeFS installation. Since this is a brand new installation of lakeFS, we don’t have any repositories yet! Let’s create one.

Creating a sample repository

Let’s do this step from the lakeFS UI by visiting http://localhost:8000 (our lakeFS server URL). We’ll have to login with the credentials we used above.

Once signed in, we should be greeted with a screen that looks like this:

Welcome screen lakeFS UI

Click the  button and 🤩! You have successfully created your first lakeFS repository.

Let’s validate this through lakectl repo list again:

Validate repository through lakectl repo list

Great, our repository is ready for us to use!

Using lakectl log

Now that we have a lakeFS installation, a repository, and a working lakectl client, let’s explore the lakectl log!

Passing a reference URI (branch, tag, or commit identifier) to lakectl log should show all commits leading up to the one we passed:

Passing a reference URI (branch, tag, or commit identifier) to lakectl log should show all commits leading up to the one we passed

In this case, we see two commits on our main branch.

The first one is a special commit that exists in all lakeFS repositories, the creation commit. 

Since we chose to create a repository with sample data, we see that 1 second after the creation of our repository, another commit was created, adding sample data.

Using the --dot Flag to Output the Log as a Graph

Now, here’s where the magic happens. You can visualize the commit graph by adding the --dot flag to the lakectl log command:

lakectl log --dot lakefs://quickstart/main > commit_graph.dot

As you can see, this is exactly the same as the command above, but with the addition of the --dot flag. We’re outputting it to a file since the output is in GraphViz dot format.

Viewing the DOT graph in your browser

To view the commit graph, you need a DOT file viewer. There are various options available, including online tools and desktop applications. One popular choice is the Graphviz online visualizer, which you can access here

Upload the commit_graph.dot file to this tool, and you’ll see an interactive visualization of your lakeFS commit graph. 

You can zoom, pan, and explore the graph to gain insights into your data history:

You can zoom, pan, and explore the graph to gain insights into your data history

Cool! That’s a nice graphical visualization. But kind of pointless for such a simple graph. Let’s create a more complex example to simulate how a real-world data lake would evolve:

# Create a zero-clone branch of our data lake
$ lakectl branch create lakefs://quickstart/my-etl --source lakefs://quickstart/main

# Make changes on our branch
$ touch example_empty_file.txt
$ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/example/1.txt
$ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/example/2.txt
$ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/other/1.txt

# Commit said changes
$ lakectl commit -m "generated 3 empty text files!" lakefs://quickstart/my-etl

# Let's make another change
$ lakectl fs upload --source example_empty_file.txt lakefs://quickstart/my-etl/datasets/example/3.txt
$ lakectl commit -m "another update to example dataset" lakefs://quickstart/my-etl

# Merge all changes, atomically, into our main branch
$ lakectl merge lakefs://quickstart/my-etl lakefs://quickstart/main

Now we should now have a slightly more complex commit graph. Let’s see it:

lakectl log --dot lakefs://quickstart/main > commit_graph.dot

Viewing the generated graph gives us a nice overview of how our data evolves over time.

Viewing the generated graph gives us a nice overview of how our data evolves over time.

In a real-world scenario, we typically want to look at how a specific table, collection or file changes, not the whole data lake. 

Let’s filter this down by looking only at changes that were made to the datasets/other/ collection:

lakectl log --dot --prefixes datasets/example/  lakefs://quickstart/main > commit_graph.dot
As we can see, this is a partial graph - it will only display commits that modified that specific location.

As we can see, this is a partial graph – it will only display commits that modified that specific location. This is very useful when trying to understand how/when/who changed a dataset or table. 

A nice property of this commit graph is that the nodes that make up the commits are actually clickable! Clicking a node will open the relevant commit in the lakeFS UI:

Clicking a node will open the relevant commit in the lakeFS UI

Wrap up

Visualizing the commit graph of your lakeFS repository from the command line is a powerful way to gain deeper insights into your data versioning. 

With the lakectl command line client, you can easily list commits and generate DOT format files for visualization.

And by understanding how your data evolves over time, you can improve data quality, streamline collaboration, and ensure compliance. 

To learn more about using lakectl and lakeFS in general, a good next step would be to complete the lakeFS Quickstart tutorial. You’ve already done the first step!

Git for Data – lakeFS

  • Get Started
    Get Started