lakeFS Transactions: Maintain Data Integrity Using ACID Principles

Nir Ozeri

Last updated on June 10, 2024

Home > Blog > lakeFS Transactions: Maintain Data Integrity Using ACID Principles

Try lakeFS open source. Watch how it works

We recently introduced the new High Level Python SDK, which provides a friendlier interface to interact with lakeFS, as part of our evergoing effort to make life simpler for data professionals.

In this article, we will introduce you to a cool new addition to the High Level SDK: Transactions!

Read on to learn what lakeFS transactions are and how they can help us make data transformations much easier while making sure we adhere to the ACID principles.

What Are Transactions

In the world of databases, a transaction is a unit of work that consists of one or more operations, such as inserting, updating, or deleting records. The key feature of a transaction is its atomicity, meaning that all the operations within one transaction are treated as a single, indivisible unit. Transactions follow the ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring that operations are reliable and maintain data integrity.

Atomicity guarantees that either all the operations within a transaction are successfully completed, or none of them are, preventing partial updates that could lead to inconsistent states. Consistency ensures that the database remains in a valid state before and after a transaction. Isolation ensures that transactions are executed independently of each other, preventing interference. Durability ensures that once a transaction is committed, its changes are permanent and survive system failures. These properties collectively contribute to the reliability and integrity of database systems.

lakeFS Transactions

While lakeFS doesn’t directly implement traditional database transactions, it introduces a versioning system and branching mechanisms to manage changes in the data lake, which allows simulation of transactional concepts.

In the context of lakeFS, we can think of a “transaction” as a series of changes made to our data lake, such as adding, updating, or deleting objects. The versioning mechanism allows us to track these changes over time and, if needed, revert to a previous state, providing a level of consistency and traceability.

In practice, we can simulate transactions in lakeFS by branching out from our source branch, performing all the operations required as part of the transaction and when done, merging our side branch back into the source branch.

The following diagram demonstrates a transactional flow in lakeFS:

Diagram demonstrating transactional flow in lakeFS

Let’s break down the lakeFS steps which simulate a transaction:

Create a side branch from the source branch

This allows us to make changes to the data freely, while maintaining the integrity of the source branch (Consistency)

Perform object operations on a side branch

Performing all the required transformations to the data (Isolation).

Commit the changes

For changes to persist, we must commit them to the side branch (Durability).

Merge the changes back to the source branch

The merge operation ensures that all the changes are introduced to the source branch simultaneously (Atomicity).

In case of failure, the source branch will not be affected.

Cleanup: Delete the side branch

An optional (though recommended) step, to ensure our data lake (and lakeFS) stays tidy!

Rollback

In lakeFS terms, rolling-back a transaction simply means that we don’t merge the side branch back into the source branch (don’t forget to delete the branch).

Isolation Level

lakeFS’s High-Level SDK transaction mechanism closely emulates the “Read Committed” isolation level found in traditional database systems. In this mode, a transaction reads only committed data, safeguarding against dirty reads; this is done by utilizing lakeFS’s version control mechanism (branching and merging).

However, it’s important to note that in this isolation level, the assurance of a “Repeatable Read” is not guaranteed. Throughout the transaction’s duration, other actors may modify and commit data to the branch, potentially leading to various outcomes when reading the same data consecutively. This can also lead to a potential failure to complete the transaction in case of changes to the branch which lead to a merge conflict when trying to finalize the transaction.

Transaction Logs

Transaction logs record all transactions and the database modifications made by each transaction. They are crucial for system recovery in case of failures, enabling rollback and rollforward operations, providing an audit trail for compliance, facilitating point-in-time recovery, and aiding in performance tuning.

In order to implement this mechanism in lakeFS we had to… do nothing! ????

Luckily, lakeFS is a data version control engine, and that’s exactly what it does.

Each committed transaction is represented by a commit in our branch.

If we’d like to see the modifications made by a transaction, we can look at the changes on that commit:

To rollback the transaction, simply revert the corresponding commit on the branch.
An audit trail is also available.

Using lakeFS Transactions

Let’s create a setup for a quick example of transactions, using lakeFS quickstart and the High Level SDK:

Prerequisites

Before installing the new Python SDK, make sure you have the following configured:

A relatively recent version of Python installed (>=3.9).
A running lakeFS installation, running lakeFS >= 1.0
PIL library for the basic example
Follow the quickstart guide to bring up a local lakeFS environment

Installing the new SDK

In your favorite terminal, IDE or notebook, run the following command:

Copy Code

pip install lakefs

Configuring a lakeFS Client

For a quick configuration of the lakeFS client, we’ll export the following environment variables:

Copy Code

export LAKECTL_SERVER_ENDPOINT_URL=http://127.0.0.1:8000/
export LAKECTL_CREDENTIALS_ACCESS_KEY_ID=AKIAIOSFOLQUICKSTART
export LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

????See: documentation for additional configuration options

Using lakeFS Transactions – Example

In this example we will use our quickstart repository and PIL, a python imaging library, to perform some transformations on the data located in the images prefix and write the manipulated data back to our lakeFS repository.
To make sure that the transformation occurs atomically, we’ll use transactions.

Copy Code

import lakefs
from PIL import Image

prod_branch = lakefs.repository("quickstart").branch("main")
with prod_branch.transact(commit_message="Transforming images") as tx:
    for obj_info in tx.objects(prefix="images/"):
        obj = tx.object(obj_info.path)
        # read image from lakeFS, and make some transformations
        with obj.reader() as reader:
            image = Image.open(reader)
            image = image.transpose(Image.FLIP_LEFT_RIGHT)  # Flip horizontally
            image = image.convert('L')  # Grayscale

        # Save image and upload it to lakeFS
        with obj.writer() as writer:
            image.save(writer)

Here’s how the commits log looks after the transaction completed:

Notice that two new commits have been added to our main branch; the first is the commit performed on our transaction branch changes, and the second is the merge-commit from the transaction branch to our main branch.

Looking at a file from the images prefix, we can see an example of the committed changes:

Before

After

Now, let’s see what happens when an error occurs during a transaction:

Copy Code

import lakefs

main = lakefs.repository("quickstart").branch("main")
with main.transact(commit_message="Upload and fail") as tx:
    tx.object("my_test").upload("This is a test")
    raise ValueError("Something bad happened")

In the example above we create a transaction, upload a new file, and then expect the changes to be merged into our main branch. But unfortunately we got an exception during our transaction.
Looking at the commits log after running the logic, we can see that the main branch was left unaffected:

Great! Our production branch is protected from partial changes.

For failure analysis purposes, we do not automatically delete the transaction branch on failure:

In Summary

Transactions are a foundational element in databases. Having been a cornerstone in databases for quite some time, it is a concept which was adopted in most data domains as a measure to uphold ACID principles. Data version control and lakeFS are no different, and while not explicitly called out, it is an inseparable part of lakeFS.

The recent integration of explicit transaction functionality into lakeFS’s High Level SDK fortifies this relationship, empowering data professionals to uphold best practices while seamlessly managing and versioning their data.

Transactions are available in the High Level Python SDK, from version 0.2.0 onwards.