Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Oz Katz
Oz Katz Author

Oz Katz is the CTO and Co-founder of lakeFS, an...

,
Nir Ozeri
Nir Ozeri Author

Nir Ozeri is a seasoned Software Engineer at lakeFS. Over...

Last updated on April 26, 2024

Since its inception, lakeFS shipped with a full featured Python SDK. For each new version of lakeFS, this SDK is automatically generated, relying on the OpenAPI specification published by the given version.

While this always ensured the Python SDK shipped with all possible features, the automatically generated code wasn’t always the nicest (or most Pythonic) to work with:

  1. Credentials must be explicitly configured without support for e.g. environment variables or configuration files
  2. No IO abstractions – users had to implement their own logic to support pre-signed URLs, convert between types to allow using the retrieved content with popular libraries that expect file-like objects, etc.
  3. No built-in pagination, requiring manual offset management and server fetching
  4. Unintuitive API structure at times – request and response fields were mapped to auto-generated python classes

With the recent release of lakeFS 1.0, we felt it’s time to improve that. 

We worked hard over the last few months, building a much nicer abstraction layer on top of the generated code. One that is simpler to use, ties better into the rest of the Python Data ecosystem – and perhaps more importantly, makes common tasks much easier to accomplish.

In this blog post, I’ll cover some of the nice quality of life improvements offered by the latest SDK, accompanied by useful code examples for each one.

I urge you to follow the examples and give them a try yourself!

Prerequisites

Before installing the latest Python SDK, make sure you have the following configured:

  1. A relatively recent version of Python installed (>=3.9).
  2. A running lakeFS installation, running lakeFS >= 1.0 

💡Tip: You can try lakeFS easily without installing anything on https://lakefs.cloud/

Installing the new SDK

Installation is simpler than ever. In your favorite terminal, IDE or notebook, run the following command:

This simple command will install the latest currently available SDK version with all required dependencies (spoiler: there aren’t many!). 

Configuring a lakeFS Client

The new SDK will try to automatically configure the endpoint and credentials to be used. Once imported, it will try to discover these from the following places:

  1. Any configuration passed explicitly by the user when initializing a client
  2. Otherwise, check for a local lakectl configuration file (~/.lakectl.yaml) and use its configuration
  3. Otherwise, check for the LAKECTL_SERVER_ENDPOINT_URL, LAKECTL_CREDENTIALS_ACCESS_KEY_ID and LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY environment variables

This means that in case the environment variables are set, or that ~/.lakectl.yaml exist, no configuration code is needed; the default, automatically initialized client could be used:

Explicitly setting configuration values

You can explicitly create a lakefs.Client object with explicit connection details:

import lakefs

client = lakefs.Client(host="https://<org_id>.lakefscloud.io", ...)

Jump to Using a custom client with subresources below to see how to use a custom client

Using the lakeFS Python SDK

The new SDK follows a resource-based approach that lends itself to lakeFS hierarchical model: objects are read/written to version references (commits, branches, tags). These references are part of repositories. Repositories are tied to a lakeFS installation. Let’s see this in practice:

Listing lakeFS repositories

import lakefs

for repo in lakefs.repositories():
    print(repo.id)

All SDK methods that list resources can optionally accept pagination/query parameters (max_amount, prefix, after). For example:

import lakefs

for repo in lakefs.repositories(prefix='test-repo-', max_amount=5):
    print(repo.id)

💡Tip: You don’t need to manually paginate through results! All listing operations are Python generators that will retrieve more results as they are consumed behind the scenes. Had enough results? Simply break out of the for-loop.

Selecting a specific repository to work with

import lakefs

repo = lakefs.repository('example-repo')

Creating a repository if it doesn’t exist yet

To create a new repository, call the Repository.create() method. By default, if the repository already exists, we’ll get an error trying to create it again. Passing exist_ok will return the existing repository instead of creating one in case the repository ID already exists…

import lakefs

repo = lakefs.repository('example-repo').create(
    storage_namespace='s3://my-s3-bucket/example-repo/', 
    exist_ok=True)

Creating & listing branches and tags

Listing branches works using the same patterns as above., We need a Repository object, from which we can call the branches() or tags() method:

import lakefs

repo = lakefs.repository('example-repo')

# create a new branch from "main"
repo.branch('dev').create(source_reference_id='main')  # source_reference_id could also be a commit ID or a tag

# list all branches in the repository
for branch in repo.branches():
    print(branch.id)

# same idea for tags
for tag in repo.tags():
    print(tag.id)

# we can also pass optional query parameters when listing:
for tag in repo.tags(prefix='dev-'):
    print(tag.id)

# or with a list comprehension
tag_ids = [tag.id for tag in repo.tags()]

Reading a reference’s commit log

This works by calling the log() method on any Reference (or subclass such as Branch and Tag):

import lakefs

branch = lakefs.repository('example-repo').branch('main')

# Read the latest 10 commits
for commit in branch.log(max_amount=10):
    print(commit.message)

Listing objects

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('main')

# Passing a delimiter will return lakefs.CommonPrefix and lakefs.ObjectInfo objects
for entry in branch.objects(delimiter='/', prefix='my_directory/'):
    print(entry.path)

# to list recursively, omit the delimiter. 
# Listing will return only lakefs.ObjectInfo objects
for object in branch.objects(max_amount=100):
    print(f'{entry.path} (size: {entry.size_bytes:,})')

# let's calculate the size of a directory!
total_bytes = sum([entry.size_bytes for entry in branch.objects(prefix='my_directory/')])

Reading objects

One of the biggest benefits of the new SDK is how it handles object I/O. The new SDK exposes a reader and writer that are Pythonic file-like objects, immediately usable by the vast majority of libraries that read, parse or write data.

Behind the scenes, these readers and writers are efficient; they will probe the lakeFS server to understand if pre-signed URLs are supported, and if so, will utilize those by default – so that no data has to go through the lakeFS server. Let’s see a few examples:

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('main')

r = branch.object('data/example.yaml').reader(mode='r')  # default mode is 'rb', returning bytes
data = r.read()  # contents of object, as a string
r.close()

# use as a context manager:
with branch.object('data/example.yaml').reader(mode='r') as r:
    print(r.read())

# line buffering is also supported
with branch.object('data/example.yaml').reader(mode='r') as r:
    for line in r:
        print(line)

# Being file-like, we can simply pass a reader to libraries that accept those:
import yaml  # pip install PyYAML

with branch.object('data/example.yaml').reader(mode='r') as r:
    data = yaml.safe_load(r)

# Using with python's csv reader
import csv

with branch.object('data/my_table.csv').reader(mode='r') as r:
    tbl_reader = csv.reader(r)
    for line in tbl_reader:
        print(line)

# Binary files work well too! let's parse an image file
from PIL import Image  # pip install Pillow

with branch.object('data/images/my_image.png').reader() as r:
    img = Image.open(r)
    print(f'image size: {img.size[0]} by {img.size[1]} pixels')

# and of course, parquet files!
import pandas as pd  # pip install pandas pyarrow

with branch.object('data/sales.parquet').reader() as r:
    df = pd.read_parquet(r)

Of course, we don’t have to read from a branch – we can use tags or commits as well:

import lakefs

# read from a tag
tag = lakefs.repository('example-repo').tag('v1')
with tag.object('data/example.txt').reader(mode='r') as r:
    print(r.read())

# ...or a commit:
commit = lakefs.repository('example-repo').commit('abc123')
with commit.object('data/example.txt').reader(mode='r') as r:
    print(r.read())

We can also “force” the SDK to either use pre-signed URLs or download data directly from the lakeFS server by explicitly setting the pre_sign argument:

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('main')

with branch.object('data/example.yaml').reader(mode='r', pre_sign=True) as r:
    print(r.read()) # would fail if underlying object store doesn't support pre-signed URLs

with branch.object('data/example.yaml').reader(mode='r', pre_sign=False) as r:
    print(r.read()) # force lakeFS to proxy the data through the lakeFS server

Writing objects

Similar to reading, writing with the new SDK would, by default, probe the repository’s storage for pre-signed URL support, and if enabled – would default to using it. This improves performance by allowing the client to directly write to the object store without any data going through the lakeFS server.

Regardless of pre-signed mode support, the writer interface exposes a file-like object, so we can integrate with many libraries out of the box:

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('main')

# simple upload
w = branch.object('data/story.txt').writer(mode='w')
w.write('hello world!\n')
w.close()

# this is functionally equivilent to
branch.object('data/story.txt').upload('hello world!\n', mode='w')

# writers are context managers
with branch.object('data/story.txt').writer(mode='w') as w:
    w.write('hello world!\n')

# let's write some yaml
import yaml # pip install PyYAML

with branch.object('data/example.yaml').writer(mode='w') as w:
    yaml.safe_dump({'foo': 1, 'bar': ['a', 'b', 'c']}, w)

# How about reading an image, resizing it, and writing back?
from PIL import Image # pip install Pillow

image_object = branch.object('data/images/my_image.png')

# read, resize, write back
with image_object.reader() as r:
    img = Image.open(r).resize((300, 300))

with image_object.writer() as w:
    img.save(w, format='png')

Unlike reading, writer() requires an object derived from a branch, since this is the only type of reference that supports writing (tags and commits in lakeFS are immutable).

Diffing, committing and merging

Using the new SDK, versioning operations become much easier. Let’s see the uncommitted changes on a branch:

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('dev')

# non-recursive, show prefix ("directories") that changed:
for change in branch.uncommitted(delimiter='/'):
    print(change.path)

# check if there are changes to a specific prefix
prefix_is_dirty = any(branch.uncommitted(prefix='some_prefix/'))

# see all uncommitted changes on our dev branch
for change in branch.uncommitted():
    print(change)

Committing operates at the branch level:

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('dev')

branch.commit(message='added a country column to the sales table')

Diffing between any two references:

import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('dev')

branch.commit(message='added a country column to the sales table')

# can also pass a commit or another branch
for change in branch.diff(repo.tag('v1')): 
    print(f'{change.type} {change.path}')

# shorthand form, no need to instantiate a tag object
for change in branch.diff('v1'):
    print(f'{change.type} {change.path}')

Lastly, let’s merge a branch into another:

import lakefs

repo = lakefs.repository('example-repo')
src = repo.branch('dev')
dst = repo.branch('main')

if any(dst.diff(src)):
    src.merge_into(dst)

Importing data into lakeFS

The new SDK also makes it much easier to import existing data from the object store into lakeFS:

import time
import lakefs

repo = lakefs.repository('example-repo')
branch = repo.branch('dev')

branch.import_data(commit_message='added public S3 data')                    \
    .prefix('s3://example-bucket1/path1/', destination='datasets/path1/') \
    .prefix('s3://example-bucket1/path2/', destination='datasets/path2/') \
    .run()

# run() is a convenience method that blocks until the import is reported as done,
# raising an exception if it fails. We can call start() and status() ourselves for
# an async version of the above:
importer = branch.import_data(commit_message='added public S3 data')         \
    .prefix('s3://example-bucket1/path1/', destination='datasets/path1/') \
    .prefix('s3://example-bucket1/path2/', destination='datasets/path2/') \

importer.start()
status = importer.start()

while not status.completed:
        time.sleep(3)  # or whatever interval you choose
        status = importer.status()

if status.error:
    # handle!

print(f'imported a total of {status.ingested_objects} objects!')

Using a custom client with subresources

In the examples above we used the default client, initialized from either environment variables or by using the values from ~/.lakectl.yaml. If we want to construct our own client, explicitly setting configuration values on it, we can pass it when initializing any resource. 

Any sub resource derived from the resource we passed the custom client to, will also inherit the custom client:

import lakefs
from lakefs.client import Client

custom_client = Client(host='http://localhost:8000', ...)
custom_client.config.ssl_ca_cert = ''
custom_client.config.proxy = 'https://my-proxy.example.com:8080'

# usage
repo = lakefs.Repository('example-repo', client=custom_client)
branch = repo.branch('main') # will inherit custom_client

# nested sub-resource, will also inherit custom_client
reader = repo.tag('v1').object('foo/bar.txt').reader(mode='r') 

Conclusion

The new Python SDK provides a greatly simplified experience for lakeFS users. It makes common tasks not only easier, but also more performant by optimizing I/O when possible. 

The new SDK is open source (Apache 2.0 licensed) and is available today for all lakeFS >= 1.0 users. 
As always, if you have any questions, feedback or suggestions – we’re happy to hear from you on the lakeFS Slack!

Git for Data – lakeFS

  • Get Started
    Get Started
  • Where is data engineering heading in 2024? Find out in this year’s State of Data Engineering Report -

    Read it here
    +