Since its inception, lakeFS shipped with a full featured Python SDK. For each new version of lakeFS, this SDK is automatically generated, relying on the OpenAPI specification published by the given version.
While this always ensured the Python SDK shipped with all possible features, the automatically generated code wasn’t always the nicest (or most Pythonic) to work with:
- Credentials must be explicitly configured without support for e.g. environment variables or configuration files
- No IO abstractions – users had to implement their own logic to support pre-signed URLs, convert between types to allow using the retrieved content with popular libraries that expect file-like objects, etc.
- No built-in pagination, requiring manual offset management and server fetching
- Unintuitive API structure at times – request and response fields were mapped to auto-generated python classes
With the recent release of lakeFS 1.0, we felt it’s time to improve that.
We worked hard over the last few months, building a much nicer abstraction layer on top of the generated code. One that is simpler to use, ties better into the rest of the Python Data ecosystem – and perhaps more importantly, makes common tasks much easier to accomplish.
In this blog post, I’ll cover some of the nice quality of life improvements offered by the latest SDK, accompanied by useful code examples for each one.
I urge you to follow the examples and give them a try yourself!
Prerequisites
Before installing the latest Python SDK, make sure you have the following configured:
- A relatively recent version of Python installed (
>=3.9). - A running lakeFS installation, running lakeFS
>= 1.0
????Tip: You can try lakeFS easily without installing anything on https://lakefs.cloud/
Installing the new SDK
Installation is simpler than ever. In your favorite terminal, IDE or notebook, run the following command:
pip install lakefsThis simple command will install the latest currently available SDK version with all required dependencies (spoiler: there aren’t many!).
Configuring a lakeFS Client
The new SDK will try to automatically configure the endpoint and credentials to be used. Once imported, it will try to discover these from the following places:
- Any configuration passed explicitly by the user when initializing a client
- Otherwise, check for a local
lakectlconfiguration file (~/.lakectl.yaml) and use its configuration - Otherwise, check for the
LAKECTL_SERVER_ENDPOINT_URL,LAKECTL_CREDENTIALS_ACCESS_KEY_IDandLAKECTL_CREDENTIALS_SECRET_ACCESS_KEYenvironment variables
This means that in case the environment variables are set, or that ~/.lakectl.yaml exist, no configuration code is needed; the default, automatically initialized client could be used:
import lakefsExplicitly setting configuration values
You can explicitly create a lakefs.Client object with explicit connection details:
import lakefs
client = lakefs.Client(host="https://<org_id>.lakefscloud.io", ...)Jump to Using a custom client with subresources below to see how to use a custom client
Using the lakeFS Python SDK
The new SDK follows a resource-based approach that lends itself to lakeFS hierarchical model: objects are read/written to version references (commits, branches, tags). These references are part of repositories. Repositories are tied to a lakeFS installation. Let’s see this in practice:
Listing lakeFS repositories
import lakefs
for repo in lakefs.repositories():
print(repo.id)All SDK methods that list resources can optionally accept pagination/query parameters (max_amount, prefix, after). For example:
import lakefs
for repo in lakefs.repositories(prefix='test-repo-', max_amount=5):
print(repo.id)????Tip: You don’t need to manually paginate through results! All listing operations are Python generators that will retrieve more results as they are consumed behind the scenes. Had enough results? Simply break out of the for-loop.
Selecting a specific repository to work with
import lakefs
repo = lakefs.repository('example-repo')Creating a repository if it doesn’t exist yet
To create a new repository, call the Repository.create() method. By default, if the repository already exists, we’ll get an error trying to create it again. Passing exist_ok will return the existing repository instead of creating one in case the repository ID already exists…
import lakefs
repo = lakefs.repository('example-repo').create(
storage_namespace='s3://my-s3-bucket/example-repo/',
exist_ok=True)Creating & listing branches and tags
Listing branches works using the same patterns as above., We need a Repository object, from which we can call the branches() or tags() method:
import lakefs
repo = lakefs.repository('example-repo')
# create a new branch from "main"
repo.branch('dev').create(source_reference_id='main') # source_reference_id could also be a commit ID or a tag
# list all branches in the repository
for branch in repo.branches():
print(branch.id)
# same idea for tags
for tag in repo.tags():
print(tag.id)
# we can also pass optional query parameters when listing:
for tag in repo.tags(prefix='dev-'):
print(tag.id)
# or with a list comprehension
tag_ids = [tag.id for tag in repo.tags()]Reading a reference’s commit log
This works by calling the log() method on any Reference (or subclass such as Branch and Tag):
import lakefs
branch = lakefs.repository('example-repo').branch('main')
# Read the latest 10 commits
for commit in branch.log(max_amount=10):
print(commit.message)Listing objects
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('main')
# Passing a delimiter will return lakefs.CommonPrefix and lakefs.ObjectInfo objects
for entry in branch.objects(delimiter='/', prefix='my_directory/'):
print(entry.path)
# to list recursively, omit the delimiter.
# Listing will return only lakefs.ObjectInfo objects
for object in branch.objects(max_amount=100):
print(f'{entry.path} (size: {entry.size_bytes:,})')
# let's calculate the size of a directory!
total_bytes = sum([entry.size_bytes for entry in branch.objects(prefix='my_directory/')])Reading objects
One of the biggest benefits of the new SDK is how it handles object I/O. The new SDK exposes a reader and writer that are Pythonic file-like objects, immediately usable by the vast majority of libraries that read, parse or write data.
Behind the scenes, these readers and writers are efficient; they will probe the lakeFS server to understand if pre-signed URLs are supported, and if so, will utilize those by default – so that no data has to go through the lakeFS server. Let’s see a few examples:
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('main')
r = branch.object('data/example.yaml').reader(mode='r') # default mode is 'rb', returning bytes
data = r.read() # contents of object, as a string
r.close()
# use as a context manager:
with branch.object('data/example.yaml').reader(mode='r') as r:
print(r.read())
# line buffering is also supported
with branch.object('data/example.yaml').reader(mode='r') as r:
for line in r:
print(line)
# Being file-like, we can simply pass a reader to libraries that accept those:
import yaml # pip install PyYAML
with branch.object('data/example.yaml').reader(mode='r') as r:
data = yaml.safe_load(r)
# Using with python's csv reader
import csv
with branch.object('data/my_table.csv').reader(mode='r') as r:
tbl_reader = csv.reader(r)
for line in tbl_reader:
print(line)
# Binary files work well too! let's parse an image file
from PIL import Image # pip install Pillow
with branch.object('data/images/my_image.png').reader() as r:
img = Image.open(r)
print(f'image size: {img.size[0]} by {img.size[1]} pixels')
# and of course, parquet files!
import pandas as pd # pip install pandas pyarrow
with branch.object('data/sales.parquet').reader() as r:
df = pd.read_parquet(r)Of course, we don’t have to read from a branch – we can use tags or commits as well:
import lakefs
# read from a tag
tag = lakefs.repository('example-repo').tag('v1')
with tag.object('data/example.txt').reader(mode='r') as r:
print(r.read())
# ...or a commit:
commit = lakefs.repository('example-repo').commit('abc123')
with commit.object('data/example.txt').reader(mode='r') as r:
print(r.read())We can also “force” the SDK to either use pre-signed URLs or download data directly from the lakeFS server by explicitly setting the pre_sign argument:
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('main')
with branch.object('data/example.yaml').reader(mode='r', pre_sign=True) as r:
print(r.read()) # would fail if underlying object store doesn't support pre-signed URLs
with branch.object('data/example.yaml').reader(mode='r', pre_sign=False) as r:
print(r.read()) # force lakeFS to proxy the data through the lakeFS serverWriting objects
Similar to reading, writing with the new SDK would, by default, probe the repository’s storage for pre-signed URL support, and if enabled – would default to using it. This improves performance by allowing the client to directly write to the object store without any data going through the lakeFS server.
Regardless of pre-signed mode support, the writer interface exposes a file-like object, so we can integrate with many libraries out of the box:
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('main')
# simple upload
w = branch.object('data/story.txt').writer(mode='w')
w.write('hello world!\n')
w.close()
# this is functionally equivilent to
branch.object('data/story.txt').upload('hello world!\n', mode='w')
# writers are context managers
with branch.object('data/story.txt').writer(mode='w') as w:
w.write('hello world!\n')
# let's write some yaml
import yaml # pip install PyYAML
with branch.object('data/example.yaml').writer(mode='w') as w:
yaml.safe_dump({'foo': 1, 'bar': ['a', 'b', 'c']}, w)
# How about reading an image, resizing it, and writing back?
from PIL import Image # pip install Pillow
image_object = branch.object('data/images/my_image.png')
# read, resize, write back
with image_object.reader() as r:
img = Image.open(r).resize((300, 300))
with image_object.writer() as w:
img.save(w, format='png')Unlike reading, writer() requires an object derived from a branch, since this is the only type of reference that supports writing (tags and commits in lakeFS are immutable).
Diffing, committing and merging
Using the new SDK, versioning operations become much easier. Let’s see the uncommitted changes on a branch:
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('dev')
# non-recursive, show prefix ("directories") that changed:
for change in branch.uncommitted(delimiter='/'):
print(change.path)
# check if there are changes to a specific prefix
prefix_is_dirty = any(branch.uncommitted(prefix='some_prefix/'))
# see all uncommitted changes on our dev branch
for change in branch.uncommitted():
print(change)Committing operates at the branch level:
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('dev')
branch.commit(message='added a country column to the sales table')Diffing between any two references:
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('dev')
branch.commit(message='added a country column to the sales table')
# can also pass a commit or another branch
for change in branch.diff(repo.tag('v1')):
print(f'{change.type} {change.path}')
# shorthand form, no need to instantiate a tag object
for change in branch.diff('v1'):
print(f'{change.type} {change.path}')Lastly, let’s merge a branch into another:
import lakefs
repo = lakefs.repository('example-repo')
src = repo.branch('dev')
dst = repo.branch('main')
if any(dst.diff(src)):
src.merge_into(dst)
Importing data into lakeFS
The new SDK also makes it much easier to import existing data from the object store into lakeFS:
import time
import lakefs
repo = lakefs.repository('example-repo')
branch = repo.branch('dev')
branch.import_data(commit_message='added public S3 data') \
.prefix('s3://example-bucket1/path1/', destination='datasets/path1/') \
.prefix('s3://example-bucket1/path2/', destination='datasets/path2/') \
.run()
# run() is a convenience method that blocks until the import is reported as done,
# raising an exception if it fails. We can call start() and status() ourselves for
# an async version of the above:
importer = branch.import_data(commit_message='added public S3 data') \
.prefix('s3://example-bucket1/path1/', destination='datasets/path1/') \
.prefix('s3://example-bucket1/path2/', destination='datasets/path2/') \
importer.start()
status = importer.start()
while not status.completed:
time.sleep(3) # or whatever interval you choose
status = importer.status()
if status.error:
# handle!
print(f'imported a total of {status.ingested_objects} objects!')Using a custom client with subresources
In the examples above we used the default client, initialized from either environment variables or by using the values from ~/.lakectl.yaml. If we want to construct our own client, explicitly setting configuration values on it, we can pass it when initializing any resource.
Any sub resource derived from the resource we passed the custom client to, will also inherit the custom client:
import lakefs
from lakefs.client import Client
custom_client = Client(host='http://localhost:8000', ...)
custom_client.config.ssl_ca_cert = ''
custom_client.config.proxy = 'https://my-proxy.example.com:8080'
# usage
repo = lakefs.Repository('example-repo', client=custom_client)
branch = repo.branch('main') # will inherit custom_client
# nested sub-resource, will also inherit custom_client
reader = repo.tag('v1').object('foo/bar.txt').reader(mode='r') Conclusion
The new Python SDK provides a greatly simplified experience for lakeFS users. It makes common tasks not only easier, but also more performant by optimizing I/O when possible.
The new SDK is open source (Apache 2.0 licensed) and is available today for all lakeFS >= 1.0 users.
As always, if you have any questions, feedback or suggestions – we’re happy to hear from you on the lakeFS Slack!


