Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros
Amit Kesarwani
Amit Kesarwani Author

Amit heads the solution architecture group at Treeverse, the company...

Last updated on January 27, 2026

lakeFS Enterprise offers a fully standards-compliant implementation of the Apache Iceberg REST Catalog, enabling Git-style version control for structured data at scale. This integration allows teams to use Iceberg-compatible tools like Spark, Trino, and PyIceberg without any vendor lock-in or proprietary formats.

By treating Iceberg tables as versioned entities within lakeFS repositories and branches, users can create isolated environments to test schema changes, ingest new data, or experiment safely without impacting production. Branches are metadata-only and zero-copy, making them fast and storage-efficient. Once validated, changes can be merged across branches using conflict-aware operations.

The REST Catalog ensures efficient performance by routing table access directly between compute engines and object stores, bypassing lakeFS in the data path. Governance is also enhanced – every change is recorded as a commit, allowing full traceability and rollback. Role-Based Access Control (RBAC) and audit logs support compliance and security needs.

This tutorial includes a PyIceberg example, showing how to register and access Iceberg tables via lakeFS using standard APIs.

In short, lakeFS’s REST Catalog for Iceberg brings version control, reproducibility, and safe collaboration to modern data lake architectures built on Apache Iceberg.

What you’ll learn in this tutorial

  • How to run lakeFS Enterprise locally
  • How to create and manage an Iceberg table backed by lakeFS
  • How to read and query the table using PyIceberg
  • How to version changes using lakeFS branches and commits

Prerequisites

Make sure you have:

  • Docker & Docker Compose
  • Git

Step 1: Clone the lakeFS Samples

git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples/02_lakefs_enterprise

Step 2: Start lakeFS Enterprise

The lakeFS Enterprise Sample is the quickest way to experience the value of lakeFS Enterprise features including lakeFS Iceberg REST Catalog in a containerized environment. This Docker-based setup is ideal if you want to easily interact with lakeFS without the hassle of integration and experiment with lakeFS without writing code.

By running the lakeFS Enterprise Sample, you will be getting a ready-to-use environment including the following containers:

  • lakeFS Enterprise (includes additional features like lakeFS Iceberg REST Catalog)
  • Postgres: used by lakeFS as a KV (Key-Value) store
  • MinIO container: used as S3-compatible object storage connected to lakeFS
  • Jupyter notebooks setup: Pre-populated with notebooks that demonstrate lakeFS Enterprise’ capabilities
  • Apache Spark: Spark client instead of PyIceberg can be used for interacting with Iceberg tables you’ll manage with lakeFS

Contact lakeFS to get the token for lakeFS Enterprise and login to Treeverse Dockerhub by using the granted token so lakeFS Enterprise proprietary image can be retrieved:

docker login -u externallakefs

Run following command to provision a lakeFS Enterprise server as well as MinIO for your object store, plus Jupyter:

This starts:

Step 3: Login to lakeFS

Go to lakeFS UI (http://localhost:8084)

Use credentials:

  • Access Key ID: AKIAIOSFOLKFSSAMPLES
  • Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
login to lakeFS

Step 4: Create a lakeFS Repository

Go to Jupyter UI (http://localhost:8894) and open “iceberg-books” notebook from the side File Browser panel:

Iceberg notebook

In the notebook, run the cells till you define/create lakeFS repository named “lakefs-py-iceberg”:

repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/ {repo_name}", default_branch=mainBranch, exist_ok=True)

branchMain = repo.branch(mainBranch)

Refresh the lakeFS UI, repository named “lakefs-py-iceberg” will get created:

lakefs-py-iceberg

Step 5: Configure the Iceberg REST Catalog

Continue running the cells in notebook and configure lakeFS Iceberg REST Catalog:

  • The Iceberg REST catalog API is exposed at “/iceberg/api” in the lakeFS server
  • Use lakeFS access key and secret for authentication
  • Use MinIO endpoint and credentials so PyIceberg client can access S3-compatible object storage
catalog = RestCatalog(
    name = "my_catalog",
    **{
    'prefix': 'lakefs',
    'uri': f'{lakefsEndPoint}/iceberg/api',
    'oauth2-server-uri': f'{lakefsEndPoint}/iceberg/api/v1/oauth/tokens',
    'credential': f'{lakefsAccessKey}:{lakefsSecretKey}',
    's3.endpoint': 'http://minio:9000',
    's3.access-key-id': 'minioadmin',
    's3.secret-access-key': 'minioadmin',
    's3.region': 'us-east-1',
    's3.force-virtual-addressing': False,
})

Step 6: Create the Iceberg Namespace and Tables Using PyIceberg

Create “lakefs_demo” Iceberg namespace in the lakeFS repository’s “main” branch:

lakefs_demo_ns = (repo_name, mainBranch, icebergNamespace)
catalog.create_namespace(lakefs_demo_ns)

Go to lakeFS UI and click on “lakefs-py-iceberg” repository to open the repository. Next, click on “_lakefs_tables” > “iceberg” > “namespaces” and you will notice that “lakefs_demo” namespace got created in the lakeFS repository’s “main” branch:

Create multiple tables in “lakefs_demo” namespace in the lakeFS repository’s “main” branch:

# create authors table
authors_schema = Schema(
    NestedField(
        field_id=1,
        name="id",
        field_type=IntegerType(),
        required=True
    ),
    NestedField(
        field_id=2,
        name="name",
        field_type=StringType(),
        required=True
    ),
)
table_authors = (repo_name, mainBranch, icebergNamespace, 'authors')

catalog.create_table(
    identifier=table_authors,
    schema=authors_schema
)

Go back to lakeFS UI and click on “lakefs_demo” > “tables” and you will notice that tables got created in the “lakefs_demo” namespace:

lakeFS demo namespace

Step 7: Insert Sample Data

Insert sample data into all three tables:

# Insert data into the authors table
authors_data = [
    {"id": 1, "name": "J.R.R. Tolkien"},
    {"id": 2, "name": "George R.R. Martin"},
    {"id": 3, "name": "Agatha Christie"},
    {"id": 4, "name": "Isaac Asimov"},
    {"id": 5, "name": "Stephen King"},
]

authors_arrow_schema = pa.schema([
    pa.field("id", pa.int8(), nullable=False),
    pa.field("name", pa.string(), nullable=False),
])
authors_arrow_table = pa.Table.from_pylist(authors_data, schema=authors_arrow_schema)
authors_table = catalog.load_table(table_authors)
authors_table.append(authors_arrow_table)

Step 8: Create a Branch in lakeFS

Create a “dev” branch sourced from the “main” branch in the lakeFS repository:

branchDev = repo.branch(devBranch).create(source_reference=mainBranch)

Click on “Branches” tab in lakeFS UI and you will notice that “dev” branch got created:

branches in main default

Step 9: Change Data in the New Branch

Delete a few records in a table in “dev” branch:

table_book_sales = (repo_name, devBranch, icebergNamespace, 'book_sales')
book_sales_table = catalog.load_table(table_book_sales)
book_sales_table.delete(delete_filter="id IN (1, 2, 6, 10, 15)")

Now compare data between “main” and “dev” branches:

compare dev and main branches

Step 10: Merge and Revert the Changes

Merge the changes in data in the “dev” branch into “main” branch:

res = branchDev.merge_into(branchMain)

You also have the option to revert/rollback the changes from the “main” branch:

branchMain.revert(parent_number=1, reference=mainBranch)

Step 11: Try Other Samples

You can also try “iceberg-books-spark” and “iceberg-books-trino” notebooks which use Spark and Trino client respectively instead of PyIceberg.

Step 12: Shut Everything Down

Once you’re finished, you can run the following to remove the Docker containers created in Step 2 above:

Summary

With PyIceberg and lakeFS you can now:

  • Read and manage Iceberg tables entirely in Python
  • Use lakeFS to isolate, test and version Iceberg tables
  • Merge validated changes just like in Git workflows
  • Revert the changes (if needed)

Resources

lakeFS