Ready to dive into the lake?
lakeFS is currently only
available on desktop.

For an optimal experience, provide your email below and one of our lifeguards will send you a link to start swimming in the lake!

lakeFS Community
Tal Sofer
Tal Sofer Author

Tal Sofer is a product manager at Treeverse, the company...

Published on July 10, 2024

One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work.  

But this problem isn’t new. Software development teams have been facing this for years when collaborating on source code. Luckily, they also developed a solution: version control.

Today it’s hard to imagine a successful development project without source version control.

And the data world is quickly catching up. A data version control system is the enabler for data collaboration. If you are part of a data team with two or more members, or you’re looking to improve collaboration within your data environment, keep reading to learn about the impact of good collaboration in data environments and how you can achieve it using lakeFS.

What Is Data Collaboration?

Software engineers collaborate during development. They usually write code on their own, explaining and recording it as they go. After honing their code in their preferred programming environment, they share their work with others through an independent source control platform (such as GitHub, GitLab, etc.).

Sounds like a breeze, doesn’t it? Well, this doesn’t hold for data teams.

Data practitioners don’t travel from point A to point B. They respond to inquiries, figure out underlying issues, perform research, and generally make their way through unknown territory without a clear destination or roadmap in sight. Because of this, working together with teammates and stakeholders happens far earlier in the process.

Data collaboration can be categorized into one of two groups:

Collaboration Type Description
Team collaboration Usually occurs asynchronously and places importance on aspects like quality assurance, monitoring, and recording a project’s progress.
Organizational collaboration This is where projects are shared with stakeholders who are technical and non-technical, collectively shaping the way businesses save, arrange, utilize, and benefit from data projects over time.

Optimal Data Collaboration

What does optimal data collaboration actually look like? Here are four key considerations for teams looking to boost their cooperation:

Data collaboration with lakeFS

1. Simultaneous teamwork

Teams must enable multiple people across different teams to work on data assets without conflicts or losing work.

2. Maintaining data governance

Organisations adhere to data governance policies to protect privacy and follow regulations such as GDPR. Data access control is a critical part of data governance processes. Data collaboration should be consistent with governance policies and ensure the right data is accessible to the right people. This is where Role-Based Access Control (RBAC) and data policies come in handy.

3. Sharing specific data versions  

Communication is a critical part of collaboration. When it comes to data that changes constantly, referring to a dataset in communication is not enough. We need to specify the version of the data we used, so that we actually talk about the same dataset. Today’s data is not the data we had a week ago. Just like we communicate an issue in a software product, we always specify the version of the software. When we communicate about data, we need to specify the version of the data.

Data versions demonstrate work and experiments easily. Team members should provide reliable static references to ensure consistency and enhance collaboration.

4. Clear change history

Team members should be able to track which changes were made to the data, when, and why. By having access to change history, you can maintain a transparent and accountable data modification record and achieve full reproducibility of past events.

How can you implement these in practice? By integrating data version control!

Git‘s success in the world of software development can be attributed to its strong support for the engineering best practices that developers demand, specifically:

  • Ability to work together while the project is being developed
  • In the event of a mistake, restoring the code repository to a prior version
  • Reproducing and resolving problems with a certain code version
  • Integrating and releasing new code continuously (CI/CD)

Data practitioners may now effortlessly manage their data using a simple, intuitive Git-like interface thanks to lakeFS, which offers the precise advantages they are missing:

  • Ability to work on data assets simultaneously, without conflicts or losing work. 
  • Easily revert to a previous version of data in case of errors, ensuring users have access to high-quality data.
  • Simplify troubleshooting by maintaining a detailed history of data changes and identifying specific versions that may have introduced issues
  • Reproduce and investigate issues using specific data versions
  • Implement and continuously run checks and validations on data before integrating it into the production environment, preventing errors from reaching end users.

How to Use lakeFS to Enable Data Collaboration

The lakeFS versioning engine allows for Git-like operations on top of object storage. 

The same benefits of cooperation and organization that come with source control application code management are also available when integrating these processes into your data lake pipelines.

In this section, we will demonstrate how to use lakeFS Git-like operations and RBAC capabilities to enable Data Collaboration within your organization. 

Sample Notebook 

You will run the sample notebook in a local Docker container. 

The diagram below demonstrates the setup created in the sample notebook: 

Data collaboration setup diagram

Prerequisites

 Step 1: Acquire lakeFS Access Key and Secret Key

In this step, you will acquire the lakeFS Access Key and Secret Key that will later on be used in the following steps. If you already have a Secret Key, you can skip this section.

Note: To create a new access key, you either need the AuthManageOwnCredentials Policy or the AuthFullAccess Policy attached to your user. 

Login to lakeFS and click on Administration -> Create Access Key

A new key will be generated:

Create access key for your data collaboration setup

As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again (but you will be able to create new ones).

Step 2: Demo Setup: Clone Samples Repo and Run a Demo Environment

*Please note all code examples provided are a reference for the steps included and cannot work standalone.

git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples/
docker compose --profile local-lakefs up

Step 3: Edit Notebook Configuration

Edit the following configuration in the notebook:

  • Change lakeFS Cloud endpoint and credentials:
lakefsEndPoint = '<lakeFS Endpoint URL>'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'
  • Storage Information: Since you are going to create a lakeFS repository in the demo, you will need to change the storage namespace to point to a unique path. If you are using your own bucket, insert the path to your bucket. If you are using a bucket in lakeFS Cloud, you will want to create a repository in a subdirectory of the sample repository that was automatically created for you.

    For example, if you login to your lakeFS Cloud and see:

Add a subdirectory to the existing path (in this case, s3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/). i.e. insert:

storageNamespace = 's3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/image-segmentation-repo/'
  • Install lakeCTL: Install and configure lakectl on your computer (lakeFS command-line tool).

Step 4: lakeFS Environment Setup

You can run the Setup cells in the notebook without changing anything. During setup, you will do the following:

  • Define a few variables
  • Install & import Python libraries
  • Create an admin user, attach it to the pre-configured Admins group and a lakeFS Python client they can use

Step 5: Run the Demo

This section includes a high-level overview of the demo flow with key code snippets, you can find the full code reference here.

  • An admin user creates a “data-collaboration” repository
repo = lakefs.Repository("data-collaboration-repo", client = admin1LakefsClient).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)
admin1Client.repositories_api.set_branch_protection_rules(
        repository=repo_name,
        branch_protection_rule=[models.BranchProtectionRule(
            pattern=mainBranch)])
  • An admin user creates two developer users (“developer1” and “developer2”) and adds them to a developers group which is pre-configured by lakeFS.
# Create users
admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
            id='developer1'))

admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
            id='developer2'))

# Attach users to "Developers" group
groupNameDevelopers='Developers'

has_more = True
next_offset = ""
while has_more:
        groups = superUserClient.auth_api.list_groups(after=next_offset)
        for r in groups.results:
                if r.name == groupNameDevelopers:
                    groupIdDevelopers = r.id
                    break
        has_more = groups.pagination.has_more
        next_offset = groups.pagination.next_offset

admin1Client.auth_api.add_group_membership(
        group_id=groupIdDevelopers,
        user_id='developer1')

admin1Client.auth_api.add_group_membership(
        group_id=groupIdDevelopers,
        user_id='developer2')
  • An admin user creates a data scientist user and data scientists group and adds the user to the group. The admin adds pre-configured lakeFS policies to the new group.
# Create a Data scientists group 
DataScientistsGroup = admin1Client.auth_api.create_group(
        group_creation=models.GroupCreation(
        id='DataScientists'))

# Attach pre-configured policies to the new group 
admin1Client.auth_api.attach_policy_to_group(
        group_id=DataScientistsGroup.id,
        policy_id='AuthManageOwnCredentials')

admin1Client.auth_api.attach_policy_to_group(
        group_id=DataScientistsGroup.id,
        policy_id='FSReadWriteAll')

admin1Client.auth_api.attach_policy_to_group(
        group_id=DataScientistsGroup.id,
        policy_id='RepoManagementReadAll')

# Create a data scientist user 
admin1Client.auth_api.create_user(
        user_creation=models.UserCreation(
        id='data_scientist1'))

# Attach the created user to the Data scientists group 
admin1Client.auth_api.add_group_membership(
        group_id=DataScientistsGroup.id,
        user_id='data_scientist1')
  • The admin user creates an “FSBlockMergingToMain” policy and attaches it to the Data Scientist group to prevent them from introducing changes to production.
# Create an FSBlockMergingToMain
admin1Client.auth_api.create_policy(
    policy=models.Policy(
        id='FSBlockMergingToMain',
        statement=[models.Statement(
            effect="deny",
            resource="arn:lakefs:fs:::repository/*/branch/main",
            action=["fs:CreateCommit"],
        ),
        ]
    )
)

# Attach the policy to the Data scientists group
admin1Client.auth_api.attach_policy_to_group(
    group_id=DataScientistsGroup.id,
    policy_id='FSBlockMergingToMain')
  • When a data scientist tries to merge their changes to production, the merge operation fails. 
  • Developer1” and  “Developer2” are collaborating on a raw shopping transactions dataset without stepping on each other’s toes:
    • Developer1” is testing its work on production data in isolation. Its work touches the raw shopping transactions dataset.
# Developer1 creates a branch from main to test changes in isolation 
branchTransformationsChange = repo.branch('transformations-change').create(source_reference=mainBranch)
print("transformations-change ref:", branchTransformationsChange.get_commit().id)
    • At the same time, “Developer2” is ingesting new data to the raw shopping transactions dataset.
# Developer2 creates a branch from main to ingest new data to a raw shopping transactions dataset
branchSecondIngestion = repo.branch('ingest-shopping-transactions-2').create(source_reference=mainBranch)
print("ingest-shopping-transactions-2' ref:", branchSecondIngestion.get_commit().id

# Data ingestions to an isolated branch
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchSecondIngestion.object('shopping_transactions/raw/rawdata2.csv').upload(data=contentToUpload, mode='wb', pre_sign=False

ref = branchSecondIngestion.commit(message='Ingested raw shopping transactions data!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

# A merge is the data ingestion to production
res = branchSecondIngestion.merge_into(branchMain)
print(res)
    • The changes made by “Developer2” don’t impact “Developer1”’s work
# A comparison between "Developer1"'s work branch and the main branch shows # the data introduced by "Developer2". Meaning, the changes had no impact on # the data "Developer1" is working on
diff = branchTransformationsChange.diff(other_ref=branchSecondIngestion)

Path                                    Path Type      Size(Bytes)  Type
--------------------------------------  -----------  -------------  ------
shopping_transactions/raw/rawdata2.csv  object                   9  added

The Future of Data Collaboration

Teams using lakeFS can already work simultaneously on different branches without conflicts, maintain data governance through role-based access control (RBAC), and share specific data versions to ensure consistency. The clear change history allows teams to track modifications, understand dataset evolution, and maintain accountability. However, there are still areas of data collaboration that remain uncovered.

In software development, collaboration thrives on the ability to comment and annotate—features often missing in data environments. What if lakeFS introduced pull requests for data? This feature would enable users to leave comments or notes on specific parts of a dataset, fostering discussions and promoting knowledge sharing within the team. Bringing this level of interaction to data management could change how teams collaborate on and understand their data.

By incorporating pull requests and annotations, lakeFS can further enhance the collaborative experience it already provides. Teams can review changes, discuss potential impacts, and make informed decisions, all within a unified platform. This approach elevates data collaboration to new heights, enabling more efficient, transparent, and cohesive teamwork in data-driven projects.

Git for Data – lakeFS

  • Get Started
    Get Started
  • Did you know that lakeFS is an official Databricks Technology Partner? Learn more about -

    lakeFS for Databricks
    +