One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work.

But this problem isn’t new. Software development teams have been facing this for years when collaborating on source code. Luckily, they also developed a solution: version control.

Today it’s hard to imagine a successful development project without source version control.

And the data world is quickly catching up. A data version control system is the enabler for data collaboration. If you are part of a data team with two or more members, or you’re looking to improve collaboration within your data environment, keep reading to learn about the impact of good collaboration in data environments and how you can achieve it using lakeFS.

What Is Data Collaboration?

Data collaboration refers to the cooperation between data practitioners who are part of the data team and data teams with broader organizations. It requires teams to collaborate closely, sharing information and insights to attain a common goal.

Data cooperation is an essential business component because it allows companies to remain competitive in an environment where the amount of data available grows exponentially.

Software engineers collaborate during development. They usually write code on their own, explaining and recording it as they go. After honing their code in their preferred programming environment, they share their work with others through an independent source control platform (such as GitHub, GitLab, etc.).

Sounds like a breeze, doesn’t it? Well, this doesn’t hold for data teams.

Data practitioners don’t travel from point A to point B. They respond to inquiries, figure out underlying issues, perform research, and generally make their way through unknown territory without a clear destination or roadmap in sight. Because of this, working together with teammates and stakeholders happens far earlier in the process.

Data collaboration can be categorized into one of two groups:

Collaboration Type	Description
Team collaboration	Usually occurs asynchronously and places importance on aspects like quality assurance, monitoring, and recording a project’s progress.
Organizational collaboration	This is where projects are shared with stakeholders who are technical and non-technical, collectively shaping the way businesses save, arrange, utilize, and benefit from data projects over time.

Benefits of Successful Data Collaboration

Data collaboration is all about exchanging information and ideas among team members to foster a collaborative environment and make informed decisions.

Enhanced decision-making

Collaboration gives teams access to real-time data, allowing them to make informed decisions consistent with the organization’s overall objectives. Sharing data allows teams to lessen the risk of making decisions based on limited, obsolete, or incomplete information.

Streamlined workflows

Collaboration helps organizations streamline workflows by eliminating the need to enter data manually. Data collaboration tools and technology enable the automation of operations, saving time spent on manual data entry while boosting data accuracy and dependability.

Increased innovation

This type of collaboration promotes a climate of innovation. Data cooperation promotes innovative issue-solving by offering teams a variety of viewpoints. Working together allows teams to develop new ideas, which leads to better decision-making and business outcomes.

Data Collaboration Best Practices

What does optimal data collaboration actually look like? Here are four key considerations for teams looking to boost their cooperation:

1. Simultaneous teamwork

Teams must enable multiple people across different teams to work on data assets without conflicts or losing work.

2. Maintaining data governance

Organisations adhere to data governance policies to protect privacy and follow regulations such as GDPR. Data access control is a critical part of data governance processes. Data collaboration should be consistent with governance policies and ensure the right data is accessible to the right people. This is where Role-Based Access Control (RBAC) and data policies come in handy.

3. Sharing specific data versions

Communication is a critical part of collaboration. When it comes to data that changes constantly, referring to a dataset in communication is not enough. We need to specify the version of the data we used, so that we actually talk about the same dataset. Today’s data is not the data we had a week ago. Just like we communicate an issue in a software product, we always specify the version of the software. When we communicate about data, we need to specify the version of the data.

Data versions demonstrate work and experiments easily. Team members should provide reliable static references to ensure consistency and enhance collaboration.

4. Clear change history

Team members should be able to track which changes were made to the data, when, and why. By having access to change history, you can maintain a transparent and accountable data modification record and achieve full reproducibility of past events.

How can you implement these in practice? By integrating data version control!

Git‘s success in the world of software development can be attributed to its strong support for the engineering best practices that developers demand, specifically:

Ability to work together while the project is being developed
In the event of a mistake, restoring the code repository to a prior version
Reproducing and resolving problems with a certain code version
Integrating and releasing new code continuously (Write-Audit-Publish)

Data practitioners may now effortlessly manage their data using a simple, intuitive Git-like interface thanks to lakeFS, which offers the precise advantages they are missing:

Ability to work on data assets simultaneously, without conflicts or losing work.
Easily revert to a previous version of data in case of errors, ensuring users have access to high-quality data.
Simplify troubleshooting by maintaining a detailed history of data changes and identifying specific versions that may have introduced issues
Reproduce and investigate issues using specific data versions
Implement and continuously run checks and validations on data before integrating it into the production environment, preventing errors from reaching end users.

Challenges of Effective Data Collaboration

Siloed data

Departmental data silos in large businesses can hamper the free flow of information. To overcome these impediments, cultural shifts, organizational restructuring, and technology solutions that encourage cross-departmental data access and collaboration are required.

Data security

As data sources and cyberattacks become increasingly sophisticated, data security grows into a top requirement during transmission and storage. Businesses are challenged with protecting sensitive information from illegal access, data breaches, and potential disclosures.

Strong encryption, firewalls, and intrusion detection systems are essential, but they are expensive to implement and require frequent updates to keep up with new threats. A centralized system for managing all data – especially personally identifiable data – helps to ensure security and compliance with regulations such as the GDPR.

Costly and complex data pipelines

Data availability, quality assurance, and protection all entail expenses. Infrastructure charges, data preparation activities, and the adoption of security and privacy safeguards all incur costs. Businesses must balance these concerns with the potential benefits of data sharing to remain profitable.

Scalability

The massive amount of data generated nowadays creates storage, transmission, and processing issues. Deciding what data to share, guaranteeing fast data transfer, and making sense of large datasets necessitate advanced storage systems, efficient data pipelines, and strong analytics tools.

How to Implement Data Collaboration with lakeFS

The lakeFS versioning engine allows for Git-like operations on top of object storage.

The same benefits of cooperation and organization that come with source control application code management are also available when integrating these processes into your data lake pipelines.

In this section, we will demonstrate how to use lakeFS Git-like operations and RBAC capabilities to enable Data Collaboration within your organization.

Sample Notebook

You will run the sample notebook in a local Docker container.

The diagram below demonstrates the setup created in the sample notebook:

Prerequisites

Docker installed on your local machine
This sample flow requires a lakeFS Cloud or lakeFS Enterprise setup

Step 1: Acquire lakeFS Access Key and Secret Key

In this step, you will acquire the lakeFS Access Key and Secret Key that will later on be used in the following steps. If you already have a Secret Key, you can skip this section.

Note: To create a new access key, you either need the AuthManageOwnCredentials Policy or the AuthFullAccess Policy attached to your user.

A new key will be generated:

Create access key for your data collaboration setup

As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again (but you will be able to create new ones).

Step 2: Demo Setup: Clone Samples Repo and Run a Demo Environment

*Please note all code examples provided are a reference for the steps included and cannot work standalone.

Copy Code

git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples/
docker compose --profile local-lakefs up

Step 3: Edit Notebook Configuration

Edit the following configuration in the notebook:

Change lakeFS Cloud endpoint and credentials:

Copy Code

lakefsEndPoint = '<lakeFS Endpoint URL>'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'

Storage Information: Since you are going to create a lakeFS repository in the demo, you will need to change the storage namespace to point to a unique path. If you are using your own bucket, insert the path to your bucket. If you are using a bucket in lakeFS Cloud, you will want to create a repository in a subdirectory of the sample repository that was automatically created for you.

For example, if you login to your lakeFS Cloud and see:

Add a subdirectory to the existing path (in this case, s3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/). i.e. insert:

Copy Code

storageNamespace = 's3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/image-segmentation-repo/'

Install lakeCTL: Install and configure lakectl on your computer (lakeFS command-line tool).

Step 4: lakeFS Environment Setup

You can run the Setup cells in the notebook without changing anything. During setup, you will do the following:

Define a few variables
Install & import Python libraries
Create an admin user, attach it to the pre-configured Admins group and a lakeFS Python client they can use

Step 5: Run the Demo

This section includes a high-level overview of the demo flow with key code snippets, you can find the full code reference here.

An admin user creates a “data-collaboration” repository

Copy Code

repo = lakefs.Repository("data-collaboration-repo", client = admin1LakefsClient).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

The admin protects the main branch by enforcing no direct writes to it.

Copy Code

admin1Client.repositories_api.set_branch_protection_rules(
        repository=repo_name,
        branch_protection_rule=[models.BranchProtectionRule(
            pattern=mainBranch)])

An admin user creates two developer users (“developer1” and “developer2”) and adds them to a developers group which is pre-configured by lakeFS.

Copy Code

# Create users
admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
            id='developer1'))

admin1Client.auth_api.create_user(
    user_creation=models.UserCreation(
            id='developer2'))

# Attach users to "Developers" group
groupNameDevelopers='Developers'

has_more = True
next_offset = ""
while has_more:
        groups = superUserClient.auth_api.list_groups(after=next_offset)
        for r in groups.results:
                if r.name == groupNameDevelopers:
                    groupIdDevelopers = r.id
                    break
        has_more = groups.pagination.has_more
        next_offset = groups.pagination.next_offset

admin1Client.auth_api.add_group_membership(
        group_id=groupIdDevelopers,
        user_id='developer1')

admin1Client.auth_api.add_group_membership(
        group_id=groupIdDevelopers,
        user_id='developer2')

An admin user creates a data scientist user and data scientists group and adds the user to the group. The admin adds pre-configured lakeFS policies to the new group.

Copy Code

# Create a Data scientists group 
DataScientistsGroup = admin1Client.auth_api.create_group(
        group_creation=models.GroupCreation(
        id='DataScientists'))

# Attach pre-configured policies to the new group 
admin1Client.auth_api.attach_policy_to_group(
        group_id=DataScientistsGroup.id,
        policy_id='AuthManageOwnCredentials')

admin1Client.auth_api.attach_policy_to_group(
        group_id=DataScientistsGroup.id,
        policy_id='FSReadWriteAll')

admin1Client.auth_api.attach_policy_to_group(
        group_id=DataScientistsGroup.id,
        policy_id='RepoManagementReadAll')

# Create a data scientist user 
admin1Client.auth_api.create_user(
        user_creation=models.UserCreation(
        id='data_scientist1'))

# Attach the created user to the Data scientists group 
admin1Client.auth_api.add_group_membership(
        group_id=DataScientistsGroup.id,
        user_id='data_scientist1')

The admin user creates an “FSBlockMergingToMain” policy and attaches it to the Data Scientist group to prevent them from introducing changes to production.

Copy Code

# Create an FSBlockMergingToMain
admin1Client.auth_api.create_policy(
    policy=models.Policy(
        id='FSBlockMergingToMain',
        statement=[models.Statement(
            effect="deny",
            resource="arn:lakefs:fs:::repository/*/branch/main",
            action=["fs:CreateCommit"],
        ),
        ]
    )
)

# Attach the policy to the Data scientists group
admin1Client.auth_api.attach_policy_to_group(
    group_id=DataScientistsGroup.id,
    policy_id='FSBlockMergingToMain')

When a data scientist tries to merge their changes to production, the merge operation fails.
“Developer1” and “Developer2” are collaborating on a raw shopping transactions dataset without stepping on each other’s toes:
- “Developer1” is testing its work on production data in isolation. Its work touches the raw shopping transactions dataset.

Copy Code

# Developer1 creates a branch from main to test changes in isolation 
branchTransformationsChange = repo.branch('transformations-change').create(source_reference=mainBranch)
print("transformations-change ref:", branchTransformationsChange.get_commit().id)

- At the same time, “Developer2” is ingesting new data to the raw shopping transactions dataset.

Copy Code

# Developer2 creates a branch from main to ingest new data to a raw shopping transactions dataset
branchSecondIngestion = repo.branch('ingest-shopping-transactions-2').create(source_reference=mainBranch)
print("ingest-shopping-transactions-2' ref:", branchSecondIngestion.get_commit().id

# Data ingestions to an isolated branch
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchSecondIngestion.object('shopping_transactions/raw/rawdata2.csv').upload(data=contentToUpload, mode='wb', pre_sign=False

ref = branchSecondIngestion.commit(message='Ingested raw shopping transactions data!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())

# A merge is the data ingestion to production
res = branchSecondIngestion.merge_into(branchMain)
print(res)

- The changes made by “Developer2” don’t impact “Developer1”’s work

Copy Code

# A comparison between "Developer1"'s work branch and the main branch shows # the data introduced by "Developer2". Meaning, the changes had no impact on # the data "Developer1" is working on
diff = branchTransformationsChange.diff(other_ref=branchSecondIngestion)

Path                                    Path Type      Size(Bytes)  Type
--------------------------------------  -----------  -------------  ------
shopping_transactions/raw/rawdata2.csv  object                   9  added

The Future of Data Collaboration

Teams using lakeFS can already work simultaneously on different branches without conflicts, maintain data governance through role-based access control (RBAC), and share specific data versions to ensure consistency. The clear change history allows teams to track modifications, understand dataset evolution, and maintain accountability. However, there are still areas of data collaboration that remain uncovered.

In software development, collaboration thrives on the ability to comment and annotate—features often missing in data environments. What if lakeFS introduced pull requests for data? This feature would enable users to leave comments or notes on specific parts of a dataset, fostering discussions and promoting knowledge sharing within the team. Bringing this level of interaction to data management could change how teams collaborate on and understand their data.

By incorporating pull requests and annotations, lakeFS can further enhance the collaborative experience it already provides. Teams can review changes, discuss potential impacts, and make informed decisions, all within a unified platform. This approach elevates data collaboration to new heights, enabling more efficient, transparent, and cohesive teamwork in data-driven projects.

Data Collaboration: What Is It And Why Do Teams Need It?

What Is Data Collaboration?

Benefits of Successful Data Collaboration

Enhanced decision-making

Streamlined workflows

Increased innovation

Data Collaboration Best Practices

1. Simultaneous teamwork

2. Maintaining data governance

3. Sharing specific data versions

4. Clear change history

Challenges of Effective Data Collaboration

Siloed data

Data security

Costly and complex data pipelines

Scalability

How to Implement Data Collaboration with lakeFS

Sample Notebook

Prerequisites

Step 1: Acquire lakeFS Access Key and Secret Key

Step 2: Demo Setup: Clone Samples Repo and Run a Demo Environment

Step 3: Edit Notebook Configuration

Step 4: lakeFS Environment Setup

Step 5: Run the Demo

The Future of Data Collaboration

Watch how lakeFS data version control works

lakeFS

Data Collaboration: What Is It And Why Do Teams Need It?

What Is Data Collaboration?

Benefits of Successful Data Collaboration

Enhanced decision-making

Streamlined workflows

Increased innovation

Data Collaboration Best Practices

1. Simultaneous teamwork

2. Maintaining data governance

3. Sharing specific data versions

4. Clear change history

Challenges of Effective Data Collaboration

Siloed data

Data security

Costly and complex data pipelines

Scalability

How to Implement Data Collaboration with lakeFS

Sample Notebook

Prerequisites

Step 1: Acquire lakeFS Access Key and Secret Key

Step 2: Demo Setup: Clone Samples Repo and Run a Demo Environment

Step 3: Edit Notebook Configuration

Step 4: lakeFS Environment Setup

Step 5: Run the Demo

The Future of Data Collaboration

Related articles

Iceberg Branching Best Practices for Reliable Data Operations

AI Ready Data Management: Process, Best Practices & Challenges

Building Compliant and Reproducible ML Pipelines

Watch how lakeFS data version control works

lakeFS

Pick up the Slack with lakeFS