One common problem data teams face today is how to avoid stepping on teammates’ toes when working in data environments. Data assets are often handled like a shared folder that anybody can access, edit, and write to. This causes conflicting changes or accidental overrides, leading to data inconsistencies or loss of work.
But this problem isn’t new. Software development teams have been facing this for years when collaborating on source code. Luckily, they also developed a solution: version control.
Today it’s hard to imagine a successful development project without source version control.
And the data world is quickly catching up. A data version control system is the enabler for data collaboration. If you are part of a data team with two or more members, or you’re looking to improve collaboration within your data environment, keep reading to learn about the impact of good collaboration in data environments and how you can achieve it using lakeFS.
What Is Data Collaboration?
Data collaboration refers to the cooperation between data practitioners who are part of the data team and data teams with broader organizations. It requires teams to collaborate closely, sharing information and insights to attain a common goal.
Data cooperation is an essential business component because it allows companies to remain competitive in an environment where the amount of data available grows exponentially.
Software engineers collaborate during development. They usually write code on their own, explaining and recording it as they go. After honing their code in their preferred programming environment, they share their work with others through an independent source control platform (such as GitHub, GitLab, etc.).
Sounds like a breeze, doesn’t it? Well, this doesn’t hold for data teams.
Data practitioners don’t travel from point A to point B. They respond to inquiries, figure out underlying issues, perform research, and generally make their way through unknown territory without a clear destination or roadmap in sight. Because of this, working together with teammates and stakeholders happens far earlier in the process.
Data collaboration can be categorized into one of two groups:
| Collaboration Type | Description |
|---|---|
| Team collaboration | Usually occurs asynchronously and places importance on aspects like quality assurance, monitoring, and recording a project’s progress. |
| Organizational collaboration | This is where projects are shared with stakeholders who are technical and non-technical, collectively shaping the way businesses save, arrange, utilize, and benefit from data projects over time. |
Benefits of Successful Data Collaboration
Data collaboration is all about exchanging information and ideas among team members to foster a collaborative environment and make informed decisions.
Enhanced decision-making
Collaboration gives teams access to real-time data, allowing them to make informed decisions consistent with the organization’s overall objectives. Sharing data allows teams to lessen the risk of making decisions based on limited, obsolete, or incomplete information.
Streamlined workflows
Collaboration helps organizations streamline workflows by eliminating the need to enter data manually. Data collaboration tools and technology enable the automation of operations, saving time spent on manual data entry while boosting data accuracy and dependability.
Increased innovation
This type of collaboration promotes a climate of innovation. Data cooperation promotes innovative issue-solving by offering teams a variety of viewpoints. Working together allows teams to develop new ideas, which leads to better decision-making and business outcomes.
Data Collaboration Best Practices
What does optimal data collaboration actually look like? Here are four key considerations for teams looking to boost their cooperation:

1. Simultaneous teamwork
Teams must enable multiple people across different teams to work on data assets without conflicts or losing work.
2. Maintaining data governance
Organisations adhere to data governance policies to protect privacy and follow regulations such as GDPR. Data access control is a critical part of data governance processes. Data collaboration should be consistent with governance policies and ensure the right data is accessible to the right people. This is where Role-Based Access Control (RBAC) and data policies come in handy.
3. Sharing specific data versions
Communication is a critical part of collaboration. When it comes to data that changes constantly, referring to a dataset in communication is not enough. We need to specify the version of the data we used, so that we actually talk about the same dataset. Today’s data is not the data we had a week ago. Just like we communicate an issue in a software product, we always specify the version of the software. When we communicate about data, we need to specify the version of the data.
Data versions demonstrate work and experiments easily. Team members should provide reliable static references to ensure consistency and enhance collaboration.
4. Clear change history
Team members should be able to track which changes were made to the data, when, and why. By having access to change history, you can maintain a transparent and accountable data modification record and achieve full reproducibility of past events.
How can you implement these in practice? By integrating data version control!
Git‘s success in the world of software development can be attributed to its strong support for the engineering best practices that developers demand, specifically:
- Ability to work together while the project is being developed
- In the event of a mistake, restoring the code repository to a prior version
- Reproducing and resolving problems with a certain code version
- Integrating and releasing new code continuously (Write-Audit-Publish)
Data practitioners may now effortlessly manage their data using a simple, intuitive Git-like interface thanks to lakeFS, which offers the precise advantages they are missing:
- Ability to work on data assets simultaneously, without conflicts or losing work.
- Easily revert to a previous version of data in case of errors, ensuring users have access to high-quality data.
- Simplify troubleshooting by maintaining a detailed history of data changes and identifying specific versions that may have introduced issues
- Reproduce and investigate issues using specific data versions
- Implement and continuously run checks and validations on data before integrating it into the production environment, preventing errors from reaching end users.
Challenges of Effective Data Collaboration
Siloed data
Departmental data silos in large businesses can hamper the free flow of information. To overcome these impediments, cultural shifts, organizational restructuring, and technology solutions that encourage cross-departmental data access and collaboration are required.
Data security
As data sources and cyberattacks become increasingly sophisticated, data security grows into a top requirement during transmission and storage. Businesses are challenged with protecting sensitive information from illegal access, data breaches, and potential disclosures.
Strong encryption, firewalls, and intrusion detection systems are essential, but they are expensive to implement and require frequent updates to keep up with new threats. A centralized system for managing all data – especially personally identifiable data – helps to ensure security and compliance with regulations such as the GDPR.
Costly and complex data pipelines
Data availability, quality assurance, and protection all entail expenses. Infrastructure charges, data preparation activities, and the adoption of security and privacy safeguards all incur costs. Businesses must balance these concerns with the potential benefits of data sharing to remain profitable.
Scalability
The massive amount of data generated nowadays creates storage, transmission, and processing issues. Deciding what data to share, guaranteeing fast data transfer, and making sense of large datasets necessitate advanced storage systems, efficient data pipelines, and strong analytics tools.
How to Implement Data Collaboration with lakeFS
The lakeFS versioning engine allows for Git-like operations on top of object storage.
The same benefits of cooperation and organization that come with source control application code management are also available when integrating these processes into your data lake pipelines.
In this section, we will demonstrate how to use lakeFS Git-like operations and RBAC capabilities to enable Data Collaboration within your organization.
Sample Notebook
You will run the sample notebook in a local Docker container.
The diagram below demonstrates the setup created in the sample notebook:
Prerequisites
- Docker installed on your local machine
- This sample flow requires a lakeFS Cloud or lakeFS Enterprise setup
Step 1: Acquire lakeFS Access Key and Secret Key
In this step, you will acquire the lakeFS Access Key and Secret Key that will later on be used in the following steps. If you already have a Secret Key, you can skip this section.
Note: To create a new access key, you either need the AuthManageOwnCredentials Policy or the AuthFullAccess Policy attached to your user.
Login to lakeFS and click on Administration -> Create Access Key
A new key will be generated:
As instructed, copy the Secret Access Key and store it somewhere safe. You will not be able to access it again (but you will be able to create new ones).
Step 2: Demo Setup: Clone Samples Repo and Run a Demo Environment
*Please note all code examples provided are a reference for the steps included and cannot work standalone.
git clone https://github.com/treeverse/lakeFS-samples.git
cd lakeFS-samples/
docker compose --profile local-lakefs upStep 3: Edit Notebook Configuration
Edit the following configuration in the notebook:
- Change lakeFS Cloud endpoint and credentials:
lakefsEndPoint = '<lakeFS Endpoint URL>'
lakefsAccessKey = '<lakeFS Access Key>'
lakefsSecretKey = '<lakeFS Secret Key>'- Storage Information: Since you are going to create a lakeFS repository in the demo, you will need to change the storage namespace to point to a unique path. If you are using your own bucket, insert the path to your bucket. If you are using a bucket in lakeFS Cloud, you will want to create a repository in a subdirectory of the sample repository that was automatically created for you.
For example, if you login to your lakeFS Cloud and see:
Add a subdirectory to the existing path (in this case, s3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/). i.e. insert:
storageNamespace = 's3://lakefs-sample-us-east-1-production/AROA5OU4KHZHHFCX4PTOM:028298a734a7a198ccd5126ebb31d7f1240faa6b64c8fcd6c4c3502fd64b6645/image-segmentation-repo/'- Install lakeCTL: Install and configure lakectl on your computer (lakeFS command-line tool).
Step 4: lakeFS Environment Setup
You can run the Setup cells in the notebook without changing anything. During setup, you will do the following:
- Define a few variables
- Install & import Python libraries
- Create an admin user, attach it to the pre-configured Admins group and a lakeFS Python client they can use
Step 5: Run the Demo
This section includes a high-level overview of the demo flow with key code snippets, you can find the full code reference here.
- An admin user creates a “
data-collaboration” repository
repo = lakefs.Repository("data-collaboration-repo", client = admin1LakefsClient).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)- The admin protects the main branch by enforcing no direct writes to it.
admin1Client.repositories_api.set_branch_protection_rules(
repository=repo_name,
branch_protection_rule=[models.BranchProtectionRule(
pattern=mainBranch)])- An admin user creates two developer users (“
developer1” and “developer2”) and adds them to a developers group which is pre-configured by lakeFS.
# Create users
admin1Client.auth_api.create_user(
user_creation=models.UserCreation(
id='developer1'))
admin1Client.auth_api.create_user(
user_creation=models.UserCreation(
id='developer2'))
# Attach users to "Developers" group
groupNameDevelopers='Developers'
has_more = True
next_offset = ""
while has_more:
groups = superUserClient.auth_api.list_groups(after=next_offset)
for r in groups.results:
if r.name == groupNameDevelopers:
groupIdDevelopers = r.id
break
has_more = groups.pagination.has_more
next_offset = groups.pagination.next_offset
admin1Client.auth_api.add_group_membership(
group_id=groupIdDevelopers,
user_id='developer1')
admin1Client.auth_api.add_group_membership(
group_id=groupIdDevelopers,
user_id='developer2')- An admin user creates a
data scientistuser anddata scientistsgroup and adds the user to the group. The admin adds pre-configured lakeFS policies to the new group.
# Create a Data scientists group
DataScientistsGroup = admin1Client.auth_api.create_group(
group_creation=models.GroupCreation(
id='DataScientists'))
# Attach pre-configured policies to the new group
admin1Client.auth_api.attach_policy_to_group(
group_id=DataScientistsGroup.id,
policy_id='AuthManageOwnCredentials')
admin1Client.auth_api.attach_policy_to_group(
group_id=DataScientistsGroup.id,
policy_id='FSReadWriteAll')
admin1Client.auth_api.attach_policy_to_group(
group_id=DataScientistsGroup.id,
policy_id='RepoManagementReadAll')
# Create a data scientist user
admin1Client.auth_api.create_user(
user_creation=models.UserCreation(
id='data_scientist1'))
# Attach the created user to the Data scientists group
admin1Client.auth_api.add_group_membership(
group_id=DataScientistsGroup.id,
user_id='data_scientist1')- The admin user creates an “
FSBlockMergingToMain” policy and attaches it to the Data Scientist group to prevent them from introducing changes to production.
# Create an FSBlockMergingToMain
admin1Client.auth_api.create_policy(
policy=models.Policy(
id='FSBlockMergingToMain',
statement=[models.Statement(
effect="deny",
resource="arn:lakefs:fs:::repository/*/branch/main",
action=["fs:CreateCommit"],
),
]
)
)
# Attach the policy to the Data scientists group
admin1Client.auth_api.attach_policy_to_group(
group_id=DataScientistsGroup.id,
policy_id='FSBlockMergingToMain')- When a data scientist tries to merge their changes to production, the merge operation fails.
- “
Developer1” and “Developer2” are collaborating on a raw shopping transactions dataset without stepping on each other’s toes:- “
Developer1” is testing its work on production data in isolation. Its work touches the raw shopping transactions dataset.
- “
# Developer1 creates a branch from main to test changes in isolation
branchTransformationsChange = repo.branch('transformations-change').create(source_reference=mainBranch)
print("transformations-change ref:", branchTransformationsChange.get_commit().id)-
- At the same time, “
Developer2” is ingesting new data to the raw shopping transactions dataset.
- At the same time, “
# Developer2 creates a branch from main to ingest new data to a raw shopping transactions dataset
branchSecondIngestion = repo.branch('ingest-shopping-transactions-2').create(source_reference=mainBranch)
print("ingest-shopping-transactions-2' ref:", branchSecondIngestion.get_commit().id
# Data ingestions to an isolated branch
contentToUpload = open(f"/data/{fileName}", 'r').read()
branchSecondIngestion.object('shopping_transactions/raw/rawdata2.csv').upload(data=contentToUpload, mode='wb', pre_sign=False
ref = branchSecondIngestion.commit(message='Ingested raw shopping transactions data!', metadata={'using': 'python_sdk'})
print_commit(ref.get_commit())
# A merge is the data ingestion to production
res = branchSecondIngestion.merge_into(branchMain)
print(res)-
- The changes made by “
Developer2” don’t impact “Developer1”’s work
- The changes made by “
# A comparison between "Developer1"'s work branch and the main branch shows # the data introduced by "Developer2". Meaning, the changes had no impact on # the data "Developer1" is working on
diff = branchTransformationsChange.diff(other_ref=branchSecondIngestion)
Path Path Type Size(Bytes) Type
-------------------------------------- ----------- ------------- ------
shopping_transactions/raw/rawdata2.csv object 9 addedThe Future of Data Collaboration
Teams using lakeFS can already work simultaneously on different branches without conflicts, maintain data governance through role-based access control (RBAC), and share specific data versions to ensure consistency. The clear change history allows teams to track modifications, understand dataset evolution, and maintain accountability. However, there are still areas of data collaboration that remain uncovered.
In software development, collaboration thrives on the ability to comment and annotate—features often missing in data environments. What if lakeFS introduced pull requests for data? This feature would enable users to leave comments or notes on specific parts of a dataset, fostering discussions and promoting knowledge sharing within the team. Bringing this level of interaction to data management could change how teams collaborate on and understand their data.
By incorporating pull requests and annotations, lakeFS can further enhance the collaborative experience it already provides. Teams can review changes, discuss potential impacts, and make informed decisions, all within a unified platform. This approach elevates data collaboration to new heights, enabling more efficient, transparent, and cohesive teamwork in data-driven projects.


