Webinar Lottie

lakeFS Acquires DVC, Uniting Data Version Control Pioneers to Accelerate AI-Ready Data

webcros

Learn from AI, ML & data leaders from Dell, Lockheed Martin, Red Hat & more

Collaborating Over Data: Introducing Pull Requests in lakeFS

Oz Katz
Last updated on October 21, 2024

Table of Contents

Watch how lakeFS pull requests work

In modern software development, Pull Requests (PRs) are a fundamental tool for collaborating on code. They allow teams to review, discuss, and merge changes in a controlled and transparent way. 

But what if you could apply that same concept to data? At lakeFS, we’re excited to introduce Pull Requests for data — a new feature that brings data collaboration, transparency, and governance to data engineering, data science and machine learning workflows.

Let’s dive into what this feature means for you, why it’s important, and how it’s going to revolutionize your data-driven projects.

What Are Pull Requests?

In the world of software development, a Pull Request is a mechanism for proposing changes to a codebase. 

When a developer makes changes in a separate branch, they can create a Pull Request to ask their peers (or other project maintainers) to review and merge those changes into the main branch. During the process, reviewers can leave comments, suggest improvements, and even run tests to ensure everything works as expected.

The Pull Request system is invaluable because it creates a structured process for change: discussions happen in the open, reviews are documented, and changes are only merged when they’re fully approved. This ensures that every change is scrutinized, improving quality and fostering collaboration.

what are pull requests

But what if you could apply this same approach to changes in our data?

Why Pull Requests for Data?

Data is rapidly becoming one of the most critical assets for organizations, and like code, it undergoes changes frequently. However, managing data changes has historically lacked the same level of governance, transparency, and control that exists for code. Data changes can be risky—small errors in a dataset can propagate into models, dashboards, and decision-making processes, leading to costly mistakes.

This is where Pull Requests for data in lakeFS come in

With Pull Requests, data practitioners can propose changes to datasets in a controlled environment. Whether it’s updating a dataset, adding new records, or modifying metadata, these changes can be reviewed before they are merged into the main data branch. Teams can leave comments, perform validations, run data quality tests, and ensure that the data is correct before it becomes part of the production workflow.

By adding Pull Requests to data workflows, lakeFS offers the same benefits seen in software development:

Benefit What It Does
Collaboration Multiple team members can collaborate on data changes, review each other’s work, and leave feedback
Transparency Every change is visible and documented, providing a clear audit trail
Quality Control Changes are reviewed and tested before they are merged, reducing the risk of introducing errors into the data

This building block also allows us to implement the Write-Audit-Publish pattern in a very simple way:

Write-Audit-Publish data patterns
Source: Mehdi Ouazza on X.com

This pattern is very powerful – it allows separating the physical act of persisting data to storage, and the logical act of exposing it to readers. By separating these two concerns, we can add a crucial step: making sure that the data written actually meets our data quality standards, regulation constraints and privacy requirements.

To learn more about the write-audit-pattern, visit Data Engineering Patterns: Write-Audit-Publish (WAP) by Robin Moffatt

Using Pull Requests in lakeFS

Using Pull Requests in lakeFS is intuitive and familiar to those who have worked with Git-based workflows. Let’s take a look at an illustrated example:

Create a Branch

Start by creating a new branch where you can safely experiment with changes to our data lake:

create a branch

Make Changes

Modify the data, whether it’s updating records, adding new data files, or performing transformations. In our case, we’ll keep things simple by uploading a new dataset directly from the lakeFS UI to the newly created branch:

upload object

Commit Changes

Let’s commit the changes we just made to our branch:

commit changes pull request

Open a Pull Request

When your changes are ready for review, head over to the Pull Requests tab in your repository and Open a Pull Request

You can add a description of the changes 

????Pro Tip: Use markdown!

open pull request

Test and Validate

Run validation checks or automated data quality tests to ensure that the changes meet your standards.

pull request test and validate

Every Pull Request is assigned a unique ID. You can grab the URL and share it with others to review the change.

As with any lakeFS reference, reviewers can take the source branch, query, test and modify it as necessary prior to merging.

Merge (or Close)

Once the review is complete and all checks have passed, merge the changes into the main branch. 

The data is now updated in a controlled and transparent manner.

merge or close pull request

Stay Tuned: What’s Coming Next?

Pull Requests are just the beginning. The lakeFS team is continuously working to improve and expand the feature sets around data collaboration. In the coming months, you can look forward to:

  • Automated Data Validation Checks: Automatically validate data changes as part of the Pull Request workflow to catch issues before they’re merged
  • Enhanced Review Tools: More advanced comparison tools to easily spot differences in large datasets and gain insights into how data has changed
  • Richer collaboration: Collaborate better between authors and reviewers using inline comments, discussions and approval processes

These upcoming features will further streamline collaboration and governance in data operations, helping teams to work smarter, faster, and with more confidence in the quality of their data.

Try It Out Yourself

Excited about the new Pull Request feature? Try it out today! Start by creating a branch, making some changes, and opening your first Pull Request in lakeFS.

If you’re new to lakeFS, now is the perfect time to explore what it can do for your data management workflows. Download lakeFS or book a demo to see how Pull Requests and versioning can transform the way you handle your data!

We use cookies to improve your experience and understand how our site is used.

Learn more in our Privacy Policy